<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import LdaModel
import gensim.matutils

# Load your dataset (assuming it's in a CSV format)
data = pd.read_csv('/content/INFO 5731_DATA_SET_ASSIGN_4.csv')

# Assuming your dataset has a column 'text' containing the text data
documents = data['clean_text'].tolist()

# Preprocess the text data (tokenization, stop word removal, etc.)
# This can be done using libraries like NLTK or spaCy

# Create a bag of words representation of the documents
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
X = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()

# Convert the sparse matrix into a Gensim corpus
corpus = gensim.matutils.Sparse2Corpus(X.T)

# Create a dictionary mapping id to word
id2word = dict((v, k) for k, v in vectorizer.vocabulary_.items())

# Build the LDA model
num_topics = 10
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics, passes=10)

# Print the top 10 clusters for topic modeling
for topic_idx, topic in lda_model.show_topics(num_topics=num_topics, num_words=10, formatted=False):
    print(f"Topic {topic_idx}:")
    topic_terms = [term for term, _ in topic]
    print(", ".join(topic_terms))
    print("\n")

# Summarize and describe the topic for each cluster
for topic_idx, topic in lda_model.show_topics(num_topics=num_topics, num_words=10, formatted=False):
    print(f"Cluster {topic_idx}:")
    topic_terms = [term for term, _ in topic]
    print("Top terms:", ", ".join(topic_terms))
    print("Description: Write your description based on these terms.")
    print("\n")


Topic 0:
movie, br, just, film, time, like, really, bad, good, movies


Topic 1:
br, film, story, just, best, time, movie, old, like, characters


Topic 2:
br, movie, film, think, like, work, way, really, actually, great


Topic 3:
br, movie, film, like, bad, good, know, just, great, really


Topic 4:
br, movie, film, like, just, good, really, time, bad, story


Topic 5:
film, br, movie, like, just, good, story, characters, disney, really


Topic 6:
film, br, movie, like, just, good, think, story, films, make


Topic 7:
br, film, movie, just, like, good, little, time, story, people


Topic 8:
br, movie, just, like, game, don, good, story, film, people


Topic 9:
br, movie, film, great, story, like, acting, movies, little, really


Cluster 0:
Top terms: movie, br, just, film, time, like, really, bad, good, movies
Description: Write your description based on these terms.


Cluster 1:
Top terms: br, film, story, just, best, time, movie, old, like, characters
Description: Write your descri

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [None]:
# Write your code here
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
data = pd.read_csv('/content/INFO 5731_DATA_SET_ASSIGN_4.csv')

# dataset has columns 'clean_text' (feature) and 'sentiment' (target)
X = data['clean_text']
y = data['sentiment']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Define classifiers
svm_classifier = SVC(kernel='linear')
rf_classifier = RandomForestClassifier(n_estimators=100)

# Cross-validation (5-fold)
def evaluate_classifier(classifier, X_train, y_train):
    cv_scores = cross_val_score(classifier, X_train, y_train, cv=5, scoring='accuracy')
    return cv_scores

# Evaluate SVM classifier
svm_cv_scores = evaluate_classifier(svm_classifier, X_train_tfidf, y_train)
print("SVM Cross-Validation Scores:", svm_cv_scores)
print("Mean Accuracy (SVM):", svm_cv_scores.mean())

# Evaluate Random Forest classifier
rf_cv_scores = evaluate_classifier(rf_classifier, X_train_tfidf, y_train)
print("Random Forest Cross-Validation Scores:", rf_cv_scores)
print("Mean Accuracy (Random Forest):", rf_cv_scores.mean())

# Train and evaluate classifiers on the test set
def evaluate_test_set(classifier, X_train, y_train, X_test, y_test):
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    return accuracy, precision, recall, f1

# Evaluate SVM classifier on test set
svm_accuracy, svm_precision, svm_recall, svm_f1 = evaluate_test_set(svm_classifier, X_train_tfidf, y_train, X_test_tfidf, y_test)
print("SVM Test Accuracy:", svm_accuracy)
print("SVM Test Precision:", svm_precision)
print("SVM Test Recall:", svm_recall)
print("SVM Test F1 Score:", svm_f1)

# Evaluate Random Forest classifier on test set
rf_accuracy, rf_precision, rf_recall, rf_f1 = evaluate_test_set(rf_classifier, X_train_tfidf, y_train, X_test_tfidf, y_test)
print("Random Forest Test Accuracy:", rf_accuracy)
print("Random Forest Test Precision:", rf_precision)
print("Random Forest Test Recall:", rf_recall)
print("Random Forest Test F1 Score:", rf_f1)


SVM Cross-Validation Scores: [0.8     0.78125 0.80625 0.8     0.78125]
Mean Accuracy (SVM): 0.79375
Random Forest Cross-Validation Scores: [0.75625 0.8     0.7875  0.75625 0.73125]
Mean Accuracy (Random Forest): 0.76625
SVM Test Accuracy: 0.815
SVM Test Precision: 0.8150618028338861
SVM Test Recall: 0.815
SVM Test F1 Score: 0.8148469119085233
Random Forest Test Accuracy: 0.755
Random Forest Test Precision: 0.7570537084398977
Random Forest Test Recall: 0.755
Random Forest Test F1 Score: 0.7537082166553142


# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

1.Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Load the training dataset
data = pd.read_csv('/content/train.csv')

# Display basic information about the dataset
print(data.head())  # View the first few rows
print(data.info())  # Display information about columns and data types

# Identify numeric columns for imputation
numeric_cols = data.select_dtypes(include=[np.number]).columns.tolist()

# Fill missing values in numeric columns with mean
data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())

# Verify that there are no more missing values
print(data.isnull().sum())

# Split the data into training and testing sets
X = data.drop('SalePrice', axis=1)  # Features
y = data['SalePrice']  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Now X_train, y_train are the training data and labels
# X_test, y_test are the testing data and labels


   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   2008        WD   

In [None]:
#Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the dataset
data = pd.read_csv('/content/train.csv')

# Select relevant features and target variable
selected_features = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'YearBuilt']
X = data[selected_features]
y = data['SalePrice']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model
train_rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
test_rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))

print(f"Training RMSE: {train_rmse}")
print(f"Testing RMSE: {test_rmse}")

# Print coefficients to understand feature importance
print("Feature Coefficients:")
for feature, coef in zip(selected_features, model.coef_):
    print(f"{feature}: {coef}")

# Explanation for feature selection:
# 1. OverallQual: Quality rating impacts perceived value and market appeal.
# 2. GrLivArea: Above ground living area correlates strongly with house size and price.
# 3. GarageCars: Larger garages indicate higher capacity and often accompany higher-priced homes.
# 4. TotalBsmtSF: Basement area contributes significantly to overall square footage and value.
# 5. YearBuilt: Age of the house can influence desirability and maintenance costs.


Training RMSE: 37970.210024760556
Testing RMSE: 39763.295265780616
Feature Coefficients:
OverallQual: 20392.513001014442
GrLivArea: 48.80980118759696
GarageCars: 15144.237203506369
TotalBsmtSF: 25.36531866067783
YearBuilt: 315.92335034468454


In [None]:
#Develop a regression model. The train set should be used.
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the dataset
data = pd.read_csv('/content/train.csv')

# Select relevant features and target variable
selected_features = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'YearBuilt']
X_train = data[selected_features]
y_train = data['SalePrice']

# Build a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict house prices on the training set
y_train_pred = model.predict(X_train)

# Evaluate the model on the training set using RMSE
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
print(f"Training RMSE: {train_rmse}")

# Optionally, print the model coefficients
print("Model Coefficients:")
for feature, coef in zip(selected_features, model.coef_):
    print(f"{feature}: {coef}")

Training RMSE: 38254.68965518844
Model Coefficients:
OverallQual: 20391.140933744067
GrLivArea: 50.83150559496018
GarageCars: 14510.003299796621
TotalBsmtSF: 29.97787731828509
YearBuilt: 301.43341058974954


In [None]:
#Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder

# Load the test dataset
test_data = pd.read_csv('/content/test.csv')

# Select relevant features and target variable for the test set
selected_features = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'YearBuilt']
X_test = test_data[selected_features]

# Identify the appropriate target variable from your dataset
# Replace 'Your_Target_Variable' with the actual name of the target variable
target_variable = 'SaleCondition'
y_test = test_data[target_variable]

# Encode categorical target variable into numerical format (Label Encoding)
label_encoder = LabelEncoder()
y_test_encoded = label_encoder.fit_transform(y_test)

# Handle missing values in the test set (if any)
# Use SimpleImputer to replace missing values in numerical features with mean
imputer = SimpleImputer(strategy='mean')
X_test_imputed = imputer.fit_transform(X_test)

# Use the trained model to make predictions on the test set
y_test_pred = model.predict(X_test_imputed)

# Evaluate the model performance on the test set using RMSE
test_rmse = mean_squared_error(y_test_encoded, y_test_pred, squared=False)
print(f"Test RMSE: {test_rmse}")


Test RMSE: 191831.25716474172




The Test data does not contain the target variable 'SalePrice'. I used target variable as 'SaleCondition'

# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


1. Description of the Selected PLM: BERT (Bidirectional Encoder Representations from Transformers)
Original Sources of Pretraining Data: A sizable corpus of literature from BooksCorpus (800 million words) and the English Wikipedia (2.5 billion words) was used to pretrain BERT. For pretraining, it makes use of the Next Sentence Prediction (NSP) and Masked Language Modelling (MLM) tasks.
The quantity of parameters Larger variants, such as BERT-large, include 340 million parameters, compared to 110 million in BERT-base.
Task-Dependent Adjustment: BERT can be optimised for sentiment analysis using sentiment-specific datasets such as Twitter sentiment datasets, Yelp reviews, and IMDb ratings. But we'll use BERT in a zero-shot configuration without any fine-tuning for this purpose.

In [None]:
# Write your code here
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from transformers import pipeline
import textwrap

# Load the dataset
data = pd.read_csv('/content/INFO 5731_DATA_SET_ASSIGN_4.csv')

# Assuming your dataset has columns 'clean_text' (feature) and 'sentiment' (target)
texts = data['clean_text'].tolist()
labels = data['sentiment'].tolist()

# Initialize sentiment analysis pipeline with BERT in zero-shot classification mode
sentiment_classifier = pipeline("zero-shot-classification", model="distilbert-base-uncased")

# Specify possible labels for sentiment (positive, negative, neutral)
possible_labels = ["positive", "negative", "neutral"]

# Define a function to preprocess text to fit within the maximum sequence length of BERT
def preprocess_text(text):
    # Use textwrap to wrap long text and take the first part (to fit within BERT's max sequence length)
    wrapped_text = textwrap.fill(text, width=512, break_long_words=False)
    return wrapped_text.split('\n')[0]  # Take the first line (to ensure it's within BERT's max length)

# Preprocess texts to fit within BERT's max sequence length
prepared_texts = [preprocess_text(text) for text in texts]

# Perform zero-shot sentiment analysis using BERT
predictions = [sentiment_classifier(text, candidate_labels=possible_labels)['labels'][0] for text in prepared_texts]

# Evaluate performance
accuracy = accuracy_score(labels, predictions)
precision = precision_score(labels, predictions, average='weighted', labels=possible_labels)
recall = recall_score(labels, predictions, average='weighted', labels=possible_labels)
f1 = f1_score(labels, predictions, average='weighted', labels=possible_labels)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)




Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Accuracy: 0.004
Precision: 0.11133333333333333
Recall: 0.003999999999999999
F1 Score: 0.007722543352601155


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


3. Advantages and Disadvantages of BERT for Sentiment Analysis
Advantages:

Because of its bidirectional nature and self-attention mechanism, BERT is able to record intricate linguistic patterns and context.
BERT may generalise to new tasks without requiring task-specific fine-tuning thanks to zero-shot learning.
Sentiment analysis in several languages is made possible by multilingual capabilities.
Drawbacks:

BERT can be costly to compute and needs a lot of resources for inference.
restricted interpretability in contrast to more established machine learning models such as logistic regression or SVM.
When dealing with noisy or out-of-domain data, BERT's performance may suffer.

Challenges Encountered:

Selecting the best BERT model for the job at hand and being aware of its input and output formats.
Using BERT's zero-shot method to multiclass sentiment analysis, particularly in cases of imbalanced classes.
controlling the amount of memory used and the time it takes for inference, particularly when working with big datasets or deploying in situations with limited resources.
BERT uses extensive text corpora for pretraining, resulting in state-of-the-art performance for sentiment analysis tasks overall. To obtain optimal performance, however, careful consideration of computational resources, job needs, and model setups must be made when using it.