<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [31]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from bertopic import BERTopic

# Load dataset
data = pd.read_csv("labeled_imdb_reviews.csv")
data['Review'] = data['Review'].str.replace('[^\w\s]', '')  # Remove punctuation
data['Review'] = data['Review'].str.lower()  # Convert text to lowercase

# Define text representation using CountVectorizer
vectorizer = CountVectorizer(max_features=5000, stop_words='english')
X = vectorizer.fit_transform(data['Review'])

# Apply LDA
lda_model = LatentDirichletAllocation(n_components=10, random_state=42)
lda_model.fit(X)

# Apply LSA
lsa_model = TruncatedSVD(n_components=10, random_state=42)
lsa_model.fit(X)

# Apply BERTopic

topic_model = BERTopic(min_topic_size=3)  # Increase min_topic_size
topics, _ = topic_model.fit_transform(data['Review'])

# Get top words for each topic in LDA and LSA
def get_top_words(model, vectorizer, n_words=10):
    words = vectorizer.get_feature_names_out()
    topics = []
    for idx, topic in enumerate(model.components_):
        top_words_idx = topic.argsort()[:-n_words - 1:-1]
        topic_words = [words[i] for i in top_words_idx]
        topics.append(topic_words)
    return topics

# Get top words for each topic in LDA and LSA
lda_topics = get_top_words(lda_model, vectorizer)
lsa_topics = get_top_words(lsa_model, vectorizer)

# Print features used for topic modeling
print("Features (text representation) used for topic modeling:")
print(vectorizer.get_feature_names_out())
print()

# Print top 10 clusters/topics for LDA and LSA
print("Top 10 clusters/topics for LDA:")
for idx, topic in enumerate(lda_topics):
    print(f"Cluster {idx+1}: {', '.join(topic)}")
    # Write a human-readable summary for each topic based on the top words and your understanding of the dataset
print()

print("Top 10 clusters/topics for LSA:")
for idx, topic in enumerate(lsa_topics):
    print(f"Cluster {idx+1}: {', '.join(topic)}")
    # Write a human-readable summary for each topic based on the top words and your understanding of the dataset
print()

# Print top topics for BERTopic
print("Top clusters/topics for BERTopic:")
for topic_id in range(max(topics)):
    top_words = topic_model.get_topic(topic_id)
    top_words = [word[0] for word in top_words]  # Extracting only the words from tuples
    print(f"Cluster {topic_id+1}: {', '.join(top_words)}")
    # Write a human-readable summary for each topic based on the top words and your understanding of the dataset


Features (text representation) used for topic modeling:
['007' '10' '100' ... 'young' 'youngsters' 'zoe']

Top 10 clusters/topics for LDA:
Cluster 1: start, career, unforgettable, 10, kids, family, directing, did, perfect, acting
Cluster 2: start, career, unforgettable, 10, kids, family, directing, did, perfect, acting
Cluster 3: movie, potter, harry, film, really, good, like, philosopher, stone, long
Cluster 4: lot, film, production, did, effects, excellent, franchise, bit, dark, cinematography
Cluster 5: film, harry, just, potter, great, book, rowling, world, characters, stone
Cluster 6: movie, potter, harry, kids, movies, series, special, world, going, alan
Cluster 7: movie, harry, potter, film, movies, good, characters, book, effects, wizard
Cluster 8: harry, movie, film, potter, great, like, book, sorcerer, columbus, stone
Cluster 9: start, career, unforgettable, 10, kids, family, directing, did, perfect, acting
Cluster 10: movie, book, story, wasn, books, old, didn, right, watch,

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [32]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

# Load dataset
data = pd.read_csv("labeled_imdb_reviews.csv")

# Split dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(data['Review'], data['Sentiment'], test_size=0.2, random_state=42)

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')

# Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform the testing data
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Supervised learning algorithms: Multinomial Naive Bayes and Logistic Regression
models = {
    "Multinomial Naive Bayes": MultinomialNB(),
    "Logistic Regression": LogisticRegression()
}

# Evaluate each model
for name, model in models.items():
    print(f"Model: {name}")
    # Train the model
    model.fit(X_train_tfidf, y_train)

    # Predict on the testing data
    y_pred = model.predict(X_test_tfidf)

    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')

    # Print evaluation metrics
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print()


Model: Multinomial Naive Bayes
Accuracy: 0.7000
Precision: 0.4900
Recall: 0.7000
F1 Score: 0.5765

Model: Logistic Regression
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000



Feature selection is an essential step in the sentiment classification process since it helps identify the underlying patterns in text data. I've decided to employ the TF-IDF (Term Frequency-Inverse Document Frequency) features for this work. The significance of each word in a document in relation to a group of documents is represented by TF-IDF. TF-IDF successfully captures the discriminative potential of words by taking into account both a word's frequency inside a document and its rarity over the whole corpus. In order to manage the dimensionality of the feature space and reduce the possibility of overfitting, I have furthermore set a maximum feature count of 5000. By focusing on the most relevant terms and ignoring frequent stop words, this selection method improves the models' overall performance and increases their capacity to generalize to new data.

# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

In [36]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder

# Load datasets
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

# Exploratory Data Analysis (EDA) and Data Cleaning

# Check for missing values
print("Missing values in train dataset:")
print(train_df.isnull().sum())

print("\nMissing values in test dataset:")
print(test_df.isnull().sum())

# Fill missing values
train_df.fillna(method='ffill', inplace=True)
test_df.fillna(method='ffill', inplace=True)

# Select numerical features
numerical_features = train_df.select_dtypes(include=[np.number])

# Split data for training and testing
X = numerical_features.drop(columns=['SalePrice'])
y = numerical_features['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Select a subset of features for the regression model
selected_features = ['OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', 'TotalBsmtSF']
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

# Develop a regression model
regressor = LinearRegression()
regressor.fit(X_train_selected, y_train)

# Evaluate performance of the regression model
y_pred_train = regressor.predict(X_train_selected)
y_pred_test = regressor.predict(X_test_selected)

print("\nEvaluation metrics for the regression model:")
print("Train set:")
print("Mean Squared Error (MSE):", mean_squared_error(y_train, y_pred_train))
print("R-squared (R2) Score:", r2_score(y_train, y_pred_train))

print("\nTest set:")
print("Mean Squared Error (MSE):", mean_squared_error(y_test, y_pred_test))
print("R-squared (R2) Score:", r2_score(y_test, y_pred_test))

# Further steps could involve cross-validation, hyperparameter tuning, and additional feature engineering for model improvement.


Missing values in train dataset:
Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64

Missing values in test dataset:
Id                 0
MSSubClass         0
MSZoning           4
LotFrontage      227
LotArea            0
                ... 
MiscVal            0
MoSold             0
YrSold             0
SaleType           1
SaleCondition      0
Length: 80, dtype: int64

Evaluation metrics for the regression model:
Train set:
Mean Squared Error (MSE): 1491922047.9872394
R-squared (R2) Score: 0.7498684807747998

Test set:
Mean Squared Error (MSE): 1598354833.0864484
R-squared (R2) Score: 0.7916184018889857


# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


In [52]:
import pandas as pd
import torch
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load data
data = pd.read_csv("labeled_imdb_reviews.csv")
text_data = data["Review"].tolist()
labels = data["Sentiment"].tolist()

# Load DistilBERT model and tokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name)

# Tokenize input text
encoded_text = tokenizer(text_data, padding="max_length", truncation=True, return_tensors="pt")

# Make predictions
with torch.no_grad():
    outputs = model(**encoded_text)
    predictions = torch.argmax(outputs.logits, dim=-1).cpu().numpy()
# Convert string labels to numerical labels with handling for missing labels
label_mapping = {"positive": 1, "negative": 0}
default_label = -1  # Default label for missing values
numerical_labels = [label_mapping.get(label, default_label) for label in labels]

# Filter out instances with default_label
filtered_numerical_labels = [label for label in numerical_labels if label != default_label]
filtered_predictions = [prediction for label, prediction in zip(numerical_labels, predictions) if label != default_label]

# Calculate evaluation metrics
accuracy = accuracy_score(filtered_numerical_labels, filtered_predictions)
precision = precision_score(filtered_numerical_labels, filtered_predictions, average='macro')
recall = recall_score(filtered_numerical_labels, filtered_predictions, average='macro')
f1 = f1_score(filtered_numerical_labels, filtered_predictions, average='macro')

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)


Accuracy: 0.8235294117647058
Precision: 0.4666666666666667
Recall: 0.4375
F1-score: 0.45161290322580644


DistilBERT, more especially the "distilbert-base-uncased-finetuned-sst-2-english" version, is the PLM that was chosen for sentiment analysis. Hugging Face created the BERT (Bidirectional Encoder Representations from Transformers) model, which is larger and heavier than DistilBERT.

This is a synopsis of the chosen DistilBERT model:

Pretraining Data Sources: BooksCorpus and the English Wikipedia are two of the many text data sets that DistilBERT is pretrained on. The model may acquire broad language representations from these corpora since they have a wide variety of material from different disciplines.
Number of Parameters: With 66 million parameters, the "distilbert-base-uncased-finetuned-sst-2-english" variation is computationally more efficient than the original BERT model.
Task-Specific Fine-Tuning: The Stanford Sentiment Treebank (SST-2) dataset, which comprises of movie reviews tagged with sentiment polarity (positive or negative), is used to fine-tune the model. The pretrained DistilBERT model is adjusted through this fine-tuning process to the particular job of sentiment analysis, allowing it to pick up on task-specific patterns and subtleties in sentiment expression.

Pros and Cons
Because of its reduced size and advantages in efficiency and memory utilization, DistilBERT may be deployed in contexts with limited resources. DistilBERT, which benefits from pretrained representations acquired from extensive text corpora, maintains competitive performance on a variety of NLP tasks while having a smaller capacity than bigger models such as BERT. However, its generic pretrained representations might not adequately capture domain-specific subtleties, and its reliance on fine-tuning using labeled data may offer difficulties, particularly in areas with a dearth of labeled datasets. All things considered, DistilBERT is a flexible option for a range of NLP applications requiring a reasonable amount of processing power since it finds a balance between efficiency and performance.
