<a href="https://colab.research.google.com/github/Saketh-11653883/UNT-SAKETH_INFO5731/blob/main/Kaveti_saketh_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim import corpora, models
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import LdaModel
from gensim.corpora import Dictionary

In [3]:
# Load the dataset
df = pd.read_csv('Antman_reviews_annotated.csv')

In [4]:
# Preprocessing
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [5]:
def preprocess_text(text):
    tokens = word_tokenize(text)
    # Remove stopwords and single character tokens
    tokens = [word for word in tokens if word.lower() not in stop_words and len(word) > 1]
    # Lemmatization
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return lemmatized_tokens

In [6]:
df['clean_text'] = df['clean_text'].apply(preprocess_text)

In [7]:
# Assuming clean_text_series contains preprocessed text data
clean_text_series = df['clean_text'].tolist()

# Initialize Dictionary
dictionary = Dictionary(clean_text_series)

# Convert text data to bag-of-words representation
corpus = [dictionary.doc2bow(text) for text in clean_text_series]

In [8]:
# Train LDA model
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)





In [9]:
# Get the summarized topics
topics_summary = lda_model.show_topics(num_topics=10, num_words=10)

In [10]:
# Print the summarized topics
print("Summarized Topics:")
for topic_idx, summary in enumerate(topics_summary):
    print(f"Topic {topic_idx}: {summary}")

Summarized Topics:
Topic 0: (0, '0.032*"movi" + 0.014*"charact" + 0.013*"marvel" + 0.011*"kang" + 0.010*"quantum" + 0.009*"mcu" + 0.009*"like" + 0.009*"set" + 0.009*"see" + 0.008*"film"')
Topic 1: (1, '0.025*"movi" + 0.016*"time" + 0.013*"charact" + 0.012*"moment" + 0.012*"lot" + 0.011*"realli" + 0.010*"make" + 0.009*"action" + 0.009*"marvel" + 0.008*"help"')
Topic 2: (2, '0.044*"movi" + 0.021*"charact" + 0.019*"phase" + 0.014*"could" + 0.014*"marvel" + 0.012*"kang" + 0.011*"antman" + 0.010*"see" + 0.010*"get" + 0.010*"like"')
Topic 3: (3, '0.020*"film" + 0.016*"mcu" + 0.012*"marvel" + 0.012*"realli" + 0.011*"one" + 0.011*"movi" + 0.010*"hope" + 0.009*"kang" + 0.009*"someth" + 0.009*"even"')
Topic 4: (4, '0.025*"movi" + 0.020*"mcu" + 0.017*"like" + 0.013*"kang" + 0.012*"film" + 0.010*"im" + 0.010*"major" + 0.010*"antman" + 0.009*"much" + 0.009*"charact"')
Topic 5: (5, '0.024*"film" + 0.016*"charact" + 0.013*"one" + 0.013*"movi" + 0.013*"like" + 0.012*"villain" + 0.009*"didnt" + 0.009*"

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [11]:
# Write your code here
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

In [12]:
# Load the dataset
data = pd.read_csv("Antman_reviews_annotated.csv")

In [13]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['clean_text'], data['sentiment'], test_size=0.2, random_state=42)


In [14]:
# Feature extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [15]:
# Select two supervised learning algorithms
svm_classifier = LinearSVC()
nb_classifier = MultinomialNB()

In [16]:
# Train and evaluate SVM classifier
svm_classifier.fit(X_train_tfidf, y_train)
svm_pred = svm_classifier.predict(X_test_tfidf)

svm_accuracy = accuracy_score(y_test, svm_pred)
svm_precision = precision_score(y_test, svm_pred, average='weighted')
svm_recall = recall_score(y_test, svm_pred, average='weighted')
svm_f1 = f1_score(y_test, svm_pred, average='weighted')

In [17]:
# Train and evaluate Naive Bayes classifier
nb_classifier.fit(X_train_tfidf, y_train)
nb_pred = nb_classifier.predict(X_test_tfidf)

nb_accuracy = accuracy_score(y_test, nb_pred)
nb_precision = precision_score(y_test, nb_pred, average='weighted')
nb_recall = recall_score(y_test, nb_pred, average='weighted')
nb_f1 = f1_score(y_test, nb_pred, average='weighted')

In [18]:
# Cross-validation
svm_cv_scores = cross_val_score(svm_classifier, X_train_tfidf, y_train, cv=5)
nb_cv_scores = cross_val_score(nb_classifier, X_train_tfidf, y_train, cv=5)

In [19]:
# Print results
print("SVM Classifier Performance:")
print("Accuracy:", svm_accuracy)
print("Precision:", svm_precision)
print("Recall:", svm_recall)
print("F1 Score:", svm_f1)
print("Cross-validation Scores:", svm_cv_scores)

print("\nNaive Bayes Classifier Performance:")
print("Accuracy:", nb_accuracy)
print("Precision:", nb_precision)
print("Recall:", nb_recall)
print("F1 Score:", nb_f1)
print("Cross-validation Scores:", nb_cv_scores)

SVM Classifier Performance:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0
Cross-validation Scores: [1. 1. 1. 1. 1.]

Naive Bayes Classifier Performance:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0
Cross-validation Scores: [1. 1. 1. 1. 1.]


# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import mean_squared_error


In [21]:
# Load the data
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

In [22]:
# Exploratory Data Analysis (EDA)
print("Train Data Info:")
print(df_train.info())
print("\nTest Data Info:")
print(df_test.info())

Train Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int6

In [23]:
# Data Cleaning
# Drop columns with too many missing values
missing_threshold = 0.5
missing_cols_train = df_train.columns[df_train.isnull().mean() > missing_threshold]
missing_cols_test = df_test.columns[df_test.isnull().mean() > missing_threshold]
df_train.drop(columns=missing_cols_train, inplace=True)
df_test.drop(columns=missing_cols_test, inplace=True)

In [24]:
# Fill missing values with mean for numerical columns
num_cols_train = df_train.select_dtypes(include=np.number).columns
num_cols_test = df_test.select_dtypes(include=np.number).columns
df_train[num_cols_train] = df_train[num_cols_train].fillna(df_train[num_cols_train].mean())
df_test[num_cols_test] = df_test[num_cols_test].fillna(df_test[num_cols_test].mean())

In [25]:
# Selecting Features
# Selecting numeric features
numeric_features = df_train.select_dtypes(include=[np.number]).columns.tolist()
selected_features = numeric_features[:10]  # Selecting the first 10 numeric features
print("\nSelected Features:", selected_features)


Selected Features: ['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1']


In [26]:
# Prepare data for training
X_train = df_train[selected_features]
y_train = df_train['SalePrice']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [27]:
# Develop a regression model
regression = LinearRegression()
regression.fit(X_train, y_train)


In [28]:
# Make predictions on the test set
y_pred = regression.predict(X_test)

In [29]:
# Evaluate the model
r_squared = regression.score(X_test, y_test)
print('\nLinear Regression R-squared:', r_squared)



Linear Regression R-squared: 0.7525976284545081


In [30]:
# Calculate root mean squared error
mse = mean_squared_error(y_pred, y_test)
rmse = np.sqrt(mse)
print('Root Mean Squared Error:', rmse)

Root Mean Squared Error: 43562.10387694369


# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


In [31]:
import pandas as pd
from transformers import pipeline

In [32]:
# Load the dataset
df = pd.read_csv('Antman_reviews_annotated.csv')


In [33]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in the dataset:")
print(missing_values)

Missing values in the dataset:
document_id    0
clean_text     0
sentiment      0
dtype: int64


In [34]:
# Drop rows with missing values in 'clean_text' and 'sentiment' columns
df = df.dropna(subset=['clean_text', 'sentiment'])

In [35]:
# Initialize sentiment analysis pipeline
emotion_pipeline = pipeline("sentiment-analysis", model="distilroberta-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [36]:
# Function to map sentiment label to emotion
def map_sentiment_to_emotion(sentiment_label):
    if sentiment_label == 'positive':
        return 'joy'
    elif sentiment_label == 'negative':
        return 'sadness'
    else:
        return 'neutral'


In [37]:
# Apply sentiment-to-emotion mapping and predict emotion for each text
def predict_emotion(text):
    # Truncate or pad text to fit model's maximum sequence length
    max_seq_length = 512
    truncated_text = text[:max_seq_length]  # Truncate text if it exceeds maximum length
    # Predict emotion
    return emotion_pipeline(truncated_text)[0]['label']

In [38]:
df['emotion'] = df['sentiment'].apply(lambda x: map_sentiment_to_emotion(x.lower()))
df['predicted_emotion'] = df['clean_text'].apply(predict_emotion)

In [39]:
# Display the results
print("\nProcessed DataFrame:")
print(df[['clean_text', 'sentiment', 'emotion', 'predicted_emotion']])


Processed DataFrame:
                                             clean_text sentiment emotion  \
0     a huge fan first one almost big fan second one...  positive     joy   
1     after entri phase pas without much set next bi...  positive     joy   
2     well happen the mcu run ga the last mcu film l...  positive     joy   
3     well ill start say wasnt bad movi it wasnt gre...  positive     joy   
4     i enjoy watch quantumania it mostli solid fair...  positive     joy   
...                                                 ...       ...     ...   
3470  thi film unspeak badit actual wors etern becau...  positive     joy   
3471  a fun onei terrif time watch antman wasp quant...  positive     joy   
3472  a mani other point far heyday peak mcu movi an...  positive     joy   
3473  the mcu current state absolut mess up endgam w...  positive     joy   
3474  so im go say great marvel film howev i also wo...  positive     joy   

     predicted_emotion  
0              LABEL_0  
1  

In [40]:
# Display sentiment distribution
print("\nSentiment distribution:")
print(df['sentiment'].value_counts())


Sentiment distribution:
sentiment
positive    3197
negative     278
Name: count, dtype: int64


In [41]:
# Display predicted emotion label distribution
print("\nPredicted Emotion Label Distribution:")
print(df['predicted_emotion'].value_counts())


Predicted Emotion Label Distribution:
predicted_emotion
LABEL_0    3475
Name: count, dtype: int64


In [42]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate evaluation metrics
accuracy = accuracy_score(df['emotion'], df['predicted_emotion'])
precision = precision_score(df['emotion'], df['predicted_emotion'], average='weighted')
recall = recall_score(df['emotion'], df['predicted_emotion'], average='weighted')
f1 = f1_score(df['emotion'], df['predicted_emotion'], average='weighted')


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [43]:
# Display evaluation metrics
print("\nEvaluation Metrics:")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)


Evaluation Metrics:
Accuracy: 0.0
Precision: 0.0
Recall: 0.0
F1 Score: 0.0


DistilRoBERTa is derived from the RoBERTa model, which undergoes pretraining on a vast corpus of text drawn from various sources like books, articles, and websites. Although the specific datasets used for RoBERTa's pretraining aren't explicitly disclosed, its pretraining method is similar to BERT's, involving tasks like masked language modeling and next sentence prediction across diverse text samples. DistilRoBERTa is designed to be more lightweight and efficient than RoBERTa, achieved by reducing its parameter count. Generally, DistilRoBERTa comprises around 66 million parameters.

For sentiment analysis tasks, the DistilRoBERTa model is fine-tuned using a sentiment analysis pipeline provided by the Transformers library. This fine-tuning process involves adjusting the model's parameters using a smaller dataset specific to sentiment analysis. The sentiment analysis pipeline applies the model to predict the sentiment of text inputs, categorizing them as positive, negative, or neutral. Additionally, the code maps these sentiment labels to corresponding emotions, such as joy, sadness, or neutrality, and predicts the emotion for each text based on the sentiment prediction.

Advantages:

Efficiency: DistilRoBERTa is crafted to be more lightweight and computationally efficient compared to the original RoBERTa model. This quality makes it well-suited for deployment in scenarios where resources are limited, or when there's a need for faster inference times.

Pretraining on Diverse Data: Similar to RoBERTa, DistilRoBERTa undergoes pretraining on a vast and varied dataset comprising texts from sources like books, articles, and websites. This wide-ranging training data helps the model grasp a broad spectrum of linguistic patterns and meanings, thereby enhancing its performance on tasks such as sentiment analysis.

Fine-Tuning for Sentiment Analysis: DistilRoBERTa undergoes a fine-tuning process specifically tailored for sentiment analysis. Through this process, the model adjusts its parameters based on labeled sentiment data, enabling it to better discern sentiment-related features.

Transfer Learning: As a pretrained language model, DistilRoBERTa leverages transfer learning to apply the knowledge gained during pretraining to new tasks. This capability allows the model to achieve promising results on sentiment analysis tasks even with limited task-specific data and training.

Disadvantages:

Simplified Representation: DistilRoBERTa achieves efficiency by reducing parameters and compressing the original RoBERTa model. However, this simplification may lead to loss of fine-grained linguistic nuances, possibly affecting performance on complex tasks.

Limited Training Data for Fine-Tuning: Fine-tuning DistilRoBERTa typically requires a smaller labeled dataset specific to the task, such as sentiment analysis. Obtaining a sufficiently diverse and sizable dataset for fine-tuning can be challenging, especially for niche or domain-specific tasks.

Interpretability: Like most deep learning models, understanding DistilRoBERTa's internal workings can be complex. This lack of interpretability makes it difficult to comprehend specific predictions, which may raise concerns in applications requiring transparency or accountability.

challenges:

Model Selection: Opting for the right pretrained language model and fine-tuning approach for sentiment analysis involves weighing factors such as model size, computational resources, and task complexity.