<a href="https://colab.research.google.com/github/NahidFathima/NahidF_INFO5731_Fall2023/blob/main/Syed_NahidFathima_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

(1) Features (text representation) used for topic modeling.

(2) Top 10 clusters for topic modeling.

(3) Summarize and describe the topic for each cluster.


In [None]:
!pip install gensim pandas nltk




In [None]:
pip install bertopic

Note: you may need to restart the kernel to use updated packages.


In [None]:
# Import necessary libraries

import pandas as pd
from bertopic import BERTopic
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Loading my dataset
df = pd.read_csv('sentimentanalysis_reddit_data.csv')

# Preprocessing
stop_words = set(stopwords.words('english'))
df['clean_text'] = df['clean_text'].apply(lambda x: ' '.join([word for word in word_tokenize(x) if word.isalpha() and word not in stop_words]))

# Features (text representation) which are used for topic modeling
documents = df['clean_text'].tolist()

# BERTopic Model
topic_model = BERTopic()

# Fit the model and transform the documents into topics
topics, _ = topic_model.fit_transform(documents)

# Assign topics to the DataFrame
df['topic'] = topics

# 1. Features (text representation) used for topic modeling.
print("Features (text representation) used for topic modeling:")
print("TF-IDF features are used by default in BERTopic.")

# 2. Top 10 clusters for topic modeling.
print("\nTop 10 clusters for topic modeling:")
top_clusters_info = topic_model.get_topic_freq().head(10)
print(top_clusters_info)

# 3. Summarize and describe the topic for each cluster.
print("\nSummarize and describe the topic for each cluster:")
for i in range(len(top_clusters_info)):
    cluster_id = top_clusters_info.iloc[i]['Topic']
    cluster_df = df[df['topic'] == cluster_id]

    print(f"\nSummary for Cluster {i + 1} (Topic {cluster_id}):")
    print(cluster_df['clean_text'].head())

Features (text representation) used for topic modeling:
TF-IDF features are used by default in BERTopic.

Top 10 clusters for topic modeling:
    Topic  Count
23      0    300
59      1    200
0       2    200
47      3    185
66     -1    156
45      4    100
55      5    100
60      6    100
44      7    100
88      8    100

Summarize and describe the topic for each cluster:

Summary for Cluster 1 (Topic 0):
24     Welcome Jurassic Park
62     Welcome Jurassic park
90     Welcome Jurassic Park
124    Welcome Jurassic Park
162    Welcome Jurassic park
Name: clean_text, dtype: object

Summary for Cluster 2 (Topic 1):
60     
89     
160    
189    
260    
Name: clean_text, dtype: object

Summary for Cluster 3 (Topic 2):
0      Jurassic Park Deleted Scene
1      Jurassic Park deleted scene
100    Jurassic Park Deleted Scene
101    Jurassic Park deleted scene
200    Jurassic Park Deleted Scene
Name: clean_text, dtype: object

Summary for Cluster 4 (Topic 3):
48     Jurassic World
85   

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

(1) Features used for sentiment classification and explain why you select these features.

(2) Select two of the supervised learning algorithm from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build a sentiment classifier respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

(3) Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [None]:
# Import necessary libraries

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Loading my dataset
df = pd.read_csv('sentimentanalysis_reddit_data.csv')

# Features used for sentiment classification
texts = df['clean_text']  # 'clean_text' is the column in my dataset with the cleaned text
labels = df['sentiment']  # 'sentiment' is the column in my dataset with the sentiment labels

# Split the data into training 80% and testing 20% sets
texts_train, texts_test, labels_train, labels_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

# Vectorize the text using CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(texts_train)
X_test_vectorized = vectorizer.transform(texts_test)

# 1. Using the the above two features for sentiment classification
# CountVectorizer is used to convert text into a matrix of token counts. It's a simple and effective representation for text data.

# 2. Select two supervised learning algorithms: Multinomial Naive Bayes and Logistic Regression
# 3. Apply cross-validation and evaluate performance metrics

# Multinomial Naive Bayes
nb_classifier = MultinomialNB()

# Cross-validation
nb_cv_accuracy = cross_val_score(nb_classifier, X_train_vectorized, labels_train, cv=5, scoring='accuracy').mean()
nb_cv_precision = cross_val_score(nb_classifier, X_train_vectorized, labels_train, cv=5, scoring='precision_macro').mean()
nb_cv_recall = cross_val_score(nb_classifier, X_train_vectorized, labels_train, cv=5, scoring='recall_macro').mean()
nb_cv_f1 = cross_val_score(nb_classifier, X_train_vectorized, labels_train, cv=5, scoring='f1_macro').mean()

# Logistic Regression
lr_classifier = LogisticRegression()

# Cross-validation
lr_cv_accuracy = cross_val_score(lr_classifier, X_train_vectorized, labels_train, cv=5, scoring='accuracy').mean()
lr_cv_precision = cross_val_score(lr_classifier, X_train_vectorized, labels_train, cv=5, scoring='precision_macro').mean()
lr_cv_recall = cross_val_score(lr_classifier, X_train_vectorized, labels_train, cv=5, scoring='recall_macro').mean()
lr_cv_f1 = cross_val_score(lr_classifier, X_train_vectorized, labels_train, cv=5, scoring='f1_macro').mean()

# Print the performance metrics
print("Performance Metrics for Multinomial Naive Bayes:")
print(f"Accuracy: {nb_cv_accuracy}")
print(f"Precision: {nb_cv_precision}")
print(f"Recall: {nb_cv_recall}")
print(f"F1 Score: {nb_cv_f1}")
print("\n")

print("Performance Metrics for Logistic Regression:")
print(f"Accuracy: {lr_cv_accuracy}")
print(f"Precision: {lr_cv_precision}")
print(f"Recall: {lr_cv_recall}")
print(f"F1 Score: {lr_cv_f1}")

Performance Metrics for Multinomial Naive Bayes:
Accuracy: 0.99825
Precision: 0.9978243978243977
Recall: 0.9969196919691969
F1 Score: 0.9973175782417488


Performance Metrics for Logistic Regression:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0


# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.


In [None]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Loading the training dataset
train_path = "C:\\Users\\Nahid\\Downloads\\train.csv"
train_data = pd.read_csv(train_path)

# Loading the testing dataset
test_path = "C:\\Users\\Nahid\\Downloads\\test.csv"
test_data = pd.read_csv(test_path)

# Identify features (X) and target variable (y) in the training data
X_train = train_data.drop('SalePrice', axis=1)
y_train = train_data['SalePrice']

# Define numerical and categorical features
numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns

# Create transformers for numerical and categorical features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create the final pipeline with the model
model = LinearRegression()

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('model', model)])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Train the model
pipeline.fit(X_train, y_train)

# Make predictions on the testing set
predictions = pipeline.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predictions)
rmse = mse**0.5
print(f'Root Mean Squared Error (RMSE): {rmse}')

# Now, use the trained model to predict house prices on the testing data
test_predictions = pipeline.predict(test_data)


Root Mean Squared Error (RMSE): 29477.1519862372


In [None]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import make_scorer

# Loading the training dataset
train_path = "C:\\Users\\Nahid\\Downloads\\train.csv"
train_data = pd.read_csv(train_path)

# Identify features (X) and target variable (y) in the training data
X_train = train_data.drop('SalePrice', axis=1)
y_train = train_data['SalePrice']

# Identify categorical and numerical features
categorical_features = X_train.select_dtypes(include=['object']).columns
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns

# Create a preprocessing pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Define the pipeline with preprocessing steps and the model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(random_state=42))
])

# Cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
rmse_cv_scores = (-cv_scores)**0.5

print("Cross-Validation RMSE Scores:", rmse_cv_scores)
print("Mean RMSE:", rmse_cv_scores.mean())


Cross-Validation RMSE Scores: [26829.02988336 31781.67684855 31536.32759815 24698.39039142
 35263.77238848]
Mean RMSE: 30021.83942199284


# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **pre-trained Large Language Model (LLM) from the Hugging Face Repository** for your specific task using the data collected in Assignment 3. After creating an account on Hugging Face (https://huggingface.co/), choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any Meta based text analysis model. Provide a brief description of the selected LLM, including its original sources, significant parameters, and any task-specific fine-tuning if applied.

Perform a detailed analysis of the LLM's performance on your task, including key metrics, strengths, and limitations. Additionally, discuss any challenges encountered during the implementation and potential strategies for improvement. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


In [None]:
import pandas as pd
from transformers import pipeline

# Loading my dataset

df = pd.read_csv('sentimentanalysis_reddit_data.csv')

# Check for any missing values
df = df.dropna(subset=['clean_text', 'sentiment'])

# Define the emotion detection pipeline using distilroberta-base
emotion_pipeline = pipeline("sentiment-analysis", model="distilroberta-base")

# Function to map sentiment labels to emotions
def map_sentiment_to_emotion(sentiment_label):

    # Map positive sentiment to joy, negative sentiment to sadness, and neutral to neutral
    if sentiment_label == 'positive':
        return 'joy'
    elif sentiment_label == 'negative':
        return 'sadness'
    else:
        return 'neutral'

# Apply emotion detection to each clean_text
df['emotion'] = df['sentiment'].apply(lambda x: map_sentiment_to_emotion(x.lower()))
df['predicted_emotion'] = df['clean_text'].apply(lambda x: emotion_pipeline(x)[0]['label'])

# Display the results
print(df[['clean_text', 'sentiment', 'emotion', 'predicted_emotion']])


Downloading config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

                                             clean_text sentiment  emotion  \
0                           Jurassic Park Deleted Scene   neutral  neutral   
1                           Jurassic Park deleted scene   neutral  neutral   
2     So, I've removed some animations from Jurassic...   neutral  neutral   
3     During the filming of Jurassic Park (1993), T-...   neutral  neutral   
4     My daughter watching Jurassic Bark for the fir...   neutral  neutral   
...                                                 ...       ...      ...   
9995  People would probably still visit Jurassic Par...  positive      joy   
9996  In Jurassic Park, you can see Dr. Wu erasing a...  negative  sadness   
9997                       Jurassic Parks & Recreation.   neutral  neutral   
9998                   Shitty New Jurassic World Poster  negative  sadness   
9999          New Poster for 'Jurassic World: Dominion'   neutral  neutral   

     predicted_emotion  
0              LABEL_1  
1            

In [None]:
import pandas as pd
from transformers import pipeline

# Loading my dataset
df = pd.read_csv('sentimentanalysis_reddit_data.csv')

# Check for any missing values
df = df.dropna(subset=['clean_text', 'sentiment'])

# Define the emotion detection pipeline using distilroberta-base
emotion_pipeline = pipeline("sentiment-analysis", model="distilroberta-base")

# Function to map sentiment labels to emotions
def map_sentiment_to_emotion(sentiment_label):

    #Map positive sentiment to joy, negative sentiment to sadness, and neutral to neutral
    if sentiment_label == 'positive':
        return 'joy'
    elif sentiment_label == 'negative':
        return 'sadness'
    else:
        return 'neutral'

# Apply emotion detection to each clean_text
df['emotion'] = df['sentiment'].apply(lambda x: map_sentiment_to_emotion(x.lower()))
df['predicted_emotion'] = df['clean_text'].apply(lambda x: emotion_pipeline(x)[0]['label'])

# Display the results
print(df[['clean_text', 'sentiment', 'emotion', 'predicted_emotion']])

# Save the results to a new CSV file
df.to_csv('sentiment_analysis_results.csv', index=False)


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


                                             clean_text sentiment  emotion  \
0                           Jurassic Park Deleted Scene   neutral  neutral   
1                           Jurassic Park deleted scene   neutral  neutral   
2     So, I've removed some animations from Jurassic...   neutral  neutral   
3     During the filming of Jurassic Park (1993), T-...   neutral  neutral   
4     My daughter watching Jurassic Bark for the fir...   neutral  neutral   
...                                                 ...       ...      ...   
9995  People would probably still visit Jurassic Par...  positive      joy   
9996  In Jurassic Park, you can see Dr. Wu erasing a...  negative  sadness   
9997                       Jurassic Parks & Recreation.   neutral  neutral   
9998                   Shitty New Jurassic World Poster  negative  sadness   
9999          New Poster for 'Jurassic World: Dominion'   neutral  neutral   

     predicted_emotion  
0              LABEL_1  
1            

In [None]:
# Checking the distribution of sentiment labels
print("Sentiment Label Distribution:")
print(df['sentiment'].value_counts())

# Checking the distribution of predicted emotion labels
print("\nPredicted Emotion Label Distribution:")
print(df['predicted_emotion'].value_counts())



Sentiment Label Distribution:
sentiment
neutral     5500
positive    2600
negative    1900
Name: count, dtype: int64

Predicted Emotion Label Distribution:
predicted_emotion
LABEL_1    10000
Name: count, dtype: int64


Brief Description: The distilroberta-base model is part of the Hugging Face Transformers library and is based on the RoBERTa architecture. RoBERTa (Robustly optimized BERT approach) is a modification of BERT (Bidirectional Encoder Representations from Transformers) introduced by Facebook AI in a research paper. Distilroberta is a distilled version of RoBERTa, meaning it's a smaller and faster variant while retaining much of the original model's performance.

Significant Parameters: The distilroberta-base model consists of 82 million parameters. It has 6 layers, 768 hidden units per layer, and 12 attention heads.

Strengths: The distilroberta-base model is computationally less intensive than larger models like RoBERTa, making it faster and more memory-efficient.
It inherits the strong performance of RoBERTa on various natural language understanding tasks.

Limitations: Smaller models like distilroberta may not capture as much context and nuance as larger models.
The model's performance heavily depends on the quality and representativeness of the training data.


