<a href="https://colab.research.google.com/github/Madhu-3499/DataScienceEssentials/blob/main/Surisetti_Madhu_INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [6]:
# Write your code here
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.models import LdaModel
from gensim.corpora import Dictionary

# Download necessary nltk resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Load the dataset
df = pd.read_csv('cleaned_reviews.csv')

# Preprocessing function
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.lower() not in stop_words and len(word) > 1]
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return lemmatized_tokens

# Apply preprocessing
df['clean_text'] = df['cleaned_text'].apply(preprocess_text)  # Update 'text_column_name' to the name of your text column

# Prepare the text data
texts = df['clean_text'].tolist()
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Train LDA model
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, passes=15)

# Display the topics
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


(0, '0.015*"realli" + 0.010*"movi" + 0.010*"bad" + 0.010*"diesel" + 0.010*"good" + 0.010*"wors" + 0.010*"couldnt" + 0.010*"mediocr" + 0.010*"isnt" + 0.010*"enjoy"')
(1, '0.025*"race" + 0.013*"even" + 0.012*"franchis" + 0.010*"film" + 0.010*"end" + 0.008*"tri" + 0.007*"one" + 0.007*"scene" + 0.006*"fast" + 0.006*"stori"')
(2, '0.001*"momoa" + 0.001*"film" + 0.001*"race" + 0.001*"bad" + 0.001*"one" + 0.001*"see" + 0.001*"start" + 0.001*"get" + 0.001*"movi" + 0.001*"fast"')
(3, '0.039*"movi" + 0.014*"charact" + 0.014*"action" + 0.014*"like" + 0.014*"scene" + 0.012*"make" + 0.011*"actual" + 0.011*"fast" + 0.010*"dont" + 0.010*"even"')
(4, '0.020*"movi" + 0.015*"action" + 0.015*"fast" + 0.010*"bad" + 0.010*"scene" + 0.008*"one" + 0.008*"watch" + 0.008*"car" + 0.007*"even" + 0.007*"get"')
(5, '0.016*"movi" + 0.014*"one" + 0.009*"screw" + 0.009*"car" + 0.009*"make" + 0.008*"famili" + 0.008*"scene" + 0.008*"good" + 0.008*"bad" + 0.007*"time"')
(6, '0.001*"film" + 0.001*"stori" + 0.001*"charact

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import make_pipeline

# Load the dataset
df = pd.read_csv('cleaned_reviews.csv')

# If 'rating' is used for sentiment, ensure it's converted to a binary format, e.g., positive (1) or negative (0)
# Modify this section as per actual sentiment representation
# For simplicity, let's assume ratings above 3 are positive (1), and 3 or below are negative (0)
df['sentiment'] = df['rating'].apply(lambda x: 1 if x > 3 else 0)

# Splitting the dataset into training and testing sets
train_data, test_data, train_labels, test_labels = train_test_split(
    df['review_text'], df['sentiment'], test_size=0.2, random_state=42)

# Feature extraction: Using TF-IDF to convert text data to numeric.
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

# Model 1: Logistic Regression
lr_pipeline = make_pipeline(vectorizer, LogisticRegression())
lr_pipeline.fit(train_data, train_labels)

# Model 2: Multinomial Naive Bayes
nb_pipeline = make_pipeline(vectorizer, MultinomialNB())
nb_pipeline.fit(train_data, train_labels)

# Evaluate with cross-validation
cv_folds = 5

lr_scores = cross_val_score(lr_pipeline, train_data, train_labels, cv=cv_folds, scoring='accuracy')
nb_scores = cross_val_score(nb_pipeline, train_data, train_labels, cv=cv_folds, scoring='accuracy')

print(f"Logistic Regression CV Accuracy: {np.mean(lr_scores):.2f}")
print(f"Multinomial Naive Bayes CV Accuracy: {np.mean(nb_scores):.2f}")

# Testing the models on the test dataset
def evaluate_model(model, test_data, test_labels):
    predictions = model.predict(test_data)
    accuracy = accuracy_score(test_labels, predictions)
    precision = precision_score(test_labels, predictions, average='weighted')
    recall = recall_score(test_labels, predictions, average='weighted')
    f1 = f1_score(test_labels, predictions, average='weighted')
    return accuracy, precision, recall, f1

lr_accuracy, lr_precision, lr_recall, lr_f1 = evaluate_model(lr_pipeline, test_data, test_labels)
nb_accuracy, nb_precision, nb_recall, nb_f1 = evaluate_model(nb_pipeline, test_data, test_labels)

print("Logistic Regression Test Metrics:")
print(f"Accuracy: {lr_accuracy:.2f}, Precision: {lr_precision:.2f}, Recall: {lr_recall:.2f}, F1 Score: {lr_f1:.2f}")

print("Multinomial Naive Bayes Test Metrics:")
print(f"Accuracy: {nb_accuracy:.2f}, Precision: {nb_precision:.2f}, Recall: {nb_recall:.2f}, F1 Score: {nb_f1:.2f}")


Logistic Regression CV Accuracy: 0.65
Multinomial Naive Bayes CV Accuracy: 0.65
Logistic Regression Test Metrics:
Accuracy: 0.60, Precision: 0.36, Recall: 0.60, F1 Score: 0.45
Multinomial Naive Bayes Test Metrics:
Accuracy: 0.60, Precision: 0.36, Recall: 0.60, F1 Score: 0.45


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

In [31]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import mean_squared_error

# Load the data
df_train = pd.read_csv("/train.csv")
df_test = pd.read_csv("/test.csv")

# Exploratory Data Analysis (EDA)
print("Train Data Info:")
print(df_train.info())
print("\nTest Data Info:")
print(df_test.info())

# Data Cleaning
# Drop columns with too many missing values
missing_threshold = 0.5
missing_cols_train = df_train.columns[df_train.isnull().mean() > missing_threshold]
missing_cols_test = df_test.columns[df_test.isnull().mean() > missing_threshold]
df_train.drop(columns=missing_cols_train, inplace=True)
df_test.drop(columns=missing_cols_test, inplace=True)

# Fill missing values with mean for numerical columns
num_cols_train = df_train.select_dtypes(include=np.number).columns
num_cols_test = df_test.select_dtypes(include=np.number).columns
df_train[num_cols_train] = df_train[num_cols_train].fillna(df_train[num_cols_train].mean())
df_test[num_cols_test] = df_test[num_cols_test].fillna(df_test[num_cols_test].mean())

# Selecting numeric features
numeric_features = df_train.select_dtypes(include=[np.number]).columns.tolist()
selected_features = numeric_features[:10]  # Selecting the first 10 numeric features
print("\nSelected Features:", selected_features)

# Prepare data for training
X_train = df_train[selected_features]
y_train = df_train['SalePrice']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Develop a regression model
regression = LinearRegression()
regression.fit(X_train, y_train)

# Make predictions on the test set
y_pred = regression.predict(X_test)

# Evaluate the model
r_squared = regression.score(X_test, y_test)
print('\nLinear Regression R-squared:', r_squared)

# Calculate root mean squared error
mse = mean_squared_error(y_pred, y_test)
rmse = np.sqrt(mse)
print('Root Mean Squared Error:', rmse)

Train Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int6

# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


In [22]:
# Write your code here

import pandas as pd
from transformers import pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
df = pd.read_csv('cleaned_reviews.csv')

# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in the dataset:")
print(missing_values)

# Drop rows with missing values in 'clean_text' and 'sentiment' columns
df = df.dropna(subset=['cleaned_text', 'review_text'])

# Initialize sentiment analysis pipeline
sentiment_analysis = pipeline("sentiment-analysis", model="distilroberta-base")

# Define a function to perform sentiment analysis and map output to expected labels
def predict_sentiment(text):
    try:
        result = sentiment_analysis(text)[0]
        if result['label'] == 'LABEL_0':
            return 'negative'
        elif result['label'] == 'LABEL_1':
            return 'positive'
    except Exception as e:
        return 'error'

# Apply sentiment prediction
df['predicted_sentiment'] = df['cleaned_text'].apply(predict_sentiment)

# Print example of predictions
print("\nProcessed DataFrame:")
print(df[['cleaned_text', 'review_text', 'predicted_sentiment']].head())

# Calculate and display evaluation metrics
accuracy = accuracy_score(df['review_text'], df['predicted_sentiment'])
precision = precision_score(df['review_text'], df['predicted_sentiment'], average='weighted')
recall = recall_score(df['review_text'], df['predicted_sentiment'], average='weighted')
f1 = f1_score(df['review_text'], df['predicted_sentiment'], average='weighted')

print("\nEvaluation Metrics:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")



Missing values in the dataset:
review_title         0
review_text          0
rating               0
noise_removed        0
numbers_removed      0
lowercased           0
stopwords_removed    0
stemmed              0
lemmatized           0
cleaned_text         0
dtype: int64


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Token indices sequence length is longer than the specified maximum sequence length for this model (539 > 512). Running this sequence through the model will result in indexing errors



Processed DataFrame:
                                        cleaned_text  \
0  thought couldnt possibl write someth even wors...   
1  write exact review last transform last indiana...   
2  fast furiou lot franchis point took action out...   
3  point went see fast x without clue happen last...   
4  movi start stori first ten minut fast five sto...   

                                         review_text predicted_sentiment  
0  I thought they couldn't possibly write somethi...               error  
1  I can write the exact same review for the last...            negative  
2  Fast & Furious 9 did what a lot of franchises ...            negative  
3  By this point, I went to see Fast X without a ...            negative  
4  The movie starts it story from the first ten m...            negative  

Evaluation Metrics:
Accuracy: 0.0
Precision: 0.0
Recall: 0.0
F1 Score: 0.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
