<a href="https://colab.research.google.com/github/Tharungovind/GOVINDTHARUN_INFO5731_FALL2024/blob/main/Govind_Tharun_Assignment_04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [None]:
import pandas as pd
def load_text_data(file_path):
    labels, reviews = [], []
    with open(file_path, 'r') as f:
        for line in f:
            label, review = line.strip().split(' ', 1)  # Split into label and review
            labels.append(int(label))
            reviews.append(review)
    return pd.DataFrame({'label': labels, 'review': reviews})

In [None]:
!pip install bertopic

!pip install transformers



In [None]:
# Write your code here

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models.ldamodel import LdaModel
from gensim.corpora.dictionary import Dictionary
from bertopic import BERTopic
import nltk
nltk.download('stopwords')
nltk.download('punkt_tab')

# Load dataset
# # Ensure you have your dataset (reviews) in a CSV file named "reviews.csv" with a column "review"
# df = pd.read_csv("reviews.csv")
# text_data = df['review'].dropna().tolist()

train_data = load_text_data("stsa-train.txt")


# Separate features and labels
text_data = train_data['review']

# Preprocess text

stop_words = set(stopwords.words('english'))

def preprocess(text):
    tokens = word_tokenize(text.lower())
    return ' '.join([word for word in tokens if word.isalpha() and word not in stop_words])

preprocessed_texts = [preprocess(text) for text in text_data]

# 1. LDA Topic Modeling
print("\n=== LDA Topic Modeling ===")

vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(preprocessed_texts)

# Create dictionary and corpus for Gensim LDA
id2word = Dictionary([text.split() for text in preprocessed_texts])
corpus = [id2word.doc2bow(text.split()) for text in preprocessed_texts]

# Train LDA model
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=10, random_state=42, passes=10)

# Display top 10 topics
for idx, topic in lda_model.print_topics(num_topics=10, num_words=10):
    print(f"Topic {idx + 1}: {topic}")

# 2. LSA Topic Modeling
print("\n=== LSA Topic Modeling ===")
# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(preprocessed_texts)

# Apply TruncatedSVD for LSA
lsa_model = TruncatedSVD(n_components=10, random_state=42)
lsa_topics = lsa_model.fit_transform(tfidf_matrix)

# Display top words for each topic
terms = tfidf_vectorizer.get_feature_names_out()
for idx, component in enumerate(lsa_model.components_):
    topic_terms = [terms[i] for i in component.argsort()[:-11:-1]]
    print(f"Topic {idx + 1}: {', '.join(topic_terms)}")

# 3. BERTopic Topic Modeling
print("\n=== BERTopic ===")
topic_model = BERTopic(language="english", calculate_probabilities=True)
topics, probs = topic_model.fit_transform(preprocessed_texts)

# Display top 10 topics
topic_info = topic_model.get_topic_info()
print(topic_info.head(10))

# Summarize each topic
for topic in range(10):
    print(f"Topic {topic}: {topic_model.get_topic(topic)}")




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.



=== LDA Topic Modeling ===
Topic 1: 0.024*"film" + 0.009*"movie" + 0.008*"life" + 0.008*"sweet" + 0.005*"look" + 0.005*"love" + 0.005*"big" + 0.005*"home" + 0.005*"fails" + 0.004*"couple"
Topic 2: 0.021*"one" + 0.019*"movie" + 0.011*"film" + 0.010*"best" + 0.007*"fun" + 0.007*"ever" + 0.007*"man" + 0.007*"movies" + 0.006*"made" + 0.006*"good"
Topic 3: 0.015*"film" + 0.009*"movie" + 0.008*"much" + 0.007*"moments" + 0.007*"like" + 0.006*"story" + 0.006*"true" + 0.006*"well" + 0.006*"works" + 0.005*"pretty"
Topic 4: 0.009*"like" + 0.008*"movie" + 0.007*"material" + 0.007*"love" + 0.006*"story" + 0.005*"film" + 0.005*"hours" + 0.005*"long" + 0.005*"almost" + 0.005*"two"
Topic 5: 0.019*"film" + 0.007*"plot" + 0.007*"characters" + 0.006*"director" + 0.005*"movie" + 0.005*"story" + 0.004*"quirky" + 0.004*"direction" + 0.004*"bad" + 0.004*"real"
Topic 6: 0.020*"movie" + 0.015*"like" + 0.013*"film" + 0.011*"one" + 0.009*"good" + 0.007*"bad" + 0.006*"minutes" + 0.005*"get" + 0.005*"action" + 0.

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

   Topic  Count                                          Name  \
0     -1   4124                        -1_film_movie_like_one   
1      0    114                         0_hate_bad_good_wrong   
2      1     98                    1_music_portrait_art_songs   
3      2     90               2_cast_actors_actress_actresses   
4      3     89                3_kids_children_parents_adults   
5      4     77                4_story_told_tale_storytelling   
6      5     72                   5_love_story_sexual_romance   
7      6     65                    6_film_screen_moving_small   
8      7     64  7_entertaining_pleasure_enjoyable_experience   
9      8     63     8_thriller_psychological_thrills_standard   

                                      Representation  \
0  [film, movie, like, one, story, even, director...   
1  [hate, bad, good, wrong, terrible, letdown, ho...   
2  [music, portrait, art, songs, colorful, artist...   
3  [cast, actors, actress, actresses, actor, lead...   
4  [

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [None]:
# Write your code here



import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk




train_data = load_text_data("stsa-train.txt")
test_data = load_text_data("stsa-test.txt")

# Preprocess text
stop_words = set(stopwords.words('english'))

def preprocess(text):
    tokens = word_tokenize(text.lower())
    return ' '.join([word for word in tokens if word.isalpha() and word not in stop_words])

train_data['processed_review'] = train_data['review'].apply(preprocess)
test_data['processed_review'] = test_data['review'].apply(preprocess)

# Feature Extraction (TF-IDF)
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data['processed_review'])
X_test = vectorizer.transform(test_data['processed_review']) # Use transform, not fit_transform
y_train = train_data['label']
y_test = test_data['label']

# Model Selection and Training
# 1. Logistic Regression
lr_model = LogisticRegression(max_iter=1000)
lr_scores = cross_validate(lr_model, X_train, y_train, cv=5, scoring=['accuracy', 'precision', 'recall', 'f1'])
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

# 2. Support Vector Machine (SVM)
svm_model = SVC()
svm_scores = cross_validate(svm_model, X_train, y_train, cv=5, scoring=['accuracy', 'precision', 'recall', 'f1'])
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)

# Evaluation Metrics
def print_metrics(model_name, y_true, y_pred):
    print(f"\n{model_name} Performance Metrics:")
    print(f"Accuracy: {accuracy_score(y_true, y_pred)}")
    print(f"Precision: {precision_score(y_true, y_pred)}")
    print(f"Recall: {recall_score(y_true, y_pred)}")
    print(f"F1 Score: {f1_score(y_true, y_pred)}")


print_metrics("Logistic Regression", y_test, y_pred_lr)
print_metrics("SVM", y_test, y_pred_svm)

print("\nLogistic Regression Cross-Validation Scores:")
for metric, scores in lr_scores.items():
    print(f"{metric}: {scores.mean()}")

print("\nSVM Cross-Validation Scores:")
for metric, scores in svm_scores.items():
    print(f"{metric}: {scores.mean()}")




Logistic Regression Performance Metrics:
Accuracy: 0.7984623833058759
Precision: 0.7754065040650406
Recall: 0.8393839383938394
F1 Score: 0.8061278394083465

SVM Performance Metrics:
Accuracy: 0.8028555738605162
Precision: 0.7835051546391752
Recall: 0.8360836083608361
F1 Score: 0.8089409260244811

Logistic Regression Cross-Validation Scores:
fit_time: 0.12870988845825196
score_time: 0.029739856719970703
test_accuracy: 0.7671965317919075
test_precision: 0.7607327680666651
test_recall: 0.8077562326869806
test_f1: 0.7835249655834811

SVM Cross-Validation Scores:
fit_time: 4.590025329589844
score_time: 0.9744800090789795
test_accuracy: 0.7708092485549133
test_precision: 0.7675773088191343
test_recall: 0.8041551246537397
test_f1: 0.7854243394623868


# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

In [None]:
# prompt: in the above regression problem. For EDA analysyis only number type column is used . consider the string by converting to catrgorial and do one hot encoding and include in all fetaures

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models.ldamodel import LdaModel
from gensim.corpora.dictionary import Dictionary
from bertopic import BERTopic
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# Load the training dataset
train_df = pd.read_csv("train.csv")

# Identify categorical columns
categorical_cols = train_df.select_dtypes(exclude=np.number).columns

# Create a OneHotEncoder object
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Fit the encoder on the categorical columns
encoder.fit(train_df[categorical_cols])

# Transform the categorical features into one-hot encoded features
encoded_features = encoder.transform(train_df[categorical_cols])

# Create a new DataFrame with the one-hot encoded features
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(categorical_cols))

train_df = pd.concat([train_df.select_dtypes(include=np.number), encoded_df], axis=1)


# Handle missing values (replace with mean for numerical features)
numerical_cols = train_df.select_dtypes(include=np.number).columns
for col in numerical_cols:
    if train_df[col].isnull().any():
        train_df[col].fillna(train_df[col].mean(), inplace=True)

# Calculate correlations with SalePrice for all features (including one-hot encoded)
correlations = train_df.corr()['SalePrice'].abs().sort_values(ascending=False)


top_features = correlations[1:40].index.tolist()  # Increased number of features

# Use selected features for training and testing
X_data = train_df[top_features]
Y_data = train_df['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(
    X_data, Y_data, test_size=0.2, random_state=42
)

# Model Development (Linear Regression)
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Model Evaluation
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)


print(f"Root Mean Squared Error: {rmse}")
print(f"R-squared: {r2}")

Root Mean Squared Error: 33287.77665810376
R-squared: 0.8555372946457576


# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


### Description


For this task, I will use RoBERTa (Robustly Optimized BERT), a pre-trained transformer-based model optimized for NLP tasks.

Original Pretraining Data Sources: RoBERTa was trained on an extensive corpus of over 160GB of text data, including sources like the Common Crawl News, OpenWebText, BooksCorpus, and English Wikipedia.
Architecture and Parameters: RoBERTa builds on BERT’s architecture but removes the Next Sentence Prediction (NSP) objective and increases the size of pretraining data. The RoBERTa-base model contains 12 layers, 768 hidden units, 12 attention heads, and 125 million parameters.
Task-specific Fine-tuning: While the model has demonstrated strong performance on downstream tasks such as sentiment analysis, no additional fine-tuning will be applied here (zero-shot setting).

## Advantages :
    Due to extensive pretraining on diverse data, RoBERTa performs well across various tasks.
    Zero-shot Capabilities: Eliminates the need for task-specific fine-tuning.

## Disadvantages:

    Higher computational demands during inference due to the large model size.

    Performance may degrade if the input data diverges significantly from the pretraining corpus.


### Code    

In [1]:
# Write your code here

from transformers import pipeline

# Initialize the sentiment analysis pipeline with DeBERTa-v3-large
classifier = pipeline("sentiment-analysis", model="roberta-base")


# Predict sentiment
def predict_sentiment(text):
    result = classifier(text)[0]
    # print(result)
    return 1 if result['label'] == 'LABEL_1' else 0

# Apply predictions

print(len(test_data['review']))
predictions = [predict_sentiment(review) for review in test_data['review'][:100]]

# Evaluate the performance
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(test_data['label'][:100], predictions)
precision = precision_score(test_data['label'][:100], predictions)
recall = recall_score(test_data['label'][:100], predictions)
f1 = f1_score(test_data['label'][:100], predictions)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")



KeyboardInterrupt: 