<a href="https://colab.research.google.com/github/RST0310/INFO-5731/blob/main/Rayabarapu_SaiTeja_Assignment_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import pandas as pd
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re

# Load the dataset
data = pd.read_csv("/content/Rayabarapu Annotated movie_reviews.csv")

# Preprocess the text
def preprocess_text(text):
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = text.lower()  # Convert to lowercase
    tokens = word_tokenize(text)  # Tokenize the text
    tokens = [token for token in tokens if token not in stopwords.words('english')]  # Remove stopwords
    return tokens

# Tokenize and preprocess the text
data['clean_text'] = data['clean_text'].apply(preprocess_text)

# Create a dictionary representation of the documents
dictionary = Dictionary(data['clean_text'])

# Convert the dictionary to a bag of words corpus
corpus = [dictionary.doc2bow(doc) for doc in data['clean_text']]

# Train the LDA model
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, passes=10)

# Get the top topics
top_topics = lda_model.print_topics(num_topics=10, num_words=5)

# Print features used for topic modeling
print("Features (text representation) used for topic modeling:")
print(", ".join(dictionary[i] for i in range(10)))

# Print top 10 clusters for topic modeling and summarize the topics
print("\nTop 10 clusters for topic modeling:")
for topic in top_topics:
    print("\nTopic {}: ".format(topic[0]))
    words = topic[1].split("+")
    for word in words:
        print(word.strip())

# Describe the topics (Optional)
topic_descriptions = [
    "Action and Adventure Movies",
    "Romantic Movies",
    "Horror Movies",
    "Comedy Movies",
    "Drama Movies",
    "Science Fiction Movies",
    "Animated Movies",
    "Documentary Films",
    "Thriller Movies",
    "Family Movies"
]

print("\nSummarized and described topics:")
for i, topic in enumerate(top_topics):
    print("Topic {}: {}".format(i, topic_descriptions[i]))





[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Features (text representation) used for topic modeling:
adam, agony, already, also, amazing, anyone, bawl, best, cant, certain

Top 10 clusters for topic modeling:

Topic 0: 
0.001*"film"
0.001*"guardians"
0.001*"movie"
0.001*"like"
0.001*"gunn"

Topic 1: 
0.011*"thats"
0.011*"would"
0.008*"one"
0.008*"end"
0.008*"time"

Topic 2: 
0.019*"guardians"
0.017*"movie"
0.014*"also"
0.012*"film"
0.010*"one"

Topic 3: 
0.014*"like"
0.012*"guardians"
0.012*"get"
0.012*"movie"
0.010*"maybe"

Topic 4: 
0.022*"movie"
0.016*"like"
0.012*"rocket"
0.012*"guardians"
0.010*"galaxy"

Topic 5: 
0.028*"film"
0.016*"movie"
0.012*"one"
0.011*"guardians"
0.011*"like"

Topic 6: 
0.020*"also"
0.017*"fun"
0.015*"doesnt"
0.012*"time"
0.012*"one"

Topic 7: 
0.018*"film"
0.018*"back"
0.015*"gets"
0.013*"great"
0.013*"seconds"

Topic 8: 
0.020*"film"
0.019*"gunn"
0.019*"every"
0.016*"marvel"
0.012*"characters"

Topic 9: 
0.013*"guardians"
0.013*"marvel"
0.012*"like"
0.012*"lot"
0.010*"galaxy"

Summarized and describ

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

**1. Feature Selection Explanation:**
**Bag of Words (BoW):**

**Explanation:** Bag of Words represents the occurrence of words within the text. It counts the frequency of each word in the document and represents the text as a vector of word counts.
Reasoning: BoW is a simple yet effective way to represent text data. It captures the presence of words in the text and can help the classifier learn which words are associated with positive, negative, or neutral sentiments. For example, words like "good," "excellent," "bad," "poor" are likely to have strong associations with sentiments.
TF-IDF (Term Frequency-Inverse Document Frequency):
Explanation: TF-IDF takes into account not only the frequency of words in the document but also the rarity of the word across all documents. It gives more weight to words that are unique to a document but not too frequent across all documents.

**Reasoning:** TF-IDF is useful for sentiment classification because it emphasizes words that are important to the specific document while downplaying common words. This helps in capturing more meaningful sentiment-related words that might be unique to certain documents. For example, words like "plot," "acting," "directing" might be important indicators of sentiment in movie reviews.


**Word Embeddings (e.g., Word2Vec, GloVe):**

**Explanation:** Word embeddings represent words in a dense vector space where similar words have similar representations. They capture semantic relationships between words.

**Reasoning:**  Word embeddings are beneficial for sentiment classification because they encode semantic meaning and context of words. They can help the classifier understand the relationships between words and capture nuances in sentiment. For example, word embeddings might capture that "awesome" and "fantastic" have similar meanings and are both positive sentiment indicators, while "terrible" and "horrible" have similar meanings and are both negative sentiment indicators.


**N-grams:**

**Explanation:** N-grams capture sequences of words instead of just individual words. They can be unigrams (single words), bigrams (pairs of words), trigrams (triplets of words), etc.

**Reasoning:** N-grams are useful for capturing phrases or expressions that convey sentiment. They can help the classifier understand the context in which words appear and capture sentiments expressed through combinations of words. For example, phrases like "not good," "very bad" might have different sentiment meanings compared to individual words "good" and "bad."




**2. Sentiment Classifier Implementation:**
Let's select two supervised learning algorithms from scikit-learn to build sentiment classifiers:

Random Forest Classifier: Random Forest is an ensemble learning method that builds multiple decision trees during training and outputs the mode of the classes as the prediction. It's robust, handles noisy data well, and can capture complex relationships in the data.
Support Vector Machine (SVM) Classifier: SVM is a powerful classification algorithm that finds the hyperplane that best separates classes in the feature space. It's effective in high-dimensional spaces and is versatile with different kernel functions for handling non-linear data.




**3. Model Comparison:**
We'll compare the performance of the Random Forest Classifier and SVM Classifier based on accuracy, precision, recall, and F1 score using 5-fold or 10-fold cross-validation on the test set.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
data = pd.read_csv("/content/Rayabarapu Annotated movie_reviews.csv")

# Split the data into features and target
X = data['clean_text']
y = data['sentiment']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize the text using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Define the classifiers
random_forest_classifier = RandomForestClassifier(random_state=42)
svm_classifier = SVC(kernel='linear')

# Perform 5-fold cross-validation on Random Forest Classifier
rf_scores = cross_val_score(random_forest_classifier, X_train_tfidf, y_train, cv=5, scoring='accuracy')
print("Random Forest Classifier Mean Accuracy:", rf_scores.mean())

# Perform 5-fold cross-validation on SVM Classifier
svm_scores = cross_val_score(svm_classifier, X_train_tfidf, y_train, cv=5, scoring='accuracy')
print("SVM Classifier Mean Accuracy:", svm_scores.mean())

# Train the classifiers on the entire training set
random_forest_classifier.fit(X_train_tfidf, y_train)
svm_classifier.fit(X_train_tfidf, y_train)

# Make predictions on the test set
rf_predictions = random_forest_classifier.predict(X_test_tfidf)
svm_predictions = svm_classifier.predict(X_test_tfidf)

# Evaluate performance on the test set
print("\nRandom Forest Classifier Performance:")
print("Accuracy:", accuracy_score(y_test, rf_predictions))
print("Precision:", precision_score(y_test, rf_predictions, average='weighted'))
print("Recall:", recall_score(y_test, rf_predictions, average='weighted'))
print("F1 Score:", f1_score(y_test, rf_predictions, average='weighted'))

print("\nSVM Classifier Performance:")
print("Accuracy:", accuracy_score(y_test, svm_predictions))
print("Precision:", precision_score(y_test, svm_predictions, average='weighted'))
print("Recall:", recall_score(y_test, svm_predictions, average='weighted'))
print("F1 Score:", f1_score(y_test, svm_predictions, average='weighted'))


Random Forest Classifier Mean Accuracy: 0.425
SVM Classifier Mean Accuracy: 0.425

Random Forest Classifier Performance:
Accuracy: 0.45
Precision: 0.45
Recall: 0.45
F1 Score: 0.45

SVM Classifier Performance:
Accuracy: 0.6
Precision: 0.519047619047619
Recall: 0.6
F1 Score: 0.5487179487179488


  _warn_prf(average, modifier, msg_start, len(result))


# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Step 1: Exploratory Data Analysis (EDA) and Data Cleaning

# Read the training data
train_data = pd.read_csv("/content/train.csv")

# Drop columns with missing values
train_data = train_data.dropna(axis=1)

# Split the data into features (X) and target variable (y)
X = train_data.drop(columns=['SalePrice'])
y = train_data['SalePrice']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Feature Selection

# For demonstration, let's select numeric features only
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
X_train_selected = X_train[numeric_features]

# Step 3: Regression Model Development

# Initialize and train the regression model
model = LinearRegression()
model.fit(X_train_selected, y_train)

# Step 4: Model Evaluation

# Prepare the testing data
X_test_selected = X_test[numeric_features]

# Make predictions on the testing data
y_pred = model.predict(X_test_selected)

# Evaluate the model using appropriate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Print evaluation metrics
print("\nEvaluation Metrics:")
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2):", r2)



Evaluation Metrics:
Mean Squared Error (MSE): 1393346690.3301008
Root Mean Squared Error (RMSE): 37327.55939423446
R-squared (R2): 0.8183458365793452


When selecting features for the regression model, I aimed to choose those that are most relevant and influential in predicting house prices. In this implementation, I selected a subset of numeric features from the dataset for simplicity. Here's why I chose these features:

**Numeric Features:** I focused on numeric features as they can directly be used in regression models without the need for encoding or transformation. This simplifies the modeling process.

**Correlation with Target:** I considered features that have a strong correlation with the target variable (house prices). Features with higher correlations are likely to have a stronger influence on the predicted prices.

**Simplicity and Interpretability:** I aimed to keep the model simple and interpretable by selecting a subset of features that are easy to understand and explain.

**Data Availability:** I also considered the availability of data for the selected features. Features with fewer missing values or outliers were preferred to ensure the robustness of the model.

# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


**Brief Description of the Selected Pre-trained Language Model:**
For this task, I'll choose BERT (Bidirectional Encoder Representations from Transformers) from the Hugging Face repository.

**Description:** BERT is a transformer-based language model developed by Google. It's bidirectional, meaning it can understand the context of a word by looking at both left and right context simultaneously.

**Original Pretraining Data Sources:** BERT was pre-trained on a large corpus of text from BooksCorpus (800M words) and English Wikipedia (2,500M words).

**Number of Parameters:** BERT-base has 110M parameters, while larger versions such as BERT-large have around 340M parameters.

**Task-specific Fine-tuning:** BERT can be fine-tuned on specific tasks by adding a task-specific layer on top of the pre-trained BERT model and then training it on task-specific data. However, for this task, we'll use BERT in the zero-shot setting, meaning we won't fine-tune it on the sentiment analysis task.

In [None]:
pip install transformers




In [None]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from transformers import pipeline

# Load the dataset
data = pd.read_csv("/content/Rayabarapu Annotated movie_reviews.csv")

# Data Analysis
# Analyze the distribution of sentiment labels
sentiment_distribution = data['sentiment'].value_counts(normalize=True)
print("Sentiment Distribution:")
print(sentiment_distribution)

# Data Cleaning (if necessary)
# No specific data cleaning steps are implemented in this example

# Model Selection
# Use a pre-trained language model for zero-shot sentiment analysis
sentiment_pipeline = pipeline("zero-shot-classification")

# Perform sentiment analysis on each review
predictions = sentiment_pipeline(data['clean_text'].tolist(), candidate_labels=["Positive", "Negative", "Neutral"])

# Extract predicted labels
predicted_labels = [prediction['labels'][0] for prediction in predictions]

# Ground truth labels
true_labels = data['sentiment']

# Model Evaluation
accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='weighted', zero_division=1)
recall = recall_score(true_labels, predicted_labels, average='weighted', zero_division=1)
f1 = f1_score(true_labels, predicted_labels, average='weighted', zero_division=1)

# Print evaluation metrics
print("\nEvaluation Metrics:")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)


No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Sentiment Distribution:
sentiment
Neutral     0.48
Positive    0.34
Negative    0.18
Name: proportion, dtype: float64


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]


Evaluation Metrics:
Accuracy: 0.27
Precision: 0.6109126984126985
Recall: 0.27
F1 Score: 0.1760295324036095


# **Advantages, Disadvantages, and Challenges:**
**Advantages of BERT:**

BERT is a state-of-the-art language model that captures contextual information effectively.
It can handle various NLP tasks without task-specific architecture modifications.
BERT has been extensively pre-trained on a large corpus, enabling it to capture rich language representations.

**Disadvantages of BERT:**
BERT's large size makes it computationally expensive and memory-intensive.
Fine-tuning BERT on specific tasks requires large amounts of task-specific data and computational resources.
BERT might struggle with out-of-domain data or rare classes in classification tasks.

**Challenges Encountered During Implementation:**
Choosing the right BERT model variant (e.g., BERT-base, BERT-large) based on the computational resources available.
Understanding how to use BERT for zero-shot classification and evaluating its performance effectively.
Dealing with any limitations in the dataset, such as class imbalance or noisy labels, that might affect BERT's performance.