<a href="https://colab.research.google.com/github/MPrasanna14/prasanna_INFO5731_Fall2023/blob/main/Prasanna_Malreddy_INFO5731_Assignment_Four_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

(1) Features (text representation) used for topic modeling.

(2) Top 10 clusters for topic modeling.

(3) Summarize and describe the topic for each cluster.


In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

df = 'MovieReview_Evaluation.csv'
reviews_df = pd.read_csv(df)

# Using TF-IDF for text representation
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(reviews_df['Review text'])

# Apply LSA (using TruncatedSVD)
lsa_model = TruncatedSVD(n_components=10, random_state=0)
X_lsa = lsa_model.fit_transform(X_tfidf)

terms = tfidf_vectorizer.get_feature_names_out()
top_words_per_topic = {}
for i, comp in enumerate(lsa_model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:10]
    top_words_per_topic[i] = [t[0] for t in sorted_terms]

for i in top_words_per_topic:
  print(i, top_words_per_topic[i])




0 ['film', 'movie', 'just', 'like', 'character', 'time', 'characters', 'good', 'make', 'man']
1 ['movie', 'just', 'funny', 'like', 'characters', 'scenes', 'time', 'good', 'man', 'worth']
2 ['like', 'story', 'good', 'character', 'life', 'just', 'love', 'people', 'characters', 'time']
3 ['like', 'man', 'movie', 'movies', 'film', 'lot', 'young', 'problems', 'head', 'hollywood']
4 ['character', 'good', 'love', 'characters', 'plot', 'movie', 'really', 'development', 'hammer', 'role']
5 ['character', 'just', 'like', 'carol', 'hammer', 'love', 'performance', 'especially', 'playing', 'fall']
6 ['little', 'know', 'character', 'course', 'funny', 'director', 'don', 'man', 'world', 'didn']
7 ['good', 'just', 'really', 'make', 'cast', 'quite', 'work', 'actors', 'feel', 'role']
8 ['don', 'know', 'films', 'way', 'man', 'king', 'think', 'bad', 'great', 'seen']
9 ['scenes', 'time', 'don', 'know', 'scene', 'director', 'funny', 'involving', 'final', 'action']


Topics for each cluster:

General Assessment
Customer Satisfaction
Usability and functionality
Physical attributes of laptop
Purchase price and benefits
Pre installed softwares
Battery life
Product satisfaction
Customer service
Unsatisfactory experienc

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

(1) Features used for sentiment classification and explain why you select these features.

(2) Select two of the supervised learning algorithm from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build a sentiment classifier respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

(3) Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [None]:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np

df = 'MovieReview_Evaluation.csv'
reviews_df = pd.read_csv(df)

# Feature extraction using TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X = tfidf_vectorizer.fit_transform(reviews_df['Review text'])
y = reviews_df['Review text']

# Defining models
models = {
    "LR": LogisticRegression(max_iter=1000, random_state=0),
    "RF": RandomForestClassifier(random_state=0)
}

# Cross-validation and performance metrics
scoring_updated = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro',
                   'precision_weighted', 'recall_weighted', 'f1_weighted']

# Recalculate scores with updated metrics
cv_results_updated = {}

for model_name, model in models.items():
    scores = cross_validate(model, X, y, scoring=scoring_updated, cv=5)
    cv_results_updated[model_name] = scores

# Summarize scores
summary_scores = {}

for model_name in models.keys():
    model_scores = cv_results_updated[model_name]
    summary_scores[model_name] = {
        "Average Accuracy": np.mean(model_scores['test_accuracy']),
        "Average Precision (Cross-validation)": np.mean(model_scores['test_precision_macro']),
        "Average Recall (Cross-validation)": np.mean(model_scores['test_recall_macro']),
        "Average F1 Score (Cross-validation)": np.mean(model_scores['test_f1_macro']),
        "Average Precision ": np.mean(model_scores['test_precision_weighted']),
        "Average Recall ": np.mean(model_scores['test_recall_weighted']),
        "Average F1 Score ": np.mean(model_scores['test_f1_weighted'])
    }

# summary_scores
for i in summary_scores:
  print(i)
  for j in summary_scores[i]:
    print(j, summary_scores[i][j])
  print("\n")





  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

LR
Average Accuracy 0.0032653121668048805
Average Precision (Cross-validation) 8.893350289473874e-06
Average Recall (Cross-validation) 0.0027218419981057847
Average F1 Score (Cross-validation) 1.77223357871724e-05
Average Precision  1.1855040582112002e-05
Average Recall  0.0032653121668048805
Average F1 Score  2.361361132862602e-05


RF
Average Accuracy 0.0032653121668048805
Average Precision (Cross-validation) 0.0001640676322693764
Average Recall (Cross-validation) 0.0015072482433106799
Average F1 Score (Cross-validation) 0.0002939171620920711
Average Precision  0.0003734638086314428
Average Recall  0.0032653121668048805
Average F1 Score  0.0006659246624520289




  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

train_df = 'train.csv'
test_df = 'test.csv'

train_dataset = pd.read_csv(train_df)
test_dataset = pd.read_csv(test_df)

features = train_dataset.drop(columns=['SalePrice', 'Id'])
target = train_dataset['SalePrice']

numeric_columns = features.select_dtypes(include=['int64', 'float64']).columns
categorical_columns = features.select_dtypes(include=['object']).columns

# Transformers
numeric_transformer = SimpleImputer(strategy='mean')

cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

data_preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_columns),
        ('cat', cat_transformer, categorical_columns)
    ])

# Gradient Boosting model
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=0)

pipeline = Pipeline(steps=[('preprocessor', data_preprocessor),
                           ('gb_model', gb_model)])

# Data Splitting
features_train, features_valid, target_train, target_valid = train_test_split(features, target, train_size=0.8, test_size=0.2, random_state=42)

# Fit model
pipeline.fit(features_train, target_train)

valid_predictions = pipeline.predict(features_valid)

# Evaluate model
mean_squared_error_value = mean_squared_error(target_valid, valid_predictions)
root_mean_squared_error_value = np.sqrt(mean_squared_error_value)
r2_value = r2_score(target_valid, valid_predictions)

print("Mean Squared Error = ", mean_squared_error_value)
print("Root-Mean Squared Error = ", root_mean_squared_error_value)
print("R-Squared = ", r2_value)

# Predict on test data
test_features = pd.read_csv(test_df).drop(columns=['Id'])
test_predictions = pipeline.predict(test_features)

for i in range(10):
  print("$", test_predictions[i])




Mean Squared Error =  803869861.5929561
Root-Mean Squared Error =  28352.598850774793
R-Squared =  0.8951974349096488
$ 127628.90410612979
$ 161072.89913083884
$ 172761.98060108643
$ 184986.37195605497
$ 200308.61022479733
$ 175566.5925633529
$ 173135.41980048126
$ 163257.60952083374
$ 182048.9136114101
$ 122050.37599910742


# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **pre-trained Large Language Model (LLM) from the Hugging Face Repository** for your specific task using the data collected in Assignment 3. After creating an account on Hugging Face (https://huggingface.co/), choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any Meta based text analysis model. Provide a brief description of the selected LLM, including its original sources, significant parameters, and any task-specific fine-tuning if applied.

Perform a detailed analysis of the LLM's performance on your task, including key metrics, strengths, and limitations. Additionally, discuss any challenges encountered during the implementation and potential strategies for improvement. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.
