<a href="https://colab.research.google.com/github/MalavikaKatta/Academic_Projects/blob/main/Computational%20Methods/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

(1) Features (text representation) used for topic modeling.

(2) Top 10 clusters for topic modeling.

(3) Summarize and describe the topic for each cluster.


###Creating Dictionary and Document-Term Matrix

In [None]:
import pandas as pd
from gensim.corpora import Dictionary
from gensim import corpora

# Load the annotated dataset from assignment 3
df = pd.read_csv("/content/annotated_dataset (1).csv")

# Tokenize the text
tokenized_text = [text.split() for text in df['Cleaned_Review']]

# Create a Dictionary
dictionary = corpora.Dictionary(tokenized_text)

# Create a Document-Term Matrix
doc_term_matrix = [dictionary.doc2bow(tokens) for tokens in tokenized_text]

###Training LDA Model and Assigning Topics

In [None]:
from gensim.models import LdaModel

# Train the LDA model
num_topics = 10  # You can adjust the number of topics
lda_model = LdaModel(doc_term_matrix, num_topics=num_topics, id2word=dictionary, passes=15)

# Display the top words for each topic
top_words_per_topic = []
for i in range(num_topics):
    topic_words = [word for word, _ in lda_model.show_topic(i, topn=10)]
    top_words_per_topic.append(topic_words)
    print(f"\nTopic {i + 1}: {', '.join(topic_words)}")

# Assign topics to documents
df['topic'] = [max(lda_model.get_document_topics(doc), key=lambda x: x[1])[0] for doc in doc_term_matrix]


Topic 1: film, watch, would, minut, danc, need, total, two, time, also

Topic 2: movi, action, one, block, like, real, song, guess, top, half

Topic 3: movi, watch, rrr, action, much, one, film, like, lot, never

Topic 4: movi, action, good, stori, rrr, make, charact, great, way, feel

Topic 5: rrr, film, bahubali, rajamouli, mass, make, look, that, two, mind

Topic 6: movi, scene, hero, well, action, stori, everyth, modern, cgi, look

Topic 7: film, say, day, im, thing, visual, even, seen, know, one

Topic 8: movi, see, ive, work, seen, rrr, great, use, get, fighter

Topic 9: film, movi, indian, rrr, watch, action, scene, one, critic, tollywood

Topic 10: seem, rajamouli, one, bheem, also, manag, film, charan, there, might


###Summarize the Clusters into topic modeling file

In [None]:
# Display the top clusters and summarize topics
top_clusters = df['topic'].value_counts().head(10)
for cluster, count in top_clusters.items():
    print(f"\nCluster {cluster + 1}:")
    print(f"Number of Documents: {count}")
    print(f"Top Words: {', '.join(top_words_per_topic[cluster])}")
    print("--------")

# Save the results to a CSV file
df.to_csv("topic_modeling_results.csv", index=False)


Cluster 9:
Number of Documents: 2000
Top Words: film, movi, indian, rrr, watch, action, scene, one, critic, tollywood
--------

Cluster 3:
Number of Documents: 1200
Top Words: movi, watch, rrr, action, much, one, film, like, lot, never
--------

Cluster 1:
Number of Documents: 1200
Top Words: film, watch, would, minut, danc, need, total, two, time, also
--------

Cluster 6:
Number of Documents: 1200
Top Words: movi, scene, hero, well, action, stori, everyth, modern, cgi, look
--------

Cluster 8:
Number of Documents: 800
Top Words: movi, see, ive, work, seen, rrr, great, use, get, fighter
--------

Cluster 4:
Number of Documents: 800
Top Words: movi, action, good, stori, rrr, make, charact, great, way, feel
--------

Cluster 7:
Number of Documents: 800
Top Words: film, say, day, im, thing, visual, even, seen, know, one
--------

Cluster 2:
Number of Documents: 800
Top Words: movi, action, one, block, like, real, song, guess, top, half
--------

Cluster 5:
Number of Documents: 800
Top 

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

(1) Features used for sentiment classification and explain why you select these features.

(2) Select two of the supervised learning algorithm from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build a sentiment classifier respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

(3) Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

###Features for Sentiment Classification

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Load the annotated dataset from Assignment 3
df = pd.read_csv("/content/annotated_dataset (1).csv")

# Features: TF-IDF representation of the cleaned reviews
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
X = tfidf_vectorizer.fit_transform(df['Cleaned_Review'])

# Target: Sentiment labels (positive, negative, neutral)
y = df['sentiment']

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save the TF-IDF vectorizer for later use
import joblib
joblib.dump(tfidf_vectorizer, 'TF-IDF_vectorizer.pkl')


['TF-IDF_vectorizer.pkl']

###Building Sentiment Classifiers
####SVM

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

# Build an SVM classifier
svm_classifier = SVC(kernel='linear', C=1)
svm_classifier.fit(X_train, y_train)

# Cross-validation (5-fold)
cv_scores_svm = cross_val_score(svm_classifier, X_train, y_train, cv=5)

# Evaluate performance on the test set
y_pred_svm = svm_classifier.predict(X_test)
report_svm = classification_report(y_test, y_pred_svm)

# Print cross-validation scores and classification report for SVM
print("SVM Cross-Validation Scores:", cv_scores_svm)
print("\nSVM Classification Report:\n", report_svm)

SVM Cross-Validation Scores: [1. 1. 1. 1. 1.]

SVM Classification Report:
               precision    recall  f1-score   support

    negative       1.00      1.00      1.00       250
    positive       1.00      1.00      1.00      1750

    accuracy                           1.00      2000
   macro avg       1.00      1.00      1.00      2000
weighted avg       1.00      1.00      1.00      2000



####Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

# Build a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Cross-validation (5-fold)
cv_scores_rf = cross_val_score(rf_classifier, X_train, y_train, cv=5)

# Evaluate performance on the test set
y_pred_rf = rf_classifier.predict(X_test)
report_rf = classification_report(y_test, y_pred_rf)

# Print cross-validation scores and classification report for Random Forest
print("Random Forest Cross-Validation Scores:", cv_scores_rf)
print("\nRandom Forest Classification Report:\n", report_rf)

Random Forest Cross-Validation Scores: [1. 1. 1. 1. 1.]

Random Forest Classification Report:
               precision    recall  f1-score   support

    negative       1.00      1.00      1.00       250
    positive       1.00      1.00      1.00      1750

    accuracy                           1.00      2000
   macro avg       1.00      1.00      1.00      2000
weighted avg       1.00      1.00      1.00      2000



#Comparison of SVM and Random Forest
##Support Vector Machine (SVM) Metrics:
###Cross-Validation Scores: [1. 1. 1. 1. 1.]
#####Accuracy: 1.00
#####Precision (positive): 1.00
#####Recall (positive): 1.00
#####F1-score (positive): 1.00
##Random Forest Metrics:
###Cross-Validation Scores: [1. 1. 1. 1. 1.]
#####Accuracy: 1.00
#####Precision (positive): 1.00
#####Recall (positive): 1.00
#####F1-score (positive): 1.00
##Comparison:
#####Accuracy: Both models achieved 100% accuracy on the dataset.
#####Precision: Both models achieved a precision of 1.00 for the positive class, indicating no false positives.
#####Recall: Both models achieved a recall of 1.00 for the positive class, indicating no false negatives.
#####F1-score: Both models achieved an F1-score of 1.00 for the positive class, indicating a perfect balance between precision and recall.
##Conclusion:
#####Both the Support Vector Machine and Random Forest models perform exceptionally well on the provided dataset, achieving perfect scores across all evaluated metrics.

# **Question 3: House price prediction**

(40 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.


### In this dataset, the target variable is "SalePrice"

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer

# Load the data
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# Separate features and target variable
X = train_data.drop('SalePrice', axis=1)
y = train_data['SalePrice']

# Identify numerical and categorical features
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Create preprocessing pipelines for numerical and categorical features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers into a preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create the model pipeline
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('regressor', RandomForestRegressor(random_state=42))])

# Split the data into train and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=42)

# Fit the model
model.fit(X_train, y_train)

# Predictions on the validation set
valid_preds = model.predict(X_valid)

# Evaluate the model
rmse = mean_squared_error(y_valid, valid_preds, squared=False)
print(f'Root Mean Squared Error on the validation set: {rmse}')


Root Mean Squared Error on the validation set: 28561.708782396403
