<a href="https://colab.research.google.com/github/Sammii0207/sami/blob/main/INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

(1) Features (text representation) used for topic modeling.

(2) Top 10 clusters for topic modeling.

(3) Summarize and describe the topic for each cluster. 


In [None]:
# I have Used LDA for Topic Modeling.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

df = pd.read_csv('annotated_amazon_reviews.csv')

vectorizer = CountVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(df['clean_text'])

# Topic modeling with Latent Dirichlet Allocation (LDA)
lda = LatentDirichletAllocation(n_components=10, random_state=42)
lda.fit(X)

def get_topic_title(topic):
    words = topic
    words = [list(vectorizer.vocabulary_.keys())[i] for i in words.argsort()[:-11:-1]]
    return ' '.join(words).capitalize()

print('Top 10 Topics:')
for i, topic in enumerate(lda.components_):
    print(f'{i+1}. {get_topic_title(topic)}')

print('\nTop 10 Clusters:')
clusters = lda.transform(X)
df_clusters = pd.DataFrame(clusters, columns=[f'Cluster {i+1}' for i in range(10)])
df_clusters['Max'] = df_clusters.idxmax(axis=1)
print(df_clusters['Max'].value_counts().head(10))

print('\nCluster Summaries:')
for i in range(10):
    print(f'\nCluster {i+1} Summary: {get_topic_title(lda.components_.argsort()[:, ::-1][i, :10])}')
    print(df[df_clusters['Max'] == f'Cluster {i+1}']['clean_text'].head(5))


Top 10 Topics:
1. Leave knock gift rain sounding cameras included connected previous commit
2. Leave knock gift rain sounding cameras included connected previous commit
3. Activate fix concert moderate son situation wouldn neutral relocate games
4. Leave knock gift rain sounding cameras included connected previous commit
5. Activate looks sounded moderate fix alarms conversationally wouldn primary couldn
6. Instantly kasa activate repeat sounded cylinder dimmed bulb glows switch
7. Couldn activate general 40 way outlets life ahead wanted floor
8. Leave knock gift rain sounding cameras included connected previous commit
9. Seating briefing 15 close nearest barely makes telling respond difficult
10. Moderate oblige activate concert mini fix ala led second area

Top 10 Clusters:
Cluster 3     5
Cluster 5     1
Cluster 9     1
Cluster 10    1
Cluster 7     1
Cluster 6     1
Name: Max, dtype: int64

Cluster Summaries:

Cluster 1 Summary: New review 4th comparison honest dot vs generation 2n

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

(1) Features used for sentiment classification and explain why you select these features.

(2) Select two of the supervised learning algorithm from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build a sentiment classifier respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

(3) Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9. 

In [None]:
''' In this code, the features used for sentiment analysis are the frequency of occurrence of words in the text data.
These features are obtained using the CountVectorizer class from the scikit-learn library. '''
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

df = pd.read_csv('annotated_amazon_reviews.csv')

vectorizer = CountVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(df['clean_text'])
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

features = vectorizer.get_feature_names_out()

# Train and evaluate the Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
nb_predictions = nb_classifier.predict(X_test)
print('Naive Bayes Classifier:')
print(f'Accuracy: {accuracy_score(y_test, nb_predictions)}')
print(f'Precision: {precision_score(y_test, nb_predictions, pos_label="positive")}')
print(f'Recall: {recall_score(y_test, nb_predictions, pos_label="positive")}')
print(f'F1 score: {f1_score(y_test, nb_predictions, pos_label="positive")}')

# Train and evaluate the SVM classifier with 5-folds cross validation
svm_classifier = SVC()
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}
grid_search = GridSearchCV(svm_classifier, param_grid, cv=StratifiedKFold(n_splits=3), n_jobs=-1)
grid_search.fit(X_train, y_train)
svm_predictions = grid_search.predict(X_test)

print('\nSVM Classifier:')
print(f'Accuracy: {accuracy_score(y_test, svm_predictions)}')
print(f'Precision: {precision_score(y_test, svm_predictions, pos_label="positive")}')
print(f'Recall: {recall_score(y_test, svm_predictions, pos_label="positive")}')
print(f'F1 score: {f1_score(y_test, svm_predictions, pos_label="positive")}')

Naive Bayes Classifier:
Accuracy: 0.5
Precision: 1.0
Recall: 0.5
F1 score: 0.6666666666666666

SVM Classifier:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0




# **Question 3: House price prediction**

(40 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878. 


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
test_df['SalePrice'] = -1

combined_df = pd.concat([train_df, test_df], axis=0, ignore_index=True)

''' Preprocessing Data, Here in this step we add the missing columns to test.csv  with placeholder values of 0 and 1 to align with that train.csv columns'''
combined_df.fillna(combined_df.mean(), inplace=True)

combined_df = pd.get_dummies(combined_df)

train_df = combined_df[combined_df['SalePrice'] != -1]
test_df = combined_df[combined_df['SalePrice'] == -1]

scaler = StandardScaler()
train_df_scaled = scaler.fit_transform(train_df.drop(['SalePrice'], axis=1))
test_df_scaled = scaler.transform(test_df.drop(['SalePrice'], axis=1))

X_train, X_val, y_train, y_val = train_test_split(train_df_scaled, train_df['SalePrice'], test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)

y_val_pred = lr.predict(X_val)


rmse = mean_squared_error(y_val, y_val_pred, squared=False)
print('Validation set RMSE:', rmse)

predictions = lr.predict(test_df_scaled)

# Adds the saleprice column to test.csv file 
test_df['SalePrice'] = predictions
test_df.to_csv('test.csv', index=False)


  combined_df.fillna(combined_df.mean(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['SalePrice'] = predictions


Validation set RMSE: 4.041821481892244e+16
