<a href="https://colab.research.google.com/github/KarunakarMuppuri/INFO-5731-Section-020/blob/main/INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

(1) Features (text representation) used for topic modeling.

(2) Top 10 clusters for topic modeling.

(3) Summarize and describe the topic for each cluster. 


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD

df = pd.read_csv('annotated_reviews.csv')

vectorizer = CountVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(df['Clean_Text'])

# Apply Latent Semantic Analysis (LSA)
lsa = TruncatedSVD(n_components=10, random_state=42)
X_lsa = lsa.fit_transform(X)

def get_topic_title(cluster):
    # Get the top 10 words for the given cluster
    indices = X_lsa[:, cluster].argsort()[::-1][:10]
    words = [list(vectorizer.vocabulary_.keys())[i] for i in indices]
    return ' '.join(words).capitalize()

print('Top 10 Topics:')
for i in range(10):
    print(f'{i+1}. {get_topic_title(i)}')

clusters = lsa.transform(X)
df_clusters = pd.DataFrame(clusters, columns=[f'Cluster {i+1}' for i in range(10)])
df_clusters['Max'] = df_clusters.idxmax(axis=1)

print('\nTop 10 Clusters:')
print(df_clusters['Max'].value_counts().head(10))

print('\nCluster Summaries:')
for i in range(10):
    print(f'\nCluster {i+1} Summary: {get_topic_title(i)}')
    print(df[df_clusters['Max'] == f'Cluster {i+1}']['Clean_Text'].head(5))


Top 10 Topics:
1. Trying surface got apple tablet microsoft say like standpoint ipad
2. Trying apple tablet like standpoint microsoft say ipad surface got
3. Apple tablet microsoft ipad say standpoint like surface got trying
4. Say like microsoft ipad tablet standpoint trying surface got apple
5. Tablet like ipad standpoint trying microsoft surface got apple say
6. Like ipad microsoft apple surface got standpoint trying tablet say
7. Ipad standpoint trying say apple surface got tablet microsoft like
8. Microsoft ipad standpoint trying surface got tablet apple say like
9. Standpoint like surface got apple say microsoft tablet trying ipad
10. Surface got standpoint say ipad like microsoft tablet apple trying

Top 10 Clusters:
Cluster 1    4
Cluster 5    1
Cluster 7    1
Cluster 8    1
Cluster 4    1
Cluster 9    1
Cluster 6    1
Name: Max, dtype: int64

Cluster Summaries:

Cluster 1 Summary: Trying surface got apple tablet microsoft say like standpoint ipad
0    I got this tablet after t

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

(1) Features used for sentiment classification and explain why you select these features.

(2) Select two of the supervised learning algorithm from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build a sentiment classifier respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

(3) Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9. 

In [None]:
# I have implemented Decision Tree and Random forest algorithms 
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

df = pd.read_csv('annotated_reviews.csv')

vectorizer = CountVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(df['Clean_Text'])
y = df['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

features = vectorizer.get_feature_names_out()

# Decision Tree classifier
dt_classifier = DecisionTreeClassifier()
dt_classifier.fit(X_train, y_train)
dt_predictions = dt_classifier.predict(X_test)
print('Random Forest Classifier:')
print(f'Accuracy: {accuracy_score(y_test, rf_predictions)}\nPrecision: {precision_score(y_test, rf_predictions, pos_label="positive")}\nRecall: {recall_score(y_test, rf_predictions, pos_label="positive")}\nF1 score: {f1_score(y_test, rf_predictions, pos_label="positive")}')


# Random Forest classifier with 5-folds cross validation
rf_classifier = RandomForestClassifier()
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_features': ['sqrt', 'log2'],
    'max_depth': [None, 5, 10]
}
grid_search = GridSearchCV(rf_classifier, param_grid, cv=StratifiedKFold(n_splits=5), n_jobs=-1)
grid_search.fit(X_train, y_train)
rf_predictions = grid_search.predict(X_test)

print('\nRandom Forest Classifier:\n'
      f'Accuracy: {accuracy_score(y_test, rf_predictions)}\n'
      f'Precision: {precision_score(y_test, rf_predictions, pos_label="positive")}\n'
      f'Recall: {recall_score(y_test, rf_predictions, pos_label="positive")}\n'
      f'F1 score: {f1_score(y_test, rf_predictions, pos_label="positive")}')


'''The features used to evaluate the random forest and decision tree algorithms would be the same as in the original code 
The frequency of occurrence of words in the text data, obtained using the CountVectorizer class from the scikit-learn library. '''

Random Forest Classifier:
Accuracy: 1.0
 Precision: 1.0
 Recall: 1.0
 F1 score: 1.0





Random Forest Classifier:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0


'The features used to evaluate the random forest and decision tree algorithms would be the same as in the original code \nThe frequency of occurrence of words in the text data, obtained using the CountVectorizer class from the scikit-learn library. '

# **Question 3: House price prediction**

(40 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878. 


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

test_df['SalePrice'] = -1

cols = list(test_df.columns)
saleprice_col = cols.pop(cols.index('SalePrice'))
cols.append(saleprice_col)
test_df = test_df[cols]


test_df['SalePrice'] = -1

combined_df = pd.concat([train_df, test_df], axis=0, ignore_index=True)

combined_df.fillna(combined_df.mean(), inplace=True)

combined_df = pd.get_dummies(combined_df)

train_df = combined_df[combined_df['SalePrice'] != -1]
test_df = combined_df[combined_df['SalePrice'] == -1]

scaler = StandardScaler()
train_df_scaled = scaler.fit_transform(train_df.drop(['SalePrice'], axis=1))
test_df_scaled = scaler.transform(test_df.drop(['SalePrice'], axis=1))

X_train, X_val, y_train, y_val = train_test_split(train_df_scaled, train_df['SalePrice'], test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)

y_val_pred = lr.predict(X_val)

rmse = mean_squared_error(y_val, y_val_pred, squared=False)
print('Validation set RMSE:', rmse)

predictions = lr.predict(test_df_scaled)

'''Here it adds a new column called Sale price to the test.csv file, 
you may see other columns filled with 0 and 1's there are created so that test.csv file have same columns that of train.csv file'''
test_df['SalePrice'] = predictions
print("SalePrice column data: ")
print(test_df['SalePrice'])

test_df.to_csv('test.csv', index=False)


  combined_df.fillna(combined_df.mean(), inplace=True)


Validation set RMSE: 1505990627819833.8
SalePrice column data: 
1460   -2.329348e+17
1461   -2.502251e+17
1462   -2.285849e+17
1463   -2.138583e+17
1464   -2.705832e+17
            ...     
2914   -2.116083e+17
2915   -2.429156e+17
2916   -2.271459e+17
2917   -2.233136e+17
2918   -2.500418e+17
Name: SalePrice, Length: 1459, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['SalePrice'] = predictions
