<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

(1) Features (text representation) used for topic modeling.

(2) Top 10 clusters for topic modeling.

(3) Summarize and describe the topic for each cluster. 


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install bertopic

In [4]:
import pandas as pd
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np
from sklearn.decomposition import LatentDirichletAllocation

In [63]:
# load the dataset
Apple_data = pd.read_csv('/content/drive/MyDrive/Assignment four/Apple_Review.csv')

# set up a vectorizer to convert text data into numerical features
Vectorizer = CountVectorizer(stop_words='english')

# create a document-term matrix
X = Vectorizer.fit_transform(Apple_data['clean_text'])

# Initialize BERTopic model
BERT_model = BERTopic()

# Fit the model to the data
topics, _ = BERT_model.fit_transform(Apple_data["clean_text"])

topics = BERT_model.get_topics()
top10_topics = pd.Series(topics).value_counts().head(10)

# printing the top 10 modeling
print("Top 10 Clusters for Topic Modeling")
print(top10_topics)

Top 10 Clusters for Topic Modeling
[(the, 0.16338758349588467), (and, 0.14166262933509235), (is, 0.13784149135357973), (on, 0.08097440971640042), (ios, 0.0758529435461054), (for, 0.06513840019093091), (battery, 0.06513840019093091), (to, 0.05951016021562593), (iphone, 0.05951016021562593), (it, 0.053670599369862036)]    1
dtype: int64


In [61]:
# Summarize and describe the topic for each cluster
for i, (cluster, count) in enumerate(top10_topics.items()):
    print(f"Cluster {i+1}:")
    print(f"Number of documents: {count}")
    cluster_id = cluster[0]
    topic_words = BERT_model.get_topic(cluster_id)
    if topic_words:
        print(f"Top words: {topic_words[:10]}")
        topic_docs = BERT_model.transform(Apple_data['clean_text'])
        top_docs = np.argsort(np.array(topic_docs)[:, cluster_id])[::-1][:3]
        print(f"Sample Documents:\n{Apple_data.iloc[top_docs]}\n")
    else:
        print(f"No topics found for cluster {cluster_id}.")

Cluster 1:
Number of documents: 1
No topics found for cluster ('the', 0.16338758349588467).


# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

(1) Features used for sentiment classification and explain why you select these features.

(2) Select two of the supervised learning algorithm from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build a sentiment classifier respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

(3) Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9. 

We can take into account the following features when classifying sentiment:

**Text**: The textual substance of the reviews serves as the primary feature for sentiment classification. The cleaned text from the 'clean_text' column can be used as the input for our classifier. This feature is crucial since it contains data about user opinions and experiences that may be used to gauge the tone of the review.



**Sentiment:** The user-assigned sentiment label is contained in the'sentiment' column. This label can be used as a feature by being converted into a numerical rating (e.g., positive = 1, negative = 0, neutral = 0.5). This feature can assist the classifier in learning the relationship between the review's textual content and the user-assigned sentiment label.

In [43]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Split the data into features and target variable
X = Apple_data['clean_text']
y = Apple_data['sentiment']

# Vectorize the features using TF-IDF vectorizer
Vectorizer = TfidfVectorizer()
X = Vectorizer.fit_transform(X)

# Build the logistic regression classifier
lr_model = LogisticRegression()

# Perform 5-fold cross-validation and print the evaluation metrics
lr_scores = cross_validate(lr_model, X, y, cv=5, scoring=['accuracy', 'precision_macro', 'recall_macro', 'f1_macro'])

print("Logistic Regression Scores:")
print(f"Accuracy: {lr_scores['test_accuracy'].mean()*100:.2f}%")
print(f"Precision: {lr_scores['test_precision_macro'].mean()*100:.2f}%")
print(f"Recall: {lr_scores['test_recall_macro'].mean()*100:.2f}%")
print(f"F1 Score: {lr_scores['test_f1_macro'].mean()*100:.2f}%")

Logistic Regression Scores:
Accuracy: 50.00%
Precision: 25.00%
Recall: 50.00%
F1 Score: 33.33%


In [41]:
# Build the random forest classifier
rf_model = RandomForestClassifier()

# Perform 5-fold cross-validation and print the evaluation metrics
rf_scores = cross_validate(rf_model, X, y, cv=5, scoring=['accuracy', 'precision_macro', 'recall_macro', 'f1_macro'])
print("\nRandom Forest Scores:")
print(f"Accuracy: {rf_scores['test_accuracy'].mean()*100:.2f}%")
print(f"Precision: {rf_scores['test_precision_macro'].mean()*100:.2f}%")
print(f"Recall: {rf_scores['test_recall_macro'].mean()*100:.2f}%")
print(f"F1 Score: {rf_scores['test_f1_macro'].mean()*100:.2f}%")


Random Forest Scores:
Accuracy: 60.00%
Precision: 45.00%
Recall: 50.00%
F1 Score: 46.67%


# **Question 3: House price prediction**

(40 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878. 


In [66]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Load the training and testing data
train_data = pd.read_csv('/content/drive/MyDrive/Assignment four/assignment4-question3-data.zip (Unzipped Files)/train.csv')
test_data= pd.read_csv('/content/drive/MyDrive/Assignment four/assignment4-question3-data.zip (Unzipped Files)/test.csv')

# Concatenate the training and testing data to ensure consistent preprocessing
concat_data = pd.concat([train_data, test_data], sort=False)

# Drop the target variable and the ID column
target_data = concat_data.drop(['SalePrice', 'Id'], axis=1)

# Handle missing values
missing_data = target_data.fillna(target_data.mean())

# One-hot encode categorical features
encode_data = pd.get_dummies(missing_data)

# Split the data back into training and testing sets
train_set = encode_data[:len(train_data)]
test_set = encode_data[len(train_data):]
train_target = train_data['SalePrice']

# Create a linear regression model and fit it to the training data
lr_model = LinearRegression()
lr_model.fit(train_set, train_target)

# Make predictions on the testing set
test_predictions = lr_model.predict(test_set)

# Save the predictions to a CSV file
House_submission= pd.DataFrame({'Id': test_data['Id'], 'SalePrice': test_predictions})
House_submission.to_csv('House Price Prediction Sales.csv', index=False)
