<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [5]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Load the dataset
data = pd.read_csv("annotated_dataset.csv")

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = tfidf_vectorizer.fit_transform(data['clean_text'])

# Apply Latent Dirichlet Allocation (LDA)
lda_model = LatentDirichletAllocation(n_components=10, random_state=42)
lda_output = lda_model.fit_transform(tfidf_matrix)

# Extract and interpret the top topics
def get_top_words_for_topic(topic, feature_names, n_top_words=10):
    return [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]

feature_names = tfidf_vectorizer.get_feature_names_out()
for i, topic in enumerate(lda_model.components_):
    top_words = get_top_words_for_topic(topic, feature_names)
    print(f"Top 10 words for Topic #{i}:")
    print(top_words)
    print()


Top 10 words for Topic #0:
['the', 'to', 'of', 'and', 'that', 'fiction', 'science', 'in', 'from', 'is']

Top 10 words for Topic #1:
['five', 'best', 'how', 'great', 'time', 'also', 'without', 'get', 'any', 'hans']

Top 10 words for Topic #2:
['the', 'of', 'it', 'and', 'movie', 'this', 'in', 'to', 'have', 'or']

Top 10 words for Topic #3:
['the', 'and', 'is', 'of', 'to', 'in', 'earth', 'aging', 'with', 'planet']

Top 10 words for Topic #4:
['the', 'this', 'movie', 'fan', 'is', 'was', 'to', 'recommend', 'mission', 'of']

Top 10 words for Topic #5:
['five', 'best', 'how', 'great', 'time', 'also', 'without', 'get', 'any', 'hans']

Top 10 words for Topic #6:
['movie', 'this', 'doubt', 'the', 'again', 'watch', 'it', 'to', 'so', 'is']

Top 10 words for Topic #7:
['the', 'to', 'and', 'it', 'is', 'this', 'of', 'that', 'we', 'in']

Top 10 words for Topic #8:
['the', 'and', 're', 'faults', 'riveting', 'bloopers', 'bathroom', 'movie', 'long', 'interest']

Top 10 words for Topic #9:
['the', 'all', 

In [None]:
"""Successfully extracted using Latent Dirichlet Allocation (LDA). Let's summarize the information:

Features (text representation) used for topic modeling
TF-IDF (Term Frequency-Inverse Document Frequency) representation was used for topic modeling. It represents each document as a vector where each dimension represents the importance of a term in that document relative to the entire corpus.
Top 10 clusters for topic modeling:
Here are the top 10 clusters/topics identified by LDA:
1. Cluster 0: Science fiction-related topics.
2. Cluster 1: General discussions about quality, time, and opinions on Hans.
3. Cluster 2: Movie-related discussions, possibly about opinions or reviews.
4. Cluster 3: Topics related to Earth, aging, and planet.
5. Cluster 4: Movie-related discussions, possibly about recommendations or opinions.
6. Cluster 5: Repeated cluster, similar to Cluster 1.
7. Cluster 6: Movie-related discussions, possibly expressing doubt or recommendations.
8. Cluster 7: General discussions, possibly about opinions or experiences.
9. Cluster 8: Movie-related discussions, possibly pointing out faults or interesting aspects.
10. Cluster 9: Movie-related discussions, possibly about cinematography and production quality.
Summarize and describe the topic for each cluster:
1. Science fiction-related topics.
2. General discussions about quality, time, and opinions on Hans.
3. Movie-related discussions, possibly about opinions or reviews.
4. Topics related to Earth, aging, and planet.
5. Movie-related discussions, possibly about recommendations or opinions.
6. Repeated cluster, similar to Cluster 1.
7. Movie-related discussions, possibly expressing doubt or recommendations.
8. General discussions, possibly about opinions or experiences.
9. Movie-related discussions, possibly pointing out faults or interesting aspects.
10. Movie-related discussions, possibly about cinematography and production quality.
"""

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [6]:
# Write your code here

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
data = pd.read_csv("annotated_dataset.csv")

# Split the data into train and test sets (80% for training, 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(data['clean_text'], data['sentiment'], test_size=0.2, random_state=42)

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Define classifiers
logistic_regression = LogisticRegression(max_iter=1000)
random_forest = RandomForestClassifier(n_estimators=100)

# Cross-validation (5-fold)
cv_scores_lr = cross_val_score(logistic_regression, X_train_tfidf, y_train, cv=5, scoring='accuracy')
cv_scores_rf = cross_val_score(random_forest, X_train_tfidf, y_train, cv=5, scoring='accuracy')

# Train classifiers
logistic_regression.fit(X_train_tfidf, y_train)
random_forest.fit(X_train_tfidf, y_train)

# Evaluate performance on the test set
y_pred_lr = logistic_regression.predict(X_test_tfidf)
y_pred_rf = random_forest.predict(X_test_tfidf)

# Calculate performance metrics
accuracy_lr = accuracy_score(y_test, y_pred_lr)
precision_lr = precision_score(y_test, y_pred_lr, average='weighted')
recall_lr = recall_score(y_test, y_pred_lr, average='weighted')
f1_lr = f1_score(y_test, y_pred_lr, average='weighted')

accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf, average='weighted')
recall_rf = recall_score(y_test, y_pred_rf, average='weighted')
f1_rf = f1_score(y_test, y_pred_rf, average='weighted')

# Print performance metrics
print("Logistic Regression Performance Metrics:")
print(f"Accuracy: {accuracy_lr:.4f}")
print(f"Precision: {precision_lr:.4f}")
print(f"Recall: {recall_lr:.4f}")
print(f"F1 Score: {f1_lr:.4f}")
print()
print("Random Forest Performance Metrics:")
print(f"Accuracy: {accuracy_rf:.4f}")
print(f"Precision: {precision_rf:.4f}")
print(f"Recall: {recall_rf:.4f}")
print(f"F1 Score: {f1_rf:.4f}")




Logistic Regression Performance Metrics:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000

Random Forest Performance Metrics:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000


# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the data
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

# Explore the data
print(train_data.info())

# Handle missing values
train_data.dropna(axis=1, inplace=True)  # Drop columns with missing values
test_data.dropna(axis=1, inplace=True)   # Drop the same columns in the test data

# Select features for regression
selected_features = ['OverallQual', 'GrLivArea', 'FullBath', 'YearBuilt']

# Prepare the data
X_train = train_data[selected_features]
y_train = train_data['SalePrice']
X_test = test_data[selected_features]

# Split data for training and testing
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_val)
mse = mean_squared_error(y_val, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_val, y_pred)

print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R-squared:", r2)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


In [None]:
# Write your code here
"""
1.,
1. Original Pretraining Data Sources: BERT (Bidirectional Encoder Representations from Transformers) was pre-trained on a large corpus of text data, which includes BooksCorpus (800 million words) and English Wikipedia (2.5 billion words). The model was trained using a masked language modeling (MLM) objective, where random words in the input sentences are masked, and the model is trained to predict these masked words based on the context provided by the surrounding words.

2. Number of Parameters: The BERT-base model consists of 12 transformer layers, each with 12 attention heads, resulting in a total of 110 million parameters. There are also larger variants such as BERT-large, which has 24 transformer layers and 340 million parameters.

3. Task-specific Fine-tuning: BERT can be fine-tuned for various downstream natural language processing (NLP) tasks, such as sentiment analysis, text classification, question answering, and more. Fine-tuning involves adding a task-specific classification layer on top of the pre-trained BERT model and then training the entire model on task-specific labeled data. During fine-tuning, both the weights of the added classification layer and some of the pre-trained BERT layers are updated based on the task-specific objective. Fine-tuning BERT for sentiment analysis involves training the model on a dataset where each input text is associated with a sentiment label (e.g., positive, negative, or neutral).

In [None]:
import pandas as pd
from transformers import pipeline

# Load the annotated dataset
annotated_dataset_file = "annotated_dataset.csv"
annotated_df = pd.read_csv(annotated_dataset_file)

# Extract text and ground truth sentiment labels
texts = annotated_df['clean_text'].tolist()
ground_truth_sentiments = annotated_df['sentiment'].tolist()

# Initialize BERT zero-shot classification pipeline
classifier = pipeline("zero-shot-classification")

# Perform sentiment analysis using BERT zero-shot classification
predicted_sentiments = classifier(texts, candidate_labels=["positive", "negative"])

# Extract predicted labels from BERT output
predicted_labels = [prediction['labels'][0] for prediction in predicted_sentiments]

# Evaluate performance
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(ground_truth_sentiments, predicted_labels)
precision = precision_score(ground_truth_sentiments, predicted_labels, average='weighted')
recall = recall_score(ground_truth_sentiments, predicted_labels, average='weighted')
f1 = f1_score(ground_truth_sentiments, predicted_labels, average='weighted')

# Print evaluation metrics
print("Evaluation Metrics:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")


No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
"""Certainly! Here's a concise overview:

Advantages:
1. Contextual Understanding: BERT captures context effectively for nuanced sentiment analysis.
2. Pre-trained Representations: Learns rich language representations from large text corpora.
3. Fine-tuning Flexibility: Adaptable to specific sentiment analysis tasks with minimal data.
4. State-of-the-Art Performance: Achieves top performance on various NLP tasks.

Disadvantages:
1. Computational Resources: Demands significant computational power and time.
2. Large Model Size: Size may hinder deployment in resource-constrained settings.
3. Domain Adaptation: May require fine-tuning for optimal performance in specific domains.
4. Complexity: Architecture complexity makes interpretation challenging.

Challenges:
1. Resource Constraints: Access to GPUs or TPUs may limit implementation.
2. Fine-tuning Parameters: Requires tuning hyperparameters for optimal performance.
3. Data Preparation: Data preprocessing for BERT's input requirements can be intricate.
4. Evaluation: Selecting and interpreting evaluation metrics accurately is crucial.

In summary, while BERT offers robust sentiment analysis capabilities, addressing challenges like resource constraints and model complexity is essential for successful implementation.""""