<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

(1) Features (text representation) used for topic modeling.

(2) Top 10 clusters for topic modeling.

(3) Summarize and describe the topic for each cluster.


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import LdaModel
from gensim.corpora import Dictionary
import gensim

# Load your sentimental analysis dataset from assignment three
# Replace 'your_dataset.csv' with the actual path or filename
# Assuming the dataset has a column 'review' for the text data and 'sentiment' for sentiment labels
data = pd.read_csv('/content/Sentimental_analysed_dataset.csv')

# (1) Features (text representation) used for topic modeling
# Using CountVectorizer to create a document-term matrix
vectorizer = CountVectorizer(max_features=5000, stop_words='english')
X = vectorizer.fit_transform(data['Review'])

# Convert the document-term matrix to a Gensim corpus
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Create a dictionary mapping words to their integer ids
id2word = dict((v, k) for k, v in vectorizer.vocabulary_.items())

# (2) Top 10 clusters for topic modeling
# Fit LDA model
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=10, passes=10)

# (3) Summarize and describe the topic for each cluster
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)


(0, '0.020*"movie" + 0.020*"story" + 0.013*"hope" + 0.010*"andy" + 0.010*"red"')
(1, '0.018*"movie" + 0.013*"sway" + 0.013*"let" + 0.010*"film" + 0.008*"didn"')
(2, '0.001*"film" + 0.001*"prison" + 0.001*"andy" + 0.001*"shawshank" + 0.001*"redemption"')
(3, '0.033*"movie" + 0.013*"time" + 0.012*"just" + 0.012*"great" + 0.010*"film"')
(4, '0.021*"movie" + 0.016*"shawshank" + 0.015*"prison" + 0.011*"andy" + 0.011*"redemption"')
(5, '0.023*"film" + 0.012*"shawshank" + 0.011*"andy" + 0.011*"time" + 0.009*"best"')
(6, '0.001*"film" + 0.001*"shawshank" + 0.001*"best" + 0.001*"prison" + 0.001*"redemption"')
(7, '0.022*"film" + 0.017*"shawshank" + 0.013*"best" + 0.010*"andy" + 0.010*"prison"')
(8, '0.028*"shawshank" + 0.014*"film" + 0.009*"andy" + 0.009*"just" + 0.009*"time"')
(9, '0.032*"film" + 0.015*"best" + 0.014*"shawshank" + 0.011*"like" + 0.010*"king"')


# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

(1) Features used for sentiment classification and explain why you select these features.

(2) Select two of the supervised learning algorithm from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build a sentiment classifier respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

(3) Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_predict

# Load your dataset from assignment three
# Replace 'your_dataset.csv' with the actual path or filename
# Assuming the dataset has a column 'text' for the text data and 'label' for sentiment labels
data = pd.read_csv('/content/Sentimental_analysed_dataset.csv')

# Create a train-test split (80% for training, 20% for testing)
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Features used for sentiment classification
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train = tfidf_vectorizer.fit_transform(train_data['Review'])
X_test = tfidf_vectorizer.transform(test_data['Review'])
y_train = train_data['sentiment']
y_test = test_data['sentiment']


# Random Forest
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
# Cross-validation using accuracy as the scoring metric
rf_scores = cross_val_score(rf_classifier, X_train, y_train, cv=5, scoring='accuracy')


# Random Forest Performance Metrics
rf_classifier.fit(X_train, y_train)
rf_predictions = cross_val_predict(rf_classifier, X_train, y_train, cv=5)
rf_accuracy = accuracy_score(y_train, rf_predictions)
rf_precision = precision_score(y_train, rf_predictions, average='weighted')
rf_recall = recall_score(y_train, rf_predictions, average='weighted')
rf_f1 = f1_score(y_train, rf_predictions, average='weighted')

# Display results

print("Random Forest Metrics:")
print("Accuracy:", rf_accuracy)
print("Precision:", rf_precision)
print("Recall:", rf_recall)
print("F1 Score:", rf_f1)


Random Forest Metrics:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0


# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.


In [3]:
# Write your code here

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer


# Assuming 'data' is your DataFrame with all columns including string and numeric ones
data = pd.read_csv('/content/train.csv')

# Display basic information about the dataset
print(data.info())

# Separate numeric features (X) and target variable (y)
numeric_features = data.select_dtypes(include=[np.number]).columns
X = data[numeric_features]
y = data['SalePrice']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Data preprocessing - Impute missing values and standardize the features
numeric_imputer = SimpleImputer(strategy='mean')
X_train = numeric_imputer.fit_transform(X_train)
X_test = numeric_imputer.transform(X_test)

# Standardize numeric features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize the Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **pre-trained Large Language Model (LLM) from the Hugging Face Repository** for your specific task using the data collected in Assignment 3. After creating an account on Hugging Face (https://huggingface.co/), choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any Meta based text analysis model. Provide a brief description of the selected LLM, including its original sources, significant parameters, and any task-specific fine-tuning if applied.

Perform a detailed analysis of the LLM's performance on your task, including key metrics, strengths, and limitations. Additionally, discuss any challenges encountered during the implementation and potential strategies for improvement. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


As of my last knowledge update in January 2023, GPT-3, BERT, and RoBERTa are distinct models designed for different purposes. GPT-3 is a powerful generative language model developed by OpenAI, while BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa (Robustly optimized BERT approach) are specifically designed for tasks like natural language understanding. Each model has its strengths and weaknesses, and the choice depends on the specific requirements of your task.

Given your request to analyze a Large Language Model (LLM) from the Hugging Face Repository, I'll provide a general overview of how you might approach this task. Please note that specific implementations might vary based on the model chosen.

Example using BERT:
Selecting a Model from Hugging Face:

Go to the Hugging Face Model Hub (https://huggingface.co/models).
Choose a pre-trained BERT model, such as bert-base-uncased or any other variant depending on your task.
Brief Description:

BERT (Bidirectional Encoder Representations from Transformers): Developed by Google, BERT is designed for natural language understanding. It considers the entire context of a word by looking at both left and right context words.
Significant Parameters:

BERT has various hyperparameters, but some of the essential ones include the learning rate, batch size, and the number of training epochs.
The model architecture itself is quite complex, with multiple layers and attention mechanisms.
Task-Specific Fine-Tuning:

Depending on your specific task (e.g., sentiment analysis, text classification), you may need to fine-tune the pre-trained BERT model on your dataset.
Fine-tuning involves training the model on your task-specific data to adapt it to your particular use case.
Performance Analysis:
Metrics:

Common metrics for text classification tasks include accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC).
For text generation tasks, you might evaluate based on perplexity or BLEU score.
Strengths:

BERT is known for its contextual understanding, making it suitable for various NLP tasks.
Pre-trained models from Hugging Face are easily accessible and can be fine-tuned for specific tasks with relatively little data.
Limitations:

BERT can be computationally expensive and memory-intensive.
Fine-tuning may require a substantial amount of labeled data for specific tasks.
Challenges and Strategies for Improvement:

Data Quality: Ensure your dataset is representative and of high quality.
Computational Resources: Address any computational constraints, such as GPU availability.
Fine-tuning: Experiment with various hyperparameters during fine-tuning to optimize model performance.
Conclusion:
Selecting the appropriate model depends on your specific task and dataset. Fine-tuning a pre-trained LLM can yield impressive results, but it requires careful consideration of model architecture, hyperparameters, and evaluation metrics. Regularly checking the Hugging Face Model Hub for new models and improvements can also be beneficial.