<a href="https://colab.research.google.com/github/Tharunchandubatla/Tharun_INFO5731_Fall2023/blob/main/Chandubatla_Tharun_INFO5731_Assignment_Four_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

(1) Features (text representation) used for topic modeling.

(2) Top 10 clusters for topic modeling.

(3) Summarize and describe the topic for each cluster.


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Load the dataset
df = pd.read_csv('/content/Baahubali_Sentiment_Reviews.csv')

# Replace 'text' with the actual column name containing text data in your dataset
text_column = 'Review Text'

# Feature extraction using CountVectorizer
vectorizer = CountVectorizer(max_df=0.85, max_features=5000, stop_words='english')
X = vectorizer.fit_transform(df[text_column])

# Topic modeling using LDA
num_topics = 10
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda.fit(X)

# Display top words for each topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    print(f"Topic #{topic_idx + 1}:")
    print([feature_names[i] for i in topic.argsort()[:-11:-1]])
    print()

# Assign topics to documents
df['topic'] = lda.transform(X).argmax(axis=1)
top_clusters = df['topic'].value_counts().head(10)
print("Top 10 Clusters:")
print(top_clusters)

# Summarize and describe the topic for each cluster
for cluster, count in top_clusters.items():
    cluster_df = df[df['topic'] == cluster]
    example_text = cluster_df[text_column].iloc[0]
    print(f"\nCluster #{cluster + 1} Summary:")
    print(f"Example Text: {example_text}")
    print(f"The Number of Documents in Cluster: {count}")
    print()

Topic #1:
['movie', 'baahubali', 'disappointed', 'better', 'good', 'action', 'like', 'cartoon', 'wait', 'watching']

Topic #2:
['movie', 'good', 'baahubali', 'scenes', 'like', 'vfx', 'film', 'epic', 'bahubali', 'act']

Topic #3:
['indian', 'cinema', 'great', 'anushka', 'got', 'cgi', 'soldiers', 'movie', 'best', 'trees']

Topic #4:
['film', 'baahubali', 'fight', 'know', 'climax', 'like', 'good', 'just', 'love', 'songs']

Topic #5:
['movie', 'hype', 'people', 'baahubali', 'like', 'scenes', 'villain', 'collections', 'just', 'drama']

Topic #6:
['prabhas', 'bahubali', 'performance', 'script', 'point', 'ss', 'asked', 'talking', 'vfx', 'effects']

Topic #7:
['film', 'scenes', 'indian', 'films', 'commercial', 'action', 'conclusion', 'bahubali', '10', 'chariot']

Topic #8:
['movie', 'vfx', 'indian', 'don', 'people', 'think', 'like', 'know', 'story', 'just']

Topic #9:
['movie', 'movies', 'good', 'wife', 'money', 'brother', 'bahubali', 'actually', 'effort', 'story']

Topic #10:
['movie', 'seque

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

(1) Features used for sentiment classification and explain why you select these features.

(2) Select two of the supervised learning algorithm from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build a sentiment classifier respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

(3) Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [2]:
# Write your code here
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

# Load the dataset
df = pd.read_csv('/content/Baahubali_Sentiment_Reviews.csv')

# Assuming you have a 'text' column for the text data and a 'label' column for sentiment labels
# Replace 'text' and 'label' with your actual column names
X = df['Review Text']
y = df['Sentiment']

# Split the data into training and testing sets (80% for training, 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Features: TF-IDF representation of the text data
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Function to evaluate the model
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    return accuracy, precision, recall, f1

# Model 1: Naive Bayes
nb_model = make_pipeline(TfidfVectorizer(max_features=5000, stop_words='english'), MultinomialNB())
nb_scores = cross_val_score(nb_model, X, y, cv=5)  # 5-fold cross-validation

# Model 2: Logistic Regression
lr_model = make_pipeline(TfidfVectorizer(max_features=5000, stop_words='english'), LogisticRegression())
lr_scores = cross_val_score(lr_model, X, y, cv=5)  # 5-fold cross-validation

# Display results
print("Naive Bayes Cross-Validation Scores:", nb_scores)
print("Mean Accuracy: {:.2f}".format(nb_scores.mean()))

print("\nLogistic Regression Cross-Validation Scores:", lr_scores)
print("Mean Accuracy: {:.2f}".format(lr_scores.mean()))

# Train and evaluate models on the test set
nb_model.fit(X_train, y_train)
lr_model.fit(X_train, y_train)

print("\nEvaluation on Test Set:")
print("\nNaive Bayes:")
nb_accuracy, nb_precision, nb_recall, nb_f1 = evaluate_model(nb_model, X_test, y_test)
print("Accuracy: {:.2f}, Precision: {:.2f}, Recall: {:.2f}, F1 Score: {:.2f}".format(nb_accuracy, nb_precision, nb_recall, nb_f1))

print("\nLogistic Regression:")
lr_accuracy, lr_precision, lr_recall, lr_f1 = evaluate_model(lr_model, X_test, y_test)
print("Accuracy: {:.2f}, Precision: {:.2f}, Recall: {:.2f}, F1 Score: {:.2f}".format(lr_accuracy, lr_precision, lr_recall, lr_f1))

Naive Bayes Cross-Validation Scores: [1. 1. 1. 1. 1.]
Mean Accuracy: 1.00

Logistic Regression Cross-Validation Scores: [1. 1. 1. 1. 1.]
Mean Accuracy: 1.00

Evaluation on Test Set:

Naive Bayes:
Accuracy: 1.00, Precision: 1.00, Recall: 1.00, F1 Score: 1.00

Logistic Regression:
Accuracy: 1.00, Precision: 1.00, Recall: 1.00, F1 Score: 1.00


# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

# Load training data
train_data = pd.read_csv('/content/train.csv')

# Assuming 'SalePrice' is the target variable
y = train_data['SalePrice']
X = train_data.drop('SalePrice', axis=1)

# Identify categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns

# Create a column transformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), X.columns.difference(categorical_cols)),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Build a linear regression model with preprocessing
model = make_pipeline(preprocessor, SimpleImputer(strategy='mean'), LinearRegression())
model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = model.predict(X_val)

# Evaluate the model
mse = mean_squared_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

# Load testing data
test_data = pd.read_csv('/content/test.csv')

# Assuming the testing data has the same features as the training data
# If not, you need to preprocess the testing data accordingly

# Make predictions on the testing set
test_predictions = model.predict(test_data)

# Save the predictions to a CSV file
submission_df = pd.DataFrame({'Id': test_data['Id'], 'SalePrice': test_predictions})
submission_df.to_csv('submission.csv', index=False)


Mean Squared Error: 4262060818.240992
R-squared: 0.44434425562664803


# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **pre-trained Large Language Model (LLM) from the Hugging Face Repository** for your specific task using the data collected in Assignment 3. After creating an account on Hugging Face (https://huggingface.co/), choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any Meta based text analysis model. Provide a brief description of the selected LLM, including its original sources, significant parameters, and any task-specific fine-tuning if applied.

Perform a detailed analysis of the LLM's performance on your task, including key metrics, strengths, and limitations. Additionally, discuss any challenges encountered during the implementation and potential strategies for improvement. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset
import torch

# Load your dataset (replace 'your_dataset.csv' with your actual dataset)
df = pd.read_csv('/content/Baahubali_Sentiment_Reviews.csv')

# Assuming you have 'User Reviews' and 'sentiment' columns
X = df['Review Text'].values
y = df['Sentiment'].apply(lambda x: 1 if x == 'positive' else 0).values  # Convert 'positive' to 1, 'negative' to 0

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Use a subset for development
X_train, _, y_train, _ = train_test_split(X_train, y_train, train_size=0.1, random_state=42)

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)  # 2 for binary classification

# Tokenize and encode the text data
X_train_tokens = tokenizer(list(X_train), padding=True, truncation=True, return_tensors='pt', max_length=256)
X_test_tokens = tokenizer(list(X_test), padding=True, truncation=True, return_tensors='pt', max_length=256)

# Create DataLoader for training and testing sets
train_dataset = TensorDataset(X_train_tokens['input_ids'], X_train_tokens['attention_mask'], torch.tensor(y_train, dtype=torch.long))
test_dataset = TensorDataset(X_test_tokens['input_ids'], X_test_tokens['attention_mask'], torch.tensor(y_test, dtype=torch.long))

train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=False)

# Fine-tune the BERT model on your task (training loop)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
criterion = torch.nn.CrossEntropyLoss()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training loop
num_epochs = 1  # Reduce the number of epochs for testing
for epoch in range(num_epochs):
    model.train()
    for batch in train_dataloader:
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

# Evaluate the model on the testing set
model.eval()
predictions = []
with torch.no_grad():
    for batch in test_dataloader:
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        probabilities = torch.nn.functional.softmax(logits, dim=1)
        predicted_labels = torch.argmax(probabilities, dim=1)
        predictions.extend(predicted_labels.cpu().numpy())

# Calculate key metrics
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Accuracy: 1.0000
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
