Lab Assignment 2: Sentiment Classification with Machine Learning Approaches

Author: Ravi Teja Kondeti
ASU ID: 1234434879
Date: Feb 02, 2025

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset (first 10,000 rows)
file_path = "/content/restaurant_reviews_az.csv"
df = pd.read_csv(file_path, nrows=10000)

# Display first few rows
df.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,Sentiment
0,IVS7do_HBzroiCiymNdxDg,fdFgZQQYQJeEAshH4lxSfQ,sGy67CpJctjeCWClWqonjA,3,1,1,0,"OK, the hype about having Hatch chili in your ...",1/27/2020 22:59,1
1,QP2pSzSqpJTMWOCuUuyXkQ,JBLWSXBTKFvJYYiM-FnCOQ,3w7NRntdQ9h0KwDsksIt5Q,5,1,1,1,Pandemic pit stop to have an ice cream.... onl...,4/19/2020 5:33,1
2,oK0cGYStgDOusZKz9B1qug,2_9fKnXChUjC5xArfF8BLg,OMnPtRGmbY8qH_wIILfYKA,5,1,0,0,I was lucky enough to go to the soft opening a...,2/29/2020 19:43,1
3,E_ABvFCNVLbfOgRg3Pv1KQ,9MExTQ76GSKhxSWnTS901g,V9XlikTxq0My4gE8LULsjw,5,0,0,0,I've gone to claim Jumpers all over the US and...,3/14/2020 21:47,1
4,Rd222CrrnXkXukR2iWj69g,LPxuausjvDN88uPr-Q4cQA,CA5BOxKRDPGJgdUQ8OUOpw,4,1,0,0,"If you haven't been to Maynard's kitchen, it'...",1/17/2020 20:32,1


In [2]:
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

# Apply VADER to get sentiment scores
df['vader_score'] = df['text'].apply(lambda x: sia.polarity_scores(str(x))['compound'])

# Assign sentiment labels based on VADER score
df['vader_sentiment'] = df['vader_score'].apply(lambda x: 'positive' if x > 0.05 else ('negative' if x < -0.05 else 'neutral'))

# Display sentiment distribution
df['vader_sentiment'].value_counts()

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Unnamed: 0_level_0,count
vader_sentiment,Unnamed: 1_level_1
positive,8359
negative,1512
neutral,129


In [3]:
# Assign sentiment labels (binary classification: positive or negative)
df['sentiment_label'] = df['vader_sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

# Split data into train and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['sentiment_label'], test_size=0.2, random_state=42)

# Convert text into a bag of words using CountVectorizer
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

In [4]:
num_tokens = sum(len(text.split()) for text in X_train)
unique_tokens = len(set(word for text in X_train for word in text.split()))

print(f"Total Tokens: {num_tokens}")
print(f"Unique Tokens: {unique_tokens}")
print(f"Total Unique Customers: {df['user_id'].nunique()}")

Total Tokens: 689049
Unique Tokens: 40735
Total Unique Customers: 6830


In [5]:
nb_model = MultinomialNB()
nb_model.fit(X_train_bow, y_train)

# Predictions
y_pred_nb = nb_model.predict(X_test_bow)

# Evaluate Model
print("Naïve Bayes Classifier Performance:")
print(classification_report(y_test, y_pred_nb))

Naïve Bayes Classifier Performance:
              precision    recall  f1-score   support

           0       0.57      0.72      0.64       326
           1       0.94      0.89      0.92      1674

    accuracy                           0.86      2000
   macro avg       0.75      0.81      0.78      2000
weighted avg       0.88      0.86      0.87      2000



In [6]:
svm_model = SVC(kernel='linear')
svm_model.fit(X_train_bow, y_train)

# Predictions
y_pred_svm = svm_model.predict(X_test_bow)

# Evaluate Model
print("SVM Classifier Performance:")
print(classification_report(y_test, y_pred_svm))

SVM Classifier Performance:
              precision    recall  f1-score   support

           0       0.66      0.64      0.65       326
           1       0.93      0.94      0.93      1674

    accuracy                           0.89      2000
   macro avg       0.80      0.79      0.79      2000
weighted avg       0.89      0.89      0.89      2000



In [7]:
# Reload dataset
df = pd.read_csv(file_path, nrows=10000)

# Apply TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [8]:
log_reg_model = LogisticRegression()
log_reg_model.fit(X_train_tfidf, y_train)

# Predictions
y_pred_lr = log_reg_model.predict(X_test_tfidf)

# Evaluate Model
print("Logistic Regression Performance:")
print(classification_report(y_test, y_pred_lr))

Logistic Regression Performance:
              precision    recall  f1-score   support

           0       0.78      0.52      0.62       326
           1       0.91      0.97      0.94      1674

    accuracy                           0.90      2000
   macro avg       0.85      0.74      0.78      2000
weighted avg       0.89      0.90      0.89      2000



**Comparison and Conclusion**

	1.	Naïve Bayes: Achieved 86% accuracy but struggled with detecting negative sentiment (low precision and recall). It’s fast and efficient but less effective with imbalanced data.
	2.	SVM: Performed better with 89% accuracy and balanced precision-recall for both sentiments. It’s effective for complex data but computationally intensive.
	3.	Logistic Regression: Delivered the highest accuracy (90%) with excellent recall for positive sentiment. It also balanced performance well and is computationally efficient.

**Conclusion:**
Machine learning models significantly outperform VADER, which lacks contextual understanding. Logistic Regression is the best overall choice for its accuracy and balance, while SVM is ideal for complex cases requiring refined classification.

In [9]:
# Input 2: Three customer reviews
input_reviews = [
    "The service is good, but location is hard to find. Sanitation is not very good with old facilities. Food served tasted extremely fishy, making us difficult to even finish it.",
    "The restaurant is definitely one of my favorites and of my family as well. I was especially impressed with my visit a few days ago. The place is clean, and you just need to wait for fewer than 10 minutes to get food served. And of course, the food is absolutely delicious!",
    "I appreciated the friendly staff. The food was good, not amazing. The service was not prompt but almost acceptable. A reliable spot for a regular meal, but nothing extraordinary."
]

# Preprocess the input reviews using the TF-IDF vectorizer
input_tfidf = tfidf_vectorizer.transform(input_reviews)

# Predict probabilities and classes
predicted_probabilities = log_reg_model.predict_proba(input_tfidf)
predicted_classes = log_reg_model.predict(input_tfidf)

# Output predictions and probabilities
for i, review in enumerate(input_reviews):
    print(f"Review {i + 1}: {review}")
    print(f"Predicted Sentiment: {'Positive' if predicted_classes[i] == 1 else 'Negative'}")
    print(f"Probability (Negative, Positive): {predicted_probabilities[i]}")
    print()

Review 1: The service is good, but location is hard to find. Sanitation is not very good with old facilities. Food served tasted extremely fishy, making us difficult to even finish it.
Predicted Sentiment: Positive
Probability (Negative, Positive): [0.39969317 0.60030683]

Review 2: The restaurant is definitely one of my favorites and of my family as well. I was especially impressed with my visit a few days ago. The place is clean, and you just need to wait for fewer than 10 minutes to get food served. And of course, the food is absolutely delicious!
Predicted Sentiment: Positive
Probability (Negative, Positive): [0.06587643 0.93412357]

Review 3: I appreciated the friendly staff. The food was good, not amazing. The service was not prompt but almost acceptable. A reliable spot for a regular meal, but nothing extraordinary.
Predicted Sentiment: Positive
Probability (Negative, Positive): [0.00465047 0.99534953]



In [13]:
#comment
print("The results demonstrate the model’s ability to distinguish positive and negative sentiment effectively, but reviews with mixed sentiment (e.g., Review 3) pose challenges as the model classifies based on the overall probability score.")

The results demonstrate the model’s ability to distinguish positive and negative sentiment effectively, but reviews with mixed sentiment (e.g., Review 3) pose challenges as the model classifies based on the overall probability score.


In [11]:
# Acknowledgments
print ( 'I used OpenAI’s ChatGPT as a resource to clarify assignment requirements, generate initial code templates, and refine reasoning for classifications. All code was independently implemented, and no collaboration occurred for this assignment.')

I used OpenAI’s ChatGPT as a resource to clarify assignment requirements, generate initial code templates, and refine reasoning for classifications. All code was independently implemented, and no collaboration occurred for this assignment.
