## Lab Assignment 2: Sentiment Classification with Machine Learning Approaches ##
- Author Name: Balbir Singh
- ASU ID: 1233870107
- File Creation Date: (02/03/2025)

In [2]:
# Code Cell 1  - Import all the necessary libraries and restaurant review data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk

# Load the dataset
file_path = "C:\\Users\\balbi\\Downloads\\restaurant_reviews_az.csv"
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
df.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,IVS7do_HBzroiCiymNdxDg,fdFgZQQYQJeEAshH4lxSfQ,sGy67CpJctjeCWClWqonjA,3,1,1,0,"OK, the hype about having Hatch chili in your ...",2020-01-27 22:59:06
1,QP2pSzSqpJTMWOCuUuyXkQ,JBLWSXBTKFvJYYiM-FnCOQ,3w7NRntdQ9h0KwDsksIt5Q,5,1,1,1,Pandemic pit stop to have an ice cream.... onl...,2020-04-19 05:33:16
2,oK0cGYStgDOusZKz9B1qug,2_9fKnXChUjC5xArfF8BLg,OMnPtRGmbY8qH_wIILfYKA,5,1,0,0,I was lucky enough to go to the soft opening a...,2020-02-29 19:43:44
3,E_ABvFCNVLbfOgRg3Pv1KQ,9MExTQ76GSKhxSWnTS901g,V9XlikTxq0My4gE8LULsjw,5,0,0,0,I've gone to claim Jumpers all over the US and...,2020-03-14 21:47:07
4,Rd222CrrnXkXukR2iWj69g,LPxuausjvDN88uPr-Q4cQA,CA5BOxKRDPGJgdUQ8OUOpw,4,1,0,0,"If you haven't been to Maynard's kitchen, it'...",2020-01-17 20:32:57


In [3]:
# Code Cell 2 -  Remove 3 star reviews from the input data, create a new column - Sentiment for the remaining reviews. For reviews with 1 or 2 star rating, set the value in the Sentiment column to 0. For reviews with 4 or 5 star rating, set the value in the sentiment column to 1. 
df = df[df['stars'] != 3]

# Creating the Sentiment column
df['Sentiment'] = df['stars'].apply(lambda x: 1 if x >= 4 else 0)

# Display the dataset after transformation
df.head()


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,Sentiment
1,QP2pSzSqpJTMWOCuUuyXkQ,JBLWSXBTKFvJYYiM-FnCOQ,3w7NRntdQ9h0KwDsksIt5Q,5,1,1,1,Pandemic pit stop to have an ice cream.... onl...,2020-04-19 05:33:16,1
2,oK0cGYStgDOusZKz9B1qug,2_9fKnXChUjC5xArfF8BLg,OMnPtRGmbY8qH_wIILfYKA,5,1,0,0,I was lucky enough to go to the soft opening a...,2020-02-29 19:43:44,1
3,E_ABvFCNVLbfOgRg3Pv1KQ,9MExTQ76GSKhxSWnTS901g,V9XlikTxq0My4gE8LULsjw,5,0,0,0,I've gone to claim Jumpers all over the US and...,2020-03-14 21:47:07,1
4,Rd222CrrnXkXukR2iWj69g,LPxuausjvDN88uPr-Q4cQA,CA5BOxKRDPGJgdUQ8OUOpw,4,1,0,0,"If you haven't been to Maynard's kitchen, it'...",2020-01-17 20:32:57,1
5,kx6O_lyLzUnA7Xip5wh2NA,YsINprB2G1DM8qG1hbrPUg,rViAhfKLKmwbhTKROM9m0w,1,0,0,0,I stay at the Main Hotel at the Casino from Ju...,2020-07-14 16:43:23,0


In [4]:
# Code Cell 3 - Conduct necessary data processing. Prepare the training and test sets on review data for machine learning classifications. 20% of the data for testing and 80% of the data for training.
# Splitting data into training and testing sets (80% train, 20% test)
X = df['text']  # Review text
y = df['Sentiment']  # Sentiment label

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Check distribution of training and test sets
print(f"Training set size: {len(X_train)}, Test set size: {len(X_test)}")


Training set size: 35274, Test set size: 8819


In [5]:
# Code Cell 4 -  Use Count Vectorizer and frequency count to represent documents, set maximum feature size as 1000.
vectorizer = CountVectorizer(max_features=1000)
X_train_cv = vectorizer.fit_transform(X_train)
X_test_cv = vectorizer.transform(X_test)

# Display feature names
print(f"Top 10 features: {vectorizer.get_feature_names_out()[:10]}")

Top 10 features: ['00' '10' '100' '11' '12' '15' '19' '20' '25' '30']


In [6]:
# Code Cell 5 - Train a naive bayes classifcation model to classify the review sentiment and evaluate its performance. 
nb_model_cv = MultinomialNB()
nb_model_cv.fit(X_train_cv, y_train)
y_pred_nb_cv = nb_model_cv.predict(X_test_cv)

# Evaluation
print("Naïve Bayes (Count Vectorizer) Performance:")
print(classification_report(y_test, y_pred_nb_cv))


Naïve Bayes (Count Vectorizer) Performance:
              precision    recall  f1-score   support

           0       0.85      0.83      0.84      2463
           1       0.94      0.94      0.94      6356

    accuracy                           0.91      8819
   macro avg       0.89      0.89      0.89      8819
weighted avg       0.91      0.91      0.91      8819



In [7]:
# Code Cells 6 - Train a SVM model to classify the review sentiment and evaluate its performance
svm_model_cv = SVC(kernel='linear')
svm_model_cv.fit(X_train_cv, y_train)
y_pred_svm_cv = svm_model_cv.predict(X_test_cv)

# Evaluation
print("SVM (Count Vectorizer) Performance:")
print(classification_report(y_test, y_pred_svm_cv))


SVM (Count Vectorizer) Performance:
              precision    recall  f1-score   support

           0       0.91      0.89      0.90      2463
           1       0.96      0.97      0.96      6356

    accuracy                           0.95      8819
   macro avg       0.94      0.93      0.93      8819
weighted avg       0.95      0.95      0.95      8819



In [8]:
# Code Cell 7 - Use TF-IDF vectorizer to represent the documents and set the max feature size to 1000 .
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Display feature names
print(f"Top 10 features: {tfidf_vectorizer.get_feature_names_out()[:10]}")


Top 10 features: ['00' '10' '100' '11' '12' '15' '19' '20' '25' '30']


In [9]:
#Code Cell 8- Train a naive bayes classifcation model with TF-IDF feature values to classify the review sentiment and evaluate its performance.
nb_model_tfidf = MultinomialNB()
nb_model_tfidf.fit(X_train_tfidf, y_train)
y_pred_nb_tfidf = nb_model_tfidf.predict(X_test_tfidf)

# Evaluation
print("Naïve Bayes (TF-IDF) Performance:")
print(classification_report(y_test, y_pred_nb_tfidf))


Naïve Bayes (TF-IDF) Performance:
              precision    recall  f1-score   support

           0       0.92      0.69      0.79      2463
           1       0.89      0.98      0.93      6356

    accuracy                           0.90      8819
   macro avg       0.91      0.83      0.86      8819
weighted avg       0.90      0.90      0.89      8819



In [10]:
# Code Cell 9  - Train a SVM model with TF-IDF feature value to classify the review sentiment and evaluate its performance
svm_model_tfidf = SVC(kernel='linear')
svm_model_tfidf.fit(X_train_tfidf, y_train)
y_pred_svm_tfidf = svm_model_tfidf.predict(X_test_tfidf)

# Evaluation
print("SVM (TF-IDF) Performance:")
print(classification_report(y_test, y_pred_svm_tfidf))


SVM (TF-IDF) Performance:
              precision    recall  f1-score   support

           0       0.92      0.90      0.91      2463
           1       0.96      0.97      0.97      6356

    accuracy                           0.95      8819
   macro avg       0.94      0.93      0.94      8819
weighted avg       0.95      0.95      0.95      8819



In [11]:
# Code Cell 10 - Use VaderSentiment to predict the review sentiment and evaluate its performance.
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

# Function to get sentiment score
def vader_sentiment(text):
    score = sia.polarity_scores(text)['compound']
    return 1 if score >= 0 else 0  # Positive sentiment if score >= 0

# Apply to test set
y_pred_vader = X_test.apply(vader_sentiment)

# Evaluation
print("Vader Sentiment Analysis Performance:")
print(classification_report(y_test, y_pred_vader))


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\balbi\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Vader Sentiment Analysis Performance:
              precision    recall  f1-score   support

           0       0.93      0.55      0.69      2463
           1       0.85      0.99      0.91      6356

    accuracy                           0.86      8819
   macro avg       0.89      0.77      0.80      8819
weighted avg       0.87      0.86      0.85      8819



## Text Cell 11 - Model Performance Comparison & Observations
1. Naïve Bayes (Count Vectorizer): Performs well with text-based classification, but may not capture complex word relationships.
2. SVM (Count Vectorizer): Generally provides better accuracy than Naïve Bayes due to its ability to handle high-dimensional spaces.
3. Naïve Bayes (TF-IDF): Handles rare words better but may be affected by imbalanced data.
4. SVM (TF-IDF): Typically achieves the highest accuracy as TF-IDF improves feature weighting.
5. Vader Sentiment: Quick and effective for lexicon-based sentiment analysis but lacks context understanding.

 ## Text Cell 12  - Acknowledgment
I used GenAI tools, such a Chatgpt to assist in  structuring the code, debugging errors, and ensuring optimal implementation of machine learning models for sentiment analysis. The tools helped with syntax corrections, efficiency improvements, and best practices for text processing.I worked independently on this project but referred todocumentation and online resources for better understanding of certain concepts, such as the Vader Sentiment Analysis library and hyperparameter tuning for SVM and Naïve Bayes models.

In [15]:
# HTML rendering
!pip install jupyter
!pip install nbconvert
!jupyter nbconvert "LA2_Singh_Balbir.ipynb" --to html




[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip
[NbConvertApp] Converting notebook LA2_Singh_Balbir.ipynb to html
[NbConvertApp] Writing 323454 bytes to LA2_Singh_Balbir.html
