## Sentiment Analysis  

#### 1. Importing necessary libraries and dataset

In [25]:
!python -m spacy download en_core_web_lg
!pip install -q vaderSentiment

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [43]:
import pandas as pd
import numpy as np
import spacy
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from IPython.display import display

In [45]:
nlp = spacy.load("en_core_web_lg")

In [47]:
df = pd.read_csv('/Users/kavya/Downloads/GitHub/Datasets/restaurant_reviews_az.csv')
display(df.head())  # Shows first few rows

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,IVS7do_HBzroiCiymNdxDg,fdFgZQQYQJeEAshH4lxSfQ,sGy67CpJctjeCWClWqonjA,3,1,1,0,"OK, the hype about having Hatch chili in your ...",2020-01-27 22:59:06
1,QP2pSzSqpJTMWOCuUuyXkQ,JBLWSXBTKFvJYYiM-FnCOQ,3w7NRntdQ9h0KwDsksIt5Q,5,1,1,1,Pandemic pit stop to have an ice cream.... onl...,2020-04-19 05:33:16
2,oK0cGYStgDOusZKz9B1qug,2_9fKnXChUjC5xArfF8BLg,OMnPtRGmbY8qH_wIILfYKA,5,1,0,0,I was lucky enough to go to the soft opening a...,2020-02-29 19:43:44
3,E_ABvFCNVLbfOgRg3Pv1KQ,9MExTQ76GSKhxSWnTS901g,V9XlikTxq0My4gE8LULsjw,5,0,0,0,I've gone to claim Jumpers all over the US and...,2020-03-14 21:47:07
4,Rd222CrrnXkXukR2iWj69g,LPxuausjvDN88uPr-Q4cQA,CA5BOxKRDPGJgdUQ8OUOpw,4,1,0,0,"If you haven't been to Maynard's kitchen, it'...",2020-01-17 20:32:57


#### 2. Removing 3-star reviews & creating Sentiment column

In [49]:
df = df[df['stars'] != 3]  # Remove neutral reviews
df['Sentiment'] = df['stars'].apply(lambda x: 1 if x > 3 else 0)
display(df.head())  # Verify sentiment column

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,Sentiment
1,QP2pSzSqpJTMWOCuUuyXkQ,JBLWSXBTKFvJYYiM-FnCOQ,3w7NRntdQ9h0KwDsksIt5Q,5,1,1,1,Pandemic pit stop to have an ice cream.... onl...,2020-04-19 05:33:16,1
2,oK0cGYStgDOusZKz9B1qug,2_9fKnXChUjC5xArfF8BLg,OMnPtRGmbY8qH_wIILfYKA,5,1,0,0,I was lucky enough to go to the soft opening a...,2020-02-29 19:43:44,1
3,E_ABvFCNVLbfOgRg3Pv1KQ,9MExTQ76GSKhxSWnTS901g,V9XlikTxq0My4gE8LULsjw,5,0,0,0,I've gone to claim Jumpers all over the US and...,2020-03-14 21:47:07,1
4,Rd222CrrnXkXukR2iWj69g,LPxuausjvDN88uPr-Q4cQA,CA5BOxKRDPGJgdUQ8OUOpw,4,1,0,0,"If you haven't been to Maynard's kitchen, it'...",2020-01-17 20:32:57,1
5,kx6O_lyLzUnA7Xip5wh2NA,YsINprB2G1DM8qG1hbrPUg,rViAhfKLKmwbhTKROM9m0w,1,0,0,0,I stay at the Main Hotel at the Casino from Ju...,2020-07-14 16:43:23,0


#### 3. Conducting necessary data processing & splitting into train-test sets

In [51]:
train_data, test_data, train_labels, test_labels = train_test_split(
    df['text'], df['Sentiment'], test_size=0.2, random_state=42, stratify=df['Sentiment']
)

#### 4. Applying CountVectorizer

In [53]:
count_vectorizer = CountVectorizer(stop_words='english', max_features=1000)
X_train_counts = count_vectorizer.fit_transform(train_data)
X_test_counts = count_vectorizer.transform(test_data)

#### 5. Training Naïve Bayes with CountVectorizer & Evaluating Performance

In [55]:
nb_model = MultinomialNB()
nb_model.fit(X_train_counts, train_labels)
nb_pred = nb_model.predict(X_test_counts)

In [91]:
nb_report = pd.DataFrame(classification_report(test_labels, nb_pred, output_dict=True)).transpose()
print("Naïve Bayes Classification Report:")
display(nb_report)

Naïve Bayes Classification Report:


Unnamed: 0,precision,recall,f1-score,support
0,0.854994,0.830694,0.842669,2463.0
1,0.935107,0.945406,0.940228,6356.0
accuracy,0.913369,0.913369,0.913369,0.913369
macro avg,0.895051,0.88805,0.891449,8819.0
weighted avg,0.912733,0.913369,0.912982,8819.0


#### 6. Training SVM with CountVectorizer & Evaluating Performance

In [57]:
svm_model = SVC(kernel='linear')
svm_model.fit(X_train_counts, train_labels)
svm_pred = svm_model.predict(X_test_counts)

In [93]:
svm_report = pd.DataFrame(classification_report(test_labels, svm_pred, output_dict=True)).transpose()
print("SVM Classification Report:")
display(svm_report)

SVM Classification Report:


Unnamed: 0,precision,recall,f1-score,support
0,0.904282,0.874543,0.889164,2463.0
1,0.951996,0.964128,0.958024,6356.0
accuracy,0.939109,0.939109,0.939109,0.939109
macro avg,0.928139,0.919336,0.923594,8819.0
weighted avg,0.938671,0.939109,0.938793,8819.0


#### 7. Applying TF-IDF Vectorization

In [59]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X_train_tfidf = tfidf_vectorizer.fit_transform(train_data)
X_test_tfidf = tfidf_vectorizer.transform(test_data)

#### 8. Training Naïve Bayes with TF-IDF & Evaluating Performance

In [61]:
nb_tfidf_model = MultinomialNB()
nb_tfidf_model.fit(X_train_tfidf, train_labels)
nb_tfidf_pred = nb_tfidf_model.predict(X_test_tfidf)

In [95]:
nb_tfidf_report = pd.DataFrame(classification_report(test_labels, nb_tfidf_pred, output_dict=True)).transpose()
print("Naïve Bayes (TF-IDF) Classification Report:")
display(nb_tfidf_report)

Naïve Bayes (TF-IDF) Classification Report:


Unnamed: 0,precision,recall,f1-score,support
0,0.917617,0.719042,0.806283,2463.0
1,0.89955,0.974984,0.935749,6356.0
accuracy,0.903504,0.903504,0.903504,0.903504
macro avg,0.908583,0.847013,0.871016,8819.0
weighted avg,0.904596,0.903504,0.899591,8819.0


#### 9. Training SVM with TF-IDF & Evaluating Performance

In [63]:
svm_tfidf_model = SVC(kernel='linear')
svm_tfidf_model.fit(X_train_tfidf, train_labels)
svm_tfidf_pred = svm_tfidf_model.predict(X_test_tfidf)

In [97]:
svm_tfidf_report = pd.DataFrame(classification_report(test_labels, svm_tfidf_pred, output_dict=True)).transpose()
print("SVM (TF-IDF) Classification Report:")
display(svm_tfidf_report)

SVM (TF-IDF) Classification Report:


Unnamed: 0,precision,recall,f1-score,support
0,0.911542,0.878603,0.894769,2463.0
1,0.953607,0.96696,0.960237,6356.0
accuracy,0.942284,0.942284,0.942284,0.942284
macro avg,0.932575,0.922782,0.927503,8819.0
weighted avg,0.941859,0.942284,0.941953,8819.0


#### 10. Using VaderSentiment & Evaluating Performance

In [65]:
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

df['Vader_Sentiment'] = df['text'].apply(lambda x: 1 if sia.polarity_scores(x)['compound'] >= 0 else 0)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/kavya/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


#### 11. Comparing model performance & writing observations

In [75]:
model_results = pd.DataFrame({
    'Model': ['Naïve Bayes (CountVec)', 'SVM (CountVec)', 'Naïve Bayes (TF-IDF)', 'SVM (TF-IDF)', 'Vader Sentiment'],
    'Accuracy': [
        accuracy_score(test_labels, nb_pred),
        accuracy_score(test_labels, svm_pred),
        accuracy_score(test_labels, nb_tfidf_pred),
        accuracy_score(test_labels, svm_tfidf_pred),
        accuracy_score(df['Sentiment'], df['Vader_Sentiment'])
    ]
})

display(model_results)  

Unnamed: 0,Model,Accuracy
0,Naïve Bayes (CountVec),0.913369
1,SVM (CountVec),0.939109
2,Naïve Bayes (TF-IDF),0.903504
3,SVM (TF-IDF),0.942284
4,Vader Sentiment,0.866079


### Observations on Sentiment Analysis Model Performance

#### **1. Naïve Bayes (CountVectorizer)**
- **Accuracy: 91.34%**
- Works well for text classification but may struggle with complex linguistic structures.
- Relies heavily on word frequency, making it sensitive to stopwords.

#### **2. SVM (CountVectorizer)**
- **Accuracy: 93.91%**
- Performs better than Naïve Bayes in handling high-dimensional text data.
- Computationally more expensive but provides better generalization.

#### **3. Naïve Bayes (TF-IDF)**
- **Accuracy: 90.35%**
- Improved accuracy compared to CountVectorizer since TF-IDF assigns better importance to meaningful words.
- Still assumes word independence, which may impact classification.

#### **4. SVM (TF-IDF)**
- **Accuracy: 94.22%**
- Best-performing model among traditional ML approaches.
- Benefits from TF-IDF’s weighting mechanism and SVM’s ability to find optimal hyperplanes.

#### **5. Vader Sentiment Analysis**
- **Accuracy: 86.61%**
- Performs well on shorter text but lacks depth for nuanced sentiments.
- Does not require labeled data, making it useful for quick sentiment assessment.

#### **6. Overall Comparison**
- **SVM with TF-IDF (94.22%) performed the best**, confirming its effectiveness in text classification.
- **Naïve Bayes performed well**, but its assumption of word independence limits its potential.
- **Vader Sentiment is useful for quick lexicon-based analysis but not as strong as ML models**.
- **Future improvements** could include deep learning models (LSTMs, Transformers) for better sentiment detection.