#### <center> CISB5123 Text Analytics </center>
#### <center> Sem 2 2024/2025 </center>

## <center> Lab Assignment 2 - Sentiment Analysis on Amazon Fine Food Reviews</center>

### 1) Amirul Farhan bin Kamaruzaman, SW01082374
### 2) Maizatul Aufa binti Zamidi, SW01082394

In [4]:
#Load the Amazon Fine Food Reviews dataset
import pandas as pd
file_path = "Reviews.csv"
df = pd.read_csv(file_path)

In [5]:
#Display basic information about dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568454 non-null  int64 
 1   ProductId               568454 non-null  object
 2   UserId                  568454 non-null  object
 3   ProfileName             568428 non-null  object
 4   HelpfulnessNumerator    568454 non-null  int64 
 5   HelpfulnessDenominator  568454 non-null  int64 
 6   Score                   568454 non-null  int64 
 7   Time                    568454 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text                    568454 non-null  object
dtypes: int64(5), object(5)
memory usage: 43.4+ MB


In [6]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [7]:
df = df[['Score', 'Text']].dropna()

In [8]:
df.head()

Unnamed: 0,Score,Text
0,5,I have bought several of the Vitality canned d...
1,1,Product arrived labeled as Jumbo Salted Peanut...
2,4,This is a confection that has been around a fe...
3,2,If you are looking for the secret ingredient i...
4,5,Great taffy at a great price. There was a wid...


In [9]:
!pip install afinn



In [10]:
import numpy as np
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from afinn import Afinn
import matplotlib.pyplot as plt
import seaborn as sns

In [11]:
def label_sentiment(score):
    if score >= 4:
        return "positive"
    elif score == 3:
        return "neutral"
    else:
        return "negative"

df['original_sentiment'] = df['Score'].apply(label_sentiment)
df['original_sentiment'].value_counts()

original_sentiment
positive    443777
negative     82037
neutral      42640
Name: count, dtype: int64

In [12]:
!pip install tqdm



### Lexicon-based approach

In [13]:
from tqdm import tqdm
tqdm.pandas()

af = Afinn()

df['afinn_score'] = df['Text'].progress_apply(lambda x: af.score(x))

df['afinn_sentiment'] = df['afinn_score'].apply(
    lambda x: 'positive' if x > 0 else 'negative' if x < 0 else 'neutral'
)

df[['afinn_score', 'afinn_sentiment']].head()

100%|██████████| 568454/568454 [19:58<00:00, 474.40it/s]


Unnamed: 0,afinn_score,afinn_sentiment
0,16.0,positive
1,-2.0,negative
2,3.0,positive
3,3.0,positive
4,9.0,positive


### Preprocessing

In [14]:
# Preprocessing setup
import nltk
nltk.download('punkt')
nltk.download('stopwords')

tqdm.pandas()

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess(text):
    text = text.lower().translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    filtered = [stemmer.stem(w) for w in tokens if w not in stop_words and w.isalpha()]
    return ' '.join(filtered)

df['processed_text'] = df['Text'].progress_apply(preprocess) 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\maiza\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maiza\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
100%|██████████| 568454/568454 [16:04<00:00, 589.37it/s] 


### Feature Extraction (TF-IDF)

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['processed_text'])

X.shape

(568454, 5000)

In [16]:
from sklearn.model_selection import train_test_split

y = df['original_sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set:", X_train.shape)
print("Test set    :", X_test.shape)

Training set: (454763, 5000)
Test set    : (113691, 5000)


### Model Evaluation

In [17]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

nb = MultinomialNB()
nb.fit(X_train, y_train)

print("Naive Bayes Results:\n")
print(classification_report(y_test, nb.predict(X_test)))

Naive Bayes Results:

              precision    recall  f1-score   support

    negative       0.85      0.25      0.38     16181
     neutral       0.53      0.00      0.00      8485
    positive       0.81      1.00      0.90     89025

    accuracy                           0.81    113691
   macro avg       0.73      0.41      0.43    113691
weighted avg       0.80      0.81      0.76    113691



In [18]:
from sklearn.svm import LinearSVC

svm = LinearSVC()
svm.fit(X_train, y_train)

print("SVM Results:\n")
print(classification_report(y_test, svm.predict(X_test)))



SVM Results:

              precision    recall  f1-score   support

    negative       0.72      0.66      0.69     16181
     neutral       0.58      0.10      0.17      8485
    positive       0.89      0.97      0.93     89025

    accuracy                           0.86    113691
   macro avg       0.73      0.58      0.60    113691
weighted avg       0.84      0.86      0.84    113691



In [19]:
df.to_csv("reviews_sentiment_output.csv", index=False)

### Model Evaluation Summary
Two Models (Naive Bayes and SVM (LinearSVC)) were evaluated 
1) Naive Bayes
   - Accuracy = 81%
   - High recall for positive class (1.00), but very low for natural (o.00).
   - poor performance on minority classes, especially neutral, leading to class imbalance issues.
2) SVM (LinearSVC)
   - Accuracy - 86%
   - Strong performance on positive (Recall = 0.97) and negative (Recall = 0.66)
   - Moderate improvement on neutral (Recall = 0.10), but still underperform due to data imbalance
  
Conclusion: SVM outperformed Naive Bayes overall, especially in handling negative and positive sentiments. Both models struggle with the neutral class, likely due to class imbalance and the subtle nature of neutral sentiment.

### Discussion
1) Naive Bayes
- Strengths: Fast and simple; good on the positive class.

- Weaknesses: Very poor on the neutral class; struggles with imbalanced data.

2) SVM
- Strengths: Highest accuracy (86%); strong on positive and negative sentiments.

- Weaknesses: Still weak on neutral class; more computationally expensive.

##### Conclusion: 
SVM outperforms Naive Bayes overall but both models need improvement on neutral sentiment detection.