Name: Reshma a/p Ganesan (IS01082523)
      Najah Zdafirah binti Mohd Zakir (IS01082508)

In [2]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk

In [6]:
# Load and inspect dataset
df = pd.read_csv('Reviews.csv')
df = df[['Score', 'Text']].dropna()

# Basic Cleaning
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'[^a-z\s]', '', text) # remove special character
    text = re.sub(r'<.*?>', '', text) # remove HTML tags
    text = re.sub(r'http\S+|www\S+|https\S+', '', text) # remove urls
    text = re.sub(r'\d+', '', text) # remove numbers
    tokens = text.split()
    tokens = [ps.stem(word) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

df['cleaned_text'] = df['Text'].apply(clean_text)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\VICTUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
# Assign sentiment label
def label_sentiment(score):
    if score <= 2:
        return 'negative'
    elif score == 3:
        return 'neutral'
    else:
        return 'positive'

df['sentiment'] = df['Score'].apply(label_sentiment)


Lexicon-Based Approach (VADER)

In [13]:
pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2
Note: you may need to restart the kernel to use updated packages.


In [15]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.metrics import classification_report

analyzer = SentimentIntensityAnalyzer()
df['vader_compound'] = df['Text'].apply(lambda x: analyzer.polarity_scores(x)['compound'])

def vader_label(score):
    if score > 0.05:
        return 'positive'
    elif score < -0.05:
        return 'negative'
    else:
        return 'neutral'

df['vader_sentiment'] = df['vader_compound'].apply(vader_label)

# Evaluation
print("VADER Classification Report:")
print(classification_report(df['sentiment'], df['vader_sentiment']))


VADER Classification Report:
              precision    recall  f1-score   support

    negative       0.59      0.40      0.47     82037
     neutral       0.14      0.04      0.06     42640
    positive       0.84      0.95      0.89    443777

    accuracy                           0.80    568454
   macro avg       0.52      0.46      0.48    568454
weighted avg       0.75      0.80      0.77    568454



Machine Learning Approach (Logistic Regression)

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Feature Extraction
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(df['cleaned_text'])
y = df['sentiment']

# Split and Train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict and Evaluate
y_pred = model.predict(X_test)
print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

    negative       0.73      0.66      0.69     16181
     neutral       0.51      0.18      0.27      8485
    positive       0.90      0.97      0.93     89025

    accuracy                           0.86    113691
   macro avg       0.71      0.60      0.63    113691
weighted avg       0.84      0.86      0.85    113691



In [19]:
df.to_csv('Processed_Reviews.csv', index=False)


Comparison of VADER and Logistic Regression Models:
VADER:

Accuracy: 80%

Strength: Performs well with positive reviews (high recall).

Weakness: Struggles with neutral reviews, low recall for negative sentiment.

Logistic Regression:

Accuracy: 86%

Strength: Balanced performance for negative and positive sentiments, higher accuracy overall.

Weakness: Struggles with neutral reviews (low recall).

Conclusion:
Logistic Regression is the better model with higher accuracy and better overall performance, especially for negative and positive sentiments.

VADER excels at identifying positive reviews but struggles with neutral ones.

VADER (Lexicon-Based Approach)
✅ Strengths:

Easy to use with no need for training data.

Handles social media-type text and emoticons well.

❌ Weaknesses:

Doesn't adapt to new domains or context-specific meanings.

May misclassify sarcasm, negation, or domain-specific words.

Logistic Regression (ML-Based Approach)
✅ Strengths:

Learns from the dataset and adapts to context.

Can improve accuracy with more data and fine-tuning.

❌ Weaknesses:

Requires labeled training data and preprocessing.

Performance may drop if the dataset is imbalanced or noisy.