# Nanyang Polytechnic (NYP)

# IT2311 Assignment - Task 2: Sentiment Classification

We are required to build a sentiment classification model to predict the sentiment of video game review text. Businesses will be able to use this model to predict the sentiment of a new review.

Complete the following sub-tasks:
1. **Load Data**: Load the dataset and perform initial exploration
2. **Data Preparation**: Prepare text representations and engineer features for classification
3. **Modelling**: Build sentiment classifiers using different algorithms and text representations
4. **Evaluation**: Evaluate results, compare models, and select the best model with justification

For each sub-task, the rationale for every step is explained in detail.

**Done by: \<Enter your name and admin number here\>**

## Import Libraries and Download Packages

We begin by importing all necessary libraries:
- **pandas / numpy**: Data manipulation and numerical operations
- **matplotlib / seaborn**: Data visualization
- **nltk**: Natural Language Processing (tokenization, stopwords, lemmatization)
- **sklearn**: Machine learning models, evaluation metrics, vectorizers, and pipelines
- **re / warnings**: Text cleaning utilities and warning suppression

These libraries form the standard toolkit for NLP-based text classification tasks.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import warnings
warnings.filterwarnings('ignore')

# NLP
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Sklearn
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (classification_report, confusion_matrix,
                             accuracy_score, f1_score, precision_score, recall_score,
                             roc_auc_score, roc_curve)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

print('All libraries imported successfully.')

In [None]:
# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
print('NLTK data downloaded successfully.')

---
## 1. Load Data

We load the Amazon Video Game Reviews dataset (`Task_2_SA_video_game_reviews.json`) which contains 50,000 product reviews. Each review includes a numeric rating (1.0–5.0), review title, full review text, and metadata such as product ID, user ID, timestamp, helpful votes, and purchase verification.

**Rationale**: Understanding the raw data structure is the critical first step. We examine the shape, data types, missing values, and the distribution of the target variable (rating) to inform our preprocessing decisions.

In [None]:
# Load the JSON dataset
df = pd.read_json('Task_2_SA_video_game_reviews.json', lines=True)
print(f'Dataset shape: {df.shape}')
print(f'Number of rows: {df.shape[0]:,}')
print(f'Number of columns: {df.shape[1]}')
df.head()

In [None]:
# Display data types and non-null counts
df.info()

In [None]:
# Statistical summary of numerical columns
df.describe()

In [None]:
# Check for missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing Count': missing, 'Missing %': missing_pct})
print('Missing values per column:')
missing_df[missing_df['Missing Count'] > 0]

### 1.1 Rating Distribution

We visualise the distribution of ratings to understand the class balance. This is crucial because:
- Imbalanced classes can bias models toward the majority class
- It helps us decide how to map ratings to sentiment categories
- It informs whether we need stratified sampling or class weighting

In [None]:
# Visualise rating distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart of rating counts
rating_counts = df['rating'].value_counts().sort_index()
axes[0].bar(rating_counts.index.astype(str), rating_counts.values, color=sns.color_palette('viridis', 5))
axes[0].set_title('Distribution of Ratings', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Rating')
axes[0].set_ylabel('Count')
for i, v in enumerate(rating_counts.values):
    axes[0].text(i, v + 200, f'{v:,}', ha='center', fontsize=10)

# Percentage pie chart
axes[1].pie(rating_counts.values, labels=[f'{r} Star' for r in rating_counts.index.astype(int)],
            autopct='%1.1f%%', colors=sns.color_palette('viridis', 5), startangle=90)
axes[1].set_title('Rating Distribution (%)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print('\nRating value counts:')
print(rating_counts)

**Observation**: The ratings are likely skewed toward higher values (4–5 stars), which is common in product reviews. This imbalance will need to be addressed during our sentiment mapping and model training.

---
## 2. Data Preparation

Data preparation is the most critical step in any NLP pipeline. We will:
1. Create sentiment labels from ratings
2. Clean and preprocess the text
3. Engineer features using text vectorization
4. Split the data for training and testing

Each decision is justified with clear rationale.

### 2.1 Sentiment Label Creation

We map the 5-point rating scale to three sentiment categories:

| Rating | Sentiment |
|--------|-----------|
| 1.0 – 2.0 | **Negative** |
| 3.0 | **Neutral** |
| 4.0 – 5.0 | **Positive** |

**Rationale for this mapping**:
- **1–2 stars** clearly indicate dissatisfaction — grouping them captures varying degrees of negativity
- **3 stars** represents an ambivalent or mixed review — the reviewer is neither satisfied nor dissatisfied
- **4–5 stars** indicate satisfaction — grouping them captures both moderate and strong approval
- This 3-class mapping is standard in sentiment analysis literature and provides a good balance between granularity and model performance
- A binary (positive/negative) mapping would lose the nuance of neutral reviews, which are valuable for business insights

In [None]:
# Map ratings to sentiment labels
def map_sentiment(rating):
    if rating <= 2.0:
        return 'Negative'
    elif rating == 3.0:
        return 'Neutral'
    else:
        return 'Positive'

df['sentiment'] = df['rating'].apply(map_sentiment)

# Display sentiment distribution
sentiment_counts = df['sentiment'].value_counts()
sentiment_pct = (sentiment_counts / len(df) * 100).round(2)

print('Sentiment Distribution:')
print('=' * 40)
for sent in ['Positive', 'Neutral', 'Negative']:
    print(f'{sent:>10}: {sentiment_counts[sent]:>6,} ({sentiment_pct[sent]:>5.1f}%)')
print(f'{"Total":>10}: {len(df):>6,}')

In [None]:
# Visualise sentiment distribution
fig, ax = plt.subplots(figsize=(8, 5))
colors = {'Positive': '#2ecc71', 'Neutral': '#f39c12', 'Negative': '#e74c3c'}
order = ['Negative', 'Neutral', 'Positive']
bars = ax.bar(order, [sentiment_counts[s] for s in order],
              color=[colors[s] for s in order], edgecolor='black', linewidth=0.5)

for bar, label in zip(bars, order):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width() / 2., height + 200,
            f'{height:,}\n({sentiment_pct[label]:.1f}%)',
            ha='center', va='bottom', fontsize=11, fontweight='bold')

ax.set_title('Sentiment Class Distribution', fontsize=14, fontweight='bold')
ax.set_xlabel('Sentiment')
ax.set_ylabel('Count')
plt.tight_layout()
plt.show()

### 2.1.1 Addressing Class Imbalance

**Observation**: The dataset is likely imbalanced with the Positive class dominating. This is expected in product reviews where satisfied customers are more numerous.

**Strategy to handle imbalance**:
1. Use **stratified train-test split** to maintain class proportions in both sets
2. Use **`class_weight='balanced'`** in models that support it (Logistic Regression, LinearSVC) — this automatically adjusts weights inversely proportional to class frequencies
3. Evaluate using **macro-averaged F1-score** rather than accuracy alone, as accuracy can be misleading with imbalanced data
4. Examine **per-class precision and recall** to ensure the model performs well on minority classes

### 2.2 Text Preprocessing

Text preprocessing is essential for reducing noise and improving model performance. Our pipeline:

1. **Combine title + text**: The review title often contains a concise sentiment summary (e.g., "Terrible game!" or "Absolutely love it!") that adds valuable signal
2. **Handle missing text**: Replace NaN values with empty strings to avoid errors
3. **Clean text**: Remove HTML tags, URLs, special characters, digits, and extra whitespace
4. **Lowercase**: Normalize case to treat "Good" and "good" as the same token
5. **Tokenize**: Split text into individual words for further processing
6. **Remove stopwords**: Remove common English words (the, is, at, etc.) that carry little semantic meaning
7. **Lemmatize**: Reduce words to their base forms (e.g., "playing" → "play", "games" → "game") for better generalization

**Rationale**: Each step reduces the vocabulary size and noise, allowing the model to focus on words that are truly indicative of sentiment.

In [None]:
# Handle missing text values
df['title'] = df['title'].fillna('')
df['text'] = df['text'].fillna('')

# Combine title and text for richer features
df['combined_text'] = df['title'] + ' ' + df['text']

print(f'Sample combined text (first review):')
print(f'Title: {df["title"].iloc[0]}')
print(f'Text:  {df["text"].iloc[0][:200]}...')
print(f'Combined: {df["combined_text"].iloc[0][:200]}...')

In [None]:
# Initialize NLP tools
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
    """Comprehensive text cleaning pipeline."""
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def preprocess_text(text):
    """Full preprocessing: clean, tokenize, remove stopwords, lemmatize."""
    text = clean_text(text)
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(t) for t in tokens if t not in stop_words and len(t) > 2]
    return ' '.join(tokens)

# Test the preprocessing pipeline
sample = 'This game is ABSOLUTELY amazing!!! I love it <br> so much... http://example.com 5/5'
print(f'Original:     {sample}')
print(f'Cleaned:      {clean_text(sample)}')
print(f'Preprocessed: {preprocess_text(sample)}')

In [None]:
# Apply preprocessing to the entire dataset
# Using a sample-friendly approach — preprocess all rows
print('Preprocessing text... This may take a few minutes.')
df['processed_text'] = df['combined_text'].apply(preprocess_text)
print(f'Preprocessing complete. {len(df):,} reviews processed.')

# Show before and after examples
print('\n--- Before and After Preprocessing ---')
for i in range(3):
    print(f'\nReview {i+1}:')
    print(f'  Original:     {df["combined_text"].iloc[i][:100]}...')
    print(f'  Preprocessed: {df["processed_text"].iloc[i][:100]}...')

### 2.3 Exploratory Text Analysis

Before building models, we explore the processed text to understand patterns across sentiment classes. This helps validate our preprocessing and reveals insights about the language used in each sentiment category.

In [None]:
# Text length analysis by sentiment
df['text_length'] = df['processed_text'].apply(lambda x: len(x.split()))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
order = ['Negative', 'Neutral', 'Positive']
colors_list = ['#e74c3c', '#f39c12', '#2ecc71']
bp = sns.boxplot(x='sentiment', y='text_length', data=df, order=order,
                 palette=colors_list, ax=axes[0], showfliers=False)
axes[0].set_title('Review Length by Sentiment', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Sentiment')
axes[0].set_ylabel('Word Count')

# Histogram
for sent, color in zip(order, colors_list):
    subset = df[df['sentiment'] == sent]['text_length']
    axes[1].hist(subset, bins=50, alpha=0.5, label=sent, color=color, range=(0, 200))
axes[1].set_title('Distribution of Review Lengths', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Word Count')
axes[1].set_ylabel('Frequency')
axes[1].legend()

plt.tight_layout()
plt.show()

# Summary statistics
print('Average word count by sentiment:')
print(df.groupby('sentiment')['text_length'].describe()[['mean', 'std', 'min', 'max']].round(1))

In [None]:
# Most frequent words per sentiment class
from collections import Counter

fig, axes = plt.subplots(1, 3, figsize=(18, 6))
sentiments = ['Negative', 'Neutral', 'Positive']
colors_map = {'Negative': '#e74c3c', 'Neutral': '#f39c12', 'Positive': '#2ecc71'}

for ax, sent in zip(axes, sentiments):
    text = ' '.join(df[df['sentiment'] == sent]['processed_text'].values)
    word_freq = Counter(text.split())
    top_words = word_freq.most_common(20)
    words, counts = zip(*top_words)
    ax.barh(range(len(words)), counts, color=colors_map[sent])
    ax.set_yticks(range(len(words)))
    ax.set_yticklabels(words)
    ax.invert_yaxis()
    ax.set_title(f'Top 20 Words — {sent}', fontsize=13, fontweight='bold')
    ax.set_xlabel('Frequency')

plt.tight_layout()
plt.show()

**Observations from text analysis**:
- Negative reviews tend to be longer — dissatisfied customers often explain their complaints in detail
- Positive reviews frequently use words like "great", "fun", "love", "good"
- Negative reviews contain words like "bad", "worst", "waste", "boring"
- These distinct word patterns suggest that bag-of-words and TF-IDF features should capture sentiment well

### 2.4 Feature Engineering — Text Vectorization

We convert the preprocessed text into numerical features using two methods:

1. **TF-IDF (Term Frequency–Inverse Document Frequency)**:
   - Weights words by their importance in a document relative to the corpus
   - Common words (appearing in many reviews) get lower weight
   - Rare, distinctive words get higher weight
   - **Best for**: Logistic Regression and SVM, which benefit from continuous, normalized features

2. **Count Vectorizer (Bag of Words)**:
   - Simply counts word occurrences in each document
   - **Best for**: Multinomial Naive Bayes, which models word counts as a multinomial distribution

**Rationale for TF-IDF as primary vectorizer**:
- TF-IDF generally outperforms raw counts because it downweights ubiquitous terms
- It produces normalized features that work well with linear models
- The `max_features=20000` limit controls dimensionality while retaining the most informative terms
- Using both unigrams and bigrams (`ngram_range=(1,2)`) captures phrases like "not good" that reverse sentiment

In [None]:
# Prepare features and target
X = df['processed_text']
y = df['sentiment']

# Encode target labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)
print(f'Label encoding: {dict(zip(le.classes_, le.transform(le.classes_)))}')

# Train-test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f'\nTraining set size: {len(X_train):,}')
print(f'Testing set size:  {len(X_test):,}')
print(f'\nTraining set sentiment distribution:')
print(y_train.value_counts())
print(f'\nTest set sentiment distribution:')
print(y_test.value_counts())

In [None]:
# TF-IDF Vectorizer
tfidf = TfidfVectorizer(
    max_features=20000,
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95,
    sublinear_tf=True
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print(f'TF-IDF feature matrix shape (train): {X_train_tfidf.shape}')
print(f'TF-IDF feature matrix shape (test):  {X_test_tfidf.shape}')
print(f'Vocabulary size: {len(tfidf.vocabulary_):,} terms')

# Count Vectorizer (for Naive Bayes)
count_vec = CountVectorizer(
    max_features=20000,
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95
)

X_train_count = count_vec.fit_transform(X_train)
X_test_count = count_vec.transform(X_test)

print(f'\nCount Vectorizer feature matrix shape (train): {X_train_count.shape}')
print(f'Count Vectorizer feature matrix shape (test):  {X_test_count.shape}')

---
## 3. Modelling

We build **three classification models** using different algorithms. Each model is selected for specific strengths in text classification:

| Model | Algorithm | Vectorizer | Key Strength |
|-------|-----------|------------|-------------|
| Model 1 | Logistic Regression | TF-IDF | Strong baseline, interpretable, probabilistic output |
| Model 2 | Linear SVC | TF-IDF | Excellent in high-dimensional spaces, margin-based |
| Model 3 | Multinomial Naive Bayes | Count Vectorizer | Fast, probabilistic, natural fit for word counts |

**Why these three models?**
- They represent three fundamentally different approaches: discriminative linear (LR), maximum-margin (SVM), and generative probabilistic (NB)
- All are well-established for text classification in both academic literature and industry practice
- They are computationally efficient for large, high-dimensional text data
- Each uses hyperparameter tuning via GridSearchCV with 3-fold cross-validation to find optimal settings

### 3.1 Model 1: Logistic Regression with TF-IDF

**Rationale for selection**:
- Logistic Regression is one of the **strongest baselines** for text classification and frequently matches or outperforms more complex models
- It is **highly interpretable** — we can examine feature coefficients to understand which words drive predictions
- It provides **probabilistic outputs** (class probabilities), useful for confidence scoring
- The regularization parameter `C` controls the trade-off between fitting the training data and generalizing to new data
- Using `class_weight='balanced'` compensates for class imbalance by giving minority classes higher weight

**Hyperparameters tuned**:
- `C`: Inverse regularization strength — lower values = stronger regularization
- `solver`: Algorithm for optimization — 'lbfgs' is efficient for multiclass with L2 penalty

In [None]:
# Model 1: Logistic Regression
print('Training Model 1: Logistic Regression with TF-IDF')
print('=' * 55)

# Hyperparameter tuning
lr_params = {
    'C': [0.1, 1.0, 10.0],
    'solver': ['lbfgs'],
    'max_iter': [1000]
}

lr_grid = GridSearchCV(
    LogisticRegression(class_weight='balanced', random_state=42, multi_class='multinomial'),
    lr_params,
    cv=3,
    scoring='f1_macro',
    n_jobs=-1,
    verbose=0
)

lr_grid.fit(X_train_tfidf, y_train)

print(f'Best parameters: {lr_grid.best_params_}')
print(f'Best CV F1 (macro): {lr_grid.best_score_:.4f}')

# Predictions
lr_model = lr_grid.best_estimator_
lr_train_pred = lr_model.predict(X_train_tfidf)
lr_test_pred = lr_model.predict(X_test_tfidf)

lr_train_acc = accuracy_score(y_train, lr_train_pred)
lr_test_acc = accuracy_score(y_test, lr_test_pred)

print(f'\nTraining Accuracy: {lr_train_acc:.4f}')
print(f'Testing Accuracy:  {lr_test_acc:.4f}')
print(f'\nClassification Report (Test Set):')
print(classification_report(y_test, lr_test_pred))

### 3.2 Model 2: Linear Support Vector Machine (LinearSVC) with TF-IDF

**Rationale for selection**:
- SVMs are **theoretically well-suited** for high-dimensional, sparse data like text features — they find the optimal hyperplane that maximizes the margin between classes
- LinearSVC is specifically designed for **efficient large-scale** text classification
- It often achieves **state-of-the-art performance** on text classification benchmarks
- Unlike kernel SVMs, LinearSVC scales linearly with dataset size, making it practical for 50K reviews
- The `class_weight='balanced'` parameter handles our class imbalance

**Hyperparameters tuned**:
- `C`: Regularization parameter — controls the penalty for misclassification

In [None]:
# Model 2: Linear SVC
print('Training Model 2: LinearSVC with TF-IDF')
print('=' * 55)

# Hyperparameter tuning
svc_params = {
    'C': [0.1, 1.0, 10.0],
    'max_iter': [5000]
}

svc_grid = GridSearchCV(
    LinearSVC(class_weight='balanced', random_state=42),
    svc_params,
    cv=3,
    scoring='f1_macro',
    n_jobs=-1,
    verbose=0
)

svc_grid.fit(X_train_tfidf, y_train)

print(f'Best parameters: {svc_grid.best_params_}')
print(f'Best CV F1 (macro): {svc_grid.best_score_:.4f}')

# Predictions
svc_model = svc_grid.best_estimator_
svc_train_pred = svc_model.predict(X_train_tfidf)
svc_test_pred = svc_model.predict(X_test_tfidf)

svc_train_acc = accuracy_score(y_train, svc_train_pred)
svc_test_acc = accuracy_score(y_test, svc_test_pred)

print(f'\nTraining Accuracy: {svc_train_acc:.4f}')
print(f'Testing Accuracy:  {svc_test_acc:.4f}')
print(f'\nClassification Report (Test Set):')
print(classification_report(y_test, svc_test_pred))

### 3.3 Model 3: Multinomial Naive Bayes with Count Vectorizer

**Rationale for selection**:
- Multinomial Naive Bayes is a **classic text classification algorithm** that treats documents as bags of words with multinomial frequency distributions
- It provides a **fundamentally different approach** — a generative probabilistic model versus the discriminative models above
- Despite its "naive" independence assumption, it performs **surprisingly well** on text data because high-dimensional text features tend to be approximately conditionally independent
- It is **extremely fast** to train and predict, making it ideal for production systems
- We use **Count Vectorizer** (instead of TF-IDF) because MultinomialNB expects raw frequency counts that follow a multinomial distribution

**Hyperparameters tuned**:
- `alpha`: Laplace smoothing parameter — prevents zero probabilities for unseen words; higher values increase smoothing

In [None]:
# Model 3: Multinomial Naive Bayes
print('Training Model 3: Multinomial Naive Bayes with Count Vectorizer')
print('=' * 55)

# Hyperparameter tuning
nb_params = {
    'alpha': [0.01, 0.1, 0.5, 1.0, 2.0]
}

nb_grid = GridSearchCV(
    MultinomialNB(),
    nb_params,
    cv=3,
    scoring='f1_macro',
    n_jobs=-1,
    verbose=0
)

nb_grid.fit(X_train_count, y_train)

print(f'Best parameters: {nb_grid.best_params_}')
print(f'Best CV F1 (macro): {nb_grid.best_score_:.4f}')

# Predictions
nb_model = nb_grid.best_estimator_
nb_train_pred = nb_model.predict(X_train_count)
nb_test_pred = nb_model.predict(X_test_count)

nb_train_acc = accuracy_score(y_train, nb_train_pred)
nb_test_acc = accuracy_score(y_test, nb_test_pred)

print(f'\nTraining Accuracy: {nb_train_acc:.4f}')
print(f'Testing Accuracy:  {nb_test_acc:.4f}')
print(f'\nClassification Report (Test Set):')
print(classification_report(y_test, nb_test_pred))

---
## 4. Evaluation

We now comprehensively compare all three models across multiple metrics to select the best one. Our evaluation strategy:

1. **Confusion matrices** — visualise where each model makes errors
2. **Performance comparison table** — compare Accuracy, Precision, Recall, and F1-Score
3. **Bar chart comparison** — visual summary of key metrics
4. **Cross-validation** — assess model robustness and variance
5. **Error analysis** — examine misclassified examples
6. **Best model selection** — justified recommendation

### 4.1 Confusion Matrix Visualizations

Confusion matrices show the detailed breakdown of correct and incorrect predictions for each class. This reveals:
- Which sentiment classes are easiest/hardest to classify
- Common misclassification patterns (e.g., Neutral being confused with Positive)

In [None]:
# Confusion matrices for all three models
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

models_info = [
    ('Logistic Regression', lr_test_pred),
    ('LinearSVC', svc_test_pred),
    ('Multinomial NB', nb_test_pred)
]

labels = ['Negative', 'Neutral', 'Positive']

for ax, (name, preds) in zip(axes, models_info):
    cm = confusion_matrix(y_test, preds, labels=labels)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
                xticklabels=labels, yticklabels=labels)
    ax.set_title(f'{name}', fontsize=13, fontweight='bold')
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')

plt.suptitle('Confusion Matrices — All Models', fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

### 4.2 Performance Comparison Table

We compute multiple metrics for a comprehensive comparison:
- **Accuracy**: Overall correct predictions (can be misleading with imbalanced data)
- **Precision (macro)**: Average precision across all classes — measures false positive control
- **Recall (macro)**: Average recall across all classes — measures false negative control
- **F1-Score (macro)**: Harmonic mean of precision and recall — our primary metric as it balances both and treats all classes equally regardless of size

In [None]:
# Build comparison table
results = {}
model_preds = {
    'Logistic Regression': (lr_test_pred, lr_train_acc),
    'LinearSVC': (svc_test_pred, svc_train_acc),
    'Multinomial NB': (nb_test_pred, nb_train_acc)
}

for name, (preds, train_acc) in model_preds.items():
    results[name] = {
        'Train Accuracy': train_acc,
        'Test Accuracy': accuracy_score(y_test, preds),
        'Precision (macro)': precision_score(y_test, preds, average='macro'),
        'Recall (macro)': recall_score(y_test, preds, average='macro'),
        'F1-Score (macro)': f1_score(y_test, preds, average='macro')
    }

results_df = pd.DataFrame(results).T
results_df = results_df.round(4)

print('\n========== MODEL PERFORMANCE COMPARISON ==========')
print(results_df.to_string())
print('\nBest model by F1-Score (macro):', results_df['F1-Score (macro)'].idxmax())

### 4.3 Visual Comparison of Model Performance

In [None]:
# Bar chart comparison
metrics_to_plot = ['Test Accuracy', 'Precision (macro)', 'Recall (macro)', 'F1-Score (macro)']
x = np.arange(len(metrics_to_plot))
width = 0.25

fig, ax = plt.subplots(figsize=(12, 6))

model_names = list(results.keys())
colors = ['#3498db', '#e74c3c', '#2ecc71']

for i, (model_name, color) in enumerate(zip(model_names, colors)):
    values = [results[model_name][m] for m in metrics_to_plot]
    bars = ax.bar(x + i * width, values, width, label=model_name, color=color, edgecolor='black', linewidth=0.5)
    for bar, val in zip(bars, values):
        ax.text(bar.get_x() + bar.get_width() / 2., bar.get_height() + 0.005,
                f'{val:.3f}', ha='center', va='bottom', fontsize=9, fontweight='bold')

ax.set_xlabel('Metric', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
ax.set_xticks(x + width)
ax.set_xticklabels(metrics_to_plot)
ax.legend(loc='lower right')
ax.set_ylim(0, 1.1)
ax.axhline(y=0.5, color='gray', linestyle='--', alpha=0.3)

plt.tight_layout()
plt.show()

### 4.4 Cross-Validation for Robustness

We perform 5-fold cross-validation to assess each model's stability and generalization ability. A model with low variance across folds is more reliable than one with high variance, even if the latter has a slightly higher mean score.

In [None]:
# Cross-validation comparison
print('Performing 5-fold cross-validation...')
print('=' * 55)

cv_models = {
    'Logistic Regression': (lr_model, X_train_tfidf),
    'LinearSVC': (svc_model, X_train_tfidf),
    'Multinomial NB': (nb_model, X_train_count)
}

cv_results = {}
for name, (model, X_data) in cv_models.items():
    scores = cross_val_score(model, X_data, y_train, cv=5, scoring='f1_macro', n_jobs=-1)
    cv_results[name] = scores
    print(f'{name:>25}: Mean F1 = {scores.mean():.4f} (+/- {scores.std():.4f})  | Folds: {np.round(scores, 4)}')

# Visualise CV results
fig, ax = plt.subplots(figsize=(10, 5))
positions = range(1, len(cv_results) + 1)
bp = ax.boxplot([cv_results[name] for name in cv_results],
                labels=list(cv_results.keys()), patch_artist=True)

for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

ax.set_title('Cross-Validation F1-Score (Macro) Distribution', fontsize=14, fontweight='bold')
ax.set_ylabel('F1-Score (Macro)')
plt.tight_layout()
plt.show()

### 4.5 Model Strengths and Weaknesses

| Model | Strengths | Weaknesses |
|-------|-----------|------------|
| **Logistic Regression** | Interpretable coefficients; probabilistic output; good balance of precision/recall; handles imbalanced classes well with `class_weight` | May underperform on highly non-linear decision boundaries |
| **LinearSVC** | Maximizes class margin for better generalization; scales well to large feature spaces; typically top performer for text | No probability estimates by default; sensitive to C parameter; less interpretable than LR |
| **Multinomial NB** | Extremely fast training/prediction; works well with small training data; natural probabilistic model for word counts | Strong independence assumption may hurt; struggles with complex feature interactions; cannot use `class_weight` |

**Key observations**:
- The Neutral class is hardest to classify for all models — this is expected because neutral reviews use language that overlaps with both positive and negative sentiments
- Positive class has highest recall due to its larger representation in the data
- The gap between training and testing accuracy indicates whether a model is overfitting

### 4.6 Error Analysis

Examining misclassified examples helps us understand model limitations and identify potential improvements.

In [None]:
# Error analysis using the best model
# Determine best model for error analysis
best_model_name = results_df['F1-Score (macro)'].idxmax()
if best_model_name == 'Logistic Regression':
    best_preds = lr_test_pred
elif best_model_name == 'LinearSVC':
    best_preds = svc_test_pred
else:
    best_preds = nb_test_pred

# Create a DataFrame for analysis
test_df = pd.DataFrame({
    'text': X_test.values,
    'actual': y_test.values,
    'predicted': best_preds
})

misclassified = test_df[test_df['actual'] != test_df['predicted']]
correct = test_df[test_df['actual'] == test_df['predicted']]

print(f'Best model for error analysis: {best_model_name}')
print(f'Total test samples: {len(test_df):,}')
print(f'Correctly classified: {len(correct):,} ({len(correct)/len(test_df)*100:.1f}%)')
print(f'Misclassified: {len(misclassified):,} ({len(misclassified)/len(test_df)*100:.1f}%)')

# Misclassification patterns
print(f'\n--- Misclassification Patterns ---')
misclass_patterns = misclassified.groupby(['actual', 'predicted']).size().reset_index(name='count')
misclass_patterns = misclass_patterns.sort_values('count', ascending=False)
print(misclass_patterns.to_string(index=False))

In [None]:
# Show sample misclassified reviews
print('\n--- Sample Misclassified Reviews ---\n')
for actual_label in ['Negative', 'Neutral', 'Positive']:
    subset = misclassified[misclassified['actual'] == actual_label].head(2)
    if len(subset) > 0:
        print(f'Actual: {actual_label}')
        for _, row in subset.iterrows():
            text_preview = row['text'][:120] + '...' if len(row['text']) > 120 else row['text']
            print(f'  Predicted: {row["predicted"]} | Text: {text_preview}')
        print()

### 4.7 Best Model Selection

Based on our comprehensive evaluation, we now select the best model with clear justification.

In [None]:
# Final model selection summary
print('=' * 60)
print('           FINAL MODEL SELECTION SUMMARY')
print('=' * 60)

best_name = results_df['F1-Score (macro)'].idxmax()
best_row = results_df.loc[best_name]

print(f'\n>>> BEST MODEL: {best_name} <<<\n')
print(f'Test Accuracy:     {best_row["Test Accuracy"]:.4f}')
print(f'Precision (macro): {best_row["Precision (macro)"]:.4f}')
print(f'Recall (macro):    {best_row["Recall (macro)"]:.4f}')
print(f'F1-Score (macro):  {best_row["F1-Score (macro)"]:.4f}')

print(f'\n--- Ranking by F1-Score (Macro) ---')
ranking = results_df['F1-Score (macro)'].sort_values(ascending=False)
for i, (model, score) in enumerate(ranking.items(), 1):
    marker = ' <<<' if model == best_name else ''
    print(f'  {i}. {model}: {score:.4f}{marker}')

print(f'\n--- Cross-Validation Stability ---')
for name, scores in cv_results.items():
    stability = 'High' if scores.std() < 0.01 else 'Medium' if scores.std() < 0.02 else 'Low'
    print(f'  {name}: std = {scores.std():.4f} ({stability} stability)')

### Justification for Best Model Selection

The best model is selected based on the following criteria, ranked by importance:

1. **F1-Score (Macro)** — Our primary metric because it balances precision and recall equally across all sentiment classes, regardless of class size. This ensures the model performs well on minority classes (Negative and Neutral), not just the majority Positive class.

2. **Cross-Validation Stability** — A model that performs consistently across different data splits is more trustworthy for production deployment than one with high variance.

3. **Generalization Gap** — A small gap between training and testing accuracy indicates the model is not overfitting to the training data.

4. **Practical Considerations**:
   - **Logistic Regression**: Offers the best interpretability — businesses can understand *why* a review is classified as positive or negative by examining the top contributing words. It also provides calibrated probability estimates.
   - **LinearSVC**: Typically achieves the highest raw accuracy for text classification but lacks native probability outputs.
   - **Multinomial NB**: Fastest inference time, ideal if speed is the primary concern.

The selected model provides the best overall balance of accuracy, robustness, and practical utility for a business sentiment analysis system.

### 4.8 Business Recommendation

**How can businesses use this sentiment classification model?**

1. **Automated Review Monitoring**: Deploy the model to automatically classify incoming reviews into Positive, Neutral, and Negative categories in real-time. This enables product managers to quickly identify and respond to negative feedback.

2. **Customer Satisfaction Tracking**: Aggregate sentiment predictions over time to track customer satisfaction trends. A sudden increase in negative sentiment could indicate a product defect, a bad update, or a service issue that needs immediate attention.

3. **Product Improvement Prioritization**: Analyse the most common words and phrases in negative reviews (using the model's interpretable features) to identify specific issues customers complain about — e.g., "lag", "crash", "overpriced".

4. **Review Summarization**: Filter and surface the most informative reviews for each sentiment class to help potential buyers make informed decisions.

5. **Competitive Analysis**: Apply the model to competitor product reviews to benchmark sentiment and identify areas of competitive advantage or weakness.

**Limitations to be aware of**:
- The model is trained on video game reviews and may not generalize well to other product categories
- Sarcasm and irony (e.g., "Great, another broken game") may be misclassified
- The Neutral class remains the most challenging to predict accurately
- Model performance should be monitored and retrained periodically as review language evolves

---
## Citation

**Dataset Source:**

Hou, Y., Li, J., He, Z., Yan, A., Chen, X., & McAuley, J. (2024). *Bridging Language and Items for Retrieval and Recommendation*. arXiv preprint arXiv:2403.03952.

---
## Submission

To export this notebook as HTML for submission:

1. In Jupyter Notebook: **File → Download as → HTML (.html)**
2. In JupyterLab: **File → Export Notebook As → HTML**
3. Via command line: `jupyter nbconvert --to html SentimentClassificationStarter.ipynb`

Ensure all cells have been run and outputs are visible before exporting.