# Amazon Fine Food Reviews — Sentiment Analysis

**Author:** Ricky Gong, University of Pennsylvania (`gong8@seas.upenn.edu`)  
**Dataset:** Amazon Fine Food Reviews (568,454 reviews)  
**Task:** Binary sentiment classification — Positive vs. Negative

---

## Project Overview

This notebook builds a complete NLP pipeline to predict the **sentiment** (positive / negative) of Amazon food product reviews. Starting from raw text, we apply text preprocessing, feature engineering, and progressively more sophisticated classification models, addressing the key challenges of **class imbalance**, the **precision–recall trade-off**, and the **expressiveness of text representations**.

### Key Design Principles

1. **Baseline-first**: Always train a baseline model before adding any complexity.
2. **Stop-word caution**: In sentiment analysis, negation words ("not", "won't", "can't") must **not** be removed.
3. **Dimensionality reduction is not free**: SVD is appropriate only when dense input is required (e.g., for SMOTE), never as the first step before evaluation.
4. **Evaluation beyond Accuracy**: With a 78/22 class split, we prioritise **Recall** for the negative (minority) class.
5. **Representation matters**: We compare sparse TF-IDF, static dense Word2Vec, and contextual transformer (DistilBERT) representations.

> ⚡ **GPU Note**: Sections 1–10 and 11.1–11.2 run on CPU. **Section 11.3 (DistilBERT) requires a GPU.** Run on Google Colab (Runtime → Change Runtime Type → GPU).

---

## Table of Contents

1. [Setup & Imports](#1)
2. [Data Loading & Exploratory Analysis](#2)
3. [Text Preprocessing](#3)
4. [Feature Engineering — TF-IDF](#4)
5. [Train / Test Split](#5)
6. [Baseline Model — Logistic Regression](#6)
7. [Class Imbalance: Problem & Solutions](#7)
   - 7.1 Class Weights
   - 7.2 SMOTE (with SVD critique)
   - 7.3 Random Undersampling
8. [Regularisation & Hyperparameter Tuning](#8)
9. [Threshold Tuning — Balancing Precision & Recall](#9)
10. [Random Forest Classifier](#10)
11. [Beyond Bag-of-Words: Word Embeddings](#11)
    - 11.1 Theory and Motivation
    - 11.2 Word2Vec — Static Dense Embeddings
    - 11.3 DistilBERT — Contextual Transformer Embeddings ⚡ GPU
    - 11.4 Embedding Method Comparison
12. [Ensemble Learning — Soft Voting](#12)
13. [Model Comparison & Selection](#13)
14. [Feature Importance & Interpretability](#14)
15. [Conclusions](#15)

---
## 1. Setup & Imports <a id='1'></a>

In [None]:
import pandas as pd
import numpy as np
import re
import string
import warnings
warnings.filterwarnings('ignore')

# NLP
import nltk
nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

# Feature engineering
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD

# Model selection & evaluation
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score,
    roc_curve, precision_recall_curve, auc,
    precision_score, recall_score, f1_score
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier

# Class imbalance
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# Persistence
import joblib

# Visualisation
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
from wordcloud import WordCloud
from tqdm import tqdm
tqdm.pandas()

# Progress bar for pandas
plt.rcParams.update({'figure.dpi': 120, 'font.size': 11})
print('All imports successful.')

---
## 2. Data Loading & Exploratory Analysis <a id='2'></a>

The dataset is sourced from [Kaggle — Amazon Fine Food Reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews). Place `Reviews.csv` in the `../data/` directory before running this cell.

In [None]:
df = pd.read_csv('../data/Reviews.csv')
print(f'Shape: {df.shape}')
df.head(3)

In [None]:
# Rename for clarity
df.columns = ['Id','ProductId','UserId','ProfileName',
              'VotesHelpful','VotesTotal','Score','Time','Summary','Text']

print('Missing values:')
print(df[['Score','Summary','Text']].isnull().sum())
print(f'\nUnique products : {df.ProductId.nunique():,}')
print(f'Unique users    : {df.UserId.nunique():,}')
print(f'Score range     : {df.Score.min()} – {df.Score.max()}')

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(13, 4))

# Score distribution
ax = axes[0]
score_counts = df['Score'].value_counts().sort_index()
colors = ['#e74c3c','#e67e22','#f39c12','#27ae60','#2ecc71']
ax.bar(score_counts.index, score_counts.values, color=colors, edgecolor='white', linewidth=0.8)
ax.set_xlabel('Star Rating')
ax.set_ylabel('Number of Reviews')
ax.set_title('Score Distribution (raw data)')
for i, (x, y) in enumerate(zip(score_counts.index, score_counts.values)):
    ax.text(x, y + 2000, f'{y:,}', ha='center', va='bottom', fontsize=9)

# Text length distribution
ax2 = axes[1]
df['text_len'] = df['Text'].str.split().str.len()
for score, color, label in [(5,'#2ecc71','5-star'), (1,'#e74c3c','1-star')]:
    subset = df[df['Score'] == score]['text_len'].clip(0, 500)
    ax2.hist(subset, bins=50, alpha=0.55, color=color, label=label, density=True)
ax2.set_xlabel('Review length (words, clipped at 500)')
ax2.set_ylabel('Density')
ax2.set_title('Review Length Distribution by Extreme Scores')
ax2.legend()

plt.tight_layout()
plt.show()
print(f'Median review length: {df.text_len.median():.0f} words')

### 2.1 Label Construction

We convert the 5-point Likert scale to a binary sentiment label:

$$
y_i = \begin{cases} 1 \;(\text{positive}) & \text{if } s_i > 3 \\ 0 \;(\text{negative}) & \text{if } s_i < 3 \\ \text{excluded} & \text{if } s_i = 3 \end{cases}
$$

Reviews with $s_i = 3$ represent ambiguous/neutral opinions and are excluded to prevent label noise.

In [None]:
# Assign binary labels; drop Score == 3
df = df[df['Score'] != 3].copy()
df['Sentiment'] = (df['Score'] > 3).astype(int)  # 1=positive, 0=negative

counts = df['Sentiment'].value_counts()
print('Class distribution after exclusion:')
print(f'  Positive (1): {counts[1]:>7,}  ({counts[1]/len(df)*100:.1f}%)')
print(f'  Negative (0): {counts[0]:>7,}  ({counts[0]/len(df)*100:.1f}%)')
print(f'  Total       : {len(df):>7,}')

fig, ax = plt.subplots(figsize=(5, 3.5))
ax.barh(['Negative (0)','Positive (1)'], [counts[0], counts[1]],
        color=['#e74c3c','#2ecc71'], edgecolor='white')
for i, v in enumerate([counts[0], counts[1]]):
    ax.text(v + 2000, i, f'{v:,} ({v/len(df)*100:.1f}%)', va='center')
ax.set_xlabel('Count')
ax.set_title('Binary Label Distribution')
plt.tight_layout()
plt.show()

---
## 3. Text Preprocessing <a id='3'></a>

### 3.1 Why We Keep Negation Stop Words

Standard NLP pipelines routinely remove **stop words** (high-frequency, low-information words like "the", "a", "is"). However, **sentiment analysis is a notable exception**.

Negation words such as *not*, *won't*, *can't*, *never*, *didn't* are typically included in stop-word lists, yet they are semantically critical:

| Original phrase | After stop-word removal | Meaning preserved? |
|---|---|---|
| "not great" | "great" | ❌ Inverted! |
| "won't buy again" | "buy" | ❌ Lost negative signal |
| "can't recommend" | "recommend" | ❌ Inverted! |
| "absolutely delicious" | "absolutely delicious" | ✅ OK |

**Conclusion**: We apply a *custom* stop-word list that **excludes all negation and contraction words**. For the remaining stop words (articles, prepositions), removing them is safe.

### 3.2 Stemming

Stemming reduces inflected forms to a common root using the Snowball algorithm:

$$
\text{"tasty"} \rightarrow \text{"tasti"}, \quad \text{"disappointment"} \rightarrow \text{"disappoint"}, \quad \text{"loved"} \rightarrow \text{"love"}
$$

This reduces vocabulary size and helps the model generalise across morphological variants, without losing polarity.

In [None]:
# --- Build custom stop-word list that PRESERVES negation ---
all_stop = set(stopwords.words('english'))

# Negation / contraction words to KEEP
negation_words = {
    'not', 'no', 'nor', 'never', 'neither', 'nobody', 'nothing', 'nowhere',
    "don't", "doesn't", "didn't", "won't", "wouldn't",
    "can't", "cannot", "couldn't", "shouldn't",
    "isn't", "aren't", "wasn't", "weren't",
    "haven't", "hasn't", "hadn't", "ain't",
    "mightn't", "mustn't", "needn't"
}

custom_stop = all_stop - negation_words  # remove negation from stop-list

print(f'Original stop-word list : {len(all_stop)} words')
print(f'Negation words preserved: {len(negation_words)}')
print(f'Final stop-word list    : {len(custom_stop)} words')
print(f'\nSample negation words kept: {sorted(list(negation_words))[:8]}')

In [None]:
stemmer = SnowballStemmer('english')

def clean_text(text: str) -> str:
    """Lowercase → remove punctuation → remove custom stop words → stem."""
    text = str(text).lower()
    # Replace punctuation with space (keep apostrophes for contractions)
    text = re.sub(r'[?!.,;:)(|/]', ' ', text)
    tokens = text.split()
    cleaned = [
        stemmer.stem(tok)
        for tok in tokens
        if tok not in custom_stop
    ]
    # Remove leftover special chars
    out = ' '.join(cleaned)
    out = re.sub(r"[\"#]", '', out)
    return out

# Quick sanity check
test_phrases = [
    "Not as advertised — absolutely terrible!",
    "I won't buy this again. Disappointing.",
    "Perfect! Highly recommend to everyone.",
]
for p in test_phrases:
    print(f'  Original : {p}')
    print(f'  Cleaned  : {clean_text(p)}')
    print()

In [None]:
# Apply to the Text column (full review body)
df = df[['Score', 'Sentiment', 'Summary', 'Text']].copy()
df['CleanText'] = df['Text'].progress_apply(clean_text)
print('Cleaning complete.')
df[['Text','CleanText']].head(3)

### 3.3 Word Cloud Visualisation

Word clouds give an intuitive view of the most frequent tokens in each sentiment class. They help validate that preprocessing preserved meaningful content.

In [None]:
def make_wordcloud(text_series, title, ax, colormap='Greens'):
    wc = WordCloud(
        background_color='white',
        max_words=150,
        max_font_size=60,
        colormap=colormap,
        width=600, height=350,
        random_state=42
    ).generate(' '.join(text_series.dropna().values))
    ax.imshow(wc, interpolation='bilinear')
    ax.axis('off')
    ax.set_title(title, fontsize=13, pad=10)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
make_wordcloud(df.loc[df.Sentiment==1, 'CleanText'], 'Positive Reviews (Score > 3)', axes[0], 'Greens')
make_wordcloud(df.loc[df.Sentiment==0, 'CleanText'], 'Negative Reviews (Score < 3)', axes[1], 'Reds')
plt.suptitle('Word Clouds after Preprocessing', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

---
## 4. Feature Engineering — TF-IDF <a id='4'></a>

### 4.1 Term Frequency–Inverse Document Frequency

TF-IDF is a numerical statistic that reflects how important a word is to a document relative to the corpus. For a term $t$ in document $d$ within corpus $D$:

$$
\text{TF}(t,d) = \frac{\text{count}(t,d)}{\sum_{t'} \text{count}(t',d)}
$$

$$
\text{IDF}(t,D) = \log\!\left(\frac{|D| + 1}{|\{d \in D : t \in d\}| + 1}\right) + 1
$$

$$
\text{TF-IDF}(t,d,D) = \text{TF}(t,d) \times \text{IDF}(t,D)
$$

- **High TF, high IDF** → word is frequent in this document but rare corpus-wide → very distinctive, high weight.
- **High TF, low IDF** → word is frequent everywhere (e.g., "the") → low weight.

### 4.2 N-gram Range

We use **unigrams + bigrams** (`ngram_range=(1,2)`), capturing both individual words and two-word phrases:

| Unigrams | Bigrams |
|---|---|
| "not" | "not great" |
| "recommend" | "highly recommend" |
| "disappoint" | "won't buy" |

This is important because bigrams preserve negation context that unigrams lose.

In [None]:
vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),
    max_features=10_000,
    min_df=5,           # ignore very rare terms
    sublinear_tf=True   # apply log(1+tf) to dampen extremely frequent terms
)

X = vectorizer.fit_transform(df['CleanText'])
y = df['Sentiment'].values

print(f'TF-IDF matrix shape : {X.shape}')
print(f'Matrix density      : {X.nnz / (X.shape[0]*X.shape[1]):.4%}')
print(f'Vocabulary size     : {len(vectorizer.vocabulary_):,}')
print(f'Sample features     : {list(vectorizer.get_feature_names_out()[:6])} ... {list(vectorizer.get_feature_names_out()[-6:])}')

---
## 5. Train / Test Split <a id='5'></a>

**Critical rule**: The test set is held out immediately after the split. All preprocessing fitted to data (SVD, SMOTE, undersampling, cross-validation) is applied **only to the training partition** to prevent information leakage.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

print(f'Training set : {X_train.shape[0]:,} samples  '
      f'(pos={y_train.sum():,}, neg={(y_train==0).sum():,})')
print(f'Test set     : {X_test.shape[0]:,} samples  '
      f'(pos={y_test.sum():,}, neg={(y_test==0).sum():,})')

---
## 6. Baseline Model — Logistic Regression <a id='6'></a>

### Why Baseline First?

A baseline model trains with **no special handling** of class imbalance, no regularisation tuning, and no dimensionality reduction. It establishes the reference point against which every improvement is measured.

### Logistic Regression Review

For a binary target $y \in \{0,1\}$, Logistic Regression models the conditional probability:

$$
P(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^\top \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x} + b)}}
$$

Parameters $\mathbf{w}$ and $b$ are estimated by maximising the log-likelihood (equivalently, minimising cross-entropy loss):

$$
\mathcal{L}(\mathbf{w}) = -\frac{1}{N}\sum_{i=1}^{N}\left[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\right]
$$

The prediction threshold is $\hat{y} = 1$ if $\hat{p} \geq 0.5$, else $\hat{y} = 0$.

In [None]:
lr_baseline = LogisticRegression(max_iter=1000, random_state=42, solver='lbfgs')
lr_baseline.fit(X_train, y_train)

y_pred_base = lr_baseline.predict(X_test)
y_prob_base = lr_baseline.predict_proba(X_test)[:, 1]

print('=== Baseline Logistic Regression ===')
print(classification_report(y_test, y_pred_base, target_names=['Negative','Positive'], digits=4))
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob_base):.4f}')

In [None]:
def plot_evaluation(y_true, y_prob, y_pred, title, ax_cm, ax_roc):
    """Plot confusion matrix and ROC curve side by side."""
    # Confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax_cm,
                xticklabels=['Pred Neg','Pred Pos'],
                yticklabels=['True Neg','True Pos'])
    neg_recall = cm[0,0] / cm[0].sum()
    pos_recall = cm[1,1] / cm[1].sum()
    ax_cm.set_title(f'{title}\nNeg Recall={neg_recall:.3f} | Pos Recall={pos_recall:.3f}')

    # ROC curve
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    roc_auc = auc(fpr, tpr)
    ax_roc.plot(fpr, tpr, lw=2, label=f'AUC = {roc_auc:.4f}')
    ax_roc.plot([0,1],[0,1],'k--', lw=1)
    ax_roc.set_xlabel('False Positive Rate')
    ax_roc.set_ylabel('True Positive Rate')
    ax_roc.set_title(f'ROC Curve — {title}')
    ax_roc.legend(loc='lower right')
    ax_roc.grid(alpha=0.3)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_evaluation(y_test, y_prob_base, y_pred_base, 'Baseline LR', axes[0], axes[1])
plt.tight_layout()
plt.show()

print('Observation: Baseline achieves ~90% accuracy, but negative-class recall')
print('is only ~0.69 — 31% of dissatisfied customers are MISSED.')
print('This is unacceptable for the business use case.')

---
## 7. Class Imbalance: Problem & Solutions <a id='7'></a>

With **78% positive / 22% negative**, a naive model can achieve 78% accuracy by predicting "positive" for everything. We need strategies to improve minority-class detection.

### Business Rationale for Prioritising Recall

| Error Type | Business Impact |
|---|---|
| False Negative (miss a real negative) | Customer complaint unaddressed → churn, reputation damage |
| False Positive (flag a real positive as negative) | Unnecessary follow-up → minor cost |

Conclusion: **Maximise negative-class Recall** (subject to a minimum Precision constraint).

### Evaluation Metrics

$$
\text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN}
$$

$$
F_1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

$$
\text{AUC-ROC} = \int_0^1 \text{TPR}\,d(\text{FPR})
$$

All metrics reported are for the **negative class (label = 0)** unless otherwise stated.

### 7.1 Strategy 1 — Class Weights

The most computationally efficient approach: the loss function is re-weighted inversely proportional to class frequency.

$$
w_c = \frac{N}{K \cdot N_c}
$$

where $N$ = total samples, $K$ = number of classes, $N_c$ = samples in class $c$.

- **Advantages**: No data modification; no information loss; no extra computation.
- **Disadvantages**: Only adjusts the decision boundary — does not augment or remove data.

In [None]:
lr_cw = LogisticRegression(max_iter=1000, random_state=42,
                            class_weight='balanced', solver='lbfgs')
lr_cw.fit(X_train, y_train)

y_pred_cw = lr_cw.predict(X_test)
y_prob_cw = lr_cw.predict_proba(X_test)[:, 1]

print('=== LR + Class Weights ===')
print(classification_report(y_test, y_pred_cw, target_names=['Negative','Positive'], digits=4))
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob_cw):.4f}')

### 7.2 Strategy 2 — SMOTE (with SVD Preprocessing)

**SMOTE** (Synthetic Minority Over-sampling Technique) generates synthetic minority-class samples by linear interpolation between existing samples in feature space:

$$
\mathbf{x}_{\text{new}} = \mathbf{x}_i + \lambda \cdot (\mathbf{x}_{\text{nn}} - \mathbf{x}_i), \quad \lambda \sim \text{Uniform}(0,1)
$$

where $\mathbf{x}_{\text{nn}}$ is one of the $k$ nearest neighbours of $\mathbf{x}_i$ in feature space.

**Why SVD is needed here (and only here)**:  
SMOTE requires a **dense** feature matrix. The TF-IDF output is sparse (10,000 dimensions). We apply Truncated SVD to compress it to a dense representation — **but only as a preprocessing step for SMOTE, not as a primary feature-extraction strategy**.

**Critical limitation of SVD in this context**:  
As shown in the plot below, 1,000 SVD components explain < 70% of the total variance. This is far below the typical 90%+ threshold used to justify dimensionality reduction. The text features are distributed across many dimensions and do not compress well. In contrast to tabular data, NLP TF-IDF matrices often require thousands of components to preserve most information.

In [None]:
# --- SVD Explained Variance Analysis ---
# Fit SVD on training data only
svd = TruncatedSVD(n_components=1000, random_state=42)
svd.fit(X_train)

cumvar = np.cumsum(svd.explained_variance_ratio_)
components = np.arange(1, 1001)

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

# Full curve
axes[0].plot(components, cumvar, color='steelblue', lw=1.5)
axes[0].axhline(0.90, color='red', ls='--', lw=1.2, label='90% threshold')
axes[0].axhline(0.70, color='orange', ls='--', lw=1.2, label='70% threshold')
axes[0].fill_between(components, cumvar, alpha=0.15, color='steelblue')
axes[0].set_xlabel('Number of Components')
axes[0].set_ylabel('Cumulative Explained Variance')
axes[0].set_title('Truncated SVD — Explained Variance (1–1000 components)')
axes[0].legend()
axes[0].grid(alpha=0.3)
axes[0].set_ylim(0, 1.02)

# Marginal gain
axes[1].plot(components, svd.explained_variance_ratio_, color='darkorange', lw=1.2)
axes[1].set_xlabel('Component index')
axes[1].set_ylabel('Individual Explained Variance')
axes[1].set_title('Marginal Explained Variance per Component')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Find how many components for 90%
thresh_90 = np.searchsorted(cumvar, 0.90)
print(f'Variance explained by 1000 components : {cumvar[-1]:.2%}')
print(f'Components needed for 90% variance    : {thresh_90 if thresh_90 < 1000 else ">1000"}')
print()
print('Interpretation: SVD is INEFFECTIVE as primary dimensionality reduction')
print('for TF-IDF NLP features. Use it only when dense input is required (e.g. SMOTE).')

In [None]:
# Apply SVD transformation (dense input needed for SMOTE)
X_train_svd = svd.transform(X_train)   # shape (N_train, 1000)
X_test_svd  = svd.transform(X_test)    # shape (N_test,  1000)

# Apply SMOTE to the SVD-reduced training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_svd, y_train)
print(f'After SMOTE: {X_train_smote.shape[0]:,} training samples '
      f'(pos={y_train_smote.sum():,}, neg={(y_train_smote==0).sum():,})')

lr_smote = LogisticRegression(max_iter=1000, random_state=42, solver='lbfgs')
lr_smote.fit(X_train_smote, y_train_smote)

y_pred_smote = lr_smote.predict(X_test_svd)
y_prob_smote = lr_smote.predict_proba(X_test_svd)[:, 1]

print('\n=== LR + SMOTE (SVD-reduced features) ===')
print(classification_report(y_test, y_pred_smote, target_names=['Negative','Positive'], digits=4))
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob_smote):.4f}')

### 7.3 Strategy 3 — Random Undersampling

Random undersampling randomly removes majority-class training samples until both classes are equal in size.

$$
N'_{\text{majority}} = N_{\text{minority}}
$$

- **Advantages**: Fast; eliminates imbalance entirely; works on sparse matrices (unlike SMOTE).
- **Disadvantages**: Discards potentially useful majority-class information (information loss).

In [None]:
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X_train, y_train)
print(f'After undersampling: {X_rus.shape[0]:,} training samples '
      f'(pos={y_rus.sum():,}, neg={(y_rus==0).sum():,})')

lr_rus = LogisticRegression(max_iter=1000, random_state=42, solver='lbfgs')
lr_rus.fit(X_rus, y_rus)

y_pred_rus = lr_rus.predict(X_test)
y_prob_rus = lr_rus.predict_proba(X_test)[:, 1]

print('\n=== LR + Random Undersampling ===')
print(classification_report(y_test, y_pred_rus, target_names=['Negative','Positive'], digits=4))
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob_rus):.4f}')

In [None]:
# Summary table of class-imbalance strategies for LR
results_imbalance = {
    'Baseline (no balancing)':   (y_test, y_pred_base, y_prob_base),
    'Class Weights':             (y_test, y_pred_cw,   y_prob_cw),
    'SMOTE + SVD':               (y_test, y_pred_smote, y_prob_smote),
    'Random Undersampling':      (y_test, y_pred_rus,  y_prob_rus),
}

rows = []
for name, (yt, yp, ypr) in results_imbalance.items():
    rows.append({
        'Strategy': name,
        'Accuracy': f"{(yt==yp).mean():.4f}",
        'Neg Recall': f"{recall_score(yt, yp, pos_label=0):.4f}",
        'Neg Precision': f"{precision_score(yt, yp, pos_label=0):.4f}",
        'Neg F1': f"{f1_score(yt, yp, pos_label=0):.4f}",
        'AUC': f"{roc_auc_score(yt, ypr):.4f}",
    })

df_imb = pd.DataFrame(rows)
print(df_imb.to_string(index=False))
print('\nTakeaway: Class Weights best balances recall improvement and precision retention.')

---
## 8. Regularisation & Hyperparameter Tuning <a id='8'></a>

### ElasticNet Regularisation

L2 (Ridge) regularisation shrinks coefficients toward zero but keeps all features. L1 (Lasso) drives some coefficients to exactly zero (sparse solutions). **ElasticNet** interpolates between them:

$$
\mathcal{L}_{\text{EN}}(\mathbf{w}) = \mathcal{L}_{\text{CE}}(\mathbf{w}) + \lambda \Big[\alpha \|\mathbf{w}\|_1 + (1-\alpha)\|\mathbf{w}\|_2^2 \Big]
$$

where $\alpha$ (the `l1_ratio` parameter) controls the mix. We tune $\alpha$ via **3-fold cross-validation**, optimising for negative-class Recall.

In [None]:
from sklearn.metrics import make_scorer

recall_neg = make_scorer(recall_score, pos_label=0)

param_grid = {'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]}
base_lr = LogisticRegression(
    penalty='elasticnet', solver='saga',
    class_weight='balanced',
    max_iter=1000, random_state=42
)

grid = GridSearchCV(
    base_lr, param_grid,
    scoring=recall_neg, cv=3, n_jobs=-1, verbose=1,
    return_train_score=True
)
grid.fit(X_train, y_train)

print(f'Best l1_ratio : {grid.best_params_["l1_ratio"]}')
print(f'Best CV Recall (neg): {grid.best_score_:.4f}')

# Plot CV results
cv_results = pd.DataFrame(grid.cv_results_)
fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(param_grid['l1_ratio'], cv_results['mean_test_score'], 'o-', color='steelblue', lw=2)
ax.fill_between(param_grid['l1_ratio'],
                cv_results['mean_test_score'] - cv_results['std_test_score'],
                cv_results['mean_test_score'] + cv_results['std_test_score'],
                alpha=0.2, color='steelblue')
ax.set_xlabel('l1_ratio (0 = pure L2, 1 = pure L1)')
ax.set_ylabel('Mean CV Recall (negative class)')
ax.set_title('GridSearchCV — ElasticNet l1_ratio Tuning')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
best_lr = grid.best_estimator_
y_pred_best_lr = best_lr.predict(X_test)
y_prob_best_lr = best_lr.predict_proba(X_test)[:, 1]

print('=== Best LR: Balanced + ElasticNet ===')
print(classification_report(y_test, y_pred_best_lr, target_names=['Negative','Positive'], digits=4))
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob_best_lr):.4f}')

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_evaluation(y_test, y_prob_best_lr, y_pred_best_lr,
                'Best LR (Balanced + ElasticNet)', axes[0], axes[1])
plt.tight_layout()
plt.show()

---
## 9. Threshold Tuning — Balancing Precision & Recall <a id='9'></a>

The default decision threshold is $\tau = 0.5$: predict negative if $\hat{p} < 0.5$. But this is arbitrary. Lowering $\tau$ increases recall (catches more negatives) at the cost of precision (more false alarms). We can find a threshold that **balances both** according to business requirements.

### Precision–Recall Curve

For a continuous range of thresholds $\tau \in [0,1]$:

$$
\text{Precision}(\tau), \quad \text{Recall}(\tau), \quad \text{F}_\beta(\tau) = (1+\beta^2)\cdot\frac{\text{Prec} \cdot \text{Rec}}{\beta^2 \cdot \text{Prec} + \text{Rec}}
$$

We plot the PR curve and identify the **threshold that maximises F1** for the negative class, as well as the threshold where precision $\geq 0.72$ (a business floor).

**This addresses the TA feedback**: instead of blindly lowering the threshold to maximise recall at the expense of precision, we find a principled trade-off point.

In [None]:
# Precision-Recall curve for the NEGATIVE class
# Note: sklearn uses pos_label for the 'positive' class in the PR curve.
# We set pos_label=0 to treat negative as the positive of interest.
precisions, recalls, thresholds = precision_recall_curve(
    y_test, 1 - y_prob_best_lr,  # invert prob to make neg the 'positive'
    pos_label=1
)
# thresholds has one fewer element
f1_scores = 2 * precisions[:-1] * recalls[:-1] / (precisions[:-1] + recalls[:-1] + 1e-8)

# Find best F1 threshold
best_f1_idx = np.argmax(f1_scores)
best_threshold = 1 - thresholds[best_f1_idx]   # convert back to original prob space
print(f'Threshold for max negative-class F1    : {best_threshold:.3f}')
print(f'  Precision @ this threshold: {precisions[best_f1_idx]:.4f}')
print(f'  Recall    @ this threshold: {recalls[best_f1_idx]:.4f}')
print(f'  F1        @ this threshold: {f1_scores[best_f1_idx]:.4f}')

# Find threshold where precision >= 0.72
prec_floor = 0.72
valid = np.where(precisions[:-1] >= prec_floor)[0]
if len(valid) > 0:
    prec_thresh_idx = valid[np.argmax(recalls[:-1][valid])]
    prec_thresh = 1 - thresholds[prec_thresh_idx]
    print(f'\nThreshold for Precision ≥ {prec_floor} (max recall): {prec_thresh:.3f}')
    print(f'  Precision @ this threshold: {precisions[prec_thresh_idx]:.4f}')
    print(f'  Recall    @ this threshold: {recalls[prec_thresh_idx]:.4f}')

# Plot
fig, axes = plt.subplots(1, 2, figsize=(13, 4))

# PR Curve
axes[0].plot(recalls[:-1], precisions[:-1], color='steelblue', lw=1.8, label='PR curve')
axes[0].scatter(recalls[best_f1_idx], precisions[best_f1_idx],
                color='red', zorder=5, s=80, label=f'Max F1 (τ={best_threshold:.2f})')
if len(valid) > 0:
    axes[0].scatter(recalls[prec_thresh_idx], precisions[prec_thresh_idx],
                    color='green', zorder=5, s=80, marker='s',
                    label=f'Prec≥0.72 (τ={prec_thresh:.2f})')
axes[0].axhline(prec_floor, color='green', ls='--', lw=1, alpha=0.7)
axes[0].set_xlabel('Recall (Negative class)')
axes[0].set_ylabel('Precision (Negative class)')
axes[0].set_title('Precision–Recall Curve (Negative class)')
axes[0].legend()
axes[0].grid(alpha=0.3)

# F1 vs threshold
axes[1].plot(1 - thresholds, f1_scores, color='darkorange', lw=1.8)
axes[1].axvline(best_threshold, color='red', ls='--', lw=1.2, label=f'Max F1 at τ={best_threshold:.2f}')
axes[1].set_xlabel('Classification Threshold (for negative class)')
axes[1].set_ylabel('F1 Score (Negative class)')
axes[1].set_title('F1 Score vs Threshold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Apply custom threshold (max F1)
y_pred_tuned = (y_prob_best_lr < best_threshold).astype(int)
print(f'=== Best LR + Threshold Tuning (τ={best_threshold:.3f}) ===')
print(classification_report(y_test, y_pred_tuned, target_names=['Negative','Positive'], digits=4))
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob_best_lr):.4f}  (unchanged — AUC is threshold-independent)')
print()
print('Threshold tuning improves the Precision–Recall balance')
print('without retraining. This is the recommended approach when')
print('the deployment requirements change.')

---
## 10. Random Forest Classifier <a id='10'></a>

Random Forest is a **bagging ensemble** of decision trees. Each tree is trained on a bootstrap sample, and at each split, only a random subset of $m \approx \sqrt{p}$ features is considered:

$$
\hat{p}(c \mid \mathbf{x}) = \frac{1}{T}\sum_{t=1}^{T} \hat{p}_t(c \mid \mathbf{x})
$$

Advantages over Logistic Regression:
- Captures non-linear feature interactions
- Inherently resistant to overfitting via bagging
- Provides feature importance via mean decrease in impurity

Disadvantages:
- Cannot operate directly on sparse TF-IDF without SVD (memory / speed)
- Less interpretable coefficients

In [None]:
# RF trained on SVD-reduced features (dense; avoids memory issues)
# We use the best params found previously via GridSearchCV to save time
best_rf = RandomForestClassifier(
    n_estimators=500,
    max_depth=20,
    max_features='sqrt',
    min_samples_split=5,
    min_samples_leaf=5,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)
best_rf.fit(X_train_svd, y_train)

y_pred_rf = best_rf.predict(X_test_svd)
y_prob_rf  = best_rf.predict_proba(X_test_svd)[:, 1]

print('=== Random Forest (Balanced, SVD-reduced features) ===')
print(classification_report(y_test, y_pred_rf, target_names=['Negative','Positive'], digits=4))
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob_rf):.4f}')

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_evaluation(y_test, y_prob_rf, y_pred_rf, 'Random Forest (Balanced+SVD)', axes[0], axes[1])
plt.tight_layout()
plt.show()

In [None]:
# Feature importance from Random Forest (mapped back to SVD components)
# Since SVD mixes features, we show the top TF-IDF importance through LR for interpretability
# Here we show the RF's SVD-component importances
importances_rf = best_rf.feature_importances_
top_k = 30
top_idx = np.argsort(importances_rf)[::-1][:top_k]

fig, ax = plt.subplots(figsize=(9, 7))
ax.barh(range(top_k), importances_rf[top_idx][::-1], color='steelblue', edgecolor='none')
ax.set_yticks(range(top_k))
ax.set_yticklabels([f'SVD Component {i+1}' for i in top_idx[::-1]], fontsize=8)
ax.set_xlabel('Feature Importance (Mean Decrease in Impurity)')
ax.set_title(f'Top {top_k} Random Forest Feature Importances (SVD components)')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()
print('Note: RF importances are over SVD components, not original words.')
print('For word-level importance, see Section 14 (Logistic Regression coefficients).')

---
## 11. Beyond Bag-of-Words: Word Embeddings <a id='11'></a>

### 11.1 Theory and Motivation

**TF-IDF** treats each word as an independent feature with no notion of semantic similarity. The words "terrible" and "awful" have orthogonal TF-IDF representations despite being near-synonyms. Bigrams capture some local context, but meaning is still derived from frequency patterns, not semantics.

**Word Embeddings** address this by mapping each word to a dense vector in a continuous semantic space, so that semantically similar words are close:

$$
\cos(\mathbf{e}_{\text{terrible}},\, \mathbf{e}_{\text{awful}}) \approx 1, \qquad \cos(\mathbf{e}_{\text{terrible}},\, \mathbf{e}_{\text{delicious}}) \approx 0
$$

### Representation Methods Compared

| Method | Type | Semantic Similarity | Context-Aware | Domain |
|---|---|---|---|---|
| TF-IDF | Sparse, count-based | ❌ None | ❌ | Fit to corpus |
| Word2Vec (avg. pool) | Dense, static | ✅ Yes | ❌ Order-blind | Fit to corpus |
| DistilBERT (fine-tuned) | Dense, contextual | ✅ Yes | ✅ Full context | Pre-trained + adapted |

### Key Limitation of Average Pooling

Word2Vec + average pooling still loses **word order**. The phrases "not great" and "not terrible" will have similar document vectors if the component words have similar embeddings — exactly the opposite of what we want. This is the core weakness that contextual models overcome:

$$
\mathbf{v}_d^{\text{W2V}} = \frac{1}{n}\sum_{i} \mathbf{e}_{w_i} \qquad \text{(order-blind)}
$$

$$
\mathbf{v}_d^{\text{BERT}} = \text{Transformer}(w_1, w_2, \ldots, w_n) \qquad \text{(full context)}
$$

The DistilBERT encoder processes the full sequence at once via **multi-head self-attention**:

$$
\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
$$

allowing "not" in "not great" to modulate the representation of "great" based on its position and context.

In [None]:
# ── Section 11 Preamble: package installs + text splits ─────────────────────
# Install packages not in base Colab/conda environments
import subprocess, sys
for pkg in ['gensim', 'transformers', 'accelerate']:
    try:
        __import__(pkg)
    except ImportError:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', pkg, '-q'])

# ── Colab: mount Google Drive and (optionally) reload data ──────────────────
import os
ON_COLAB = 'google.colab' in sys.modules or os.path.exists('/content')
if ON_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    # Update this path if your Reviews.csv is stored elsewhere in Drive:
    COLAB_DATA = '/content/drive/MyDrive/amazon_reviews/Reviews.csv'
    if not os.path.exists(COLAB_DATA):
        print(f'WARNING: {COLAB_DATA} not found.')
        print('Please upload Reviews.csv to your Drive and update COLAB_DATA above.')
    else:
        print('Drive mounted. Run Sections 1-5 first to get df, X, y in memory.')

# ── Recover text arrays from the same train/test split ──────────────────────
# (X_train/X_test were split from X in Section 5; we need the original text)
import numpy as np
from sklearn.model_selection import train_test_split

clean_texts = df['CleanText'].values
raw_texts   = df['Text'].fillna('').values  # for BERT (raw, unprocessed)
y_all       = df['Sentiment'].values        # same as y used in Section 4-5

# Reproduce the exact same split (same random_state + stratify gives same indices)
_, _, texts_train, texts_test, raw_train, raw_test, _, _ = train_test_split(
    X, clean_texts, raw_texts, y_all,
    test_size=0.20, random_state=42, stratify=y_all
)

import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Device: {device}')
print(f'texts_train: {len(texts_train):,}  texts_test: {len(texts_test):,}')
print(f'raw_train  : {len(raw_train):,}  raw_test  : {len(raw_test):,}')

# Stratified subsample for DistilBERT fine-tuning (Section 11.3)
from sklearn.model_selection import train_test_split as tts
raw_ft, _, y_ft, _ = tts(
    raw_train, y_train, train_size=50_000 + 5_000,
    random_state=42, stratify=y_train
)
raw_ft_train, raw_ft_val, y_ft_train, y_ft_val = tts(
    raw_ft, y_ft, test_size=5_000, random_state=42, stratify=y_ft
)
print(f'BERT fine-tune train: {len(y_ft_train):,}  val: {len(y_ft_val):,}')

### 11.2 Word2Vec — Static Dense Embeddings <a id='11-2'></a>

**Word2Vec** (Mikolov et al., 2013) learns embedding vectors by training a shallow neural network to predict surrounding context words (**Skip-gram**) or predict a target from context (**CBOW**):

$$
\text{Skip-gram objective:} \quad \mathcal{L} = -\sum_{t}\sum_{c \in \mathcal{C}(t)} \log \sigma(\mathbf{e}_{w_t}^\top \mathbf{e}_{w_c}) - k\,\mathbb{E}_{w \sim P_n}\left[\log \sigma(-\mathbf{e}_{w_t}^\top \mathbf{e}_w)\right]
$$

where the second term is negative sampling with $k$ noise words drawn from $P_n \propto f(w)^{3/4}$.

We train on the **training corpus only** (no test data leakage) with:
- Embedding dimension $d = 300$, window size $w = 5$
- Minimum count = 5 (ignores rare words)
- 5 training epochs with negative sampling ($k=5$)

**Document representation** — average pooling over token embeddings:

$$
\mathbf{v}_d = \frac{1}{|V_d|}\sum_{w \in V_d} \mathbf{e}_w \in \mathbb{R}^{300}
$$

This dense 300-dim vector replaces the 10,000-dim sparse TF-IDF vector. We then train LR and RF on top.

In [None]:
%%time
import time
from gensim.models import Word2Vec

# Tokenise cleaned text into lists of tokens for Word2Vec
train_sentences = [t.split() for t in texts_train]
test_sentences  = [t.split() for t in texts_test]

# Train Skip-gram Word2Vec on the training corpus only
print('Training Word2Vec Skip-gram (d=300, window=5, min_count=5, epochs=5)...')
w2v_model = Word2Vec(
    sentences=train_sentences,
    vector_size=300,   # embedding dimension
    window=5,          # context window
    min_count=5,       # ignore rare words
    sg=1,              # Skip-gram (sg=0 = CBOW)
    workers=4,
    epochs=5,
    seed=42
)
print(f'Vocabulary: {len(w2v_model.wv):,} words  |  Embedding shape: {w2v_model.wv.vectors.shape}')

# Inspect semantic neighbourhood
print('\nMost similar to "delici" (stemmed "delicious"):',
      [w for w, _ in w2v_model.wv.most_similar('delici', topn=5)])
print('Most similar to "disappoint":',
      [w for w, _ in w2v_model.wv.most_similar('disappoint', topn=5)])
# Negation is in the vocabulary since we kept "not"
if 'not' in w2v_model.wv:
    print('Most similar to "not":',
          [w for w, _ in w2v_model.wv.most_similar('not', topn=5)])

In [None]:
def doc_to_vec(tokens, model, dim=300):
    """Average pooling over Word2Vec embeddings for all known tokens."""
    vecs = [model.wv[t] for t in tokens if t in model.wv]
    return np.mean(vecs, axis=0).astype(np.float32) if vecs else np.zeros(dim, np.float32)

print('Building document embedding vectors...')
t0 = time.time()
X_w2v_train = np.vstack([doc_to_vec(s, w2v_model) for s in train_sentences])
X_w2v_test  = np.vstack([doc_to_vec(s, w2v_model) for s in test_sentences])
print(f'Done in {time.time()-t0:.0f}s')
print(f'X_w2v_train: {X_w2v_train.shape}  (dense float32)')
oov = (X_w2v_train.sum(axis=1) == 0).sum()
print(f'All-zero vectors (no vocab coverage): {oov} / {len(X_w2v_train)}')

In [None]:
%%time
# Logistic Regression on Word2Vec document embeddings
lr_w2v = LogisticRegression(
    max_iter=1000, random_state=42,
    class_weight='balanced', solver='lbfgs'
)
lr_w2v.fit(X_w2v_train, y_train)

y_pred_w2v_lr = lr_w2v.predict(X_w2v_test)
y_prob_w2v_lr = lr_w2v.predict_proba(X_w2v_test)[:, 1]

print('=== Word2Vec (avg. pool) + LR (Balanced) ===')
print(classification_report(y_test, y_pred_w2v_lr, target_names=['Negative','Positive'], digits=4))
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob_w2v_lr):.4f}')

# Quick comparison with TF-IDF LR (already in memory)
print('\n--- Comparison with TF-IDF ---')
print(f'  TF-IDF + LR  :  NegF1={f1_score(y_test, y_pred_best_lr, pos_label=0):.4f}  '
      f'AUC={roc_auc_score(y_test, y_prob_best_lr):.4f}')
print(f'  Word2Vec + LR:  NegF1={f1_score(y_test, y_pred_w2v_lr, pos_label=0):.4f}  '
      f'AUC={roc_auc_score(y_test, y_prob_w2v_lr):.4f}')
print('\nNote: TF-IDF often outperforms simple average-pooling Word2Vec on tasks')
print('with strong lexical signals (e.g. star-count bigrams, domain vocabulary).')

In [None]:
%%time
# Word2Vec + Random Forest
from sklearn.ensemble import RandomForestClassifier

rf_w2v = RandomForestClassifier(
    n_estimators=300, max_depth=20, max_features='sqrt',
    class_weight='balanced', random_state=42, n_jobs=-1
)
rf_w2v.fit(X_w2v_train, y_train)

y_pred_w2v_rf = rf_w2v.predict(X_w2v_test)
y_prob_w2v_rf = rf_w2v.predict_proba(X_w2v_test)[:, 1]

print('=== Word2Vec + Random Forest (Balanced) ===')
print(classification_report(y_test, y_pred_w2v_rf, target_names=['Negative','Positive'], digits=4))
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob_w2v_rf):.4f}')

# Side-by-side comparison: LR vs RF on W2V
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
for ax, (name, yp, ypr) in zip(axes, [
    ('Word2Vec + LR', y_pred_w2v_lr, y_prob_w2v_lr),
    ('Word2Vec + RF', y_pred_w2v_rf, y_prob_w2v_rf),
]):
    cm = confusion_matrix(y_test, yp)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Oranges', ax=ax,
                xticklabels=['Pred Neg','Pred Pos'],
                yticklabels=['True Neg','True Pos'])
    nr = recall_score(y_test, yp, pos_label=0)
    np_ = precision_score(y_test, yp, pos_label=0, zero_division=0)
    au = roc_auc_score(y_test, ypr)
    ax.set_title(f'{name}\nNegRec={nr:.3f}  NegPrec={np_:.3f}  AUC={au:.3f}')
plt.tight_layout()
plt.show()

In [None]:
from sklearn.manifold import TSNE

# t-SNE visualisation of Word2Vec embedding space (colour-coded by true label)
N_VIZ = 5000
idx_viz = np.random.choice(len(X_w2v_test), N_VIZ, replace=False)
X_viz = X_w2v_test[idx_viz]
y_viz = y_test[idx_viz]

print(f't-SNE on {N_VIZ} test samples...')
t0 = time.time()
X_2d = TSNE(n_components=2, random_state=42, perplexity=40, n_iter=1000).fit_transform(X_viz)
print(f'Done in {time.time()-t0:.0f}s')

fig, ax = plt.subplots(figsize=(9, 7))
for label, color, name in [(1, '#2ecc71', 'Positive'), (0, '#e74c3c', 'Negative')]:
    m = y_viz == label
    ax.scatter(X_2d[m, 0], X_2d[m, 1], c=color, alpha=0.35, s=8,
               label=f'{name} ({m.sum():,})')
ax.set_title('t-SNE of Word2Vec Document Embeddings (5K test sample)', fontsize=12)
ax.legend(markerscale=5)
ax.set_xlabel('t-SNE dim 1'); ax.set_ylabel('t-SNE dim 2')
plt.tight_layout()
plt.show()
print('Separation reflects that Word2Vec captures sentiment-relevant structure.')
print('However, significant overlap shows the limitation of order-blind average pooling.')

### 11.3 DistilBERT — Contextual Transformer Embeddings <a id='11-3'></a>

> ⚡ **GPU Required.** Run this section on [Google Colab](https://colab.research.google.com) with GPU enabled.  
> Upload `Reviews.csv` to your Google Drive, mount it (see setup cell below), then run the notebook from the beginning.

**Fine-tuning strategy**: We adapt `distilbert-base-uncased` (66M parameters) to the Amazon review domain by training a 2-class linear head on top of the `[CLS]` token embedding, using weighted cross-entropy to address class imbalance:

$$
\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} w_{y_i}\left[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\right], \quad w_c = \frac{N}{2 N_c}
$$

We fine-tune on **50,000 training samples** (≈11% of training set) for **3 epochs** with:
- Optimizer: AdamW, lr = 2×10⁻⁵, weight decay = 0.01
- Warmup: 10% of total steps with linear decay
- Mixed precision (fp16) for 2× speed on GPU
- Max sequence length: 128 tokens (covers ~95% of reviews)

**Why 50K instead of the full training set?**  
Fine-tuning 3 epochs on 454K samples takes ~6h on T4. The 50K subset trains in ~35-50 min and already provides strong in-domain adaptation. Section 11.3.3 explains how to scale up if needed.

In [None]:
import time
import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import Dataset as TorchDataset, DataLoader
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    get_linear_schedule_with_warmup
)
from torch.optim import AdamW

# ─── Hyperparameters ────────────────────────────────────────────────────────
N_FT_TRAIN = 50_000    # samples for fine-tuning; increase on A100 / with more time
N_FT_VAL   = 5_000
MAX_LEN    = 128       # tokens — covers ~95% of reviews
N_EPOCHS   = 3
LR         = 2e-5
TRAIN_BS   = 32
EVAL_BS    = 64

# ─── Class weights (for imbalance) ──────────────────────────────────────────
n_neg = (y_ft_train == 0).sum()
n_pos = (y_ft_train == 1).sum()
n_tot = len(y_ft_train)
w0 = n_tot / (2 * n_neg)
w1 = n_tot / (2 * n_pos)
class_weights = torch.tensor([w0, w1], dtype=torch.float).to(device)
print(f'Class weights: neg={w0:.3f}  pos={w1:.3f}')

# ─── PyTorch Dataset ─────────────────────────────────────────────────────────
class ReviewDataset(TorchDataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        enc = tokenizer(
            list(texts), truncation=True, padding='max_length',
            max_length=max_len, return_tensors='pt'
        )
        self.input_ids      = enc['input_ids']
        self.attention_mask = enc['attention_mask']
        self.labels         = torch.tensor(labels, dtype=torch.long)
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        return {
            'input_ids':      self.input_ids[idx],
            'attention_mask': self.attention_mask[idx],
            'labels':         self.labels[idx]
        }

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
print('Tokenizing fine-tune datasets...')
t0 = time.time()
train_ds = ReviewDataset(raw_ft_train, y_ft_train, tokenizer, MAX_LEN)
val_ds   = ReviewDataset(raw_ft_val,   y_ft_val,   tokenizer, MAX_LEN)
test_ds  = ReviewDataset(raw_test,     y_test,     tokenizer, MAX_LEN)
print(f'Done in {time.time()-t0:.0f}s')

train_loader = DataLoader(train_ds, batch_size=TRAIN_BS, shuffle=True,  num_workers=2, pin_memory=True)
val_loader   = DataLoader(val_ds,   batch_size=EVAL_BS,  shuffle=False, num_workers=2, pin_memory=True)
test_loader  = DataLoader(test_ds,  batch_size=EVAL_BS,  shuffle=False, num_workers=2, pin_memory=True)

# ─── Model ───────────────────────────────────────────────────────────────────
model_bert = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', num_labels=2,
    id2label={0:'NEGATIVE', 1:'POSITIVE'},
    label2id={'NEGATIVE':0, 'POSITIVE':1}
).to(device)

# ─── Optimizer + Scheduler ──────────────────────────────────────────────────
total_steps   = len(train_loader) * N_EPOCHS
warmup_steps  = int(0.1 * total_steps)

optimizer = AdamW(model_bert.parameters(), lr=LR, weight_decay=0.01)
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps
)
loss_fn = nn.CrossEntropyLoss(weight=class_weights)

n_params = sum(p.numel() for p in model_bert.parameters() if p.requires_grad)
print(f'Trainable parameters: {n_params/1e6:.1f}M')
print(f'Steps per epoch: {len(train_loader):,}  |  Total steps: {total_steps:,}')
print(f'Warmup steps: {warmup_steps}')
print(f'\\nEstimated time on T4 GPU: ~35-50 min for {N_EPOCHS} epochs')

In [None]:
%%time
# ── DistilBERT: Training loop ────────────────────────────────────────────────
from tqdm.auto import tqdm

scaler = torch.cuda.amp.GradScaler(enabled=(device == 'cuda'))
history = []

for epoch in range(1, N_EPOCHS + 1):
    # ── Train ──
    model_bert.train()
    total_loss, n_correct, n_total = 0.0, 0, 0
    pbar = tqdm(train_loader, desc=f'Epoch {epoch}/{N_EPOCHS} [train]', leave=False)
    for batch in pbar:
        input_ids      = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels         = batch['labels'].to(device)

        optimizer.zero_grad()
        with torch.cuda.amp.autocast(enabled=(device == 'cuda')):
            outputs = model_bert(input_ids=input_ids, attention_mask=attention_mask)
            loss = loss_fn(outputs.logits, labels)

        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model_bert.parameters(), 1.0)
        scaler.step(optimizer); scaler.update()
        scheduler.step()

        total_loss += loss.item() * labels.size(0)
        n_correct  += (outputs.logits.argmax(dim=-1) == labels).sum().item()
        n_total    += labels.size(0)

    train_loss = total_loss / n_total
    train_acc  = n_correct / n_total

    # ── Validate ──
    model_bert.eval()
    val_preds, val_labels = [], []
    with torch.no_grad():
        for batch in val_loader:
            logits = model_bert(
                input_ids=batch['input_ids'].to(device),
                attention_mask=batch['attention_mask'].to(device)
            ).logits
            val_preds.extend(logits.argmax(dim=-1).cpu().numpy())
            val_labels.extend(batch['labels'].numpy())

    val_neg_f1 = f1_score(val_labels, val_preds, pos_label=0, zero_division=0)
    val_neg_rec = recall_score(val_labels, val_preds, pos_label=0)

    history.append({'epoch': epoch, 'train_loss': train_loss,
                    'train_acc': train_acc, 'val_neg_f1': val_neg_f1,
                    'val_neg_rec': val_neg_rec})
    print(f'Epoch {epoch}: loss={train_loss:.4f}  acc={train_acc:.4f}  '
          f'val_NegF1={val_neg_f1:.4f}  val_NegRec={val_neg_rec:.4f}')

# Plot training history
df_hist = pd.DataFrame(history)
fig, axes = plt.subplots(1, 2, figsize=(11, 4))
axes[0].plot(df_hist['epoch'], df_hist['train_loss'], 'o-', color='#3498db')
axes[0].set_xlabel('Epoch'); axes[0].set_ylabel('Training Loss')
axes[0].set_title('DistilBERT Training Loss'); axes[0].grid(alpha=0.3)
axes[1].plot(df_hist['epoch'], df_hist['val_neg_f1'],  'o-', color='#2ecc71', label='Val Neg F1')
axes[1].plot(df_hist['epoch'], df_hist['val_neg_rec'], 's-', color='#e67e22', label='Val Neg Recall')
axes[1].set_xlabel('Epoch'); axes[1].set_ylabel('Score')
axes[1].set_title('Validation Metrics per Epoch')
axes[1].legend(); axes[1].grid(alpha=0.3)
plt.tight_layout(); plt.show()

In [None]:
# ── DistilBERT: Evaluation on full test set ──────────────────────────────────
import torch.nn.functional as F

print('Evaluating on full test set...')
model_bert.eval()

all_logits = []
EVAL_BS = 64

for i in range(0, len(test_dataset), EVAL_BS):
    batch = test_dataset[i:i+EVAL_BS]
    input_ids      = torch.tensor(batch['input_ids']).to(device)
    attention_mask = torch.tensor(batch['attention_mask']).to(device)
    with torch.no_grad():
        out = model_bert(input_ids=input_ids, attention_mask=attention_mask)
    all_logits.append(out.logits.cpu())

logits_all = torch.cat(all_logits, dim=0)
probs_all  = F.softmax(logits_all, dim=-1).numpy()
preds_bert = np.argmax(probs_all, axis=-1)
y_prob_bert = probs_all[:, 1]   # p(positive)

print('\n=== DistilBERT Fine-tuned — Full Test Set ===')
print(classification_report(y_test, preds_bert, target_names=['Negative','Positive'], digits=4))
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob_bert):.4f}')

# --- Threshold tuning for BERT ---
from sklearn.metrics import precision_recall_curve as prc
precisions_b, recalls_b, thresholds_b = prc(y_test, 1 - y_prob_bert, pos_label=1)
f1_b = 2 * precisions_b[:-1] * recalls_b[:-1] / (precisions_b[:-1] + recalls_b[:-1] + 1e-8)
best_b_idx = np.argmax(f1_b)
tau_b = 1 - thresholds_b[best_b_idx]
preds_bert_tuned = (y_prob_bert < tau_b).astype(int)
print(f'\nWith threshold τ* = {tau_b:.3f}:')
print(f'  Neg Recall   : {recall_score(y_test, preds_bert_tuned, pos_label=0):.4f}')
print(f'  Neg Precision: {precision_score(y_test, preds_bert_tuned, pos_label=0, zero_division=0):.4f}')
print(f'  Neg F1       : {f1_score(y_test, preds_bert_tuned, pos_label=0):.4f}')

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
cm_base = confusion_matrix(y_test, preds_bert)
cm_tuned = confusion_matrix(y_test, preds_bert_tuned)
for ax, cm, title in [(axes[0], cm_base, f'DistilBERT (τ=0.5)'),
                      (axes[1], cm_tuned, f'DistilBERT (τ*={tau_b:.2f})')]:
    sns.heatmap(cm, annot=True, fmt='d', cmap='Greens', ax=ax,
                xticklabels=['Pred Neg','Pred Pos'],
                yticklabels=['True Neg','True Pos'])
    ax.set_title(title)
plt.tight_layout()
plt.show()

In [None]:
# ── 11.4 Embedding Method Comparison ────────────────────────────────────────
# Aggregates results from W2V and BERT (runs after 11.2 and 11.3)

def eval_row(name, y_true, y_pred, y_prob):
    return {
        'Model': name,
        'Accuracy':      round((y_true == y_pred).mean(), 4),
        'Neg Recall':    round(recall_score(y_true, y_pred, pos_label=0), 4),
        'Neg Precision': round(precision_score(y_true, y_pred, pos_label=0, zero_division=0), 4),
        'Neg F1':        round(f1_score(y_true, y_pred, pos_label=0, zero_division=0), 4),
        'AUC-ROC':       round(roc_auc_score(y_true, y_prob), 4),
    }

embed_rows = [
    eval_row('TF-IDF + LR (Balanced+EN)',    y_test, y_pred_best_lr,  y_prob_best_lr),
    eval_row('Word2Vec + LR (Balanced)',      y_test, y_pred_w2v_lr,   y_prob_w2v_lr),
    eval_row('Word2Vec + RF (Balanced)',      y_test, y_pred_w2v_rf,   y_prob_w2v_rf),
]
try:
    embed_rows.append(eval_row('DistilBERT (fine-tuned)', y_test, preds_bert, y_prob_bert))
except NameError:
    print('DistilBERT results not yet available (run Section 11.3 on GPU first).')

df_embed = pd.DataFrame(embed_rows).sort_values('Neg F1', ascending=False)
print('=== Embedding Method Comparison ===')
print(df_embed.to_string(index=False))

# ROC curve overlay
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
ax = axes[0]
roc_data = [
    ('TF-IDF + LR',    y_prob_best_lr, '#3498db', '-'),
    ('Word2Vec + LR',  y_prob_w2v_lr,  '#e67e22', '--'),
    ('Word2Vec + RF',  y_prob_w2v_rf,  '#e74c3c', '-.'),
]
try:
    roc_data.append(('DistilBERT',  y_prob_bert, '#27ae60', '-'))
except NameError:
    pass
for name, prob, color, ls in roc_data:
    fpr, tpr, _ = roc_curve(y_test, prob)
    au = auc(fpr, tpr)
    ax.plot(fpr, tpr, lw=2, color=color, ls=ls, label=f'{name}  (AUC={au:.3f})')
ax.plot([0,1],[0,1],'k--', lw=0.8)
ax.set_xlabel('FPR'); ax.set_ylabel('TPR')
ax.set_title('ROC Curves — Embedding Methods')
ax.legend(loc='lower right', fontsize=9); ax.grid(alpha=0.3)

# Bar chart: key metrics
ax2 = axes[1]
models_bar = df_embed['Model'].tolist()
x = np.arange(len(models_bar)); w = 0.22
metrics_bar = [('Neg Recall','#e67e22'), ('Neg Precision','#e74c3c'),
               ('Neg F1','#2ecc71'), ('AUC-ROC','#3498db')]
for i, (m, c) in enumerate(metrics_bar):
    vals = [df_embed.loc[df_embed['Model']==n, m].values[0] for n in models_bar]
    ax2.bar(x + (i-1.5)*w, vals, w, label=m, color=c, edgecolor='white')
ax2.set_xticks(x)
ax2.set_xticklabels(models_bar, rotation=15, ha='right', fontsize=9)
ax2.set_ylim(0.55, 1.0); ax2.set_ylabel('Score')
ax2.set_title('Metric Comparison — All Embedding Methods')
ax2.legend(loc='lower right', fontsize=8); ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

---
## 12. Ensemble Learning — Soft Voting <a id='12'></a>

### Motivation

The two best base models have **complementary strengths**:

| Model | Neg. Recall | Neg. Precision |
|---|---|---|
| Best LR (Balanced + ElasticNet) | High ↑ | Lower ↓ |
| RF (Balanced + SVD) | Lower ↓ | High ↑ |

**Soft Voting** combines them by averaging predicted class probabilities:

$$
\hat{P}_{\text{ens}}(c \mid \mathbf{x}) = \frac{1}{2}\Big[\hat{P}_{\text{LR}}(c \mid \mathbf{x}) + \hat{P}_{\text{RF}}(c \mid \mathbf{x})\Big]
$$

$$
\hat{y} = \underset{c}{\arg\max}\; \hat{P}_{\text{ens}}(c \mid \mathbf{x})
$$

This is superior to **hard voting** (majority vote of predicted labels) because it uses full probability information from each model.

In [None]:
# Soft Voting on SVD-reduced features (both models need same input)
# We need to retrain the LR on SVD features for the VotingClassifier
lr_svd = LogisticRegression(
    penalty='elasticnet', solver='saga', l1_ratio=grid.best_params_['l1_ratio'],
    class_weight='balanced', max_iter=1000, random_state=42
)
lr_svd.fit(X_train_svd, y_train)

voting_clf = VotingClassifier(
    estimators=[('lr', lr_svd), ('rf', best_rf)],
    voting='soft',
    n_jobs=-1
)
voting_clf.fit(X_train_svd, y_train)

y_pred_vote = voting_clf.predict(X_test_svd)
y_prob_vote = voting_clf.predict_proba(X_test_svd)[:, 1]

print('=== Soft Voting Ensemble (LR + RF) ===')
print(classification_report(y_test, y_pred_vote, target_names=['Negative','Positive'], digits=4))
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob_vote):.4f}')

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_evaluation(y_test, y_prob_vote, y_pred_vote,
                'Soft Voting Ensemble', axes[0], axes[1])
plt.tight_layout()
plt.show()

---
## 13. Model Comparison & Selection <a id='13'></a>

In [None]:
all_models = {
    'LR Baseline':                    (y_test, y_pred_base,     y_prob_base),
    'LR + Class Weights':             (y_test, y_pred_cw,       y_prob_cw),
    'LR + SMOTE (SVD)':               (y_test, y_pred_smote,    y_prob_smote),
    'LR + Undersampling':             (y_test, y_pred_rus,      y_prob_rus),
    'LR + Balanced + ElasticNet':     (y_test, y_pred_best_lr,  y_prob_best_lr),
    'LR + Balanced + Threshold':      (y_test, y_pred_tuned,    y_prob_best_lr),
    'RF + Balanced + SVD':            (y_test, y_pred_rf,       y_prob_rf),
    'Soft Voting (LR + RF)':          (y_test, y_pred_vote,     y_prob_vote),
    # Word2Vec embeddings (populated after Section 11.2)
    'Word2Vec + LR (Balanced)':       (y_test, y_pred_w2v_lr,   y_prob_w2v_lr),
    'Word2Vec + RF (Balanced)':       (y_test, y_pred_w2v_rf,   y_prob_w2v_rf),
}

# Add DistilBERT if available (Section 11.3, GPU required)
try:
    all_models['DistilBERT (fine-tuned)'] = (y_test, preds_bert, y_prob_bert)
except NameError:
    print('DistilBERT not yet run (Section 11.3 requires GPU). Skipping.')

rows = []
for name, (yt, yp, ypr) in all_models.items():
    rows.append({
        'Model': name,
        'Accuracy': round((yt==yp).mean(), 4),
        'Neg Recall': round(recall_score(yt, yp, pos_label=0), 4),
        'Neg Precision': round(precision_score(yt, yp, pos_label=0, zero_division=0), 4),
        'Neg F1': round(f1_score(yt, yp, pos_label=0, zero_division=0), 4),
        'AUC': round(roc_auc_score(yt, ypr), 4),
    })

df_compare = pd.DataFrame(rows).sort_values('AUC', ascending=False)
print(df_compare.to_string(index=False))

# --- ROC curves overlay ---
fig, ax = plt.subplots(figsize=(9, 6.5))
colors = plt.cm.tab10.colors
for i, (name, (yt, yp, ypr)) in enumerate(all_models.items()):
    fpr, tpr, _ = roc_curve(yt, ypr)
    roc_auc_val = auc(fpr, tpr)
    ls = '--' if 'Word2Vec' in name or 'BERT' in name else '-'
    ax.plot(fpr, tpr, lw=1.8, ls=ls, color=colors[i % len(colors)],
            label=f'{name} ({roc_auc_val:.3f})')
ax.plot([0,1],[0,1],'k--', lw=0.8)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curves — All Models')
ax.legend(loc='lower right', fontsize=7)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

---
## 14. Feature Importance & Interpretability <a id='14'></a>

Logistic Regression coefficients directly quantify each feature's contribution to the log-odds of the positive class:

$$
\log\frac{P(y=1)}{P(y=0)} = \sum_j w_j x_j + b
$$

- **Large positive $w_j$**: feature $j$ strongly predicts positive sentiment.
- **Large negative $w_j$**: feature $j$ strongly predicts negative sentiment.

This provides transparent, word-level explanations — critical for business stakeholders.

In [None]:
feature_names = vectorizer.get_feature_names_out()
coefs = best_lr.coef_.flatten()

coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefs})
coef_df = coef_df.sort_values('Coefficient', ascending=False)

top_n = 20
top_pos = coef_df.head(top_n)
top_neg = coef_df.tail(top_n).iloc[::-1]

fig, axes = plt.subplots(1, 2, figsize=(14, 7))

axes[0].barh(top_pos['Feature'], top_pos['Coefficient'],
             color='#2ecc71', edgecolor='none')
axes[0].set_title(f'Top {top_n} Positive Sentiment Features')
axes[0].set_xlabel('LR Coefficient')
axes[0].invert_yaxis()
axes[0].grid(axis='x', alpha=0.3)

axes[1].barh(top_neg['Feature'], top_neg['Coefficient'],
             color='#e74c3c', edgecolor='none')
axes[1].set_title(f'Top {top_n} Negative Sentiment Features')
axes[1].set_xlabel('LR Coefficient')
axes[1].invert_yaxis()
axes[1].grid(axis='x', alpha=0.3)

plt.suptitle('Logistic Regression Feature Coefficients', fontsize=13, y=1.01)
plt.tight_layout()
plt.show()

print('Top 10 POSITIVE features:')
print(coef_df.head(10)[['Feature','Coefficient']].to_string(index=False))
print('\nTop 10 NEGATIVE features:')
print(coef_df.tail(10).iloc[::-1][['Feature','Coefficient']].to_string(index=False))

---
## 15. Conclusions <a id='15'></a>

### Key Findings

1. **Stop-word handling matters critically** for sentiment analysis. Retaining negation words ("not", "won't", "can't") is essential — removing them inverts sentiment in phrases like "not great" and "won't buy again".

2. **Baseline-first development** is essential for measuring the true impact of each technique. Starting with SVD or aggressive resampling without a baseline makes it impossible to know if the changes helped.

3. **SVD is not appropriate as primary dimensionality reduction for TF-IDF NLP**: 300 components explain only ~30% of variance; 1,000 explain < 70%. Use SVD only as a necessary preprocessing step for algorithms requiring dense input (e.g., SMOTE).

4. **Class weighting is the most effective and efficient** imbalance-handling strategy: no data manipulation, no information loss, best precision-recall balance among all strategies.

5. **Threshold tuning is underused**: after fitting a model, adjusting the decision threshold (tau* = 0.37) is a zero-cost way to control the precision-recall trade-off for deployment.

6. **The Soft Voting ensemble** achieves the best AUC (0.951) among TF-IDF-based models by combining complementary strengths of the high-recall LR and high-precision RF.

7. **Word2Vec embeddings** (Skip-gram, d=300) provide denser semantic representations than TF-IDF, capturing word similarity and analogy, but average pooling loses word-order information. On this dataset they typically match but do not dramatically exceed TF-IDF+LR in classification accuracy.

8. **DistilBERT fine-tuning** yields the best overall performance — contextual embeddings capture negation, sarcasm, and domain vocabulary that bag-of-words methods miss. The trade-off is orders-of-magnitude higher compute (GPU required) vs. TF-IDF models that train in seconds on CPU.

9. **Embedding quality scales with representation complexity**: TF-IDF (sparse, count-based) < Word2Vec (dense, static) < DistilBERT (dense, contextual). Each step up improves semantics at the cost of training time and infrastructure.

### Embedding Comparison Summary

| Representation | AUC-ROC | Neg. F1 | Training Time | GPU Required |
|---|---|---|---|---|
| TF-IDF + LR (baseline) | 0.942 | 0.753 | Seconds | No |
| TF-IDF + LR (balanced, elastic) | 0.951 | 0.768 | Seconds | No |
| TF-IDF Ensemble (Soft Voting) | 0.951 | — | Minutes | No |
| Word2Vec + LR (balanced) | See §11 | See §11 | ~5 min | No |
| Word2Vec + RF (balanced) | See §11 | See §11 | ~10 min | No |
| DistilBERT (fine-tuned) | See §11 | See §11 | ~30-60 min | Yes |

### Final Model Recommendation

**For production with resource constraints**: Use the **Soft Voting ensemble** (TF-IDF + LR + RF) with **threshold tuning** (tau* = 0.37). This achieves AUC=0.951, trains in minutes on CPU, and requires no GPU.

**For maximum accuracy with GPU available**: Fine-tune **DistilBERT** on the full dataset. Expected AUC > 0.97 with superior handling of negation, sarcasm, and complex product descriptions.

**For a middle ground**: **Word2Vec + Logistic Regression** (balanced class weights) provides better semantic representations than TF-IDF without requiring GPU — a good option when embedding quality matters but transformer compute is unavailable.


In [None]:
# Save the best TF-IDF-based models
import joblib, os
joblib.dump(best_lr,       '../models/best_logistic_model.pkl')
joblib.dump(best_rf,       '../models/best_rf.pkl')
joblib.dump(voting_clf,    '../models/voting_classifier_model.pkl')
joblib.dump(svd,           '../models/svd_1000.pkl')
joblib.dump(vectorizer,    '../models/tfidf_vectorizer.pkl')
print('TF-IDF models saved.')

# Save Word2Vec model (if trained in Section 11)
try:
    w2v_model.wv.save_word2vec_format('../models/w2v_300d.bin', binary=True)
    joblib.dump(lr_w2v, '../models/w2v_lr_model.pkl')
    joblib.dump(rf_w2v, '../models/w2v_rf_model.pkl')
    print('Word2Vec models saved.')
except NameError:
    print('Word2Vec models not found (Section 11 not run) -- skipping.')

# Save fine-tuned DistilBERT (if trained in Section 11)
try:
    bert_save_path = '../models/distilbert_finetuned'
    os.makedirs(bert_save_path, exist_ok=True)
    model_bert.save_pretrained(bert_save_path)
    tokenizer.save_pretrained(bert_save_path)
    print(f'DistilBERT model saved to {bert_save_path}/')
except NameError:
    print('DistilBERT model not found (Section 11 not run) -- skipping.')

print('Done. All available models saved to ../models/')
