# Project: Sentiment Analysis on Product Reviews
**Description:** This notebook demonstrates [sentiment analysis](https://github.com/Aronno1920/Sentiment-Analysis) on the IMDB movie reviews dataset using three methods: TF-IDF, Word2Vec, and BERT embeddings. We will train models, evaluate them, and compare their performance.
<br/> <br/>
**Submitted by:** *Selim Ahmed*


## Setup & Install Dependencies

In [4]:
# ===== Step 1: Uninstall conflicting packages =====
# !pip uninstall -y numpy scipy torch scikit-learn spacy thinc gensim \
# opencv-python opencv-python-headless opencv-contrib-python albumentations \
# albucore dopamine-rl tsfresh transformers datasets

# ===== Step 2: Install compatible versions =====
!pip install numpy pandas
!pip install scikit-learn torch
!pip install gensim
!pip install transformers datasets
!pip install matplotlib seaborn
!pip install beautifulsoup4 sentencepiece

Collecting gensim
  Downloading gensim-4.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.6/26.6 MB[0m [31m80.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
[2K   [90m━━━━━━━━━━━



KeyboardInterrupt: 

## Import required libraries

In [3]:
import os, re, string, random, time
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

import matplotlib.pyplot as plt
import seaborn as sns

from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from gensim.models import Word2Vec
from sentence_transformers import SentenceTransformer

import torch
from transformers import AutoTokenizer, AutoModel

# Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

ModuleNotFoundError: No module named 'gensim'

## Function: Helper Functions

In [None]:
# Clean and tokenize text
PUNCT_TABLE = str.maketrans("", "", string.punctuation)

# Only parse if '<' and '>' exist (likely HTML)
def clean_text(text):
    if "<" in text and ">" in text:
        return BeautifulSoup(text, "html.parser").get_text(" ")
    return text

# Simple tokenization: lowercase + keep only words
def tokenize_for_w2v(text):
    text = clean_text(text)
    tokens = re.findall(r"\b\w+\b", text.lower())
    return tokens


# Evaluation metrics
def evaluate(y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="binary")
    print("Accuracy:", f"{acc:.4f}")
    print("Precision:", f"{precision:.4f}")
    print("Recall:", f"{recall:.4f}")
    print("F1:", f"{f1:.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred, digits=4))
    cm = confusion_matrix(y_true, y_pred)
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}, cm


# plot confusion matrix:
def plot_confusion_matrix(cm, title):
    plt.figure(figsize=(4,4))
    plt.imshow(cm, interpolation="nearest")
    plt.title(title)
    plt.xticks([0,1], ["neg(0)","pos(1)"])
    plt.yticks([0,1], ["neg(0)","pos(1)"])
    for (i, j), z in np.ndenumerate(cm):
        plt.text(j, i, str(z), ha='center', va='center')
    plt.show()

## Load IMDB dataset

In [None]:
# download load_dataset
imdb = load_dataset("imdb")

In [None]:
def split_train_test():
    test_size=0.2
    random_state=42

    # Full training data
    texts_full = list(imdb["train"]["text"])
    labels_full = list(imdb["train"]["label"])

    # Stratified split: train / test
    train_texts, test_texts, train_labels, test_labels = train_test_split(
        texts_full,
        labels_full,
        test_size=test_size,
        stratify=labels_full,
        random_state=random_state
    )

    return train_texts, train_labels, test_texts, test_labels

In [None]:
# Usage example
train_texts, train_labels, test_texts, test_labels = split_train_test()

print("Training examples:", len(train_texts))
print("Test examples:", len(test_texts))

## EDA

### Inspect the data

In [None]:
# Look at first few rows
print(imdb['train'][0])

# Sample review
print("Text snippet:", imdb['train'][0]['text'][:500])
print("Label:", imdb['train'][0]['label'])  # 0=negative, 1=positive

### Basic statistics

In [None]:
# Convert to DataFrame for easier analysis
train_df = pd.DataFrame(imdb['train'])
test_df  = pd.DataFrame(imdb['test'])

# Check shape
print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)

# Check class distribution
print("Train label counts:\n", train_df['label'].value_counts())
print("Test label counts:\n", test_df['label'].value_counts())

### Text length analysis

In [None]:
# Number of words per review
train_df['num_words'] = train_df['text'].apply(lambda x: len(x.split()))
test_df['num_words']  = test_df['text'].apply(lambda x: len(x.split()))

# Basic stats
print("Train review length stats:\n", train_df['num_words'].describe())
print("Test review length stats:\n", test_df['num_words'].describe())

### Visualizations

In [None]:
# Class distribution
sns.countplot(x='label', data=train_df)
plt.title("Train set label distribution (0=neg, 1=pos)")
plt.show()

# Review length distribution
plt.figure(figsize=(10,5))
sns.histplot(train_df['num_words'], bins=50, kde=True)
plt.title("Train review length distribution")
plt.xlabel("Number of words")
plt.show()

train_df.sample(5, random_state=SEED)[["label","text"]]

### Sample text inspection

In [None]:
# Random positive review
print("Random positive review:\n", train_df[train_df['label']==1]['text'].sample(1).values[0][:500])

# Random negative review
print("Random negative review:\n", train_df[train_df['label']==0]['text'].sample(1).values[0][:500])

## Data Preprocessing

In [None]:
import re
TAG_RE = re.compile(r"<[^>]+>")
PUNCT_RE = re.compile(r"[^a-z0-9\\s]")


for df in (train_df, val_df, test_df):
    df["text_clean"] = df["text"].apply(clean_text)

train_df.head(3)

In [None]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r"<.*?>", "", text)       # remove HTML tags
    text = re.sub(r"[^a-zA-Z\s]", "", text) # remove punctuation/numbers
    return text.strip()

train_df["clean_text"] = train_df["text"].apply(preprocess_text)
test_df["clean_text"] = test_df["text"].apply(preprocess_text)

train_df.head()

## Train Model

### TF-IDF + Logistic Regression

In [None]:
def train_tfidf(train_texts, train_labels, test_texts, test_labels, max_features=5000):
    print("Training TF-IDF vectorizer...")
    vectorizer = TfidfVectorizer(max_features=max_features, stop_words="english")
    X_train = vectorizer.fit_transform(train_texts)
    X_test = vectorizer.transform(test_texts)

    print("Training Logistic Regression on TF-IDF...")
    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_train, train_labels)

    preds = clf.predict(X_test)
    metrics = evaluate(test_labels, preds)
    return metrics, (clf, vectorizer)

# Run TF-IDF model
train_texts, train_labels, test_texts, test_labels = split_train_test()
metrics_tfidf, tfidf_obj = train_tfidf(train_texts, train_labels, test_texts, test_labels)
print("TF-IDF metrics:", metrics_tfidf)


### Word2Vec + Logistic Regression

In [None]:
# Sentence Vector using Word2Vec embeddings
def sentence_vector(tokens, model):
    vecs = [model.wv[w] for w in tokens if w in model.wv]
    return np.mean(vecs, axis=0) if vecs else np.zeros(model.vector_size)


def train_word2vec(train_texts, train_labels, test_texts, test_labels):
    # Tokenize all texts
    train_tokens = [tokenize_for_w2v(t) for t in train_texts]
    test_tokens = [tokenize_for_w2v(t) for t in test_texts]

    print("Training Word2Vec embeddings...")
    w2v_model = Word2Vec(
        sentences=train_tokens,
        vector_size=100,
        window=5,
        min_count=2,
        workers=4
    )

    # Convert sentences to vectors
    X_train = np.vstack([sentence_vector(t, w2v_model) for t in train_tokens])
    X_test  = np.vstack([sentence_vector(t, w2v_model) for t in test_tokens])

    print("Training Logistic Regression on Word2Vec...")
    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_train, train_labels)

    preds = clf.predict(X_test)
    metrics = evaluate(test_labels, preds)
    return metrics, (clf, w2v_model)

# --- Run Training ---
metrics_w2v, w2v_obj = train_word2vec(train_texts, train_labels, test_texts, test_labels)
print("Word2Vec metrics:", metrics_w2v)

### BERT + Logistic Regression

In [None]:
def train_bert(train_texts, train_labels, test_texts, test_labels, limit_train=5000, limit_test=2000):
    print("Loading DistilBERT tokenizer and model...")
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    bert_model = AutoModel.from_pretrained("distilbert-base-uncased").to(DEVICE)
    bert_model.eval()

    # ✅ Stratified sampling to keep class balance
    X_train_texts, _, y_train, _ = train_test_split(
        train_texts, train_labels,
        train_size=min(limit_train, len(train_texts)),
        stratify=train_labels,
        random_state=42
    )

    X_test_texts, _, y_test, _ = train_test_split(
        test_texts, test_labels,
        train_size=min(limit_test, len(test_texts)),
        stratify=test_labels,
        random_state=42
    )

    def clean_text(text):
        # Simple cleaning: remove HTML tags if present
        return BeautifulSoup(text, "html.parser").get_text(" ")

    def bert_encode(texts):
        embeddings = []
        batch_size = 32
        for i in range(0, len(texts), batch_size):
            batch = [clean_text(t) for t in texts[i:i+batch_size]]
            enc = tokenizer(batch, truncation=True, padding=True, max_length=512, return_tensors="pt").to(DEVICE)
            with torch.no_grad():
                out = bert_model(**enc).last_hidden_state
                mask = enc["attention_mask"].unsqueeze(-1).expand(out.shape).float()
                pooled = (out * mask).sum(1) / mask.sum(1)
                embeddings.append(pooled.cpu().numpy())
        return np.vstack(embeddings)

    # Encode text into embeddings
    X_train = bert_encode(X_train_texts)
    X_test = bert_encode(X_test_texts)

    print("Training Logistic Regression on BERT embeddings...")
    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_train, y_train)

    preds = clf.predict(X_test)
    metrics = evaluate(y_test, preds)
    return metrics, (clf, tokenizer, bert_model)

# Run training
metrics_bert, bert_obj = train_bert(train_texts, train_labels, test_texts, test_labels)
print("BERT metrics:", metrics_bert)


## Compare Models

In [None]:
results = pd.DataFrame([
    {"Model": "TF-IDF + LR", **metrics_tfidf},
    {"Model": "Word2Vec + LR", **metrics_w2v},
    {"Model": "BERT + LR", **metrics_bert}
])
results


## Function: Get Prediction with Probability

In [None]:
import numpy as np
import torch
import pandas as pd

# MODELS dictionary
MODELS = {
    "tfidf": tfidf_obj,
    "word2vec": w2v_obj,
    "bert": bert_obj
}

def get_prediction_with_proba(text: str) -> pd.DataFrame:
    """
    Predict sentiment for a single text using all three models:
    TF-IDF, Word2Vec, BERT.
    Returns: pd.DataFrame with columns ['model', 'prediction', 'probability']
    """
    results = []

    for model_name, model_obj in MODELS.items():
        try:
            if model_name == "tfidf":
                clf, tfidf = model_obj
                X = tfidf.transform([text])
                pred = clf.predict(X)[0]
                proba = float(np.max(clf.predict_proba(X)))

            elif model_name == "word2vec":
                clf, w2v = model_obj
                tokens = tokenize_for_w2v(text)
                vecs = [w2v.wv[w] for w in tokens if w in w2v.wv]
                X = (
                    np.mean(vecs, axis=0).reshape(1, -1)
                    if vecs else np.zeros((1, w2v.vector_size))
                )
                pred = clf.predict(X)[0]
                proba = float(np.max(clf.predict_proba(X)))

            elif model_name == "bert":
                clf, tokenizer, bert_model = model_obj
                bert_model.eval()

                enc = tokenizer(
                    [clean_text(text)],
                    truncation=True,
                    padding=True,
                    max_length=512,
                    return_tensors="pt"
                ).to(bert_model.device)

                with torch.no_grad():
                    out = bert_model(**enc).last_hidden_state
                    mask = enc["attention_mask"].unsqueeze(-1).expand(out.shape).float()
                    pooled = (out * mask).sum(1) / mask.sum(1)
                    X = pooled.cpu().numpy()

                pred = clf.predict(X)[0]
                proba = float(np.max(clf.predict_proba(X)))

            else:
                raise ValueError(f"Unsupported model: {model_name}")

            results.append({
                "model": model_name,
                "prediction": "positive" if pred == 1 else "negative",
                "probability": f"{proba * 100:.2f}%"
            })

        except Exception as e:
            # Catch errors per model and continue
            results.append({
                "model": model_name,
                "prediction": None,
                "probability": None
            })

    return pd.DataFrame(results)

## Predict Using Three Models

In [None]:
df = get_prediction_with_proba("One reviews on here gave this 1/10 because he said it was the same exact concept as a number of other movies he listed where children go on a rampage killing everyone in a town. Did this person even bother to watch this or was he on his phone the entire time? That isn't even remotely close to this plot. Another person gave the same rating because it didn't scare him enough. You're supposed to rate films based on things like story, characters, cinematography, etc.")
print(df)

results = pd.DataFrame([
    {"Model": "TF-IDF + LR", **metrics_tfidf},
    {"Model": "Word2Vec + LR", **metrics_w2v},
    {"Model": "BERT + LR", **metrics_bert}
])
results


## Conclusion

- **Best Model**: BERT achieved the highest accuracy, followed by TF–IDF, then Word2Vec.  
- **Why BERT works better**: It captures deep contextual meaning, unlike TF–IDF (which is bag-of-words) or Word2Vec (static embeddings).  
- **Training Time**: BERT was slowest and required more memory; TF–IDF was fastest.  
- **Errors**: TF–IDF and Word2Vec often failed on sarcastic or long reviews. BERT handled them better but still misclassified ambiguous sentences.  
- **Trade-offs**:  
  - TF–IDF: Fast, interpretable, weaker accuracy.  
  - Word2Vec: Captures semantic meaning, but averaging loses context.  
  - BERT: Best accuracy, but computationally expensive.  
- **Generalization**: BERT generalizes better to unseen words because of subword tokenization.  
- **Recommendation**: For production where speed matters, TF–IDF or Word2Vec may suffice. For research/accuracy, BERT is best.  