# Week 8 — Final Sentiment Analysis Model (Binary)

**Thesis Context**: This notebook reproduces and improves the end‑to‑end experiment for sentiment analysis of Amazon reviews, with a specific focus on **time‑of‑day and negativity**.

**Research Questions**:
- **RQ1**: Does time‑of‑day relate to negativity patterns in reviews?
- **RQ2**: Do engineered time‑based features improve sentiment prediction beyond text‑only models?

**Classification (Binary)**:
- Negative (0): rating ≤ 2
- Positive (1): rating ≥ 3

**Seed Requirement**:
This notebook runs the full pipeline **twice**:
- Run A: seed = 319302
- Run B: random 6‑digit seed generated at runtime

**Note**: This notebook **does not use Unsloth**.


## Colab Setup (optional Google Drive)
If you want to store outputs in Google Drive, uncomment and run the cell below, then set `USE_DRIVE = True` in the configuration cell.

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')


In [None]:
# === 1) Install + Imports (single cell) ===
import os
import sys
import json
import subprocess

# Disable W&B by default
os.environ["WANDB_DISABLED"] = "true"

# Install required packages (single clean cell)
packages = [
    "pandas>=2.0.0",
    "numpy>=1.24.0",
    "matplotlib>=3.7.0",
    "seaborn>=0.12.0",
    "scikit-learn>=1.3.0",
    "transformers>=4.40.0",
    "datasets>=2.18.0",
    "evaluate>=0.4.1",
    "accelerate>=0.20.0",
    "pyarrow>=10.0.0",
]
for pkg in packages:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
from datetime import datetime
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    classification_report, confusion_matrix
)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from scipy.sparse import hstack, csr_matrix
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# GPU check
if torch.cuda.is_available():
    print(f"✓ GPU available: {torch.cuda.get_device_name(0)}")
else:
    print("⚠️  No GPU detected. Transformer will run in FAST_RUN mode.")


In [None]:
# === 2) Configuration ===
DATA_PATH = "/content/Amazon_Data.csv"  # set your path
FILE_TYPE = "auto"  # "csv", "parquet", "jsonl", or "auto"
TEXT_COL = None     # set if your column names differ
RATING_COL = None   # set if your column names differ
TIME_COL = None     # set if your column names differ

STUDENT_SEED = 319302
FAST_RUN = True
SAMPLE_FOR_TEXT = 100000  # for heavy text models; set None to use full train

MAX_SEQ_LEN_BERT = 256
BERT_EPOCHS = 1 if FAST_RUN else 3
BERT_BATCH = 16

USE_DRIVE = False
DRIVE_OUTPUT_DIR = "/content/drive/MyDrive/GRAD699/Week8/"

# Output folders
OUTPUT_DIR = DRIVE_OUTPUT_DIR if USE_DRIVE else "outputs"
FIGURES_DIR = os.path.join(OUTPUT_DIR, "figures")
MODELS_DIR = os.path.join(OUTPUT_DIR, "models")
TABLES_DIR = os.path.join(OUTPUT_DIR, "tables")
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(FIGURES_DIR, exist_ok=True)
os.makedirs(MODELS_DIR, exist_ok=True)
os.makedirs(TABLES_DIR, exist_ok=True)

label_map = {0: "Negative", 1: "Positive"}

print("✓ Config loaded")


## Utilities

In [None]:
def set_global_seed(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

def parse_timestamp(series: pd.Series) -> pd.Series:
    if pd.api.types.is_numeric_dtype(series):
        max_val = series.max()
        if max_val > 1e12:
            return pd.to_datetime(series, errors="coerce", unit="ms")
        if max_val > 1e9:
            return pd.to_datetime(series, errors="coerce", unit="s")
    return pd.to_datetime(series, errors="coerce")

def daypart_from_hour(hour: int) -> str:
    if 0 <= hour <= 4:
        return "late_night"
    if 5 <= hour <= 11:
        return "morning"
    if 12 <= hour <= 16:
        return "afternoon"
    if 17 <= hour <= 20:
        return "evening"
    return "night"

def load_data(path: str, file_type: str) -> pd.DataFrame:
    ftype = file_type.lower() if isinstance(file_type, str) else "auto"
    if ftype == "auto":
        if path.endswith(".parquet"):
            ftype = "parquet"
        elif path.endswith(".jsonl"):
            ftype = "jsonl"
        else:
            ftype = "csv"

    if ftype == "csv":
        return pd.read_csv(path)
    if ftype == "parquet":
        return pd.read_parquet(path)
    if ftype == "jsonl":
        return pd.read_json(path, lines=True)
    raise ValueError("FILE_TYPE must be one of: csv, parquet, jsonl, auto")

def find_column(df, candidates, override=None):
    if override:
        if override in df.columns:
            return override
        raise ValueError(f"Column '{override}' not found in dataset.")
    for c in candidates:
        if c in df.columns:
            return c
    return None

def chronological_split(df, train_ratio=0.8, val_ratio=0.1):
    n = len(df)
    n_train = int(train_ratio * n)
    n_val = int(val_ratio * n)
    train = df.iloc[:n_train].copy()
    val = df.iloc[n_train:n_train + n_val].copy()
    test = df.iloc[n_train + n_val:].copy()
    return train, val, test

def eval_binary_metrics(y_true, y_pred, y_proba=None):
    metrics = {
        "accuracy": accuracy_score(y_true, y_pred),
        "precision": precision_score(y_true, y_pred, zero_division=0),
        "recall": recall_score(y_true, y_pred, zero_division=0),
        "f1": f1_score(y_true, y_pred, zero_division=0),
    }
    if y_proba is not None:
        try:
            metrics["roc_auc"] = roc_auc_score(y_true, y_proba)
        except Exception:
            metrics["roc_auc"] = float("nan")
    else:
        metrics["roc_auc"] = float("nan")
    return metrics

def build_time_features(df):
    df = df.copy()
    df["hour"] = df["timestamp"].dt.hour
    df["day_of_week"] = df["timestamp"].dt.dayofweek
    df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
    df["daypart"] = df["hour"].apply(daypart_from_hour)
    df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
    df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)
    return df

def build_time_matrix(train_df, val_df, test_df):
    num_features = ["hour_sin", "hour_cos", "is_weekend"]
    cat_features = ["day_of_week", "daypart"]
    enc = OneHotEncoder(handle_unknown="ignore", sparse=True)
    enc.fit(train_df[cat_features])
    train_cat = enc.transform(train_df[cat_features])
    val_cat = enc.transform(val_df[cat_features])
    test_cat = enc.transform(test_df[cat_features])
    train_num = csr_matrix(train_df[num_features].values)
    val_num = csr_matrix(val_df[num_features].values)
    test_num = csr_matrix(test_df[num_features].values)
    return (hstack([train_num, train_cat]),
            hstack([val_num, val_cat]),
            hstack([test_num, test_cat]))


## 3) Load data (raw showcase)

In [None]:
df_raw = load_data(DATA_PATH, FILE_TYPE)

text_col = find_column(df_raw, ["text", "review", "reviewText"], TEXT_COL)
rating_col = find_column(df_raw, ["rating", "stars", "overall"], RATING_COL)
time_col = find_column(df_raw, ["timestamp", "reviewTime", "time"], TIME_COL)

if text_col is None or rating_col is None or time_col is None:
    raise ValueError("Missing required columns. Please set TEXT_COL, RATING_COL, TIME_COL.")

df_raw = df_raw.rename(columns={text_col: "text", rating_col: "rating", time_col: "timestamp"})

print("Shape:", df_raw.shape)
print("Columns:", df_raw.columns.tolist())
print("Dtypes:")
print(df_raw.dtypes)

print("\nSample rows:")
display(df_raw[["timestamp", "rating", "text"]].head(20))

print("\nMissingness summary:")
print(df_raw[["text", "rating", "timestamp"]].isna().sum())

plt.figure(figsize=(6, 4))
df_raw["rating"].value_counts().sort_index().plot(kind="bar")
plt.title("Raw Rating Distribution")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig(os.path.join(FIGURES_DIR, "raw_rating_distribution.png"), dpi=150)
plt.show()


## 4) Clean + preprocess + label

In [None]:
df = df_raw.dropna(subset=["text", "rating", "timestamp"]).copy()
df["timestamp"] = parse_timestamp(df["timestamp"])
df = df.dropna(subset=["timestamp"]).copy()
df["text"] = df["text"].astype(str).str.strip()
df = df[df["text"].str.len() > 0].copy()

# Binary label mapping
df["label"] = df["rating"].apply(lambda r: 0 if r <= 2 else 1)
df["label_name"] = df["label"].map(label_map)

# Basic text features
df["review_len"] = df["text"].str.len()
df["word_count"] = df["text"].str.split().str.len()

# Sort chronologically
df = df.sort_values("timestamp").reset_index(drop=True)

print("Cleaned shape:", df.shape)
print(df[["label_name"]].value_counts())


## 5) Feature engineering (time-of-day)

In [None]:
df = build_time_features(df)
print("✓ Time features created")


## 6) Chronological split (train/val/test)

In [None]:
df_train, df_val, df_test = chronological_split(df, train_ratio=0.8, val_ratio=0.1)

print("Train range:", df_train["timestamp"].min(), "to", df_train["timestamp"].max())
print("Val range:", df_val["timestamp"].min(), "to", df_val["timestamp"].max())
print("Test range:", df_test["timestamp"].min(), "to", df_test["timestamp"].max())

assert df_train["timestamp"].max() <= df_val["timestamp"].min(), "Train/Val overlap"
assert df_val["timestamp"].max() <= df_test["timestamp"].min(), "Val/Test overlap"

print("Sizes:", len(df_train), len(df_val), len(df_test))
print("Train label distribution:\n", df_train["label_name"].value_counts())
print("Val label distribution:\n", df_val["label_name"].value_counts())
print("Test label distribution:\n", df_test["label_name"].value_counts())


## 7) RQ1 Visualizations (negativity vs time)

In [None]:
# Negativity rate by hour
neg_by_hour = df.groupby("hour")["label"].apply(lambda x: (x == 0).mean()).reset_index(name="neg_rate")
vol_by_hour = df.groupby("hour").size().reset_index(name="count")

plt.figure(figsize=(10, 4))
plt.plot(neg_by_hour["hour"], neg_by_hour["neg_rate"], marker="o")
plt.title("Negativity Rate by Hour")
plt.xlabel("Hour")
plt.ylabel("Negativity Rate")
plt.tight_layout()
plt.savefig(os.path.join(FIGURES_DIR, "negativity_by_hour.png"), dpi=150)
plt.show()

plt.figure(figsize=(10, 4))
plt.bar(vol_by_hour["hour"], vol_by_hour["count"], color="steelblue")
plt.title("Review Volume by Hour")
plt.xlabel("Hour")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig(os.path.join(FIGURES_DIR, "volume_by_hour.png"), dpi=150)
plt.show()

# Negativity rate by daypart
neg_by_daypart = df.groupby("daypart")["label"].apply(lambda x: (x == 0).mean()).reset_index(name="neg_rate")
vol_by_daypart = df.groupby("daypart").size().reset_index(name="count")

plt.figure(figsize=(8, 4))
sns.barplot(x="daypart", y="neg_rate", data=neg_by_daypart, color="salmon")
plt.title("Negativity Rate by Daypart")
plt.ylabel("Negativity Rate")
plt.tight_layout()
plt.savefig(os.path.join(FIGURES_DIR, "negativity_by_daypart.png"), dpi=150)
plt.show()

plt.figure(figsize=(8, 4))
sns.barplot(x="daypart", y="count", data=vol_by_daypart, color="steelblue")
plt.title("Review Volume by Daypart")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig(os.path.join(FIGURES_DIR, "volume_by_daypart.png"), dpi=150)
plt.show()

# Heatmap: negativity rate by hour x day_of_week
heat = df.groupby(["hour", "day_of_week"])["label"].apply(lambda x: (x == 0).mean()).reset_index()
heat_pivot = heat.pivot(index="hour", columns="day_of_week", values="label")

plt.figure(figsize=(8, 6))
sns.heatmap(heat_pivot, cmap="Reds")
plt.title("Negativity Rate Heatmap (Hour x Day of Week)")
plt.tight_layout()
plt.savefig(os.path.join(FIGURES_DIR, "negativity_heatmap.png"), dpi=150)
plt.show()


## 8) Run the full pipeline twice (Run A + Run B)

In [None]:
run_b_seed = random.SystemRandom().randint(100000, 999999)
print("Run A seed:", STUDENT_SEED)
print("Run B seed:", run_b_seed)

metrics_rows = []
run_summaries = []

def run_models(seed: int, run_name: str):
    set_global_seed(seed)

    # Sampling for text models (train only)
    train_text = df_train
    if SAMPLE_FOR_TEXT is not None and len(df_train) > SAMPLE_FOR_TEXT:
        train_text = df_train.head(SAMPLE_FOR_TEXT).copy()

    # Baseline 1: TF-IDF text only
    tfidf = TfidfVectorizer(max_features=20000, ngram_range=(1, 2), min_df=2)
    X_train = tfidf.fit_transform(train_text["text"].values)
    X_val = tfidf.transform(df_val["text"].values)
    X_test = tfidf.transform(df_test["text"].values)

    y_train = train_text["label"].values
    y_val = df_val["label"].values
    y_test = df_test["label"].values

    clf_text = LogisticRegression(max_iter=1000, class_weight="balanced")
    clf_text.fit(X_train, y_train)

    test_pred = clf_text.predict(X_test)
    test_proba = clf_text.predict_proba(X_test)[:, 1]
    m1 = eval_binary_metrics(y_test, test_pred, test_proba)

    metrics_rows.append({"run": run_name, "model": "Baseline_TFIDF_Text", **m1})

    # Baseline 2: TF-IDF + time features
    X_train_time, X_val_time, X_test_time = build_time_matrix(train_text, df_val, df_test)
    X_train_combined = hstack([X_train, X_train_time])
    X_test_combined = hstack([X_test, X_test_time])

    clf_time = LogisticRegression(max_iter=1000, class_weight="balanced")
    clf_time.fit(X_train_combined, y_train)

    test_pred2 = clf_time.predict(X_test_combined)
    test_proba2 = clf_time.predict_proba(X_test_combined)[:, 1]
    m2 = eval_binary_metrics(y_test, test_pred2, test_proba2)

    metrics_rows.append({"run": run_name, "model": "Baseline_TFIDF_Time", **m2})

    # DistilBERT
    bert_train = train_text
    bert_val = df_val
    bert_test = df_test
    if FAST_RUN:
        bert_train = bert_train.head(min(20000, len(bert_train)))
        bert_val = bert_val.head(min(5000, len(bert_val)))
        bert_test = bert_test.head(min(5000, len(bert_test)))

    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

    def tokenize_fn(batch):
        return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=MAX_SEQ_LEN_BERT)

    train_ds = Dataset.from_pandas(bert_train[["text", "label"]])
    val_ds = Dataset.from_pandas(bert_val[["text", "label"]])
    test_ds = Dataset.from_pandas(bert_test[["text", "label"]])

    train_ds = train_ds.map(tokenize_fn, batched=True)
    val_ds = val_ds.map(tokenize_fn, batched=True)
    test_ds = test_ds.map(tokenize_fn, batched=True)

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        probs = torch.softmax(torch.tensor(logits), dim=1).numpy()[:, 1]
        preds = np.argmax(logits, axis=1)
        return eval_binary_metrics(labels, preds, probs)

    training_args = TrainingArguments(
        output_dir=os.path.join(MODELS_DIR, f"distilbert_{run_name}"),
        num_train_epochs=BERT_EPOCHS,
        per_device_train_batch_size=BERT_BATCH,
        per_device_eval_batch_size=BERT_BATCH,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_steps=50,
        seed=seed,
        data_seed=seed,
        fp16=torch.cuda.is_available(),
        report_to="none",
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_ds,
        eval_dataset=val_ds,
        compute_metrics=compute_metrics,
    )

    trainer.train()
    test_output = trainer.predict(test_ds)
    logits = test_output.predictions
    probs = torch.softmax(torch.tensor(logits), dim=1).numpy()[:, 1]
    preds = np.argmax(logits, axis=1)

    m3 = eval_binary_metrics(bert_test["label"].values, preds, probs)

    # Confusion matrix for LLM model (DistilBERT)
    cm_bert = confusion_matrix(bert_test["label"].values, preds)
    plt.figure(figsize=(5, 4))
    sns.heatmap(cm_bert, annot=True, fmt="d", cmap="Blues",
                xticklabels=["Negative", "Positive"],
                yticklabels=["Negative", "Positive"])
    plt.title(f"Confusion Matrix - DistilBERT ({run_name})")
    plt.ylabel("True")
    plt.xlabel("Predicted")
    plt.tight_layout()
    plt.savefig(os.path.join(FIGURES_DIR, f"cm_distilbert_{run_name}.png"), dpi=150)
    plt.show()

    metrics_rows.append({"run": run_name, "model": "DistilBERT", **m3})

    # Save prediction examples from Run A
    if run_name == "Run_A":
        pred_examples = pd.DataFrame({
            "text": bert_test["text"].values,
            "gold_label": bert_test["label"].map(label_map),
            "pred_label": [label_map[int(p)] for p in preds],
        })
        pred_examples.to_csv(os.path.join(TABLES_DIR, "pred_examples.csv"), index=False)

    return m1, m2, m3

run_models(STUDENT_SEED, "Run_A")
run_models(run_b_seed, "Run_B")


## 9) Results + Comparison Tables

In [None]:
metrics_df = pd.DataFrame(metrics_rows)
display(metrics_df)

# Save metrics
metrics_df.to_csv(os.path.join(OUTPUT_DIR, "metrics.csv"), index=False)
with open(os.path.join(OUTPUT_DIR, "metrics.json"), "w") as f:
    json.dump(metrics_rows, f, indent=2)

# Run A vs Run B comparison table
comparison = metrics_df.pivot_table(index="model", columns="run", values=["accuracy", "precision", "recall", "f1", "roc_auc"])
display(comparison)

# RQ1 aggregates (stable across runs)
max_hour = int(neg_by_hour.loc[neg_by_hour["neg_rate"].idxmax(), "hour"])
max_daypart = neg_by_daypart.loc[neg_by_daypart["neg_rate"].idxmax(), "daypart"]
late_night_rate = float(neg_by_daypart[neg_by_daypart["daypart"] == "late_night"]["neg_rate"].values[0])
overall_neg_rate = float((df["label"] == 0).mean())
pct_diff = (late_night_rate - overall_neg_rate) * 100

summary_text = f"""
### RQ1 Summary
- Highest negativity hour: {max_hour}
- Highest negativity daypart: {max_daypart}
- Late‑night negativity rate: {late_night_rate:.3f}
- Overall negativity rate: {overall_neg_rate:.3f}
- Late‑night vs overall difference: {pct_diff:.2f} percentage points
"""
print(summary_text)


## 10) Interpretation (RQ1 + RQ2)

**RQ1**: The hour/daypart aggregates are stable across runs because they are computed from the full dataset with deterministic grouping (no randomness). The highest‑negativity hour and daypart are therefore identical in both runs.

**RQ2**: The ML models can vary slightly across runs because random initialization, optimization order, and GPU nondeterminism can affect training. Even with a fixed split and identical hyperparameters, the learned parameters can differ between seeds, which can shift accuracy, F1, and ROC‑AUC.

Use the comparison tables above to document whether time‑based features improved performance (Baseline 2 vs Baseline 1) and whether DistilBERT outperforms the classical baselines.


## 11) Publish to GitHub

**Option 1 (Recommended: Manual)**
```bash
git status
git add "Week 8/Final Sentiment Analysis Model.ipynb"
git commit -m "Add Week 8 final sentiment analysis notebook"
git push
```

**Option 2 (From Colab)**
```bash
git clone https://github.com/<your-username>/<your-repo>.git
cd <your-repo>
cp "/content/Final Sentiment Analysis Model.ipynb" .
git add "Final Sentiment Analysis Model.ipynb"
git commit -m "Add Week 8 final sentiment analysis notebook"
git push https://<YOUR_TOKEN>@github.com/<your-username>/<your-repo>.git
```
Use a GitHub Personal Access Token (PAT) for authentication. Do not hardcode tokens in the notebook.

In [None]:
# === 12) Zip outputs + download ===
import shutil
zip_path = shutil.make_archive("outputs_week8", "zip", OUTPUT_DIR)
print("Created:", zip_path)
print("Size (MB):", os.path.getsize(zip_path) / (1024 * 1024))


In [None]:
# Download in Colab
from google.colab import files
files.download("outputs_week8.zip")
