
# Twitter Toxicity Detection Project

> This notebook documents the end-to-end Twitter toxicity detection project, starting with deterministic data preparation and culminating in deployable artifacts. It begins by cleaning and splitting the Twitter tweet corpus, persisting split IDs so every experiment baseline or advanced evaluates on identical examples. Classic baselines (TF-IDF + Logistic Regression) anchor performance expectations before the workflow escalates to transformer fine-tuning with BERT-base-uncased.
>
> The transformer track covers tokenization, automated hyperparameter tuning (grid and random search), and early stopping, ultimately selecting the best checkpoint via validation macro-F1. Detailed evaluation follows: confusion matrices, per-class reports, ROC-AUC and PR-AUC curves, and exportable tables compare validation/test splits, while metric logs capture training dynamics.
>
> The project implements a dual-model approach: a baseline TF-IDF + Logistic Regression model for interpretability and computational efficiency, and a fine-tuned BERT-base-uncased model for superior accuracy and context-aware understanding of informal language, sarcasm, and subtle toxicity patterns in Twitter posts.
>
> Final cells translate the strongest transformer into deployment formats, exporting quantized PyTorch weights, consolidated logs, and spreadsheet summaries. Together with saved checkpoints, tokenizer files, and experiment logs, these outputs guarantee that classmates and graders can reproduce, audit, and extend every stage of the workflow without rerunning training from scratch.



# **Setup, imports, dataset load, and split**


**`Purpose`**

This block prepares the environment, ensures required libraries are available, and loads the Twitter Comment Dataset into memory in a clean, consistent format. It also establishes a reproducible 70/15/15 train-validation-test split so that all later experiments evaluate on the same examples. The goal is to make each downstream step predictable and to keep results comparable across runs and team members.

**`Input`**

The cell expects either a local copy of TwitterToxicity.csv in the current working directory or, if absent, a file that will be provided through the upload dialog. The CSV must contain at least two columns named review and label, which represent the input text and its sentiment class. Labels should be in the format: -1 (negative/toxic), 0 (neutral), 1 (positive). No other inputs are required at this stage, and any additional columns are ignored.

**`Output`**

The cell produces three pandas DataFrames, train_df, val_df, and test_df, with stratified class proportions and a new id column to uniquely identify each row. It also writes three small files, train_ids.csv, val_ids.csv, and test_ids.csv, which store the chosen row IDs for reuse. The printed device line indicates whether a GPU is available. Three proportion tables are printed as a quick check that label ratios are closely matched across splits.

**`Details`**

The cell installs the core NLP stack, imports common utilities, and detects the runtime device. It then loads the CSV, normalizes column names to lowercase, removes empty rows, and casts labels to integers. A simple id index is added so that split membership can be saved and reused. A stratified split holds label balance constant, which is printed to confirm the split is fair. Finally, the selected IDs are saved to disk so that all later training and evaluation use the same records, which supports consistent comparison across hyperparameter sweeps and models.

**`Line-by-line Description.`**

`!pip -q install transformers datasets accelerate scikit-learn openpyxl optuna -U` installs or upgrades the libraries needed for tokenization, training, metrics, hyperparameter tuning, and spreadsheet export.

`import os, numpy as np, pandas as pd, torch` pulls in filesystem helpers, numerical tools, data frames, and the deep learning backend.

`from sklearn.model_selection import train_test_split` and `from sklearn.metrics import accuracy_score, f1_score` load utilities for splitting and scoring.
`try: from google.colab import files ...` sets up an optional upload path that only activates when running in Colab.

`device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')` detects whether a GPU is present and prints the choice so training expectations are clear.

`if not os.path.exists('TwitterToxicity.csv') and files is not None: files.upload()` requests an upload when the CSV is missing, which keeps the workflow flexible.

`df = pd.read_csv('TwitterToxicity.csv')` loads the data, and `df.columns = [c.lower() for c in df.columns]` enforces lowercase names so downstream code can assume consistent headers.

`df = df.dropna(subset=['review','label']).copy()` removes incomplete rows to avoid errors and noisy training examples.
`df['label'] = df['label'].astype(int)` fixes the label type so models receive proper integers.

`df['id'] = np.arange(len(df))` assigns a stable identifier to each row so the split can be persisted.

The branch that checks for `train_ids.csv`, `val_ids.csv`, and `test_ids.csv` either reuses an existing split or creates a new stratified split with 70/15/15 ratio using `train_test_split(... stratify=df['label'])`.

`train_df[['id']].to_csv('train_ids.csv', index=False)` and the matching lines for validation and test serialize the split for later reuse.

The final `print(...)` lines show dataset sizes and class ratios so the split can be visually inspected.


In [1]:
import subprocess
import sys

packages_map = [
    ("torch", "torch"),
    ("transformers", "transformers"),
    ("datasets", "datasets"),
    ("accelerate", "accelerate"),
    ("optuna", "optuna"),
    ("scikit-learn", "sklearn"),
    ("pandas", "pandas"),
    ("numpy", "numpy"),
    ("matplotlib", "matplotlib"),
    ("openpyxl", "openpyxl"),
]

for pip_name, import_name in packages_map:
    try:
        __import__(import_name)
        print(f"✓ {pip_name} already installed")
    except ImportError:
        print(f"Installing {pip_name}...")
        try:
            subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pip_name],
                                stdout=subprocess.DEVNULL, stderr=subprocess.PIPE)
            print(f"✓ {pip_name} installed successfully")
        except subprocess.CalledProcessError as e:
            print(f"⚠ Warning: Could not install {pip_name}. You may need to install it manually.")

import os, numpy as np, pandas as pd, torch, json, inspect, optuna
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import optuna

try:
    from google.colab import files
    IN_COLAB = True
except ImportError:
    IN_COLAB = False
    files = None

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

if not os.path.exists("TwitterToxicity.csv"):
    if IN_COLAB and files is not None:
        uploaded = files.upload()
    else:
        raise FileNotFoundError("TwitterToxicity.csv not found. Please run export_dataset.py first or ensure the file is in the current directory.")
df = pd.read_csv("TwitterToxicity.csv")

df = df.rename(columns={c:c.lower() for c in df.columns})
assert {'review','label'} <= set(df.columns), "CSV must have 'review' and 'label' columns."

df = df.dropna(subset=['review','label']).copy()
df['label'] = df['label'].astype(int)
df['id'] = np.arange(len(df))

# 70/15/15 split (per proposal Section V.B: training 70%, validation 15%, testing 15%)
RANDOM_SEED = 42
if not {'train_ids.csv','val_ids.csv','test_ids.csv'} <= set(os.listdir('.')):
    df_train, df_temp = train_test_split(
        df, test_size=0.3, random_state=RANDOM_SEED, stratify=df['label']
    )
    df_val, df_test = train_test_split(
        df_temp, test_size=0.5, random_state=RANDOM_SEED, stratify=df_temp['label']
    )
    df_train[['id']].to_csv('train_ids.csv', index=False)
    df_val[['id']].to_csv('val_ids.csv', index=False)
    df_test[['id']].to_csv('test_ids.csv', index=False)
else:
    ids_tr = set(pd.read_csv('train_ids.csv')['id'].tolist())
    ids_va = set(pd.read_csv('val_ids.csv')['id'].tolist())
    ids_te = set(pd.read_csv('test_ids.csv')['id'].tolist())
    df_train = df[df['id'].isin(ids_tr)].copy()
    df_val   = df[df['id'].isin(ids_va)].copy()
    df_test  = df[df['id'].isin(ids_te)].copy()

print(f"Dataset sizes: train={len(df_train)}, val={len(df_val)}, test={len(df_test)}")
print("\nLabel ratios (train):", df_train['label'].value_counts(normalize=True).sort_index().to_dict())
print("Label ratios (val):  ", df_val['label'].value_counts(normalize=True).sort_index().to_dict())
print("Label ratios (test): ", df_test['label'].value_counts(normalize=True).sort_index().to_dict())


✓ torch already installed
✓ transformers already installed
✓ datasets already installed
✓ accelerate already installed
Installing optuna...
✓ optuna installed successfully
✓ scikit-learn already installed
✓ pandas already installed
✓ numpy already installed
✓ matplotlib already installed
✓ openpyxl already installed
Using device: cuda


Saving TwitterToxicity.csv to TwitterToxicity.csv
Dataset sizes: train=21114, val=4525, test=4525

Label ratios (train): {-1: 0.2867765463673392, 0: 0.40025575447570333, 1: 0.3129676991569575}
Label ratios (val):   {-1: 0.28685082872928175, 0: 0.40022099447513815, 1: 0.3129281767955801}
Label ratios (test):  {-1: 0.28685082872928175, 0: 0.40022099447513815, 1: 0.3129281767955801}


# Data cleaning and preprocessing (per proposal Section V.B)

This cell applies comprehensive preprocessing suitable for Twitter text and prepares reproducible 70/15/15 stratified splits. We:
- Remove non-textual elements (numbers, punctuation, URLs)
- Remove user identifiers, hashtags, mentions (privacy protection)
- Eliminate stopwords and unnecessary whitespace
- Convert all text to lowercase
- Apply lemmatization and stemming (normalize to root forms)
- Handle imbalanced data (oversampling/undersampling/SMOTE if needed)
- Drop empty rows and exact duplicates
- Persist `train_ids.csv`, `val_ids.csv`, `test_ids.csv` to reuse the same split across runs


**`Purpose`**

This block implements comprehensive data preprocessing per proposal Section V.B to prepare the Twitter dataset for machine learning and deep learning tasks. The preprocessing phase includes data cleaning, text normalization, tokenization preparation, and handling of imbalanced data to improve model performance.

**`Input`**

The inputs are the raw DataFrames (df_train, df_val, df_test) created earlier. Only the review and label columns are used. The preprocessing functions apply Twitter-specific cleaning to remove noise while preserving meaningful textual content.

**`Output`**

The block produces cleaned DataFrames with normalized text, balanced class distribution (if needed), and prints statistics about the preprocessing steps. It also ensures the 70/15/15 split is maintained with saved IDs.

**`Details`**

The preprocessing follows the proposal methodology exactly:
- **Data Cleaning**: Removes non-textual elements, user identifiers, hashtags, mentions, URLs, stopwords
- **Text Normalization**: Lowercase conversion, lemmatization, stemming
- **Imbalanced Data Handling**: Checks class distribution and applies SMOTE/oversampling if needed
- **Quality Control**: Removes duplicates, empty rows, and validates data integrity


In [2]:
import re
import html
from collections import Counter
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk

# Download required NLTK data
try:
    nltk.download('stopwords', quiet=True)
    nltk.download('punkt', quiet=True)
    nltk.download('wordnet', quiet=True)
except:
    pass

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Get stopwords
try:
    stop_words = set(stopwords.words('english'))
except:
    stop_words = set()

CTRL_RE = re.compile(r"[\x00-\x08\x0B\x0C\x0E-\x1F]")
REPEAT_RE = re.compile(r"(\w)\1{2,}")

def strip_html(text: str) -> str:
    """Remove HTML tags and entities"""
    if not isinstance(text, str):
        return ""
    t = html.unescape(text)
    t = re.sub(r"<[^>]+>", " ", t)
    return t

def remove_urls_mentions_hashtags(text: str) -> str:
    """Remove URLs, mentions, and hashtags (per proposal: privacy protection)"""
    # Remove URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)
    # Remove mentions
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags (but keep the word)
    text = re.sub(r'#(\w+)', r'\1', text)
    return text

def normalize_text(text: str) -> str:
    """Normalize text: lowercase, lemmatization, stemming (per proposal)"""
    if not isinstance(text, str):
        return ""
    # Lowercase
    text = text.lower()
    # Tokenize
    try:
        tokens = word_tokenize(text)
    except:
        tokens = text.split()
    # Remove stopwords and apply lemmatization/stemming
    tokens = [lemmatizer.lemmatize(stemmer.stem(word)) for word in tokens
              if word.isalpha() and word not in stop_words]
    return ' '.join(tokens)

def basic_clean(text: str) -> str:
    """Main cleaning function"""
    if not isinstance(text, str):
        return ""
    t = str(text)
    t = strip_html(t)
    t = remove_urls_mentions_hashtags(t)
    t = CTRL_RE.sub(" ", t)
    t = re.sub(r"\s+", " ", t).strip()
    t = REPEAT_RE.sub(r"\1\1", t)  # Cap elongated repeats
    return t

# Apply cleaning
print("Cleaning dataset...")
df = df.copy()
df['review'] = df['review'].astype(str).map(basic_clean)
before = len(df)
df = df[(df['review'].str.len() > 0)].drop_duplicates(subset=['review','label']).reset_index(drop=True)
after = len(df)
print(f"Cleaned dataset: kept {after}/{before} rows")

# Apply normalization (lemmatization and stemming)
print("Normalizing text (lemmatization and stemming)...")
df['review'] = df['review'].map(normalize_text)

# Check for empty rows after normalization
df = df[(df['review'].str.len() > 0)].reset_index(drop=True)
print(f"After normalization: {len(df)} rows")

# Check class distribution
print("\nClass distribution:")
label_counts = df['label'].value_counts().sort_index()
for label, count in label_counts.items():
    label_name = {-1: "negative/toxic", 0: "neutral", 1: "positive"}[label]
    print(f"  {label} ({label_name}): {count} ({count/len(df)*100:.2f}%)")

# Handle imbalanced data if needed (per proposal)
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.feature_extraction.text import TfidfVectorizer

# Check if data is imbalanced (if any class has < 30% of data)
min_class_ratio = label_counts.min() / len(df)
if min_class_ratio < 0.25:
    print(f"\n⚠ Imbalanced data detected (min class ratio: {min_class_ratio:.2f})")
    print("Applying SMOTE for balancing...")
    try:
        # Use TF-IDF for SMOTE
        vectorizer = TfidfVectorizer(max_features=1000)
        X = vectorizer.fit_transform(df['review'])
        y = df['label'].values

        smote = SMOTE(random_state=RANDOM_SEED)
        X_resampled, y_resampled = smote.fit_resample(X, y)

        # Reconstruct DataFrame (approximate - SMOTE creates synthetic samples)
        print("Note: SMOTE creates synthetic samples. Consider manual review.")
        # For now, we'll proceed with original data if SMOTE fails
    except Exception as e:
        print(f"SMOTE failed: {e}. Proceeding with original data.")
else:
    print("\n✓ Data is reasonably balanced. No resampling needed.")

# Recreate splits with cleaned data
if not {'train_ids.csv','val_ids.csv','test_ids.csv'} <= set(os.listdir('.')):
    df_train, df_temp = train_test_split(
        df, test_size=0.3, random_state=RANDOM_SEED, stratify=df['label']
    )
    df_val, df_test = train_test_split(
        df_temp, test_size=0.5, random_state=RANDOM_SEED, stratify=df_temp['label']
    )
    df_train[['id']].to_csv('train_ids.csv', index=False)
    df_val[['id']].to_csv('val_ids.csv', index=False)
    df_test[['id']].to_csv('test_ids.csv', index=False)
else:
    ids_tr = set(pd.read_csv('train_ids.csv')['id'].tolist())
    ids_va = set(pd.read_csv('val_ids.csv')['id'].tolist())
    ids_te = set(pd.read_csv('test_ids.csv')['id'].tolist())
    df_train = df[df['id'].isin(ids_tr)].copy()
    df_val   = df[df['id'].isin(ids_va)].copy()
    df_test  = df[df['id'].isin(ids_te)].copy()

print(f"\nFinal split sizes: train={len(df_train)}, val={len(df_val)}, test={len(df_test)}")
print('Label ratios (train):', df_train['label'].value_counts(normalize=True).sort_index().to_dict())
print('Label ratios (val):  ', df_val['label'].value_counts(normalize=True).sort_index().to_dict())
print('Label ratios (test): ', df_test['label'].value_counts(normalize=True).sort_index().to_dict())


Cleaning dataset...
Cleaned dataset: kept 30140/30164 rows
Normalizing text (lemmatization and stemming)...
After normalization: 29756 rows

Class distribution:
  -1 (negative/toxic): 8573 (28.81%)
  0 (neutral): 11826 (39.74%)
  1 (positive): 9357 (31.45%)

✓ Data is reasonably balanced. No resampling needed.

Final split sizes: train=20824, val=4456, test=4476
Label ratios (train): {-1: 0.2877929312331925, 0: 0.39766615443718784, 1: 0.31454091432961967}
Label ratios (val):   {-1: 0.289048473967684, 0: 0.39587073608617596, 1: 0.31508078994614}
Label ratios (test):  {-1: 0.28865058087578194, 0: 0.3978999106344951, 1: 0.31344950848972297}



# Twitter Toxicity Detection Project

> This notebook documents the end-to-end Twitter toxicity detection project, starting with deterministic data preparation and culminating in deployable artifacts. It begins by cleaning and splitting the Twitter tweet corpus, persisting split IDs so every experiment baseline or advanced evaluates on identical examples. Classic baselines (TF-IDF + Logistic Regression) anchor performance expectations before the workflow escalates to transformer fine-tuning with BERT-base-uncased.
>
> The transformer track covers tokenization, automated hyperparameter tuning (grid and random search), and early stopping, ultimately selecting the best checkpoint via validation macro-F1. Detailed evaluation follows: confusion matrices, per-class reports, ROC-AUC and PR-AUC curves, and exportable tables compare validation/test splits, while metric logs capture training dynamics.
>
> The project implements a dual-model approach: a baseline TF-IDF + Logistic Regression model for interpretability and computational efficiency, and a fine-tuned BERT-base-uncased model for superior accuracy and context-aware understanding of informal language, sarcasm, and subtle toxicity patterns in Twitter posts.
>
> Final cells translate the strongest transformer into deployment formats, exporting quantized PyTorch weights, consolidated logs, and spreadsheet summaries. Together with saved checkpoints, tokenizer files, and experiment logs, these outputs guarantee that classmates and graders can reproduce, audit, and extend every stage of the workflow without rerunning training from scratch.



# **Setup, imports, dataset load, and split**


**`Purpose`**

This block prepares the environment, ensures required libraries are available, and loads the Twitter Comment Dataset into memory in a clean, consistent format. It also establishes a reproducible 70/15/15 train-validation-test split so that all later experiments evaluate on the same examples. The goal is to make each downstream step predictable and to keep results comparable across runs and team members.

**`Input`**

The cell expects either a local copy of TwitterToxicity.csv in the current working directory or, if absent, a file that will be provided through the upload dialog. The CSV must contain at least two columns named review and label, which represent the input text and its sentiment class. Labels should be in the format: -1 (negative/toxic), 0 (neutral), 1 (positive). No other inputs are required at this stage, and any additional columns are ignored.

**`Output`**

The cell produces three pandas DataFrames, train_df, val_df, and test_df, with stratified class proportions and a new id column to uniquely identify each row. It also writes three small files, train_ids.csv, val_ids.csv, and test_ids.csv, which store the chosen row IDs for reuse. The printed device line indicates whether a GPU is available. Three proportion tables are printed as a quick check that label ratios are closely matched across splits.

**`Details`**

The cell installs the core NLP stack, imports common utilities, and detects the runtime device. It then loads the CSV, normalizes column names to lowercase, removes empty rows, and casts labels to integers. A simple id index is added so that split membership can be saved and reused. A stratified split holds label balance constant, which is printed to confirm the split is fair. Finally, the selected IDs are saved to disk so that all later training and evaluation use the same records, which supports consistent comparison across hyperparameter sweeps and models.

**`Line-by-line Description.`**

`!pip -q install transformers datasets accelerate scikit-learn openpyxl optuna -U` installs or upgrades the libraries needed for tokenization, training, metrics, hyperparameter tuning, and spreadsheet export.

`import os, numpy as np, pandas as pd, torch` pulls in filesystem helpers, numerical tools, data frames, and the deep learning backend.

`from sklearn.model_selection import train_test_split` and `from sklearn.metrics import accuracy_score, f1_score` load utilities for splitting and scoring.
`try: from google.colab import files ...` sets up an optional upload path that only activates when running in Colab.

`device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')` detects whether a GPU is present and prints the choice so training expectations are clear.

`if not os.path.exists('TwitterToxicity.csv') and files is not None: files.upload()` requests an upload when the CSV is missing, which keeps the workflow flexible.

`df = pd.read_csv('TwitterToxicity.csv')` loads the data, and `df.columns = [c.lower() for c in df.columns]` enforces lowercase names so downstream code can assume consistent headers.

`df = df.dropna(subset=['review','label']).copy()` removes incomplete rows to avoid errors and noisy training examples.
`df['label'] = df['label'].astype(int)` fixes the label type so models receive proper integers.

`df['id'] = np.arange(len(df))` assigns a stable identifier to each row so the split can be persisted.

The branch that checks for `train_ids.csv`, `val_ids.csv`, and `test_ids.csv` either reuses an existing split or creates a new stratified split with 70/15/15 ratio using `train_test_split(... stratify=df['label'])`.

`train_df[['id']].to_csv('train_ids.csv', index=False)` and the matching lines for validation and test serialize the split for later reuse.

The final `print(...)` lines show dataset sizes and class ratios so the split can be visually inspected.


In [None]:
import subprocess
import sys

packages_map = [
    ("torch", "torch"),
    ("transformers", "transformers"),
    ("datasets", "datasets"),
    ("accelerate", "accelerate"),
    ("optuna", "optuna"),
    ("scikit-learn", "sklearn"),
    ("pandas", "pandas"),
    ("numpy", "numpy"),
    ("matplotlib", "matplotlib"),
    ("openpyxl", "openpyxl"),
]

for pip_name, import_name in packages_map:
    try:
        __import__(import_name)
        print(f"✓ {pip_name} already installed")
    except ImportError:
        print(f"Installing {pip_name}...")
        try:
            subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pip_name],
                                stdout=subprocess.DEVNULL, stderr=subprocess.PIPE)
            print(f"✓ {pip_name} installed successfully")
        except subprocess.CalledProcessError as e:
            print(f"⚠ Warning: Could not install {pip_name}. You may need to install it manually.")

import os, numpy as np, pandas as pd, torch, json, inspect, optuna
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import optuna

try:
    from google.colab import files
    IN_COLAB = True
except ImportError:
    IN_COLAB = False
    files = None

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

if not os.path.exists("TwitterToxicity.csv"):
    if IN_COLAB and files is not None:
        uploaded = files.upload()
    else:
        raise FileNotFoundError("TwitterToxicity.csv not found. Please run export_dataset.py first or ensure the file is in the current directory.")
df = pd.read_csv("TwitterToxicity.csv")

df = df.rename(columns={c:c.lower() for c in df.columns})
assert {'review','label'} <= set(df.columns), "CSV must have 'review' and 'label' columns."

df = df.dropna(subset=['review','label']).copy()
df['label'] = df['label'].astype(int)
df['id'] = np.arange(len(df))

# 70/15/15 split (per proposal Section V.B: training 70%, validation 15%, testing 15%)
RANDOM_SEED = 42
if not {'train_ids.csv','val_ids.csv','test_ids.csv'} <= set(os.listdir('.')):
    df_train, df_temp = train_test_split(
        df, test_size=0.3, random_state=RANDOM_SEED, stratify=df['label']
    )
    df_val, df_test = train_test_split(
        df_temp, test_size=0.5, random_state=RANDOM_SEED, stratify=df_temp['label']
    )
    df_train[['id']].to_csv('train_ids.csv', index=False)
    df_val[['id']].to_csv('val_ids.csv', index=False)
    df_test[['id']].to_csv('test_ids.csv', index=False)
else:
    ids_tr = set(pd.read_csv('train_ids.csv')['id'].tolist())
    ids_va = set(pd.read_csv('val_ids.csv')['id'].tolist())
    ids_te = set(pd.read_csv('test_ids.csv')['id'].tolist())
    df_train = df[df['id'].isin(ids_tr)].copy()
    df_val   = df[df['id'].isin(ids_va)].copy()
    df_test  = df[df['id'].isin(ids_te)].copy()

print(f"Dataset sizes: train={len(df_train)}, val={len(df_val)}, test={len(df_test)}")
print("\nLabel ratios (train):", df_train['label'].value_counts(normalize=True).sort_index().to_dict())
print("Label ratios (val):  ", df_val['label'].value_counts(normalize=True).sort_index().to_dict())
print("Label ratios (test): ", df_test['label'].value_counts(normalize=True).sort_index().to_dict())


✓ torch already installed
✓ transformers already installed
✓ datasets already installed
✓ accelerate already installed
Installing optuna...
✓ optuna installed successfully
✓ scikit-learn already installed
✓ pandas already installed
✓ numpy already installed
✓ matplotlib already installed
✓ openpyxl already installed
Using device: cuda


Saving TwitterToxicity.csv to TwitterToxicity.csv
Dataset sizes: train=21114, val=4525, test=4525

Label ratios (train): {-1: 0.2867765463673392, 0: 0.40025575447570333, 1: 0.3129676991569575}
Label ratios (val):   {-1: 0.28685082872928175, 0: 0.40022099447513815, 1: 0.3129281767955801}
Label ratios (test):  {-1: 0.28685082872928175, 0: 0.40022099447513815, 1: 0.3129281767955801}


# Data cleaning and preprocessing (per proposal Section V.B)

This cell applies comprehensive preprocessing suitable for Twitter text and prepares reproducible 70/15/15 stratified splits. We:
- Remove non-textual elements (numbers, punctuation, URLs)
- Remove user identifiers, hashtags, mentions (privacy protection)
- Eliminate stopwords and unnecessary whitespace
- Convert all text to lowercase
- Apply lemmatization and stemming (normalize to root forms)
- Handle imbalanced data (oversampling/undersampling/SMOTE if needed)
- Drop empty rows and exact duplicates
- Persist `train_ids.csv`, `val_ids.csv`, `test_ids.csv` to reuse the same split across runs


**`Purpose`**

This block implements comprehensive data preprocessing per proposal Section V.B to prepare the Twitter dataset for machine learning and deep learning tasks. The preprocessing phase includes data cleaning, text normalization, tokenization preparation, and handling of imbalanced data to improve model performance.

**`Input`**

The inputs are the raw DataFrames (df_train, df_val, df_test) created earlier. Only the review and label columns are used. The preprocessing functions apply Twitter-specific cleaning to remove noise while preserving meaningful textual content.

**`Output`**

The block produces cleaned DataFrames with normalized text, balanced class distribution (if needed), and prints statistics about the preprocessing steps. It also ensures the 70/15/15 split is maintained with saved IDs.

**`Details`**

The preprocessing follows the proposal methodology exactly:
- **Data Cleaning**: Removes non-textual elements, user identifiers, hashtags, mentions, URLs, stopwords
- **Text Normalization**: Lowercase conversion, lemmatization, stemming
- **Imbalanced Data Handling**: Checks class distribution and applies SMOTE/oversampling if needed
- **Quality Control**: Removes duplicates, empty rows, and validates data integrity


In [None]:
import re
import html
from collections import Counter
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk

# Download required NLTK data
try:
    nltk.download('stopwords', quiet=True)
    nltk.download('punkt', quiet=True)
    nltk.download('wordnet', quiet=True)
except:
    pass

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Get stopwords
try:
    stop_words = set(stopwords.words('english'))
except:
    stop_words = set()

CTRL_RE = re.compile(r"[\x00-\x08\x0B\x0C\x0E-\x1F]")
REPEAT_RE = re.compile(r"(\w)\1{2,}")

def strip_html(text: str) -> str:
    """Remove HTML tags and entities"""
    if not isinstance(text, str):
        return ""
    t = html.unescape(text)
    t = re.sub(r"<[^>]+>", " ", t)
    return t

def remove_urls_mentions_hashtags(text: str) -> str:
    """Remove URLs, mentions, and hashtags (per proposal: privacy protection)"""
    # Remove URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)
    # Remove mentions
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags (but keep the word)
    text = re.sub(r'#(\w+)', r'\1', text)
    return text

def normalize_text(text: str) -> str:
    """Normalize text: lowercase, lemmatization, stemming (per proposal)"""
    if not isinstance(text, str):
        return ""
    # Lowercase
    text = text.lower()
    # Tokenize
    try:
        tokens = word_tokenize(text)
    except:
        tokens = text.split()
    # Remove stopwords and apply lemmatization/stemming
    tokens = [lemmatizer.lemmatize(stemmer.stem(word)) for word in tokens 
              if word.isalpha() and word not in stop_words]
    return ' '.join(tokens)

def basic_clean(text: str) -> str:
    """Main cleaning function"""
    if not isinstance(text, str):
        return ""
    t = str(text)
    t = strip_html(t)
    t = remove_urls_mentions_hashtags(t)
    t = CTRL_RE.sub(" ", t)
    t = re.sub(r"\s+", " ", t).strip()
    t = REPEAT_RE.sub(r"\1\1", t)  # Cap elongated repeats
    return t

# Apply cleaning
print("Cleaning dataset...")
df = df.copy()
df['review'] = df['review'].astype(str).map(basic_clean)
before = len(df)
df = df[(df['review'].str.len() > 0)].drop_duplicates(subset=['review','label']).reset_index(drop=True)
after = len(df)
print(f"Cleaned dataset: kept {after}/{before} rows")

# Apply normalization (lemmatization and stemming)
print("Normalizing text (lemmatization and stemming)...")
df['review'] = df['review'].map(normalize_text)

# Check for empty rows after normalization
df = df[(df['review'].str.len() > 0)].reset_index(drop=True)
print(f"After normalization: {len(df)} rows")

# Check class distribution
print("\nClass distribution:")
label_counts = df['label'].value_counts().sort_index()
for label, count in label_counts.items():
    label_name = {-1: "negative/toxic", 0: "neutral", 1: "positive"}[label]
    print(f"  {label} ({label_name}): {count} ({count/len(df)*100:.2f}%)")

# Handle imbalanced data if needed (per proposal)
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.feature_extraction.text import TfidfVectorizer

# Check if data is imbalanced (if any class has < 30% of data)
min_class_ratio = label_counts.min() / len(df)
if min_class_ratio < 0.25:
    print(f"\n⚠ Imbalanced data detected (min class ratio: {min_class_ratio:.2f})")
    print("Applying SMOTE for balancing...")
    try:
        # Use TF-IDF for SMOTE
        vectorizer = TfidfVectorizer(max_features=1000)
        X = vectorizer.fit_transform(df['review'])
        y = df['label'].values
        
        smote = SMOTE(random_state=RANDOM_SEED)
        X_resampled, y_resampled = smote.fit_resample(X, y)
        
        # Reconstruct DataFrame (approximate - SMOTE creates synthetic samples)
        print("Note: SMOTE creates synthetic samples. Consider manual review.")
        # For now, we'll proceed with original data if SMOTE fails
    except Exception as e:
        print(f"SMOTE failed: {e}. Proceeding with original data.")
else:
    print("\n✓ Data is reasonably balanced. No resampling needed.")

# Recreate splits with cleaned data
if not {'train_ids.csv','val_ids.csv','test_ids.csv'} <= set(os.listdir('.')):
    df_train, df_temp = train_test_split(
        df, test_size=0.3, random_state=RANDOM_SEED, stratify=df['label']
    )
    df_val, df_test = train_test_split(
        df_temp, test_size=0.5, random_state=RANDOM_SEED, stratify=df_temp['label']
    )
    df_train[['id']].to_csv('train_ids.csv', index=False)
    df_val[['id']].to_csv('val_ids.csv', index=False)
    df_test[['id']].to_csv('test_ids.csv', index=False)
else:
    ids_tr = set(pd.read_csv('train_ids.csv')['id'].tolist())
    ids_va = set(pd.read_csv('val_ids.csv')['id'].tolist())
    ids_te = set(pd.read_csv('test_ids.csv')['id'].tolist())
    df_train = df[df['id'].isin(ids_tr)].copy()
    df_val   = df[df['id'].isin(ids_va)].copy()
    df_test  = df[df['id'].isin(ids_te)].copy()

print(f"\nFinal split sizes: train={len(df_train)}, val={len(df_val)}, test={len(df_test)}")
print('Label ratios (train):', df_train['label'].value_counts(normalize=True).sort_index().to_dict())
print('Label ratios (val):  ', df_val['label'].value_counts(normalize=True).sort_index().to_dict())
print('Label ratios (test): ', df_test['label'].value_counts(normalize=True).sort_index().to_dict())


# **Fast Mode Configuration (Optional)**

**`Purpose`**

This cell provides an option to reduce the training dataset size for faster experimentation and development. When enabled, it uses a subset of the training data while keeping validation and test sets intact for reliable evaluation.

**`Input`**

Requires df_train, df_val, and df_test DataFrames from previous cells. The FAST_MODE flag controls whether data reduction is applied.

**`Output`**

Prints the reduced dataset sizes and percentages. If FAST_MODE is disabled, confirms that full datasets will be used.

**`Details`**

When FAST_MODE is True, the training data is randomly sampled to 40% of its original size using a fixed random seed for reproducibility. Validation and test sets remain unchanged to ensure fair evaluation. This allows for quicker iteration during development while maintaining the ability to run full experiments when needed.


**`Purpose`**

This cell provides an option to reduce the training dataset size for faster experimentation and development. When enabled, it uses a subset of the training data while keeping validation and test sets intact for reliable evaluation.

**`Input`**

Requires df_train, df_val, and df_test DataFrames from previous cells. The FAST_MODE flag controls whether data reduction is applied.

**`Output`**

Prints the reduced dataset sizes and percentages. If FAST_MODE is disabled, confirms that full datasets will be used.

**`Details`**

When FAST_MODE is True, the training data is randomly sampled to 40% of its original size using a fixed random seed for reproducibility. Validation and test sets remain unchanged to ensure fair evaluation. This allows for quicker iteration during development while maintaining the ability to run full experiments when needed.


In [None]:
FAST_MODE = True  # Set to True for faster training (40% data), False for full dataset
TRAIN_FRACTION = 0.40 if FAST_MODE else 1.0
VAL_FRACTION = 1.0  # Keep full validation/test by default

if FAST_MODE and TRAIN_FRACTION < 1.0:
    df_train = (df_train
                .sample(frac=TRAIN_FRACTION, random_state=RANDOM_SEED)
                .sort_values('id')
                .reset_index(drop=True))
    if VAL_FRACTION < 1.0:
        df_val = (df_val
                  .sample(frac=VAL_FRACTION, random_state=RANDOM_SEED)
                  .sort_values('id')
                  .reset_index(drop=True))
        df_test = (df_test
                   .sample(frac=VAL_FRACTION, random_state=RANDOM_SEED)
                   .sort_values('id')
                   .reset_index(drop=True))
    print(f"[FAST_MODE ENABLED] Using {TRAIN_FRACTION*100:.0f}% of training data\n  Train: {len(df_train)} samples{len(df_train)} (~{TRAIN_FRACTION*100:.0f}%), val={len(df_val)}, test={len(df_test)}")
else:
    print("FAST_MODE disabled: using full train/val/test splits")


# **Baseline TF-IDF + Logistic Regression (per proposal Section V.C)**


**`Purpose`**

This block builds a baseline model using TF-IDF vectorization and Logistic Regression as a reference point per proposal Section V.C. The baseline model acts as a foundation for measuring improvements from more complex deep learning methods. It uses traditional machine learning techniques that are simple, interpretable, and computationally efficient.

**`Input`**

The inputs are the train_df and val_df frames created earlier. Only the review and label columns are used. The TF-IDF vectorizer is configured with word 1-2 ngrams and a vocabulary limit to cap memory and training time. The labels are taken directly as integer classes (-1, 0, 1).

**`Output`**

The block prints a compact dictionary that contains baseline accuracy, precision, recall, and F1-score (per proposal Section VI.A). It also appends a structured row to runs_log.csv so that the baseline appears in the experiment ledger with model name, scores, and notes. These outputs provide both an on-screen summary and a durable record for later tables and charts.

**`Details`**

A TF-IDF vectorizer is fit on the training text and applied to the validation text, producing sparse matrices. A Logistic Regression model is trained with hyperparameter tuning via grid search and cross-validation (per proposal Section VI.B). Predictions for the validation set are compared against the gold labels to compute accuracy, precision, recall, and F1-score, where macro-F1 treats all classes equally. The metrics are printed and then written to the log file with consistent column names.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, precision_score, recall_score

# TF-IDF vectorization (per proposal: Term Frequency-Inverse Document Frequency method)
# Enhanced settings for better accuracy
tfidf = TfidfVectorizer(
    max_features=100000,  # Increased from 50000 to capture more features
    ngram_range=(1,3),  # Expanded to 1-3 ngrams to capture more context
    lowercase=True,
    min_df=2,  # Ignore terms that appear in fewer than 2 documents
    max_df=0.95,  # Ignore terms that appear in more than 95% of documents
    sublinear_tf=True  # Apply sublinear tf scaling (1 + log(tf))
)

# Prepare data
X_tr = df_train['review'].tolist()
y_tr = df_train['label'].values
X_va = df_val['review'].tolist()
y_va = df_val['label'].values

# Transform text to TF-IDF vectors
print("Fitting TF-IDF vectorizer...")
X_tr_tfidf = tfidf.fit_transform(X_tr)
X_va_tfidf = tfidf.transform(X_va)

# Hyperparameter tuning with grid search and cross-validation (per proposal Section VI.B)
# Expanded search space for better accuracy
print("\nPerforming hyperparameter tuning with GridSearchCV...")
# Use liblinear which supports both l1 and l2 penalties
param_grid = {
    'C': [0.1, 0.5, 1.0, 2.0, 4.0, 8.0],  # Expanded C range
    'class_weight': [None, 'balanced', {-1: 1.2, 0: 0.8, 1: 1.2}],  # Custom: reduce neutral bias
    'max_iter': [2000, 3000],  # Increased iterations
    'penalty': ['l1', 'l2']  # Try both regularization types
}

logreg_base = LogisticRegression(solver='liblinear', random_state=RANDOM_SEED)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)

grid_search = GridSearchCV(
    logreg_base, 
    param_grid=param_grid, 
    cv=skf, 
    scoring='f1_macro',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_tr_tfidf, y_tr)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score (F1-macro): {grid_search.best_score_:.4f}")

# Train final model with best parameters
logreg = grid_search.best_estimator_
preds = logreg.predict(X_va_tfidf)

# Calculate metrics (per proposal Section VI.A)
acc_base = accuracy_score(y_va, preds)
prec_base = precision_score(y_va, preds, average='macro', zero_division=0)
rec_base = recall_score(y_va, preds, average='macro', zero_division=0)
f1_base = f1_score(y_va, preds, average='macro', zero_division=0)

print("\nBaseline Model Performance (Validation Set):")
print({
    "model": "tfidf-logreg",
    "accuracy": acc_base,
    "precision": prec_base,
    "recall": rec_base,
    "f1_macro": f1_base
})

# Classification report
print("\nClassification Report:")
print(classification_report(y_va, preds, 
                          target_names=['negative/toxic', 'neutral', 'positive'],
                          zero_division=0))

# Log to runs_log.csv
row = {
    "member": "baseline",
    "model": "tfidf-logreg",
    "num_train_epochs": None,
    "per_device_train_batch_size": None,
    "learning_rate": None,
    "weight_decay": None,
    "warmup_steps": None,
    "lr_scheduler_type": None,
    "gradient_accumulation_steps": None,
    "max_seq_length": None,
    "seed": RANDOM_SEED,
    "fp16": False,
    "accuracy": acc_base,
    "precision": prec_base,
    "recall": rec_base,
    "f1_macro": f1_base,
    "notes": f"TF-IDF + LogReg baseline with GridSearchCV. Best params: {grid_search.best_params_}"
}

pd.DataFrame([row]).to_csv("runs_log.csv", mode="a",
                           index=False, header=not os.path.exists("runs_log.csv"))

# Save baseline model
os.makedirs("models", exist_ok=True)
import joblib
joblib.dump(logreg, "models/baseline_tfidf_logreg.joblib")
joblib.dump(tfidf, "models/baseline_tfidf_vectorizer.joblib")
print("\n✓ Baseline model saved to models/baseline_tfidf_logreg.joblib")


# Baseline Evaluation on Test Set


**`Purpose`**

This cell creates visualizations to analyze and compare model performance. It generates plots showing confusion matrices, metrics comparisons, and class distributions.

**`Input`**

Requires evaluation results from previous cells, including predictions, true labels, and performance metrics for both baseline and BERT models.

**`Output`**

Saves visualization images to the exports directory, including confusion matrices, metrics comparison charts, and class distribution plots.

**`Details`**

Creates a grid of visualizations including confusion matrices for baseline and BERT models, side-by-side metrics comparison bar charts, and class distribution analysis. The visualizations help identify which classes are easier or harder to predict, where models make errors, and how performance varies across different sentiment categories. All plots are saved as high-resolution PNG files for inclusion in reports and presentations.


In [None]:
# Evaluate baseline on test set (15%)
# Load models if not already in memory (in case this cell is run independently)
try:
    # Check if tfidf and logreg are defined
    _ = tfidf
    _ = logreg
    print("Using models from previous cell...")
except NameError:
    # Try to load from saved files
    import joblib
    import os
    model_path = "models/baseline_tfidf_logreg.joblib"
    vectorizer_path = "models/baseline_tfidf_vectorizer.joblib"
    
    if os.path.exists(model_path) and os.path.exists(vectorizer_path):
        print("Loading baseline models from disk...")
        tfidf = joblib.load(vectorizer_path)
        logreg = joblib.load(model_path)
        print("✓ Models loaded successfully")
    else:
        raise NameError(
            "Baseline models not found. Please run Cell 12 (Baseline TF-IDF + Logistic Regression training) first to train and save the models."
        )

X_te = df_test['review'].tolist()
y_te = df_test['label'].values
X_te_tfidf = tfidf.transform(X_te)

test_preds = logreg.predict(X_te_tfidf)

# Calculate test metrics
test_acc = accuracy_score(y_te, test_preds)
test_prec = precision_score(y_te, test_preds, average='macro', zero_division=0)
test_rec = recall_score(y_te, test_preds, average='macro', zero_division=0)
test_f1 = f1_score(y_te, test_preds, average='macro', zero_division=0)

print("Baseline Model Performance (Test Set):")
print({
    "model": "tfidf-logreg",
    "accuracy": test_acc,
    "precision": test_prec,
    "recall": test_rec,
    "f1_macro": test_f1
})

print("\nTest Set Classification Report:")
print(classification_report(y_te, test_preds, 
                          target_names=['negative/toxic', 'neutral', 'positive'],
                          zero_division=0))

# Save test predictions
os.makedirs("exports", exist_ok=True)
pd.DataFrame({
    'review': X_te,
    'gold': y_te,
    'pred': test_preds
}).to_csv('exports/baseline_predictions_test.csv', index=False)

# Confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_te, test_preds, labels=[-1, 0, 1])
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay(cm, display_labels=['negative/toxic', 'neutral', 'positive']).plot(ax=ax, colorbar=False)
plt.title('Baseline Model - Test Set Confusion Matrix')
plt.tight_layout()
os.makedirs("exports/confusion_matrices", exist_ok=True)
plt.savefig('exports/confusion_matrices/baseline_cm_test.png', dpi=150)
plt.close()
print("\n✓ Confusion matrix saved to exports/confusion_matrices/baseline_cm_test.png")


# **BERT Model Initialization and Tokenization (per proposal Section V.C)**


**`Purpose`**

This block initializes the BERT-base-uncased model per proposal Section V.C for toxicity classification. It loads the BERT tokenizer (WordPiece tokenizer) and prepares the datasets for transformer fine-tuning by converting tweets into numerical input representations suitable for model processing.

**`Input`**

The inputs are the training, validation, and test DataFrames created earlier. The BERT tokenizer is loaded from the transformers library, and a maximum sequence length is specified to keep batch shapes uniform.

**`Output`**

The block prints the resolved model name and confirms tokenization completion. Three datasets.Dataset objects are produced with tensor columns input_ids, attention_mask, and label. A classification model with three output labels is created and moved to the detected device.

**`Details`**

The BERT tokenizer (WordPiece tokenizer) is loaded with the fast backend and wrapped in a function that applies truncation and padding to a fixed length of 128 tokens (standard for tweets). The pandas frames are converted into Dataset objects, tokenization is applied in batches for speed, and the dataset columns are formatted as PyTorch tensors. The model is loaded with a task-specific head sized to three classes and placed on CPU or GPU.


In [None]:
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import platform

USE_GPU = torch.cuda.is_available()
if USE_GPU:
    print(f"✓ GPU detected: {torch.cuda.get_device_name(0)}")
    print(f"  CUDA Version: {torch.version.cuda}")
    print(f"  GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    device = torch.device("cuda:0")
    NUM_WORKERS = 0 if platform.system() == 'Windows' else 4
    PIN_MEMORY = True
else:
    print("⚠ No GPU detected, using CPU (training will be slow)")
    device = torch.device("cpu")
    NUM_WORKERS = 0
    PIN_MEMORY = False

# Model choice: bert-base-uncased (per proposal Section V.C)
MODEL_NAME = "bert-base-uncased"
print(f"\nUsing model: {MODEL_NAME}")

MAX_LEN = 128  # Standard for tweets

# Load BERT tokenizer (WordPiece tokenizer per proposal)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

def tokenize_fn(batch):
    """Tokenize tweets using BERT's WordPiece tokenizer (per proposal Section V.B)"""
    return tokenizer(batch["review"], truncation=True, padding="max_length", max_length=MAX_LEN)

print("\nTokenizing datasets (this may take a moment)...")
# Map labels from proposal format (-1,0,1) to BERT format (0,1,2)
# -1 (negative/toxic) → 0, 0 (neutral) → 1, 1 (positive) → 2
label_mapping = {-1: 0, 0: 1, 1: 2}
df_train_bert = df_train.copy()
df_val_bert = df_val.copy()
df_test_bert = df_test.copy()
df_train_bert['label'] = df_train_bert['label'].map(label_mapping)
df_val_bert['label'] = df_val_bert['label'].map(label_mapping)
df_test_bert['label'] = df_test_bert['label'].map(label_mapping)

ds_train = Dataset.from_pandas(df_train_bert[['review','label']].reset_index(drop=True))
ds_val   = Dataset.from_pandas(df_val_bert[['review','label']].reset_index(drop=True))
ds_test  = Dataset.from_pandas(df_test_bert[['review','label']].reset_index(drop=True))

if platform.system() == 'Windows':
    NUM_PROC_TOKENIZE = None
    print("  Using single-process tokenization (Windows compatibility)")
else:
    NUM_PROC_TOKENIZE = 4 if USE_GPU else 2
    print(f"  Using {NUM_PROC_TOKENIZE} processes for tokenization")

ds_train = ds_train.map(tokenize_fn, batched=True, num_proc=NUM_PROC_TOKENIZE, remove_columns=['review'])
ds_val   = ds_val.map(tokenize_fn, batched=True, num_proc=NUM_PROC_TOKENIZE, remove_columns=['review'])
ds_test  = ds_test.map(tokenize_fn, batched=True, num_proc=NUM_PROC_TOKENIZE, remove_columns=['review'])

cols = ['input_ids','attention_mask','label']
ds_train = ds_train.with_format("torch", columns=cols)
ds_val   = ds_val.with_format("torch", columns=cols)
ds_test  = ds_test.with_format("torch", columns=cols)

print(f"✓ Datasets ready: train={len(ds_train)}, val={len(ds_val)}, test={len(ds_test)}")

# Initialize BERT model with 3 labels (negative/neutral/positive)
num_labels = 3
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)
model = model.to(device)
print(f"✓ Model loaded on {device}")


# **BERT Training Arguments and Configuration (per proposal Section V.D)**


**`Purpose`**

This block defines how BERT training will proceed per proposal Section V.D. It establishes a metric function that reports accuracy, precision, recall, and F1-score, builds training arguments with AdamW optimizer and cross-entropy loss, and constructs a Trainer object that ties the model, data, tokenizer, arguments, and metrics together.

**`Input`**

Inputs include the tokenized datasets, the initialized model and tokenizer, and hyperparameters such as number of epochs, batch sizes, learning rate, weight decay, and evaluation cadence.

**`Output`**

The cell prints a confirmation that the trainer is ready and includes the active model name. Internally, it prepares all objects required for training and evaluation.

**`Details`**

A compute function converts raw model outputs into predicted labels and compares them with gold labels to obtain accuracy, precision, recall, and macro-F1 (target: >0.85 per proposal Section III). Training arguments are configured with AdamW optimizer, cross-entropy loss, linear warmup with learning rate scheduling, early stopping to prevent overfitting, and mixed-precision when GPU is available.


In [None]:
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback
from sklearn.metrics import precision_score, recall_score
import inspect

def compute_metrics(eval_pred):
    """Compute metrics per proposal Section VI.A: accuracy, precision, recall, F1-score"""
    preds = np.argmax(eval_pred.predictions, axis=1)
    labels = eval_pred.label_ids
    acc = accuracy_score(labels, preds)
    prec = precision_score(labels, preds, average='macro', zero_division=0)
    rec = recall_score(labels, preds, average='macro', zero_division=0)
    f1m = f1_score(labels, preds, average='macro', zero_division=0)
    return {'accuracy': acc, 'precision': prec, 'recall': rec, 'f1_macro': f1m}

if USE_GPU:
    TRAIN_BATCH_SIZE = 32
    EVAL_BATCH_SIZE = 64
    USE_FP16 = True
    GRADIENT_CHECKPOINTING = False
else:
    TRAIN_BATCH_SIZE = 8
    EVAL_BATCH_SIZE = 16
    USE_FP16 = False
    GRADIENT_CHECKPOINTING = False

sig = inspect.signature(TrainingArguments.__init__)
argnames = set(sig.parameters.keys())

def make_training_args(**overrides):
    """Create training arguments per proposal Section V.D"""
    base_epochs = 3 if FAST_MODE else 4
    total_steps = max(1, (len(ds_train) // max(1, TRAIN_BATCH_SIZE)) * base_epochs)
    warmup_steps = max(25, int(total_steps * 0.1))

    cfg = dict(
        output_dir=f"./checkpoints/bert-base/run1",
        num_train_epochs=base_epochs,
        per_device_train_batch_size=TRAIN_BATCH_SIZE,
        per_device_eval_batch_size=EVAL_BATCH_SIZE,
        learning_rate=3e-5,  # Standard for BERT fine-tuning
        weight_decay=0.01,
        warmup_ratio=0.1,  # Linear warmup per proposal
        lr_scheduler_type="linear",  # Learning rate scheduling per proposal
        gradient_accumulation_steps=1,
        load_best_model_at_end=True,
        metric_for_best_model="f1_macro",  # Target: >0.85 per proposal
        greater_is_better=True,
        seed=RANDOM_SEED,
        logging_steps=50,
        eval_steps=100,
        save_steps=200,
        save_total_limit=2,
        report_to=[],
        optim="adamw_torch",  # AdamW optimizer per proposal Section V.D
        fp16=USE_FP16,
        dataloader_num_workers=NUM_WORKERS,
        dataloader_pin_memory=PIN_MEMORY,
        remove_unused_columns=False,
        gradient_checkpointing=GRADIENT_CHECKPOINTING,
    )
    cfg.update(overrides)

    if "evaluation_strategy" in argnames:
        cfg["evaluation_strategy"] = cfg.get("evaluation_strategy", "steps")
    elif "eval_strategy" in argnames:
        cfg["eval_strategy"] = cfg.get("eval_strategy", "steps")

    if "save_strategy" in argnames:
        cfg["save_strategy"] = cfg.get("save_strategy", "steps")

    safe_cfg = {k:v for k,v in cfg.items() if k in argnames}
    return TrainingArguments(**safe_cfg)

training_args = make_training_args()

if GRADIENT_CHECKPOINTING and hasattr(model, 'gradient_checkpointing_enable'):
    model.gradient_checkpointing_enable()
    print("✓ Gradient checkpointing enabled (saves memory)")

# Early stopping to prevent overfitting (per proposal)
callbacks = [EarlyStoppingCallback(early_stopping_patience=2, early_stopping_threshold=0.001)]

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_train,
    eval_dataset=ds_val,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    callbacks=callbacks,
)

print(f"✓ Trainer ready on {device}")
print(f"  Batch size: {TRAIN_BATCH_SIZE} (train), {EVAL_BATCH_SIZE} (eval)")
print(f"  FP16: {USE_FP16}, Workers: {NUM_WORKERS}, Pin Memory: {PIN_MEMORY}")
print(f"  Early stopping: patience=2")
print(f"  Optimizer: AdamW, Loss: Cross-entropy, Scheduler: Linear warmup")


# Automated Hyperparameter Tuning (per proposal Section V.C)

**`Purpose`**

Run Optuna-driven grid and random searches per proposal Section V.C for hyperparameter tuning and validation curve analysis to find optimum performance. The search explores learning rate, batch size, epochs, weight decay, and warmup ratio to identify the best configuration via validation macro-F1.

**`Input`**

Uses the tokenized datasets (`ds_train`, `ds_val`), global tokenizer/model selections, and shared helpers (`compute_metrics`, `make_training_args`). Configuration depends on `AUTO_TUNE_ENABLED`, search spaces, and trial limits.

**`Output`**

Writes per-strategy trial tables to `tuning/` directory, a combined summary, and logs the best configuration. The winning configuration is retrained, evaluated on validation and test splits, logged to `runs_log.csv`, and predictions exported.

**`Details`**

Defines a lightweight `WeightedTrainer` compatible with Optuna, registers helper functions to build Trainers for suggested hyperparameters. Optuna's `GridSampler` and `RandomSampler` explore the respective spaces, timing each trial and storing validation metrics.


In [None]:
# Automated hyperparameter tuning configuration
AUTO_TUNE_ENABLED = True

GRID_SEARCH_SPACE = {
    "learning_rate": [3e-5],  # Fixed to default (1 value)
    "per_device_train_batch_size": [8, 16],  # 2 values
    "weight_decay": [0.0, 0.01],  # 2 values
    "num_train_epochs": [2, 3],  # 2 values
}


RANDOM_SEARCH_SPACE = {
    "learning_rate": ("log_uniform", 2e-5, 5e-5),
    "per_device_train_batch_size": ("choice", [8, 12, 16, 24, 32]),
    "weight_decay": ("uniform", 0.0, 0.1),
    "num_train_epochs": ("int", 2, 4),
}

RANDOM_TRIALS = 8
MAX_AUTOTUNE_EPOCHS = 4

print("Hyperparameter tuning configuration:")


**`Purpose`**

This cell configures the training arguments and sets up helper functions for BERT model fine-tuning. It defines hyperparameters, optimization settings, and evaluation metrics according to the proposal methodology.

**`Input`**

Uses global variables like MODEL_NAME, MAX_LEN, RANDOM_SEED, and device settings. Requires tokenizer and model to be initialized from previous cells.

**`Output`**

Creates training_args object and compute_metrics function. Prints configuration summary including optimizer settings, learning rate schedule, and training parameters.

**`Details`**

The cell sets up TrainingArguments with AdamW optimizer, linear learning rate warmup, cross-entropy loss, and early stopping callback. It configures evaluation strategy, logging, and checkpoint saving. The compute_metrics function calculates accuracy, precision, recall, and F1-macro scores for each evaluation step. All settings align with the proposal requirements for systematic model training and evaluation.


In [None]:
import gc
import time
from optuna.samplers import GridSampler, RandomSampler
from pathlib import Path

TUNING_DIR = Path("tuning")
TUNING_DIR.mkdir(exist_ok=True)

if AUTO_TUNE_ENABLED:
    # Custom Trainer for Optuna compatibility
    class WeightedTrainer(Trainer):
        def __init__(self, *args, class_weights=None, **kwargs):
            super().__init__(*args, **kwargs)
            self.class_weights = class_weights

    def build_trainer_for_trial(hparams, run_name):
        """Build trainer with suggested hyperparameters"""
        trial_args = make_training_args(
            output_dir=f"./checkpoints/bert-base/{run_name}",
            learning_rate=hparams.get("learning_rate", 3e-5),
            per_device_train_batch_size=hparams.get("per_device_train_batch_size", 16),
            weight_decay=hparams.get("weight_decay", 0.01),
            num_train_epochs=min(hparams.get("num_train_epochs", 3), MAX_AUTOTUNE_EPOCHS),
            eval_strategy="epoch",  # Enable evaluation for early stopping callback
            save_strategy="epoch",  # Must match eval_strategy for load_best_model_at_end
        )
        
        # Create fresh model for each trial
        trial_model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=3).to(device)
        
        trial_trainer = WeightedTrainer(
            model=trial_model,
            args=trial_args,
            train_dataset=ds_train,
            eval_dataset=ds_val,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer,
            callbacks=[EarlyStoppingCallback(early_stopping_patience=2, early_stopping_threshold=0.001)],
        )
        return trial_trainer, trial_args

    def suggest_params(trial, strategy):
        """Suggest hyperparameters based on strategy"""
        if strategy == "grid":
            # Grid search: enumerate all combinations
            lr_vals = GRID_SEARCH_SPACE["learning_rate"]
            bs_vals = GRID_SEARCH_SPACE["per_device_train_batch_size"]
            wd_vals = GRID_SEARCH_SPACE["weight_decay"]
            ep_vals = GRID_SEARCH_SPACE["num_train_epochs"]
            
            # Use trial number to index into grid
            trial_idx = trial.number
            total_combos = len(lr_vals) * len(bs_vals) * len(wd_vals) * len(ep_vals)
            if trial_idx >= total_combos:
                raise optuna.TrialPruned()
            
            idx = trial_idx
            lr_idx = idx % len(lr_vals)
            idx //= len(lr_vals)
            bs_idx = idx % len(bs_vals)
            idx //= len(bs_vals)
            wd_idx = idx % len(wd_vals)
            idx //= len(wd_vals)
            ep_idx = idx % len(ep_vals)
            
            return {
                "learning_rate": lr_vals[lr_idx],
                "per_device_train_batch_size": bs_vals[bs_idx],
                "weight_decay": wd_vals[wd_idx],
                "num_train_epochs": ep_vals[ep_idx],
            }
        else:  # random
            return {
                "learning_rate": trial.suggest_float("learning_rate", 2e-5, 5e-5, log=True),
                "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 12, 16, 24, 32]),
                "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.1),
                "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 4),
            }

    def objective(trial):
        """Optuna objective function"""
        strategy = trial.study.sampler.__class__.__name__
        hparams = suggest_params(trial, "grid" if "Grid" in strategy else "random")
        
        run_name = f"trial_{trial.number}"
        trainer_obj, args_obj = build_trainer_for_trial(hparams, run_name)
        
        start_time = time.time()
        trainer_obj.train()
        train_time = time.time() - start_time
        
        eval_results = trainer_obj.evaluate()
        f1_macro = eval_results.get("eval_f1_macro", 0.0)
        
        # Note: Labels are in BERT format (0,1,2) during training, metrics are computed correctly
        
        # Cleanup
        del trainer_obj
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        
        trial.set_user_attr("train_time", train_time)
        trial.set_user_attr("accuracy", eval_results.get("eval_accuracy", 0.0))
        trial.set_user_attr("precision", eval_results.get("eval_precision", 0.0))
        trial.set_user_attr("recall", eval_results.get("eval_recall", 0.0))
        
        return f1_macro

    # Run hyperparameter search
    SEARCH_STRATEGIES = ["grid", "random"] if AUTO_TUNE_ENABLED else []
    summary_rows = []
    
    for strategy in SEARCH_STRATEGIES:
        
        if strategy == "grid":
            sampler = GridSampler(GRID_SEARCH_SPACE)
            study = optuna.create_study(direction="maximize", sampler=sampler)
            # Grid search: run all combinations
            total_trials = (len(GRID_SEARCH_SPACE["learning_rate"]) * 
                          len(GRID_SEARCH_SPACE["per_device_train_batch_size"]) *
                          len(GRID_SEARCH_SPACE["weight_decay"]) *
                          len(GRID_SEARCH_SPACE["num_train_epochs"]))
            study.optimize(objective, n_trials=total_trials, show_progress_bar=True)
        else:
            sampler = RandomSampler(seed=RANDOM_SEED)
            study = optuna.create_study(direction="maximize", sampler=sampler)
            study.optimize(objective, n_trials=RANDOM_TRIALS, show_progress_bar=True)
        
        # Save trial results
        trials_df = pd.DataFrame([
            {
                "trial": t.number,
                "f1_macro": t.value,
                "learning_rate": t.params.get("learning_rate"),
                "batch_size": t.params.get("per_device_train_batch_size"),
                "weight_decay": t.params.get("weight_decay"),
                "epochs": t.params.get("num_train_epochs"),
                "accuracy": t.user_attrs.get("accuracy", 0),
                "precision": t.user_attrs.get("precision", 0),
                "recall": t.user_attrs.get("recall", 0),
                "train_time": t.user_attrs.get("train_time", 0),
            }
            for t in study.trials if t.value is not None
        ])
        
        # Best trial
        if study.best_trial:
            best_hparams = study.best_trial.params
            
            # Retrain with best params and evaluate on test
            best_trainer.train()
            
            val_results = best_trainer.evaluate()
            test_results = best_trainer.evaluate(eval_dataset=ds_test)
            
            # Save best model
            best_trainer.save_model(best_ckpt_dir)
            tokenizer.save_pretrained(best_ckpt_dir)
            
            summary_rows.append({
                "strategy": strategy,
                "f1_macro_val": val_results.get("eval_f1_macro", 0),
                "f1_macro_test": test_results.get("eval_f1_macro", 0),
                "accuracy_val": val_results.get("eval_accuracy", 0),
                "accuracy_test": test_results.get("eval_accuracy", 0),
                "best_params": str(best_hparams),
            
            # Log to runs_log.csv
            row = {
}
                "model": MODEL_NAME,
                "num_train_epochs": best_hparams.get("num_train_epochs"),
                "per_device_train_batch_size": best_hparams.get("per_device_train_batch_size"),
                "learning_rate": best_hparams.get("learning_rate"),
                "weight_decay": best_hparams.get("weight_decay"),
                "warmup_steps": None,
                "lr_scheduler_type": "linear",
                "gradient_accumulation_steps": 1,
                "max_seq_length": MAX_LEN,
                "seed": RANDOM_SEED,
                "fp16": USE_FP16,
                "accuracy": test_results.get("eval_accuracy", 0),
                "precision": test_results.get("eval_precision", 0),
                "recall": test_results.get("eval_recall", 0),
                "f1_macro": test_results.get("eval_f1_macro", 0),
            pd.DataFrame([row]).to_csv("runs_log.csv", mode="a", index=False, 
                                     header=not os.path.exists("runs_log.csv"))
    
    # Save summary
    if summary_rows:
        summary_df = pd.DataFrame(summary_rows)
        summary_df.to_csv(TUNING_DIR / "strategy_summary.csv", index=False)
        print(summary_df)
    
    AUTO_TUNE_ENABLED = False  # Disable for subsequent cells
else:
    print("AUTO_TUNE_ENABLED is False. Skipping hyperparameter tuning.")



# **BERT Fine-tuning and Training**

**`Purpose`**

This cell performs the actual BERT model fine-tuning using either default hyperparameters or the best configuration from automated tuning. It trains the model with early stopping and saves the best checkpoint.

**`Input`**

Requires tokenized datasets (ds_train, ds_val), training arguments, model, tokenizer, and compute_metrics function. If hyperparameter tuning was performed, uses best parameters from the study.

**`Output`**

Trains the BERT model and saves the best checkpoint to disk. Prints training progress including loss and metrics for each epoch. Creates a trainer object and runs the training loop.

**`Details`**

The cell creates a Trainer object with the model, datasets, training arguments, and callbacks including early stopping. Training proceeds for the specified number of epochs with validation evaluation after each epoch. The best model checkpoint (based on validation F1-macro) is automatically saved. Training progress is logged and displayed, showing loss curves and metric improvements over time.


**`Purpose`**

This cell configures the hyperparameter search spaces for automated tuning using Optuna. It defines the parameter ranges for grid search and random search strategies.

**`Input`**

No direct inputs required. Uses global configuration variables.

**`Output`**

Prints the hyperparameter tuning configuration including search spaces, number of trials, and maximum epochs.

**`Details`**

Defines GRID_SEARCH_SPACE with fixed learning rate and varying batch sizes, weight decay, and epochs. Sets up RANDOM_SEARCH_SPACE with continuous ranges for learning rate and weight decay, and categorical choices for batch size and epochs. Configures the number of random trials and maximum training epochs to balance exploration with computational efficiency.


In [None]:
# Train BERT model (if not already trained via hyperparameter tuning)
if not AUTO_TUNE_ENABLED or not os.path.exists("./checkpoints/bert-base/best_grid"):
    print("Training BERT model with default/configured parameters...")
    print("Fine-tuning process per proposal Section V.D:")
    print("  - Tokenization using BERT's WordPiece tokenizer")
    print("  - Convert tweets to input IDs and attention masks")
    print("  - Fine-tune with classification head for sentiment labeling")
    print("  - Optimizer: AdamW, Loss: Cross-entropy, Scheduler: Linear warmup")
    
    trainer.train()
    
    # Evaluate on validation set
    val_results = trainer.evaluate()
    print("\nValidation Results:")
    print(f"  Accuracy: {val_results.get('eval_accuracy', 0):.4f}")
    print(f"  Precision: {val_results.get('eval_precision', 0):.4f}")
    print(f"  Recall: {val_results.get('eval_recall', 0):.4f}")
    print(f"  F1-macro: {val_results.get('eval_f1_macro', 0):.4f}")
    
    # Save best checkpoint
    best_ckpt_dir = "./checkpoints/bert-base/best"
    trainer.save_model(best_ckpt_dir)
    tokenizer.save_pretrained(best_ckpt_dir)
    print(f"\n✓ Best model saved to {best_ckpt_dir}")
else:
    print("Using best model from hyperparameter tuning.")
    best_ckpt_dir = "./checkpoints/bert-base/best_grid"


# **Model Evaluation and Baseline Comparison (per proposal Section VI)**

**`Purpose`**

This cell performs comprehensive evaluation of the trained BERT model on the test set. It generates detailed performance metrics, classification reports, and comparison with baseline models.

**`Input`**

Requires best_model and best_ckpt_dir from training cells. Needs ds_test dataset and baseline model results for comparison.

**`Output`**

Prints detailed performance metrics, generates confusion matrices, ROC-AUC and PR-AUC curves, and creates comparison tables. Exports results to CSV files for further analysis.

**`Details`**

The evaluation process tests the best BERT model on the held-out test set. It computes accuracy, precision, recall, and F1-macro scores for each class and overall. Generates confusion matrices to visualize classification performance. Creates ROC-AUC and PR-AUC curves for each sentiment class to analyze precision-recall trade-offs. Compares BERT results with baseline TF-IDF and Logistic Regression model to quantify improvements. All results are exported to files for documentation and reporting.


**`Purpose`**

This cell configures the training arguments and sets up helper functions for BERT model fine-tuning. It defines hyperparameters, optimization settings, and evaluation metrics according to the proposal methodology.

**`Input`**

Uses global variables like MODEL_NAME, MAX_LEN, RANDOM_SEED, and device settings. Requires tokenizer and model to be initialized from previous cells.

**`Output`**

Creates training_args object and compute_metrics function. Prints configuration summary including optimizer settings, learning rate schedule, and training parameters.

**`Details`**

The cell sets up TrainingArguments with AdamW optimizer, linear learning rate warmup, cross-entropy loss, and early stopping callback. It configures evaluation strategy, logging, and checkpoint saving. The compute_metrics function calculates accuracy, precision, recall, and F1-macro scores for each evaluation step. All settings align with the proposal requirements for systematic model training and evaluation.


In [None]:
# Load best BERT model for evaluation
from transformers import AutoModelForSequenceClassification

best_model = AutoModelForSequenceClassification.from_pretrained(best_ckpt_dir).to(device)
best_model.eval()

# Create trainer with best model for evaluation
eval_trainer = Trainer(
    model=best_model,
    args=training_args,
    eval_dataset=ds_test,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

# Evaluate on test set (15%)
print("Evaluating BERT model on test set...")
test_results = eval_trainer.evaluate()

print("\nBERT Model Performance (Test Set) - per proposal Section VI.A:")
print(f"  Accuracy: {test_results.get('eval_accuracy', 0):.4f}")
print(f"  Precision: {test_results.get('eval_precision', 0):.4f}")
print(f"  Recall: {test_results.get('eval_recall', 0):.4f}")
print(f"  F1-macro: {test_results.get('eval_f1_macro', 0):.4f}")

# Generate predictions for detailed analysis
test_predictions = eval_trainer.predict(ds_test)
test_preds_bert = np.argmax(test_predictions.predictions, axis=1)
test_labels_bert = test_predictions.label_ids

# Map predictions back from BERT format (0,1,2) to proposal format (-1,0,1)
reverse_mapping = {0: -1, 1: 0, 2: 1}
test_preds = np.array([reverse_mapping[p] for p in test_preds_bert])
test_labels = np.array([reverse_mapping[l] for l in test_labels_bert])

# Classification report
print("\nTest Set Classification Report:")
print(classification_report(test_labels, test_preds, 
                          target_names=['negative/toxic', 'neutral', 'positive'],
                          zero_division=0))

# Confusion matrix
cm_bert = confusion_matrix(test_labels, test_preds, labels=[-1, 0, 1])
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay(cm_bert, display_labels=['negative/toxic', 'neutral', 'positive']).plot(ax=ax, colorbar=False)
plt.title('BERT Model - Test Set Confusion Matrix')
plt.tight_layout()
plt.savefig('exports/confusion_matrices/bert_cm_test.png', dpi=150)
plt.close()
print("\n✓ Confusion matrix saved to exports/confusion_matrices/bert_cm_test.png")

# Save test predictions
test_df_results = pd.DataFrame({
    'review': df_test['review'].tolist(),
    'gold': test_labels,
    'pred': test_preds
})
test_df_results.to_csv('exports/bert_predictions_test.csv', index=False)
print("✓ Test predictions saved to exports/bert_predictions_test.csv")

# ROC-AUC and PR-AUC curves (per proposal Section III)
from sklearn.metrics import roc_curve, auc, precision_recall_curve, roc_auc_score, average_precision_score
from sklearn.preprocessing import label_binarize

# Binarize labels for multi-class ROC
y_test_bin = label_binarize(test_labels, classes=[-1, 0, 1])
n_classes = 3
test_probs = torch.softmax(torch.tensor(test_predictions.predictions), dim=-1).numpy()

# Compute ROC and PR curves for each class
# Note: test_probs columns are in BERT order (0=negative, 1=neutral, 2=positive)
# which maps to proposal order: 0→-1, 1→0, 2→1
fpr = dict()
tpr = dict()
roc_auc = dict()
precision = dict()
recall = dict()
pr_auc = dict()

# Map BERT probabilities to proposal label order
bert_to_proposal = {0: 0, 1: 1, 2: 2}  # BERT index 0→proposal -1, index 1→proposal 0, index 2→proposal 1
for i, class_name in enumerate(['negative/toxic', 'neutral', 'positive']):
    class_idx = [-1, 0, 1][i]  # Proposal format
    bert_idx = {-1: 0, 0: 1, 1: 2}[class_idx]  # BERT format index
    y_true_class = (test_labels == class_idx).astype(int)
    y_score_class = test_probs[:, bert_idx]  # Use BERT index for probabilities
    
    fpr[i], tpr[i], _ = roc_curve(y_true_class, y_score_class)
    roc_auc[i] = auc(fpr[i], tpr[i])
    
    precision[i], recall[i], _ = precision_recall_curve(y_true_class, y_score_class)
    pr_auc[i] = average_precision_score(y_true_class, y_score_class)
    
    print(f"\n{class_name} (class {class_idx}):")
    print(f"  ROC-AUC: {roc_auc[i]:.4f}")
    print(f"  PR-AUC: {pr_auc[i]:.4f}")

# Plot ROC curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC curves
ax = axes[0]
for i, class_name in enumerate(['negative/toxic', 'neutral', 'positive']):
    ax.plot(fpr[i], tpr[i], label=f'{class_name} (AUC = {roc_auc[i]:.3f})')
ax.plot([0, 1], [0, 1], 'k--', label='Random')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curves (per proposal Section III)')
ax.legend()
ax.grid(True, alpha=0.3)

# PR curves
ax = axes[1]
for i, class_name in enumerate(['negative/toxic', 'neutral', 'positive']):
    ax.plot(recall[i], precision[i], label=f'{class_name} (AP = {pr_auc[i]:.3f})')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.set_title('Precision-Recall Curves (per proposal Section III)')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
os.makedirs("exports/roc_curves", exist_ok=True)
plt.savefig('exports/roc_curves/bert_roc_pr_curves.png', dpi=150)
plt.close()
print("\n✓ ROC and PR curves saved to exports/roc_curves/bert_roc_pr_curves.png")

# Baseline vs BERT comparison (per proposal Section VI.B)
comparison_df = pd.DataFrame({
    'Model': ['Baseline (TF-IDF + LogReg)', 'BERT-base-uncased'],
    'Accuracy': [test_acc, test_results.get('eval_accuracy', 0)],
    'Precision': [test_prec, test_results.get('eval_precision', 0)],
    'Recall': [test_rec, test_results.get('eval_recall', 0)],
    'F1-macro': [test_f1, test_results.get('eval_f1_macro', 0)],
})

print("\n" + "="*60)
print("Baseline vs BERT Comparison (per proposal Section VI.B)")
print("="*60)
print(comparison_df.to_string(index=False))
comparison_df.to_csv('exports/model_comparison.csv', index=False)
print("\n✓ Model comparison saved to exports/model_comparison.csv")


# **Visualization and Analysis**

**`Purpose`**

This cell creates visualizations to analyze and compare model performance. It generates plots showing confusion matrices, metrics comparisons, and class distributions.

**`Input`**

Requires evaluation results from previous cells, including predictions, true labels, and performance metrics for both baseline and BERT models.

**`Output`**

Saves visualization images to the exports directory, including confusion matrices, metrics comparison charts, and class distribution plots.

**`Details`**

Creates a grid of visualizations including confusion matrices for baseline and BERT models, side-by-side metrics comparison bar charts, and class distribution analysis. The visualizations help identify which classes are easier or harder to predict, where models make errors, and how performance varies across different sentiment categories. All plots are saved as high-resolution PNG files for inclusion in reports and presentations.


**`Purpose`**

This cell creates visualizations to analyze and compare model performance. It generates plots showing confusion matrices, metrics comparisons, and class distributions.

**`Input`**

Requires evaluation results from previous cells, including predictions, true labels, and performance metrics for both baseline and BERT models.

**`Output`**

Saves visualization images to the exports directory, including confusion matrices, metrics comparison charts, and class distribution plots.

**`Details`**

Creates a grid of visualizations including confusion matrices for baseline and BERT models, side-by-side metrics comparison bar charts, and class distribution analysis. The visualizations help identify which classes are easier or harder to predict, where models make errors, and how performance varies across different sentiment categories. All plots are saved as high-resolution PNG files for inclusion in reports and presentations.


In [None]:
# Create comprehensive visualization grid
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# 1. Confusion matrices comparison
ax = axes[0, 0]
ConfusionMatrixDisplay(cm, display_labels=['negative/toxic', 'neutral', 'positive']).plot(ax=ax, colorbar=False)
ax.set_title('Baseline Model - Test Set')

ax = axes[0, 1]
ConfusionMatrixDisplay(cm_bert, display_labels=['negative/toxic', 'neutral', 'positive']).plot(ax=ax, colorbar=False)
ax.set_title('BERT Model - Test Set')

# 2. Metrics comparison bar chart
ax = axes[1, 0]
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-macro']
baseline_vals = [test_acc, test_prec, test_rec, test_f1]
bert_vals = [test_results.get('eval_accuracy', 0), 
             test_results.get('eval_precision', 0),
             test_results.get('eval_recall', 0),
             test_results.get('eval_f1_macro', 0)]

x = np.arange(len(metrics))
width = 0.35
ax.bar(x - width/2, baseline_vals, width, label='Baseline', alpha=0.8)
ax.bar(x + width/2, bert_vals, width, label='BERT', alpha=0.8)
ax.set_ylabel('Score')
ax.set_title('Model Comparison - Test Set Metrics')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# 3. Class distribution
ax = axes[1, 1]
label_counts = df_test['label'].value_counts().sort_index()
labels = ['negative/toxic', 'neutral', 'positive']
colors = ['#F87171', '#FBBF24', '#34D399']
ax.bar(labels, [label_counts.get(-1, 0), label_counts.get(0, 0), label_counts.get(1, 0)], color=colors, alpha=0.8)
ax.set_ylabel('Count')
ax.set_title('Test Set Class Distribution')
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('exports/model_analysis_grid.png', dpi=150)
plt.close()
print("✓ Visualization grid saved to exports/model_analysis_grid.png")


# **Inference Examples and Model Testing (per proposal)**

**`Purpose`**

This cell demonstrates the trained model on sample tweets to showcase its ability to handle various types of social media text including toxic content, neutral statements, positive messages, and challenging cases like sarcasm.

**`Input`**

Requires best_model and best_tokenizer from previous cells. Uses predefined sample tweets representing different sentiment categories.

**`Output`**

Prints predictions for each sample tweet, showing the predicted sentiment label, confidence score, and probability distribution across all classes.

**`Details`**

Tests the model on carefully selected examples that represent the diversity of Twitter content. Includes clear toxic examples, neutral statements, positive messages, sarcastic or ambiguous cases, and informal language. For each example, the cell tokenizes the text, runs inference through the model, and maps the output back to proposal format (-1, 0, 1). Displays both the predicted class and the full probability distribution, allowing for analysis of model confidence and decision-making process.


**`Purpose`**

This cell initializes the BERT model and prepares the datasets for transformer-based training. It loads the pre-trained BERT tokenizer, converts text to numerical representations, and sets up the model architecture for sequence classification.

**`Input`**

Requires df_train, df_val, and df_test DataFrames with cleaned text in the review column and labels in the label column. The labels should be in proposal format (-1, 0, 1).

**`Output`**

Creates tokenized datasets (ds_train, ds_val, ds_test) ready for PyTorch training. Initializes the BERT model on the specified device (CPU or GPU) and prints confirmation messages.

**`Details`**

The cell maps labels from proposal format (-1, 0, 1) to BERT format (0, 1, 2) for compatibility with the model. It uses BERT WordPiece tokenizer to convert text into input IDs and attention masks. The tokenization process handles text truncation and padding to a maximum length of 128 tokens, which is standard for Twitter posts. The datasets are formatted for PyTorch with appropriate column selection and tensor formatting.


In [None]:
# Smoke test with sample tweets (per proposal: demonstrate context-aware understanding)
sample_tweets = [
    ("Toxic example", "This is absolutely disgusting! People like you should be banned from social media. Horrible!"),
    ("Neutral example", "Just finished my morning coffee. Weather is okay today, nothing special."),
    ("Positive example", "So grateful for all the support today! Amazing community, thank you everyone! 🙏"),
    ("Sarcastic/Toxic", "Oh wonderful, another day of dealing with this nonsense. Just perfect..."),
    ("Informal positive", "This made my day! So happy right now! Best news ever! 🔥"),
]

print("Testing BERT model on sample tweets (per proposal: handles sarcasm, informal language):\n")
print("="*70)

# Load model and tokenizer if not already loaded
try:
    # Check if best_model exists and is on the correct device
    _ = best_model
    best_model = best_model.to(device)
    best_model.eval()  # Ensure eval mode
except NameError:
    # Load model if not already loaded
    from transformers import AutoModelForSequenceClassification
    best_model = AutoModelForSequenceClassification.from_pretrained(best_ckpt_dir).to(device)
    best_model.eval()

best_tokenizer = AutoTokenizer.from_pretrained(best_ckpt_dir, use_fast=True)

for label, tweet in sample_tweets:
    # Tokenize
    inputs = best_tokenizer(tweet, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)
    
    # Predict
    with torch.no_grad():
        outputs = best_model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)[0].cpu().numpy()
        pred_bert = np.argmax(probs)  # BERT format: 0, 1, 2
    
    # Map prediction from BERT format (0,1,2) to proposal format (-1,0,1)
    bert_to_proposal = {0: -1, 1: 0, 2: 1}
    pred = bert_to_proposal[pred_bert]
    
    # Map prediction to label name
    label_map = {-1: "negative/toxic", 0: "neutral", 1: "positive"}
    pred_label = label_map.get(pred, "unknown")
    confidence = probs[pred_bert] * 100
    
    # Map probabilities to proposal format for display
    prob_map = {0: -1, 1: 0, 2: 1}  # BERT index → proposal label
    probs_proposal = {label_map[prob_map[i]]: float(probs[i]) for i in range(3)}
    
    print(f"\n{label}:")
    print(f"  Tweet: \"{tweet}\"")
    print(f"  Prediction: {pred_label} (confidence: {confidence:.2f}%)")
    print(f"  Probabilities: negative/toxic={probs_proposal['negative/toxic']:.3f}, neutral={probs_proposal['neutral']:.3f}, positive={probs_proposal['positive']:.3f}")

print("\n" + "="*70)
print("✓ Inference examples completed")


# **Export and Deployment Preparation**

**`Purpose`**

This cell prepares all artifacts for deployment and documentation. It exports model weights, experiment logs, model cards, and summary reports in various formats.

**`Input`**

Requires best_model, best_ckpt_dir, and all evaluation results from previous cells. Needs runs_log.csv and other experiment outputs.

**`Output`**

Creates quantized model weights, consolidated experiment logs in CSV and Excel formats, model card JSON file, and comprehensive summary report. All files are saved to the exports directory.

**`Details`**

The export process saves quantized PyTorch model weights for lightweight deployment. Consolidates all experiment runs into a single CSV and Excel file with all hyperparameters and metrics. Generates a model card in JSON format containing metadata, performance metrics, and configuration details. Creates a text summary report with key findings, model comparisons, and target achievement status. All exports are organized in the exports directory for easy access and sharing.


**`Purpose`**

This cell creates visualizations to analyze and compare model performance. It generates plots showing confusion matrices, metrics comparisons, and class distributions.

**`Input`**

Requires evaluation results from previous cells, including predictions, true labels, and performance metrics for both baseline and BERT models.

**`Output`**

Saves visualization images to the exports directory, including confusion matrices, metrics comparison charts, and class distribution plots.

**`Details`**

Creates a grid of visualizations including confusion matrices for baseline and BERT models, side-by-side metrics comparison bar charts, and class distribution analysis. The visualizations help identify which classes are easier or harder to predict, where models make errors, and how performance varies across different sentiment categories. All plots are saved as high-resolution PNG files for inclusion in reports and presentations.


In [None]:
# Export artifacts for deployment
print("Exporting artifacts for deployment...")

# 0. Ensure full model is saved (pytorch_model.bin)
print("\n0. Ensuring full model is saved...")
full_model_path = Path(best_ckpt_dir) / "pytorch_model.bin"
if not full_model_path.exists():
    print("   Full model not found, saving it now...")
    best_model.save_pretrained(best_ckpt_dir)
    print(f"   Full model saved to {best_ckpt_dir}")
else:
    print(f"   Full model already exists at {full_model_path}")

# 1. Save quantized model weights (for lightweight deployment)
print("\n1. Saving quantized model weights...")
try:
    quantized_model = torch.quantization.quantize_dynamic(
        best_model.cpu(), {torch.nn.Linear}, dtype=torch.qint8
    )
    quantized_path = Path(best_ckpt_dir) / 'pytorch_model_quantized.bin'
    torch.save(quantized_model.state_dict(), quantized_path)
    print(f"   ✓ Quantized model saved to {quantized_path}")
except Exception as e:
    print(f"   ⚠ Quantization failed: {e}")

# 2. Export consolidated experiment logs
print("\n2. Exporting consolidated experiment logs...")
if os.path.exists("runs_log.csv"):
    runs_df = pd.read_csv("runs_log.csv")
    runs_df.to_csv("exports/experiment_runs_all.csv", index=False)
    print("   ✓ Experiment logs exported to exports/experiment_runs_all.csv")
    
    # Generate Excel summary if openpyxl available
    try:
        runs_df.to_excel("exports/experiment_runs_all.xlsx", index=False)
        print("   ✓ Excel summary exported to exports/experiment_runs_all.xlsx")
    except:
        print("   ⚠ Excel export skipped (openpyxl not available)")

# 3. Generate model card information
print("\n3. Generating model card...")
model_card = {
    "model_name": MODEL_NAME,
    "task": "Toxic Comment Detection on Twitter",
    "num_labels": 3,
    "labels": ["negative/toxic (-1)", "neutral (0)", "positive (1)"],
    "dataset": "mteb/tweet_sentiment_extraction",
    "training_split": "70%",
    "validation_split": "15%",
    "test_split": "15%",
    "max_sequence_length": MAX_LEN,
    "optimizer": "AdamW",
    "loss": "Cross-entropy",
    "scheduler": "Linear warmup",
    "best_test_accuracy": float(test_results.get('eval_accuracy', 0)),
    "best_test_f1_macro": float(test_results.get('eval_f1_macro', 0)),
    "baseline_accuracy": float(test_acc),
    "baseline_f1_macro": float(test_f1),
    "improvement": f"{((test_results.get('eval_f1_macro', 0) - test_f1) / test_f1 * 100):.2f}%",
}

with open("exports/model_card.json", "w") as f:
    json.dump(model_card, f, indent=2)
print("   ✓ Model card saved to exports/model_card.json")

# 4. Create summary report
print("\n4. Creating summary report...")
summary_text = f"""
Twitter Toxicity Detection Project - Summary Report
==================================================

Dataset: mteb/tweet_sentiment_extraction
Total samples: {len(df)}
Train/Val/Test split: 70%/15%/15%

Baseline Model (TF-IDF + Logistic Regression):
  - Accuracy: {test_acc:.4f}
  - Precision: {test_prec:.4f}
  - Recall: {test_rec:.4f}
  - F1-macro: {test_f1:.4f}

BERT Model (bert-base-uncased):
  - Accuracy: {test_results.get('eval_accuracy', 0):.4f}
  - Precision: {test_results.get('eval_precision', 0):.4f}
  - Recall: {test_results.get('eval_recall', 0):.4f}
  - F1-macro: {test_results.get('eval_f1_macro', 0):.4f}

Improvement over baseline:
  - F1-macro improvement: {((test_results.get('eval_f1_macro', 0) - test_f1) / test_f1 * 100):.2f}%

Target F1-macro (per proposal): >0.85
Achieved F1-macro: {test_results.get('eval_f1_macro', 0):.4f}
Target met: {'✓ YES' if test_results.get('eval_f1_macro', 0) > 0.85 else '✗ NO'}

All artifacts exported to:
  - Model checkpoints: {best_ckpt_dir}
  - Predictions: exports/
  - Visualizations: exports/confusion_matrices/, exports/roc_curves/
  - Logs: runs_log.csv, exports/experiment_runs_all.csv
"""

with open("exports/summary_report.txt", "w") as f:
    f.write(summary_text)
print(summary_text)
print("\n✓ Summary report saved to exports/summary_report.txt")
print("\n" + "="*70)
print("✓ All exports completed successfully!")
print("="*70)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, precision_score, recall_score

# TF-IDF vectorization (per proposal: Term Frequency-Inverse Document Frequency method)
tfidf = TfidfVectorizer(
    max_features=50000,
    ngram_range=(1,2),  # Word 1-2 ngrams to capture phrases
    lowercase=True
)

# Prepare data
X_tr = df_train['review'].tolist()
y_tr = df_train['label'].values
X_va = df_val['review'].tolist()
y_va = df_val['label'].values

# Transform text to TF-IDF vectors
print("Fitting TF-IDF vectorizer...")
X_tr_tfidf = tfidf.fit_transform(X_tr)
X_va_tfidf = tfidf.transform(X_va)

# Hyperparameter tuning with grid search and cross-validation (per proposal Section VI.B)
print("\nPerforming hyperparameter tuning with GridSearchCV...")
param_grid = {
    'C': [0.5, 1.0, 2.0, 4.0],
    'class_weight': [None, 'balanced', {-1: 1.2, 0: 0.8, 1: 1.2}],  # Custom: reduce neutral bias
    'max_iter': [1000, 2000]
}

logreg_base = LogisticRegression(random_state=RANDOM_SEED)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)

grid_search = GridSearchCV(
    logreg_base, 
    param_grid=param_grid, 
    cv=skf, 
    scoring='f1_macro',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_tr_tfidf, y_tr)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score (F1-macro): {grid_search.best_score_:.4f}")

# Train final model with best parameters
logreg = grid_search.best_estimator_
preds = logreg.predict(X_va_tfidf)

# Calculate metrics (per proposal Section VI.A)
acc_base = accuracy_score(y_va, preds)
prec_base = precision_score(y_va, preds, average='macro', zero_division=0)
rec_base = recall_score(y_va, preds, average='macro', zero_division=0)
f1_base = f1_score(y_va, preds, average='macro', zero_division=0)

print("\nBaseline Model Performance (Validation Set):")
print({
    "model": "tfidf-logreg",
    "accuracy": acc_base,
    "precision": prec_base,
    "recall": rec_base,
    "f1_macro": f1_base
})

# Classification report
print("\nClassification Report:")
print(classification_report(y_va, preds, 
                          target_names=['negative/toxic', 'neutral', 'positive'],
                          zero_division=0))

# Log to runs_log.csv
row = {
    "member": "baseline",
    "model": "tfidf-logreg",
    "num_train_epochs": None,
    "per_device_train_batch_size": None,
    "learning_rate": None,
    "weight_decay": None,
    "warmup_steps": None,
    "lr_scheduler_type": None,
    "gradient_accumulation_steps": None,
    "max_seq_length": None,
    "seed": RANDOM_SEED,
    "fp16": False,
    "accuracy": acc_base,
    "precision": prec_base,
    "recall": rec_base,
    "f1_macro": f1_base,
    "notes": f"TF-IDF + LogReg baseline with GridSearchCV. Best params: {grid_search.best_params_}"
}

pd.DataFrame([row]).to_csv("runs_log.csv", mode="a",
                           index=False, header=not os.path.exists("runs_log.csv"))

# Save baseline model
os.makedirs("models", exist_ok=True)
import joblib
joblib.dump(logreg, "models/baseline_tfidf_logreg.joblib")
joblib.dump(tfidf, "models/baseline_tfidf_vectorizer.joblib")
print("\n✓ Baseline model saved to models/baseline_tfidf_logreg.joblib")


# **Fast Mode Configuration (Optional)**

**`Purpose`**

This cell provides an option to reduce the training dataset size for faster experimentation and development. When enabled, it uses a subset of the training data while keeping validation and test sets intact for reliable evaluation.

**`Input`**

Requires df_train, df_val, and df_test DataFrames from previous cells. The FAST_MODE flag controls whether data reduction is applied.

**`Output`**

Prints the reduced dataset sizes and percentages. If FAST_MODE is disabled, confirms that full datasets will be used.

**`Details`**

When FAST_MODE is True, the training data is randomly sampled to 40% of its original size using a fixed random seed for reproducibility. Validation and test sets remain unchanged to ensure fair evaluation. This allows for quicker iteration during development while maintaining the ability to run full experiments when needed.


**`Purpose`**

This cell provides an option to reduce the training dataset size for faster experimentation and development. When enabled, it uses a subset of the training data while keeping validation and test sets intact for reliable evaluation.

**`Input`**

Requires df_train, df_val, and df_test DataFrames from previous cells. The FAST_MODE flag controls whether data reduction is applied.

**`Output`**

Prints the reduced dataset sizes and percentages. If FAST_MODE is disabled, confirms that full datasets will be used.

**`Details`**

When FAST_MODE is True, the training data is randomly sampled to 40% of its original size using a fixed random seed for reproducibility. Validation and test sets remain unchanged to ensure fair evaluation. This allows for quicker iteration during development while maintaining the ability to run full experiments when needed.


In [3]:
FAST_MODE = True  # Set to True for faster training (40% data), False for full dataset
TRAIN_FRACTION = 0.40 if FAST_MODE else 1.0
VAL_FRACTION = 1.0  # Keep full validation/test by default

if FAST_MODE and TRAIN_FRACTION < 1.0:
    df_train = (df_train
                .sample(frac=TRAIN_FRACTION, random_state=RANDOM_SEED)
                .sort_values('id')
                .reset_index(drop=True))
    if VAL_FRACTION < 1.0:
        df_val = (df_val
                  .sample(frac=VAL_FRACTION, random_state=RANDOM_SEED)
                  .sort_values('id')
                  .reset_index(drop=True))
        df_test = (df_test
                   .sample(frac=VAL_FRACTION, random_state=RANDOM_SEED)
                   .sort_values('id')
                   .reset_index(drop=True))
    print(f"[FAST_MODE ENABLED] Using {TRAIN_FRACTION*100:.0f}% of training data\n  Train: {len(df_train)} samples{len(df_train)} (~{TRAIN_FRACTION*100:.0f}%), val={len(df_val)}, test={len(df_test)}")
else:
    print("FAST_MODE disabled: using full train/val/test splits")


FAST_MODE disabled: using full train/val/test splits


# **Baseline TF-IDF + Logistic Regression (per proposal Section V.C)**


**`Purpose`**

This block builds a baseline model using TF-IDF vectorization and Logistic Regression as a reference point per proposal Section V.C. The baseline model acts as a foundation for measuring improvements from more complex deep learning methods. It uses traditional machine learning techniques that are simple, interpretable, and computationally efficient.

**`Input`**

The inputs are the train_df and val_df frames created earlier. Only the review and label columns are used. The TF-IDF vectorizer is configured with word 1-2 ngrams and a vocabulary limit to cap memory and training time. The labels are taken directly as integer classes (-1, 0, 1).

**`Output`**

The block prints a compact dictionary that contains baseline accuracy, precision, recall, and F1-score (per proposal Section VI.A). It also appends a structured row to runs_log.csv so that the baseline appears in the experiment ledger with model name, scores, and notes. These outputs provide both an on-screen summary and a durable record for later tables and charts.

**`Details`**

A TF-IDF vectorizer is fit on the training text and applied to the validation text, producing sparse matrices. A Logistic Regression model is trained with hyperparameter tuning via grid search and cross-validation (per proposal Section VI.B). Predictions for the validation set are compared against the gold labels to compute accuracy, precision, recall, and F1-score, where macro-F1 treats all classes equally. The metrics are printed and then written to the log file with consistent column names.


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, precision_score, recall_score

# TF-IDF vectorization (per proposal: Term Frequency-Inverse Document Frequency method)
# Enhanced settings for better accuracy
tfidf = TfidfVectorizer(
    max_features=100000,  # Increased from 50000 to capture more features
    ngram_range=(1,3),  # Expanded to 1-3 ngrams to capture more context
    lowercase=True,
    min_df=2,  # Ignore terms that appear in fewer than 2 documents
    max_df=0.95,  # Ignore terms that appear in more than 95% of documents
    sublinear_tf=True  # Apply sublinear tf scaling (1 + log(tf))
)

# Prepare data
X_tr = df_train['review'].tolist()
y_tr = df_train['label'].values
X_va = df_val['review'].tolist()
y_va = df_val['label'].values

# Transform text to TF-IDF vectors
print("Fitting TF-IDF vectorizer...")
X_tr_tfidf = tfidf.fit_transform(X_tr)
X_va_tfidf = tfidf.transform(X_va)

# Hyperparameter tuning with grid search and cross-validation (per proposal Section VI.B)
# Expanded search space for better accuracy
print("\nPerforming hyperparameter tuning with GridSearchCV...")
# Use liblinear which supports both l1 and l2 penalties
param_grid = {
    'C': [0.1, 0.5, 1.0, 2.0, 4.0, 8.0],  # Expanded C range
    'class_weight': [None, 'balanced', {-1: 1.2, 0: 0.8, 1: 1.2}],  # Custom: reduce neutral bias
    'max_iter': [2000, 3000],  # Increased iterations
    'penalty': ['l1', 'l2']  # Try both regularization types
}

logreg_base = LogisticRegression(solver='liblinear', random_state=RANDOM_SEED)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)

grid_search = GridSearchCV(
    logreg_base,
    param_grid=param_grid,
    cv=skf,
    scoring='f1_macro',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_tr_tfidf, y_tr)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score (F1-macro): {grid_search.best_score_:.4f}")

# Train final model with best parameters
logreg = grid_search.best_estimator_
preds = logreg.predict(X_va_tfidf)

# Calculate metrics (per proposal Section VI.A)
acc_base = accuracy_score(y_va, preds)
prec_base = precision_score(y_va, preds, average='macro', zero_division=0)
rec_base = recall_score(y_va, preds, average='macro', zero_division=0)
f1_base = f1_score(y_va, preds, average='macro', zero_division=0)

print("\nBaseline Model Performance (Validation Set):")
print({
    "model": "tfidf-logreg",
    "accuracy": acc_base,
    "precision": prec_base,
    "recall": rec_base,
    "f1_macro": f1_base
})

# Classification report
print("\nClassification Report:")
print(classification_report(y_va, preds,
                          target_names=['negative/toxic', 'neutral', 'positive'],
                          zero_division=0))

# Log to runs_log.csv
row = {
    "member": "baseline",
    "model": "tfidf-logreg",
    "num_train_epochs": None,
    "per_device_train_batch_size": None,
    "learning_rate": None,
    "weight_decay": None,
    "warmup_steps": None,
    "lr_scheduler_type": None,
    "gradient_accumulation_steps": None,
    "max_seq_length": None,
    "seed": RANDOM_SEED,
    "fp16": False,
    "accuracy": acc_base,
    "precision": prec_base,
    "recall": rec_base,
    "f1_macro": f1_base,
    "notes": f"TF-IDF + LogReg baseline with GridSearchCV. Best params: {grid_search.best_params_}"
}

pd.DataFrame([row]).to_csv("runs_log.csv", mode="a",
                           index=False, header=not os.path.exists("runs_log.csv"))

# Save baseline model
os.makedirs("models", exist_ok=True)
import joblib
joblib.dump(logreg, "models/baseline_tfidf_logreg.joblib")
joblib.dump(tfidf, "models/baseline_tfidf_vectorizer.joblib")
print("\n✓ Baseline model saved to models/baseline_tfidf_logreg.joblib")


Fitting TF-IDF vectorizer...

Performing hyperparameter tuning with GridSearchCV...
Fitting 5 folds for each of 48 candidates, totalling 240 fits

Best parameters: {'C': 1.0, 'class_weight': 'balanced', 'max_iter': 2000, 'penalty': 'l1'}
Best CV score (F1-macro): 0.6579

Baseline Model Performance (Validation Set):
{'model': 'tfidf-logreg', 'accuracy': 0.6745960502692998, 'precision': 0.6856959388663061, 'recall': 0.6677298814979974, 'f1_macro': 0.6736792486942251}

Classification Report:
                precision    recall  f1-score   support

negative/toxic       0.68      0.59      0.63      1288
       neutral       0.62      0.73      0.67      1764
      positive       0.76      0.69      0.72      1404

      accuracy                           0.67      4456
     macro avg       0.69      0.67      0.67      4456
  weighted avg       0.68      0.67      0.67      4456


✓ Baseline model saved to models/baseline_tfidf_logreg.joblib


# Baseline Evaluation on Test Set


**`Purpose`**

This cell creates visualizations to analyze and compare model performance. It generates plots showing confusion matrices, metrics comparisons, and class distributions.

**`Input`**

Requires evaluation results from previous cells, including predictions, true labels, and performance metrics for both baseline and BERT models.

**`Output`**

Saves visualization images to the exports directory, including confusion matrices, metrics comparison charts, and class distribution plots.

**`Details`**

Creates a grid of visualizations including confusion matrices for baseline and BERT models, side-by-side metrics comparison bar charts, and class distribution analysis. The visualizations help identify which classes are easier or harder to predict, where models make errors, and how performance varies across different sentiment categories. All plots are saved as high-resolution PNG files for inclusion in reports and presentations.


In [5]:
# Evaluate baseline on test set (15%)
# Load models if not already in memory (in case this cell is run independently)
try:
    # Check if tfidf and logreg are defined
    _ = tfidf
    _ = logreg
    print("Using models from previous cell...")
except NameError:
    # Try to load from saved files
    import joblib
    import os
    model_path = "models/baseline_tfidf_logreg.joblib"
    vectorizer_path = "models/baseline_tfidf_vectorizer.joblib"

    if os.path.exists(model_path) and os.path.exists(vectorizer_path):
        print("Loading baseline models from disk...")
        tfidf = joblib.load(vectorizer_path)
        logreg = joblib.load(model_path)
        print("✓ Models loaded successfully")
    else:
        raise NameError(
            "Baseline models not found. Please run Cell 12 (Baseline TF-IDF + Logistic Regression training) first to train and save the models."
        )

X_te = df_test['review'].tolist()
y_te = df_test['label'].values
X_te_tfidf = tfidf.transform(X_te)

test_preds = logreg.predict(X_te_tfidf)

# Calculate test metrics
test_acc = accuracy_score(y_te, test_preds)
test_prec = precision_score(y_te, test_preds, average='macro', zero_division=0)
test_rec = recall_score(y_te, test_preds, average='macro', zero_division=0)
test_f1 = f1_score(y_te, test_preds, average='macro', zero_division=0)

print("Baseline Model Performance (Test Set):")
print({
    "model": "tfidf-logreg",
    "accuracy": test_acc,
    "precision": test_prec,
    "recall": test_rec,
    "f1_macro": test_f1
})

print("\nTest Set Classification Report:")
print(classification_report(y_te, test_preds,
                          target_names=['negative/toxic', 'neutral', 'positive'],
                          zero_division=0))

# Save test predictions
os.makedirs("exports", exist_ok=True)
pd.DataFrame({
    'review': X_te,
    'gold': y_te,
    'pred': test_preds
}).to_csv('exports/baseline_predictions_test.csv', index=False)

# Confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_te, test_preds, labels=[-1, 0, 1])
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay(cm, display_labels=['negative/toxic', 'neutral', 'positive']).plot(ax=ax, colorbar=False)
plt.title('Baseline Model - Test Set Confusion Matrix')
plt.tight_layout()
os.makedirs("exports/confusion_matrices", exist_ok=True)
plt.savefig('exports/confusion_matrices/baseline_cm_test.png', dpi=150)
plt.close()
print("\n✓ Confusion matrix saved to exports/confusion_matrices/baseline_cm_test.png")


Using models from previous cell...
Baseline Model Performance (Test Set):
{'model': 'tfidf-logreg', 'accuracy': 0.6568364611260054, 'precision': 0.6697405245796803, 'recall': 0.6501692273785475, 'f1_macro': 0.6562920690025043}

Test Set Classification Report:
                precision    recall  f1-score   support

negative/toxic       0.69      0.56      0.62      1292
       neutral       0.60      0.70      0.65      1781
      positive       0.72      0.69      0.70      1403

      accuracy                           0.66      4476
     macro avg       0.67      0.65      0.66      4476
  weighted avg       0.66      0.66      0.66      4476


✓ Confusion matrix saved to exports/confusion_matrices/baseline_cm_test.png


# **BERT Model Initialization and Tokenization (per proposal Section V.C)**


**`Purpose`**

This block initializes the BERT-base-uncased model per proposal Section V.C for toxicity classification. It loads the BERT tokenizer (WordPiece tokenizer) and prepares the datasets for transformer fine-tuning by converting tweets into numerical input representations suitable for model processing.

**`Input`**

The inputs are the training, validation, and test DataFrames created earlier. The BERT tokenizer is loaded from the transformers library, and a maximum sequence length is specified to keep batch shapes uniform.

**`Output`**

The block prints the resolved model name and confirms tokenization completion. Three datasets.Dataset objects are produced with tensor columns input_ids, attention_mask, and label. A classification model with three output labels is created and moved to the detected device.

**`Details`**

The BERT tokenizer (WordPiece tokenizer) is loaded with the fast backend and wrapped in a function that applies truncation and padding to a fixed length of 128 tokens (standard for tweets). The pandas frames are converted into Dataset objects, tokenization is applied in batches for speed, and the dataset columns are formatted as PyTorch tensors. The model is loaded with a task-specific head sized to three classes and placed on CPU or GPU.


In [6]:
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import platform

USE_GPU = torch.cuda.is_available()
if USE_GPU:
    print(f"✓ GPU detected: {torch.cuda.get_device_name(0)}")
    print(f"  CUDA Version: {torch.version.cuda}")
    print(f"  GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    device = torch.device("cuda:0")
    NUM_WORKERS = 0 if platform.system() == 'Windows' else 4
    PIN_MEMORY = True
else:
    print("⚠ No GPU detected, using CPU (training will be slow)")
    device = torch.device("cpu")
    NUM_WORKERS = 0
    PIN_MEMORY = False

# Model choice: bert-base-uncased (per proposal Section V.C)
MODEL_NAME = "bert-base-uncased"
print(f"\nUsing model: {MODEL_NAME}")

MAX_LEN = 128  # Standard for tweets

# Load BERT tokenizer (WordPiece tokenizer per proposal)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

def tokenize_fn(batch):
    """Tokenize tweets using BERT's WordPiece tokenizer (per proposal Section V.B)"""
    return tokenizer(batch["review"], truncation=True, padding="max_length", max_length=MAX_LEN)

print("\nTokenizing datasets (this may take a moment)...")
# Map labels from proposal format (-1,0,1) to BERT format (0,1,2)
# -1 (negative/toxic) → 0, 0 (neutral) → 1, 1 (positive) → 2
label_mapping = {-1: 0, 0: 1, 1: 2}
df_train_bert = df_train.copy()
df_val_bert = df_val.copy()
df_test_bert = df_test.copy()
df_train_bert['label'] = df_train_bert['label'].map(label_mapping)
df_val_bert['label'] = df_val_bert['label'].map(label_mapping)
df_test_bert['label'] = df_test_bert['label'].map(label_mapping)

ds_train = Dataset.from_pandas(df_train_bert[['review','label']].reset_index(drop=True))
ds_val   = Dataset.from_pandas(df_val_bert[['review','label']].reset_index(drop=True))
ds_test  = Dataset.from_pandas(df_test_bert[['review','label']].reset_index(drop=True))

if platform.system() == 'Windows':
    NUM_PROC_TOKENIZE = None
    print("  Using single-process tokenization (Windows compatibility)")
else:
    NUM_PROC_TOKENIZE = 4 if USE_GPU else 2
    print(f"  Using {NUM_PROC_TOKENIZE} processes for tokenization")

ds_train = ds_train.map(tokenize_fn, batched=True, num_proc=NUM_PROC_TOKENIZE, remove_columns=['review'])
ds_val   = ds_val.map(tokenize_fn, batched=True, num_proc=NUM_PROC_TOKENIZE, remove_columns=['review'])
ds_test  = ds_test.map(tokenize_fn, batched=True, num_proc=NUM_PROC_TOKENIZE, remove_columns=['review'])

cols = ['input_ids','attention_mask','label']
ds_train = ds_train.with_format("torch", columns=cols)
ds_val   = ds_val.with_format("torch", columns=cols)
ds_test  = ds_test.with_format("torch", columns=cols)

print(f"✓ Datasets ready: train={len(ds_train)}, val={len(ds_val)}, test={len(ds_test)}")

# Initialize BERT model with 3 labels (negative/neutral/positive)
num_labels = 3
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)
model = model.to(device)
print(f"✓ Model loaded on {device}")


✓ GPU detected: Tesla T4
  CUDA Version: 12.6
  GPU Memory: 15.83 GB

Using model: bert-base-uncased


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]


Tokenizing datasets (this may take a moment)...
  Using 4 processes for tokenization


Map (num_proc=4):   0%|          | 0/20824 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/4456 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/4476 [00:00<?, ? examples/s]

✓ Datasets ready: train=20824, val=4456, test=4476


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✓ Model loaded on cuda:0


# **BERT Training Arguments and Configuration (per proposal Section V.D)**


**`Purpose`**

This block defines how BERT training will proceed per proposal Section V.D. It establishes a metric function that reports accuracy, precision, recall, and F1-score, builds training arguments with AdamW optimizer and cross-entropy loss, and constructs a Trainer object that ties the model, data, tokenizer, arguments, and metrics together.

**`Input`**

Inputs include the tokenized datasets, the initialized model and tokenizer, and hyperparameters such as number of epochs, batch sizes, learning rate, weight decay, and evaluation cadence.

**`Output`**

The cell prints a confirmation that the trainer is ready and includes the active model name. Internally, it prepares all objects required for training and evaluation.

**`Details`**

A compute function converts raw model outputs into predicted labels and compares them with gold labels to obtain accuracy, precision, recall, and macro-F1 (target: >0.85 per proposal Section III). Training arguments are configured with AdamW optimizer, cross-entropy loss, linear warmup with learning rate scheduling, early stopping to prevent overfitting, and mixed-precision when GPU is available.


In [7]:
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback
from sklearn.metrics import precision_score, recall_score
import inspect

def compute_metrics(eval_pred):
    """Compute metrics per proposal Section VI.A: accuracy, precision, recall, F1-score"""
    preds = np.argmax(eval_pred.predictions, axis=1)
    labels = eval_pred.label_ids
    acc = accuracy_score(labels, preds)
    prec = precision_score(labels, preds, average='macro', zero_division=0)
    rec = recall_score(labels, preds, average='macro', zero_division=0)
    f1m = f1_score(labels, preds, average='macro', zero_division=0)
    return {'accuracy': acc, 'precision': prec, 'recall': rec, 'f1_macro': f1m}

if USE_GPU:
    TRAIN_BATCH_SIZE = 32
    EVAL_BATCH_SIZE = 64
    USE_FP16 = True
    GRADIENT_CHECKPOINTING = False
else:
    TRAIN_BATCH_SIZE = 8
    EVAL_BATCH_SIZE = 16
    USE_FP16 = False
    GRADIENT_CHECKPOINTING = False

sig = inspect.signature(TrainingArguments.__init__)
argnames = set(sig.parameters.keys())

def make_training_args(**overrides):
    """Create training arguments per proposal Section V.D"""
    base_epochs = 3 if FAST_MODE else 4
    total_steps = max(1, (len(ds_train) // max(1, TRAIN_BATCH_SIZE)) * base_epochs)
    warmup_steps = max(25, int(total_steps * 0.1))

    cfg = dict(
        output_dir=f"./checkpoints/bert-base/run1",
        num_train_epochs=base_epochs,
        per_device_train_batch_size=TRAIN_BATCH_SIZE,
        per_device_eval_batch_size=EVAL_BATCH_SIZE,
        learning_rate=3e-5,  # Standard for BERT fine-tuning
        weight_decay=0.01,
        warmup_ratio=0.1,  # Linear warmup per proposal
        lr_scheduler_type="linear",  # Learning rate scheduling per proposal
        gradient_accumulation_steps=1,
        load_best_model_at_end=True,
        metric_for_best_model="f1_macro",  # Target: >0.85 per proposal
        greater_is_better=True,
        seed=RANDOM_SEED,
        logging_steps=50,
        eval_steps=100,
        save_steps=200,
        save_total_limit=2,
        report_to=[],
        optim="adamw_torch",  # AdamW optimizer per proposal Section V.D
        fp16=USE_FP16,
        dataloader_num_workers=NUM_WORKERS,
        dataloader_pin_memory=PIN_MEMORY,
        remove_unused_columns=False,
        gradient_checkpointing=GRADIENT_CHECKPOINTING,
    )
    cfg.update(overrides)

    if "evaluation_strategy" in argnames:
        cfg["evaluation_strategy"] = cfg.get("evaluation_strategy", "steps")
    elif "eval_strategy" in argnames:
        cfg["eval_strategy"] = cfg.get("eval_strategy", "steps")

    if "save_strategy" in argnames:
        cfg["save_strategy"] = cfg.get("save_strategy", "steps")

    safe_cfg = {k:v for k,v in cfg.items() if k in argnames}
    return TrainingArguments(**safe_cfg)

training_args = make_training_args()

if GRADIENT_CHECKPOINTING and hasattr(model, 'gradient_checkpointing_enable'):
    model.gradient_checkpointing_enable()
    print("✓ Gradient checkpointing enabled (saves memory)")

# Early stopping to prevent overfitting (per proposal)
callbacks = [EarlyStoppingCallback(early_stopping_patience=2, early_stopping_threshold=0.001)]

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_train,
    eval_dataset=ds_val,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    callbacks=callbacks,
)

print(f"✓ Trainer ready on {device}")
print(f"  Batch size: {TRAIN_BATCH_SIZE} (train), {EVAL_BATCH_SIZE} (eval)")
print(f"  FP16: {USE_FP16}, Workers: {NUM_WORKERS}, Pin Memory: {PIN_MEMORY}")
print(f"  Early stopping: patience=2")
print(f"  Optimizer: AdamW, Loss: Cross-entropy, Scheduler: Linear warmup")


✓ Trainer ready on cuda:0
  Batch size: 32 (train), 64 (eval)
  FP16: True, Workers: 4, Pin Memory: True
  Early stopping: patience=2
  Optimizer: AdamW, Loss: Cross-entropy, Scheduler: Linear warmup


  trainer = Trainer(


# Automated Hyperparameter Tuning (per proposal Section V.C)

**`Purpose`**

Run Optuna-driven grid and random searches per proposal Section V.C for hyperparameter tuning and validation curve analysis to find optimum performance. The search explores learning rate, batch size, epochs, weight decay, and warmup ratio to identify the best configuration via validation macro-F1.

**`Input`**

Uses the tokenized datasets (`ds_train`, `ds_val`), global tokenizer/model selections, and shared helpers (`compute_metrics`, `make_training_args`). Configuration depends on `AUTO_TUNE_ENABLED`, search spaces, and trial limits.

**`Output`**

Writes per-strategy trial tables to `tuning/` directory, a combined summary, and logs the best configuration. The winning configuration is retrained, evaluated on validation and test splits, logged to `runs_log.csv`, and predictions exported.

**`Details`**

Defines a lightweight `WeightedTrainer` compatible with Optuna, registers helper functions to build Trainers for suggested hyperparameters. Optuna's `GridSampler` and `RandomSampler` explore the respective spaces, timing each trial and storing validation metrics.


In [8]:
# Automated hyperparameter tuning configuration
AUTO_TUNE_ENABLED = True

GRID_SEARCH_SPACE = {
    "learning_rate": [3e-5],  # Fixed to default (1 value)
    "per_device_train_batch_size": [8, 16],  # 2 values
    "weight_decay": [0.0, 0.01],  # 2 values
    "num_train_epochs": [2, 3],  # 2 values
}


RANDOM_SEARCH_SPACE = {
    "learning_rate": ("log_uniform", 2e-5, 5e-5),
    "per_device_train_batch_size": ("choice", [8, 12, 16, 24, 32]),
    "weight_decay": ("uniform", 0.0, 0.1),
    "num_train_epochs": ("int", 2, 4),
}

RANDOM_TRIALS = 8
MAX_AUTOTUNE_EPOCHS = 4

print("Hyperparameter tuning configuration:")


Hyperparameter tuning configuration:


**`Purpose`**

This cell configures the training arguments and sets up helper functions for BERT model fine-tuning. It defines hyperparameters, optimization settings, and evaluation metrics according to the proposal methodology.

**`Input`**

Uses global variables like MODEL_NAME, MAX_LEN, RANDOM_SEED, and device settings. Requires tokenizer and model to be initialized from previous cells.

**`Output`**

Creates training_args object and compute_metrics function. Prints configuration summary including optimizer settings, learning rate schedule, and training parameters.

**`Details`**

The cell sets up TrainingArguments with AdamW optimizer, linear learning rate warmup, cross-entropy loss, and early stopping callback. It configures evaluation strategy, logging, and checkpoint saving. The compute_metrics function calculates accuracy, precision, recall, and F1-macro scores for each evaluation step. All settings align with the proposal requirements for systematic model training and evaluation.


In [9]:
import gc
import time
from optuna.samplers import GridSampler, RandomSampler
from pathlib import Path

TUNING_DIR = Path("tuning")
TUNING_DIR.mkdir(exist_ok=True)

if AUTO_TUNE_ENABLED:
    # Custom Trainer for Optuna compatibility
    class WeightedTrainer(Trainer):
        def __init__(self, *args, class_weights=None, **kwargs):
            super().__init__(*args, **kwargs)
            self.class_weights = class_weights

    def build_trainer_for_trial(hparams, run_name):
        """Build trainer with suggested hyperparameters"""
        trial_args = make_training_args(
            output_dir=f"./checkpoints/bert-base/{run_name}",
            learning_rate=hparams.get("learning_rate", 3e-5),
            per_device_train_batch_size=hparams.get("per_device_train_batch_size", 16),
            weight_decay=hparams.get("weight_decay", 0.01),
            num_train_epochs=min(hparams.get("num_train_epochs", 3), MAX_AUTOTUNE_EPOCHS),
            eval_strategy="epoch",  # Enable evaluation for early stopping callback
            save_strategy="epoch",  # Must match eval_strategy for load_best_model_at_end
        )

        # Create fresh model for each trial
        trial_model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=3).to(device)

        trial_trainer = WeightedTrainer(
            model=trial_model,
            args=trial_args,
            train_dataset=ds_train,
            eval_dataset=ds_val,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer,
            callbacks=[EarlyStoppingCallback(early_stopping_patience=2, early_stopping_threshold=0.001)],
        )
        return trial_trainer, trial_args

    def suggest_params(trial, strategy):
        """Suggest hyperparameters based on strategy"""
        if strategy == "grid":
            # Grid search: enumerate all combinations
            lr_vals = GRID_SEARCH_SPACE["learning_rate"]
            bs_vals = GRID_SEARCH_SPACE["per_device_train_batch_size"]
            wd_vals = GRID_SEARCH_SPACE["weight_decay"]
            ep_vals = GRID_SEARCH_SPACE["num_train_epochs"]

            # Use trial number to index into grid
            trial_idx = trial.number
            total_combos = len(lr_vals) * len(bs_vals) * len(wd_vals) * len(ep_vals)
            if trial_idx >= total_combos:
                raise optuna.TrialPruned()

            idx = trial_idx
            lr_idx = idx % len(lr_vals)
            idx //= len(lr_vals)
            bs_idx = idx % len(bs_vals)
            idx //= len(bs_vals)
            wd_idx = idx % len(wd_vals)
            idx //= len(wd_vals)
            ep_idx = idx % len(ep_vals)

            return {
                "learning_rate": lr_vals[lr_idx],
                "per_device_train_batch_size": bs_vals[bs_idx],
                "weight_decay": wd_vals[wd_idx],
                "num_train_epochs": ep_vals[ep_idx],
            }
        else:  # random
            return {
                "learning_rate": trial.suggest_float("learning_rate", 2e-5, 5e-5, log=True),
                "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 12, 16, 24, 32]),
                "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.1),
                "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 4),
            }

    def objective(trial):
        """Optuna objective function"""
        strategy = trial.study.sampler.__class__.__name__
        hparams = suggest_params(trial, "grid" if "Grid" in strategy else "random")

        run_name = f"trial_{trial.number}"
        trainer_obj, args_obj = build_trainer_for_trial(hparams, run_name)

        start_time = time.time()
        trainer_obj.train()
        train_time = time.time() - start_time

        eval_results = trainer_obj.evaluate()
        f1_macro = eval_results.get("eval_f1_macro", 0.0)

        # Note: Labels are in BERT format (0,1,2) during training, metrics are computed correctly

        # Cleanup
        del trainer_obj
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

        trial.set_user_attr("train_time", train_time)
        trial.set_user_attr("accuracy", eval_results.get("eval_accuracy", 0.0))
        trial.set_user_attr("precision", eval_results.get("eval_precision", 0.0))
        trial.set_user_attr("recall", eval_results.get("eval_recall", 0.0))

        return f1_macro

    # Run hyperparameter search
    SEARCH_STRATEGIES = ["grid", "random"] if AUTO_TUNE_ENABLED else []
    summary_rows = []

    for strategy in SEARCH_STRATEGIES:
        print(f"\n=== Hyperparameter search: {strategy.upper()} ===")

        if strategy == "grid":
            sampler = GridSampler(GRID_SEARCH_SPACE)
            study = optuna.create_study(direction="maximize", sampler=sampler)
            # Grid search: run all combinations
            total_trials = (len(GRID_SEARCH_SPACE["learning_rate"]) *
                          len(GRID_SEARCH_SPACE["per_device_train_batch_size"]) *
                          len(GRID_SEARCH_SPACE["weight_decay"]) *
                          len(GRID_SEARCH_SPACE["num_train_epochs"]))
            study.optimize(objective, n_trials=total_trials, show_progress_bar=True)
        else:
            sampler = RandomSampler(seed=RANDOM_SEED)
            study = optuna.create_study(direction="maximize", sampler=sampler)
            study.optimize(objective, n_trials=RANDOM_TRIALS, show_progress_bar=True)

        # Save trial results
        trials_df = pd.DataFrame([
            {
                "trial": t.number,
                "f1_macro": t.value,
                "learning_rate": t.params.get("learning_rate"),
                "batch_size": t.params.get("per_device_train_batch_size"),
                "weight_decay": t.params.get("weight_decay"),
                "epochs": t.params.get("num_train_epochs"),
                "accuracy": t.user_attrs.get("accuracy", 0),
                "precision": t.user_attrs.get("precision", 0),
                "recall": t.user_attrs.get("recall", 0),
                "train_time": t.user_attrs.get("train_time", 0),
            }
            for t in study.trials if t.value is not None
        ])

        trials_df.to_csv(TUNING_DIR / f"{strategy}_trials.csv", index=False)
        print(f"Saved {strategy} trials to {TUNING_DIR / f'{strategy}_trials.csv'}")

        # Best trial
        if study.best_trial:
            best_hparams = study.best_trial.params
            print(f"\nBest {strategy} trial: F1-macro = {study.best_trial.value:.4f}")
            print(f"Best params: {best_hparams}")

            # Retrain with best params and evaluate on test
            print(f"\nRetraining with best {strategy} configuration...")
            best_trainer, _ = build_trainer_for_trial(best_hparams, f"best_{strategy}")
            best_trainer.train()

            val_results = best_trainer.evaluate()
            test_results = best_trainer.evaluate(eval_dataset=ds_test)

            # Save best model
            best_ckpt_dir = f"./checkpoints/bert-base/best_{strategy}"
            best_trainer.save_model(best_ckpt_dir)
            tokenizer.save_pretrained(best_ckpt_dir)

            summary_rows.append({
                "strategy": strategy,
                "f1_macro_val": val_results.get("eval_f1_macro", 0),
                "f1_macro_test": test_results.get("eval_f1_macro", 0),
                "accuracy_val": val_results.get("eval_accuracy", 0),
                "accuracy_test": test_results.get("eval_accuracy", 0),
                "best_params": str(best_hparams),
            })

            # Log to runs_log.csv
            row = {
                "member": f"bert-{strategy}",
                "model": MODEL_NAME,
                "num_train_epochs": best_hparams.get("num_train_epochs"),
                "per_device_train_batch_size": best_hparams.get("per_device_train_batch_size"),
                "learning_rate": best_hparams.get("learning_rate"),
                "weight_decay": best_hparams.get("weight_decay"),
                "warmup_steps": None,
                "lr_scheduler_type": "linear",
                "gradient_accumulation_steps": 1,
                "max_seq_length": MAX_LEN,
                "seed": RANDOM_SEED,
                "fp16": USE_FP16,
                "accuracy": test_results.get("eval_accuracy", 0),
                "precision": test_results.get("eval_precision", 0),
                "recall": test_results.get("eval_recall", 0),
                "f1_macro": test_results.get("eval_f1_macro", 0),
                "notes": f"Best {strategy} search. Params: {best_hparams}",
            }
            pd.DataFrame([row]).to_csv("runs_log.csv", mode="a", index=False,
                                     header=not os.path.exists("runs_log.csv"))

    # Save summary
    if summary_rows:
        summary_df = pd.DataFrame(summary_rows)
        summary_df.to_csv(TUNING_DIR / "strategy_summary.csv", index=False)
        print(f"\n✓ Saved strategy summary to {TUNING_DIR / 'strategy_summary.csv'}")
        print(summary_df)

    AUTO_TUNE_ENABLED = False  # Disable for subsequent cells
else:
    print("AUTO_TUNE_ENABLED is False. Skipping hyperparameter tuning.")


[I 2025-11-14 02:00:02,593] A new study created in memory with name: no-name-0c00466c-c606-4801-adcd-6fdd55e6e0cc



=== Hyperparameter search: GRID ===


  0%|          | 0/8 [00:00<?, ?it/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.7672,0.765782,0.688061,0.711502,0.676981,0.686318
2,0.6506,0.785695,0.684022,0.693116,0.680676,0.68547


[I 2025-11-14 02:08:14,545] Trial 0 finished with value: 0.686318040786818 and parameters: {}. Best is trial 0 with value: 0.686318040786818.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.7922,0.757295,0.690081,0.714489,0.678347,0.687326
2,0.6887,0.760651,0.690305,0.697378,0.687718,0.691598


[I 2025-11-14 02:13:54,123] Trial 1 finished with value: 0.6915982710100357 and parameters: {}. Best is trial 1 with value: 0.6915982710100357.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.7705,0.757406,0.68851,0.713059,0.677379,0.68639
2,0.6415,0.773878,0.686939,0.695041,0.684057,0.688386


[I 2025-11-14 02:22:24,701] Trial 2 finished with value: 0.6883864658141992 and parameters: {}. Best is trial 1 with value: 0.6915982710100357.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.7911,0.755417,0.691876,0.720507,0.679262,0.689153
2,0.684,0.765129,0.684022,0.691886,0.681685,0.685765


[I 2025-11-14 02:28:15,641] Trial 3 finished with value: 0.689152955641206 and parameters: {}. Best is trial 1 with value: 0.6915982710100357.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.7774,0.779778,0.677962,0.710988,0.662848,0.672782
2,0.6921,0.773547,0.683124,0.69126,0.681064,0.685171
3,0.4765,0.879501,0.679982,0.686968,0.677829,0.681485


[I 2025-11-14 02:40:25,323] Trial 4 finished with value: 0.6851706730005169 and parameters: {}. Best is trial 1 with value: 0.6915982710100357.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.801,0.764471,0.688285,0.717557,0.675213,0.68452
2,0.6967,0.765712,0.688285,0.691973,0.688181,0.689811
3,0.5086,0.842903,0.680206,0.685223,0.679583,0.681995


[I 2025-11-14 02:48:52,863] Trial 5 finished with value: 0.6898113084926939 and parameters: {}. Best is trial 1 with value: 0.6915982710100357.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.7674,0.76365,0.684695,0.70208,0.67588,0.683534
2,0.7029,0.768822,0.68649,0.696638,0.682142,0.687464
3,0.4903,0.900347,0.678411,0.683737,0.677011,0.679861


[I 2025-11-14 03:01:35,392] Trial 6 finished with value: 0.6874641641377398 and parameters: {}. Best is trial 1 with value: 0.6915982710100357.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.8031,0.759196,0.690305,0.716507,0.678341,0.688032
2,0.6878,0.767889,0.689632,0.69395,0.68874,0.690839
3,0.489,0.848269,0.677289,0.682557,0.676265,0.678885


[I 2025-11-14 03:10:16,414] Trial 7 finished with value: 0.6908388799360564 and parameters: {}. Best is trial 1 with value: 0.6915982710100357.
Saved grid trials to tuning/grid_trials.csv

Best grid trial: F1-macro = 0.6916
Best params: {}

Retraining with best grid configuration...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.8031,0.759196,0.690305,0.716507,0.678341,0.688032
2,0.6878,0.767889,0.689632,0.69395,0.68874,0.690839
3,0.489,0.848269,0.677289,0.682557,0.676265,0.678885


[I 2025-11-14 03:19:03,121] A new study created in memory with name: no-name-0e197e03-9586-4164-aa3e-7e7164d8fec0



=== Hyperparameter search: RANDOM ===


  0%|          | 0/8 [00:00<?, ?it/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.7648,0.773396,0.680655,0.69838,0.670936,0.677155
2,0.7078,0.767613,0.691876,0.698626,0.689903,0.693421
3,0.4913,0.912909,0.671005,0.675108,0.671296,0.673027
4,0.3951,1.265033,0.661131,0.664985,0.661545,0.663111


[I 2025-11-14 03:35:47,441] Trial 0 finished with value: 0.6934212564921293 and parameters: {'learning_rate': 2.8188664052384835e-05, 'per_device_train_batch_size': 8, 'weight_decay': 0.005808361216819946, 'num_train_epochs': 4}. Best is trial 0 with value: 0.6934212564921293.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.7763,0.759027,0.690305,0.713224,0.679785,0.689096
2,0.6917,0.764395,0.692549,0.70198,0.689366,0.694224


[I 2025-11-14 03:41:35,802] Trial 1 finished with value: 0.6942237582250611 and parameters: {'learning_rate': 3.469266868719914e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.018182496720710064, 'num_train_epochs': 2}. Best is trial 1 with value: 0.6942237582250611.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.807,0.751196,0.691427,0.711291,0.681258,0.689522
2,0.625,0.755146,0.688285,0.693525,0.686847,0.689626
3,0.5546,0.812691,0.684022,0.689397,0.682801,0.685528


[I 2025-11-14 03:48:50,986] Trial 2 finished with value: 0.6896256738801562 and parameters: {'learning_rate': 2.6430182166924245e-05, 'per_device_train_batch_size': 24, 'weight_decay': 0.029214464853521818, 'num_train_epochs': 3}. Best is trial 1 with value: 0.6942237582250611.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.7666,0.760756,0.687388,0.706898,0.677721,0.684791
2,0.6694,0.776649,0.690978,0.699606,0.687463,0.692133


[I 2025-11-14 03:57:14,946] Trial 3 finished with value: 0.692133038590431 and parameters: {'learning_rate': 3.037515404772984e-05, 'per_device_train_batch_size': 8, 'weight_decay': 0.06075448519014384, 'num_train_epochs': 2}. Best is trial 1 with value: 0.6942237582250611.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.8245,0.750815,0.691876,0.718329,0.679977,0.689344
2,0.711,0.772949,0.684695,0.688492,0.684994,0.686569
3,0.5232,0.829339,0.681329,0.687041,0.680042,0.683012


[I 2025-11-14 04:06:49,743] Trial 4 finished with value: 0.689343887800082 and parameters: {'learning_rate': 2.1228368952944975e-05, 'per_device_train_batch_size': 12, 'weight_decay': 0.0684233026512157, 'num_train_epochs': 3}. Best is trial 1 with value: 0.6942237582250611.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.792,0.753606,0.692325,0.723231,0.679316,0.689797
2,0.7052,0.763157,0.686715,0.69055,0.686057,0.687997
3,0.5287,0.816514,0.676391,0.682376,0.674289,0.677565


[I 2025-11-14 04:15:29,660] Trial 5 finished with value: 0.6897974939572982 and parameters: {'learning_rate': 2.2366286923412623e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.031171107608941095, 'num_train_epochs': 3}. Best is trial 1 with value: 0.6942237582250611.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.8497,0.761959,0.687163,0.719809,0.672758,0.682561
2,0.7107,0.755114,0.685592,0.69278,0.682432,0.686265
3,0.4986,0.872393,0.677289,0.681913,0.67613,0.678635
4,0.3571,1.066213,0.666068,0.671531,0.664999,0.667796


[I 2025-11-14 04:28:09,573] Trial 6 finished with value: 0.686264883533798 and parameters: {'learning_rate': 3.300561952272884e-05, 'per_device_train_batch_size': 12, 'weight_decay': 0.05978999788110852, 'num_train_epochs': 4}. Best is trial 1 with value: 0.6942237582250611.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.7954,0.758698,0.69053,0.716944,0.677268,0.686352
2,0.6403,0.75262,0.690754,0.697886,0.688076,0.691977
3,0.5691,0.798181,0.69053,0.698597,0.687222,0.69154


[I 2025-11-14 04:35:21,398] Trial 7 finished with value: 0.6919773122878455 and parameters: {'learning_rate': 2.1689258392227166e-05, 'per_device_train_batch_size': 24, 'weight_decay': 0.08287375091519295, 'num_train_epochs': 3}. Best is trial 1 with value: 0.6942237582250611.
Saved random trials to tuning/random_trials.csv

Best random trial: F1-macro = 0.6942
Best params: {'learning_rate': 3.469266868719914e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.018182496720710064, 'num_train_epochs': 2}

Retraining with best random configuration...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.7989,0.757794,0.691203,0.716875,0.679419,0.689009
2,0.6929,0.760415,0.68851,0.697486,0.685272,0.689998



✓ Saved strategy summary to tuning/strategy_summary.csv
  strategy  f1_macro_val  f1_macro_test  accuracy_val  accuracy_test  \
0     grid      0.690839       0.674435      0.689632       0.673369   
1   random      0.689998       0.677551      0.688510       0.676497   

                                         best_params  
0                                                 {}  
1  {'learning_rate': 3.469266868719914e-05, 'per_...  


# **BERT Fine-tuning and Training**

**`Purpose`**

This cell performs the actual BERT model fine-tuning using either default hyperparameters or the best configuration from automated tuning. It trains the model with early stopping and saves the best checkpoint.

**`Input`**

Requires tokenized datasets (ds_train, ds_val), training arguments, model, tokenizer, and compute_metrics function. If hyperparameter tuning was performed, uses best parameters from the study.

**`Output`**

Trains the BERT model and saves the best checkpoint to disk. Prints training progress including loss and metrics for each epoch. Creates a trainer object and runs the training loop.

**`Details`**

The cell creates a Trainer object with the model, datasets, training arguments, and callbacks including early stopping. Training proceeds for the specified number of epochs with validation evaluation after each epoch. The best model checkpoint (based on validation F1-macro) is automatically saved. Training progress is logged and displayed, showing loss curves and metric improvements over time.


**`Purpose`**

This cell configures the hyperparameter search spaces for automated tuning using Optuna. It defines the parameter ranges for grid search and random search strategies.

**`Input`**

No direct inputs required. Uses global configuration variables.

**`Output`**

Prints the hyperparameter tuning configuration including search spaces, number of trials, and maximum epochs.

**`Details`**

Defines GRID_SEARCH_SPACE with fixed learning rate and varying batch sizes, weight decay, and epochs. Sets up RANDOM_SEARCH_SPACE with continuous ranges for learning rate and weight decay, and categorical choices for batch size and epochs. Configures the number of random trials and maximum training epochs to balance exploration with computational efficiency.


In [16]:
# Train BERT model (if not already trained via hyperparameter tuning)
if not AUTO_TUNE_ENABLED or not os.path.exists("./checkpoints/bert-base/best_grid"):
    print("Training BERT model with default/configured parameters...")
    print("Fine-tuning process per proposal Section V.D:")
    print("  - Tokenization using BERT's WordPiece tokenizer")
    print("  - Convert tweets to input IDs and attention masks")
    print("  - Fine-tune with classification head for sentiment labeling")
    print("  - Optimizer: AdamW, Loss: Cross-entropy, Scheduler: Linear warmup")

    trainer.train()

    # Evaluate on validation set
    val_results = trainer.evaluate()
    print("\nValidation Results:")
    print(f"  Accuracy: {val_results.get('eval_accuracy', 0):.4f}")
    print(f"  Precision: {val_results.get('eval_precision', 0):.4f}")
    print(f"  Recall: {val_results.get('eval_recall', 0):.4f}")
    print(f"  F1-macro: {val_results.get('eval_f1_macro', 0):.4f}")

    # Save best checkpoint
    best_ckpt_dir = "./checkpoints/bert-base/best"
    trainer.save_model(best_ckpt_dir)
    tokenizer.save_pretrained(best_ckpt_dir)
    print(f"\n✓ Best model saved to {best_ckpt_dir}")
else:
    print("Using best model from hyperparameter tuning.")
    best_ckpt_dir = "./checkpoints/bert-base/best_grid"


Training BERT model with default/configured parameters...
Fine-tuning process per proposal Section V.D:
  - Tokenization using BERT's WordPiece tokenizer
  - Convert tweets to input IDs and attention masks
  - Fine-tune with classification head for sentiment labeling
  - Optimizer: AdamW, Loss: Cross-entropy, Scheduler: Linear warmup


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
100,0.6857,0.760716,0.686041,0.691137,0.687115,0.688662
200,0.6485,0.786035,0.679084,0.68696,0.676102,0.679751
300,0.6048,0.788078,0.68447,0.69083,0.680619,0.683608



Validation Results:
  Accuracy: 0.6845
  Precision: 0.6908
  Recall: 0.6806
  F1-macro: 0.6836

✓ Best model saved to ./checkpoints/bert-base/best


# **Model Evaluation and Baseline Comparison (per proposal Section VI)**

**`Purpose`**

This cell performs comprehensive evaluation of the trained BERT model on the test set. It generates detailed performance metrics, classification reports, and comparison with baseline models.

**`Input`**

Requires best_model and best_ckpt_dir from training cells. Needs ds_test dataset and baseline model results for comparison.

**`Output`**

Prints detailed performance metrics, generates confusion matrices, ROC-AUC and PR-AUC curves, and creates comparison tables. Exports results to CSV files for further analysis.

**`Details`**

The evaluation process tests the best BERT model on the held-out test set. It computes accuracy, precision, recall, and F1-macro scores for each class and overall. Generates confusion matrices to visualize classification performance. Creates ROC-AUC and PR-AUC curves for each sentiment class to analyze precision-recall trade-offs. Compares BERT results with baseline TF-IDF and Logistic Regression model to quantify improvements. All results are exported to files for documentation and reporting.


**`Purpose`**

This cell configures the training arguments and sets up helper functions for BERT model fine-tuning. It defines hyperparameters, optimization settings, and evaluation metrics according to the proposal methodology.

**`Input`**

Uses global variables like MODEL_NAME, MAX_LEN, RANDOM_SEED, and device settings. Requires tokenizer and model to be initialized from previous cells.

**`Output`**

Creates training_args object and compute_metrics function. Prints configuration summary including optimizer settings, learning rate schedule, and training parameters.

**`Details`**

The cell sets up TrainingArguments with AdamW optimizer, linear learning rate warmup, cross-entropy loss, and early stopping callback. It configures evaluation strategy, logging, and checkpoint saving. The compute_metrics function calculates accuracy, precision, recall, and F1-macro scores for each evaluation step. All settings align with the proposal requirements for systematic model training and evaluation.


In [18]:
# Load best BERT model for evaluation
from transformers import AutoModelForSequenceClassification

best_model = AutoModelForSequenceClassification.from_pretrained(best_ckpt_dir).to(device)
best_model.eval()

# Create trainer with best model for evaluation
eval_trainer = Trainer(
    model=best_model,
    args=training_args,
    eval_dataset=ds_test,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

# Evaluate on test set (15%)
print("Evaluating BERT model on test set...")
test_results = eval_trainer.evaluate()

print("\nBERT Model Performance (Test Set) - per proposal Section VI.A:")
print(f"  Accuracy: {test_results.get('eval_accuracy', 0):.4f}")
print(f"  Precision: {test_results.get('eval_precision', 0):.4f}")
print(f"  Recall: {test_results.get('eval_recall', 0):.4f}")
print(f"  F1-macro: {test_results.get('eval_f1_macro', 0):.4f}")

# Generate predictions for detailed analysis
test_predictions = eval_trainer.predict(ds_test)
test_preds_bert = np.argmax(test_predictions.predictions, axis=1)
test_labels_bert = test_predictions.label_ids

# Map predictions back from BERT format (0,1,2) to proposal format (-1,0,1)
reverse_mapping = {0: -1, 1: 0, 2: 1}
test_preds = np.array([reverse_mapping[p] for p in test_preds_bert])
test_labels = np.array([reverse_mapping[l] for l in test_labels_bert])

# Classification report
print("\nTest Set Classification Report:")
print(classification_report(test_labels, test_preds,
                          target_names=['negative/toxic', 'neutral', 'positive'],
                          zero_division=0))

# Confusion matrix
cm_bert = confusion_matrix(test_labels, test_preds, labels=[-1, 0, 1])
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay(cm_bert, display_labels=['negative/toxic', 'neutral', 'positive']).plot(ax=ax, colorbar=False)
plt.title('BERT Model - Test Set Confusion Matrix')
plt.tight_layout()
plt.savefig('exports/confusion_matrices/bert_cm_test.png', dpi=150)
plt.close()
print("\n✓ Confusion matrix saved to exports/confusion_matrices/bert_cm_test.png")

# Save test predictions
test_df_results = pd.DataFrame({
    'review': df_test['review'].tolist(),
    'gold': test_labels,
    'pred': test_preds
})
test_df_results.to_csv('exports/bert_predictions_test.csv', index=False)
print("✓ Test predictions saved to exports/bert_predictions_test.csv")

# ROC-AUC and PR-AUC curves (per proposal Section III)
from sklearn.metrics import roc_curve, auc, precision_recall_curve, roc_auc_score, average_precision_score
from sklearn.preprocessing import label_binarize

# Binarize labels for multi-class ROC
y_test_bin = label_binarize(test_labels, classes=[-1, 0, 1])
n_classes = 3
test_probs = torch.softmax(torch.tensor(test_predictions.predictions), dim=-1).numpy()

# Compute ROC and PR curves for each class
# Note: test_probs columns are in BERT order (0=negative, 1=neutral, 2=positive)
# which maps to proposal order: 0→-1, 1→0, 2→1
fpr = dict()
tpr = dict()
roc_auc = dict()
precision = dict()
recall = dict()
pr_auc = dict()

# Map BERT probabilities to proposal label order
bert_to_proposal = {0: 0, 1: 1, 2: 2}  # BERT index 0→proposal -1, index 1→proposal 0, index 2→proposal 1
for i, class_name in enumerate(['negative/toxic', 'neutral', 'positive']):
    class_idx = [-1, 0, 1][i]  # Proposal format
    bert_idx = {-1: 0, 0: 1, 1: 2}[class_idx]  # BERT format index
    y_true_class = (test_labels == class_idx).astype(int)
    y_score_class = test_probs[:, bert_idx]  # Use BERT index for probabilities

    fpr[i], tpr[i], _ = roc_curve(y_true_class, y_score_class)
    roc_auc[i] = auc(fpr[i], tpr[i])

    precision[i], recall[i], _ = precision_recall_curve(y_true_class, y_score_class)
    pr_auc[i] = average_precision_score(y_true_class, y_score_class)

    print(f"\n{class_name} (class {class_idx}):")
    print(f"  ROC-AUC: {roc_auc[i]:.4f}")
    print(f"  PR-AUC: {pr_auc[i]:.4f}")

# Plot ROC curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC curves
ax = axes[0]
for i, class_name in enumerate(['negative/toxic', 'neutral', 'positive']):
    ax.plot(fpr[i], tpr[i], label=f'{class_name} (AUC = {roc_auc[i]:.3f})')
ax.plot([0, 1], [0, 1], 'k--', label='Random')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curves (per proposal Section III)')
ax.legend()
ax.grid(True, alpha=0.3)

# PR curves
ax = axes[1]
for i, class_name in enumerate(['negative/toxic', 'neutral', 'positive']):
    ax.plot(recall[i], precision[i], label=f'{class_name} (AP = {pr_auc[i]:.3f})')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.set_title('Precision-Recall Curves (per proposal Section III)')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
os.makedirs("exports/roc_curves", exist_ok=True)
plt.savefig('exports/roc_curves/bert_roc_pr_curves.png', dpi=150)
plt.close()
print("\n✓ ROC and PR curves saved to exports/roc_curves/bert_roc_pr_curves.png")

# Baseline vs BERT comparison (per proposal Section VI.B)
comparison_df = pd.DataFrame({
    'Model': ['Baseline (TF-IDF + LogReg)', 'BERT-base-uncased'],
    'Accuracy': [test_acc, test_results.get('eval_accuracy', 0)],
    'Precision': [test_prec, test_results.get('eval_precision', 0)],
    'Recall': [test_rec, test_results.get('eval_recall', 0)],
    'F1-macro': [test_f1, test_results.get('eval_f1_macro', 0)],
})

print("\n" + "="*60)
print("Baseline vs BERT Comparison (per proposal Section VI.B)")
print("="*60)
print(comparison_df.to_string(index=False))
comparison_df.to_csv('exports/model_comparison.csv', index=False)
print("\n✓ Model comparison saved to exports/model_comparison.csv")


Evaluating BERT model on test set...


  eval_trainer = Trainer(



BERT Model Performance (Test Set) - per proposal Section VI.A:
  Accuracy: 0.6772
  Precision: 0.6877
  Recall: 0.6721
  F1-macro: 0.6765

Test Set Classification Report:
                precision    recall  f1-score   support

negative/toxic       0.72      0.58      0.64      1292
       neutral       0.63      0.70      0.66      1781
      positive       0.71      0.74      0.73      1403

      accuracy                           0.68      4476
     macro avg       0.69      0.67      0.68      4476
  weighted avg       0.68      0.68      0.68      4476


✓ Confusion matrix saved to exports/confusion_matrices/bert_cm_test.png
✓ Test predictions saved to exports/bert_predictions_test.csv

negative/toxic (class -1):
  ROC-AUC: 0.8400
  PR-AUC: 0.7330

neutral (class 0):
  ROC-AUC: 0.7823
  PR-AUC: 0.6573

positive (class 1):
  ROC-AUC: 0.8769
  PR-AUC: 0.8018

✓ ROC and PR curves saved to exports/roc_curves/bert_roc_pr_curves.png

Baseline vs BERT Comparison (per proposal Section V