
# Twitter Toxicity Detection Project

> This notebook documents the end-to-end Twitter toxicity detection project, starting with deterministic data preparation and culminating in deployable artifacts. It begins by cleaning and splitting the Twitter tweet corpus, persisting split IDs so every experiment baseline or advanced evaluates on identical examples. Classic baselines (TF-IDF + Logistic Regression) anchor performance expectations before the workflow escalates to transformer fine-tuning with BERT-base-uncased.
>
> The transformer track covers tokenization, automated hyperparameter tuning (grid and random search), and early stopping, ultimately selecting the best checkpoint via validation macro-F1. Detailed evaluation follows: confusion matrices, per-class reports, ROC-AUC and PR-AUC curves, and exportable tables compare validation/test splits, while metric logs capture training dynamics.
>
> The project implements a dual-model approach: a baseline TF-IDF + Logistic Regression model for interpretability and computational efficiency, and a fine-tuned BERT-base-uncased model for superior accuracy and context-aware understanding of informal language, sarcasm, and subtle toxicity patterns in Twitter posts.
>
> Final cells translate the strongest transformer into deployment formats, exporting quantized PyTorch weights, consolidated logs, and spreadsheet summaries. Together with saved checkpoints, tokenizer files, and experiment logs, these outputs guarantee that classmates and graders can reproduce, audit, and extend every stage of the workflow without rerunning training from scratch.



# **Setup, imports, dataset load, and split**


**`Purpose`**

This block prepares the environment, ensures required libraries are available, and loads the Twitter Comment Dataset into memory in a clean, consistent format. It also establishes a reproducible 70/15/15 train-validation-test split so that all later experiments evaluate on the same examples. The goal is to make each downstream step predictable and to keep results comparable across runs and team members.

**`Input`**

The cell expects either a local copy of TwitterToxicity.csv in the current working directory or, if absent, a file that will be provided through the upload dialog. The CSV must contain at least two columns named review and label, which represent the input text and its sentiment class. Labels should be in the format: -1 (negative/toxic), 0 (neutral), 1 (positive). No other inputs are required at this stage, and any additional columns are ignored.

**`Output`**

The cell produces three pandas DataFrames, train_df, val_df, and test_df, with stratified class proportions and a new id column to uniquely identify each row. It also writes three small files, train_ids.csv, val_ids.csv, and test_ids.csv, which store the chosen row IDs for reuse. The printed device line indicates whether a GPU is available. Three proportion tables are printed as a quick check that label ratios are closely matched across splits.

**`Details`**

The cell installs the core NLP stack, imports common utilities, and detects the runtime device. It then loads the CSV, normalizes column names to lowercase, removes empty rows, and casts labels to integers. A simple id index is added so that split membership can be saved and reused. A stratified split holds label balance constant, which is printed to confirm the split is fair. Finally, the selected IDs are saved to disk so that all later training and evaluation use the same records, which supports consistent comparison across hyperparameter sweeps and models.

**`Line-by-line Description.`**

`!pip -q install transformers datasets accelerate scikit-learn openpyxl optuna -U` installs or upgrades the libraries needed for tokenization, training, metrics, hyperparameter tuning, and spreadsheet export.

`import os, numpy as np, pandas as pd, torch` pulls in filesystem helpers, numerical tools, data frames, and the deep learning backend.

`from sklearn.model_selection import train_test_split` and `from sklearn.metrics import accuracy_score, f1_score` load utilities for splitting and scoring.
`try: from google.colab import files ...` sets up an optional upload path that only activates when running in Colab.

`device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')` detects whether a GPU is present and prints the choice so training expectations are clear.

`if not os.path.exists('TwitterToxicity.csv') and files is not None: files.upload()` requests an upload when the CSV is missing, which keeps the workflow flexible.

`df = pd.read_csv('TwitterToxicity.csv')` loads the data, and `df.columns = [c.lower() for c in df.columns]` enforces lowercase names so downstream code can assume consistent headers.

`df = df.dropna(subset=['review','label']).copy()` removes incomplete rows to avoid errors and noisy training examples.
`df['label'] = df['label'].astype(int)` fixes the label type so models receive proper integers.

`df['id'] = np.arange(len(df))` assigns a stable identifier to each row so the split can be persisted.

The branch that checks for `train_ids.csv`, `val_ids.csv`, and `test_ids.csv` either reuses an existing split or creates a new stratified split with 70/15/15 ratio using `train_test_split(... stratify=df['label'])`.

`train_df[['id']].to_csv('train_ids.csv', index=False)` and the matching lines for validation and test serialize the split for later reuse.

The final `print(...)` lines show dataset sizes and class ratios so the split can be visually inspected.


In [1]:
import subprocess
import sys

packages_map = [
    ("torch", "torch"),
    ("transformers", "transformers"),
    ("datasets", "datasets"),
    ("accelerate", "accelerate"),
    ("optuna", "optuna"),
    ("scikit-learn", "sklearn"),
    ("pandas", "pandas"),
    ("numpy", "numpy"),
    ("matplotlib", "matplotlib"),
    ("openpyxl", "openpyxl"),
]

for pip_name, import_name in packages_map:
    try:
        __import__(import_name)
        print(f"✓ {pip_name} already installed")
    except ImportError:
        print(f"Installing {pip_name}...")
        try:
            subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pip_name],
                                stdout=subprocess.DEVNULL, stderr=subprocess.PIPE)
            print(f"✓ {pip_name} installed successfully")
        except subprocess.CalledProcessError as e:
            print(f"⚠ Warning: Could not install {pip_name}. You may need to install it manually.")

import os, numpy as np, pandas as pd, torch, json, inspect, optuna
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import optuna

try:
    from google.colab import files
    IN_COLAB = True
except ImportError:
    IN_COLAB = False
    files = None

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

if not os.path.exists("TwitterToxicity.csv"):
    if IN_COLAB and files is not None:
        uploaded = files.upload()
    else:
        raise FileNotFoundError("TwitterToxicity.csv not found. Please run export_dataset.py first or ensure the file is in the current directory.")
df = pd.read_csv("TwitterToxicity.csv")

df = df.rename(columns={c:c.lower() for c in df.columns})
assert {'review','label'} <= set(df.columns), "CSV must have 'review' and 'label' columns."

df = df.dropna(subset=['review','label']).copy()
df['label'] = df['label'].astype(int)
df['id'] = np.arange(len(df))

# 70/15/15 split (per proposal Section V.B: training 70%, validation 15%, testing 15%)
RANDOM_SEED = 42
if not {'train_ids.csv','val_ids.csv','test_ids.csv'} <= set(os.listdir('.')):
    df_train, df_temp = train_test_split(
        df, test_size=0.3, random_state=RANDOM_SEED, stratify=df['label']
    )
    df_val, df_test = train_test_split(
        df_temp, test_size=0.5, random_state=RANDOM_SEED, stratify=df_temp['label']
    )
    df_train[['id']].to_csv('train_ids.csv', index=False)
    df_val[['id']].to_csv('val_ids.csv', index=False)
    df_test[['id']].to_csv('test_ids.csv', index=False)
else:
    ids_tr = set(pd.read_csv('train_ids.csv')['id'].tolist())
    ids_va = set(pd.read_csv('val_ids.csv')['id'].tolist())
    ids_te = set(pd.read_csv('test_ids.csv')['id'].tolist())
    df_train = df[df['id'].isin(ids_tr)].copy()
    df_val   = df[df['id'].isin(ids_va)].copy()
    df_test  = df[df['id'].isin(ids_te)].copy()

print(f"Dataset sizes: train={len(df_train)}, val={len(df_val)}, test={len(df_test)}")
print("\nLabel ratios (train):", df_train['label'].value_counts(normalize=True).sort_index().to_dict())
print("Label ratios (val):  ", df_val['label'].value_counts(normalize=True).sort_index().to_dict())
print("Label ratios (test): ", df_test['label'].value_counts(normalize=True).sort_index().to_dict())

# Export ground-truth test set BEFORE any downsampling
# This ensures the exported file always contains the full test set (required deliverable)
# The ground_truth_test_set.csv should match the test_ids.csv exactly
# This file is required as Deliverable 2: Ground-Truth Test Set (Final Version)
if 'test_ids.csv' in os.listdir('.'):
    # Load full test set from test_ids.csv to ensure consistency
    test_ids = pd.read_csv('test_ids.csv')['id'].tolist()
    df_test_full = df[df['id'].isin(test_ids)].copy()
    ground_truth_test_set = df_test_full[['review', 'label']].copy()
    ground_truth_test_set.to_csv('ground_truth_test_set.csv', index=False)
    print(f"✓ Exported ground-truth test set: ground_truth_test_set.csv ({len(ground_truth_test_set)} samples)")
    print(f"  Label distribution: {ground_truth_test_set['label'].value_counts().sort_index().to_dict()}")
    # Verify the export
    if os.path.exists('ground_truth_test_set.csv'):
        verify_df = pd.read_csv('ground_truth_test_set.csv')
        assert len(verify_df) == len(ground_truth_test_set), "Export verification failed"
        assert list(verify_df.columns) == ['review', 'label'], "Export columns incorrect"
        print(f"  ✓ Verification passed: File contains {len(verify_df)} samples with 'review' and 'label' columns")
else:
    # If test_ids.csv doesn't exist yet, export from current df_test
    ground_truth_test_set = df_test[['review', 'label']].copy()
    ground_truth_test_set.to_csv('ground_truth_test_set.csv', index=False)
    print(f"✓ Exported ground-truth test set: ground_truth_test_set.csv ({len(ground_truth_test_set)} samples)")


✓ torch already installed
✓ transformers already installed
✓ datasets already installed
✓ accelerate already installed
Installing optuna...
✓ optuna installed successfully
✓ scikit-learn already installed
✓ pandas already installed
✓ numpy already installed
✓ matplotlib already installed
✓ openpyxl already installed
Using device: cuda


Saving TwitterToxicity.csv to TwitterToxicity.csv
Dataset sizes: train=21114, val=4525, test=4525

Label ratios (train): {-1: 0.2867765463673392, 0: 0.40025575447570333, 1: 0.3129676991569575}
Label ratios (val):   {-1: 0.28685082872928175, 0: 0.40022099447513815, 1: 0.3129281767955801}
Label ratios (test):  {-1: 0.28685082872928175, 0: 0.40022099447513815, 1: 0.3129281767955801}


# Data cleaning and preprocessing

This cell applies comprehensive preprocessing suitable for Twitter text and prepares reproducible 70/15/15 stratified splits. We:
- Remove non-textual elements (numbers, punctuation, URLs)
- Remove user identifiers, hashtags, mentions (privacy protection)
- Eliminate stopwords and unnecessary whitespace
- Convert all text to lowercase
- Apply lemmatization and stemming (normalize to root forms)
- Handle imbalanced data (oversampling/undersampling/SMOTE if needed)
- Drop empty rows and exact duplicates
- Persist `train_ids.csv`, `val_ids.csv`, `test_ids.csv` to reuse the same split across runs


**`Purpose`**

This block implements comprehensive data preprocessing per proposal Section V.B to prepare the Twitter dataset for machine learning and deep learning tasks. The preprocessing phase includes data cleaning, text normalization, tokenization preparation, and handling of imbalanced data to improve model performance.

**`Input`**

The inputs are the raw DataFrames (df_train, df_val, df_test) created earlier. Only the review and label columns are used. The preprocessing functions apply Twitter-specific cleaning to remove noise while preserving meaningful textual content.

**`Output`**

The block produces cleaned DataFrames with normalized text, balanced class distribution (if needed), and prints statistics about the preprocessing steps. It also ensures the 70/15/15 split is maintained with saved IDs.

**`Details`**

The preprocessing follows the proposal methodology exactly:
- **Data Cleaning**: Removes non-textual elements, user identifiers, hashtags, mentions, URLs, stopwords
- **Text Normalization**: Lowercase conversion, lemmatization, stemming
- **Imbalanced Data Handling**: Checks class distribution and applies SMOTE/oversampling if needed
- **Quality Control**: Removes duplicates, empty rows, and validates data integrity

**`Line-by-line Description.`**

`import re, html` imports regular expressions for pattern matching and HTML entity unescaping.

`from nltk.corpus import stopwords` and `from nltk.stem import PorterStemmer, WordNetLemmatizer` loads NLTK utilities for stopword removal, stemming, and lemmatization.

`nltk.download('stopwords', quiet=True)` downloads required NLTK data files if not already present.

`stemmer = PorterStemmer()` and `lemmatizer = WordNetLemmatizer()` initializes stemmer and lemmatizer objects for text normalization.

`stop_words = set(stopwords.words('english'))` loads English stopwords into a set for efficient lookup.

`CTRL_RE = re.compile(r"[\x00-\x08\x0B\x0C\x0E-\x1F]")` and `REPEAT_RE = re.compile(r"(\w)\1{2,}")` compile regex patterns for removing control characters and elongated word repeats.

`def strip_html(text: str) -> str:` defines a function to remove HTML tags and unescape HTML entities from text.

`def remove_urls_mentions_hashtags(text: str) -> str:` defines a function to remove URLs, user mentions, and hashtags (per proposal: privacy protection), keeping hashtag words.

`def normalize_text(text: str) -> str:` defines a function to lowercase text, tokenize, remove stopwords, and apply lemmatization and stemming.

`def basic_clean(text: str) -> str:` defines the main cleaning function that combines HTML stripping, URL/mention/hashtag removal, control character removal, whitespace normalization, and elongated repeat capping.

`print("Cleaning dataset...")` provides user feedback about the cleaning process starting.

`df = df.copy()` creates a copy of the dataframe to avoid modifying the original.

`df['review'] = df['review'].astype(str).map(basic_clean)` applies the basic cleaning function to all review text, converting to string type first.

`df = df[(df['review'].str.len() > 0)].drop_duplicates(subset=['review','label']).reset_index(drop=True)` removes empty rows and exact duplicates based on review text and label.

`print("Normalizing text (lemmatization and stemming)...")` provides user feedback about normalization starting.

`df['review'] = df['review'].map(normalize_text)` applies text normalization (lowercase, lemmatization, stemming) to all reviews.

`label_counts = df['label'].value_counts().sort_index()` counts occurrences of each label and sorts by label value.

`print(f"  {label} ({label_name}): {count} ({count/len(df)*100:.2f}%)")` displays class distribution statistics with percentages.

`min_class_ratio = label_counts.min() / len(df)` calculates the minimum class ratio to check for imbalance.

`if min_class_ratio < 0.25:` checks if data is imbalanced (any class has less than 25% of data).

`vectorizer = TfidfVectorizer(max_features=1000)` creates a TF-IDF vectorizer for SMOTE (if needed), limiting features to 1000.

`smote = SMOTE(random_state=RANDOM_SEED)` initializes SMOTE oversampler with random seed for reproducibility.

`if not {'train_ids.csv','val_ids.csv','test_ids.csv'} <= set(os.listdir('.')):` checks if split ID files exist to determine whether to create new splits or reuse existing ones.

`df_train, df_temp = train_test_split(df, test_size=0.3, random_state=RANDOM_SEED, stratify=df['label'])` creates initial 70/30 train-temp split with stratified sampling.

`df_val, df_test = train_test_split(df_temp, test_size=0.5, random_state=RANDOM_SEED, stratify=df_temp['label'])` splits temp data into 50/50 validation and test sets (15% each of total).

`df_train[['id']].to_csv('train_ids.csv', index=False)` saves training split IDs to disk for reproducibility.

`print(f"\nFinal split sizes: train={len(df_train)}, val={len(df_val)}, test={len(df_test)}")` displays final split sizes after preprocessing.


In [2]:
import re
import html
from collections import Counter
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk

try:
    nltk.download('stopwords', quiet=True)
    nltk.download('punkt', quiet=True)
    nltk.download('wordnet', quiet=True)
except:
    pass

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Get stopwords
try:
    stop_words = set(stopwords.words('english'))
except:
    stop_words = set()

CTRL_RE = re.compile(r"[\x00-\x08\x0B\x0C\x0E-\x1F]")
REPEAT_RE = re.compile(r"(\w)\1{2,}")

def strip_html(text: str) -> str:
    """Remove HTML tags and entities"""
    if not isinstance(text, str):
        return ""
    t = html.unescape(text)
    t = re.sub(r"<[^>]+>", " ", t)
    return t

def remove_urls_mentions_hashtags(text: str) -> str:
    """Remove URLs, mentions, and hashtags (per proposal: privacy protection)"""
    # Remove URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)
    # Remove mentions
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags (but keep the word)
    text = re.sub(r'#(\w+)', r'\1', text)
    return text

def normalize_text(text: str) -> str:
    """Normalize text: lowercase, lemmatization, stemming (per proposal)"""
    if not isinstance(text, str):
        return ""
    text = text.lower()
    try:
        tokens = word_tokenize(text)
    except:
        tokens = text.split()
    tokens = [lemmatizer.lemmatize(stemmer.stem(word)) for word in tokens
              if word.isalpha() and word not in stop_words]
    return ' '.join(tokens)

def basic_clean(text: str) -> str:
    """Main cleaning function"""
    if not isinstance(text, str):
        return ""
    t = str(text)
    t = strip_html(t)
    t = remove_urls_mentions_hashtags(t)
    t = CTRL_RE.sub(" ", t)
    t = re.sub(r"\s+", " ", t).strip()
    t = REPEAT_RE.sub(r"\1\1", t)  # Cap elongated repeats
    return t

print("Cleaning dataset...")
df = df.copy()
df['review'] = df['review'].astype(str).map(basic_clean)
before = len(df)
df = df[(df['review'].str.len() > 0)].drop_duplicates(subset=['review','label']).reset_index(drop=True)
after = len(df)
print(f"Cleaned dataset: kept {after}/{before} rows")

print("Normalizing text (lemmatization and stemming)...")
df['review'] = df['review'].map(normalize_text)

df = df[(df['review'].str.len() > 0)].reset_index(drop=True)
print(f"After normalization: {len(df)} rows")

print("\nClass distribution:")
label_counts = df['label'].value_counts().sort_index()
for label, count in label_counts.items():
    label_name = {-1: "negative/toxic", 0: "neutral", 1: "positive"}[label]
    print(f"  {label} ({label_name}): {count} ({count/len(df)*100:.2f}%)")

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.feature_extraction.text import TfidfVectorizer

min_class_ratio = label_counts.min() / len(df)
if min_class_ratio < 0.25:
    print(f"\n⚠ Imbalanced data detected (min class ratio: {min_class_ratio:.2f})")
    print("Applying SMOTE for balancing...")
    try:
        vectorizer = TfidfVectorizer(max_features=1000)
        X = vectorizer.fit_transform(df['review'])
        y = df['label'].values

        smote = SMOTE(random_state=RANDOM_SEED)
        X_resampled, y_resampled = smote.fit_resample(X, y)

        print("Note: SMOTE creates synthetic samples. Consider manual review.")

    except Exception as e:
        print(f"SMOTE failed: {e}. Proceeding with original data.")
else:
    print("\n✓ Data is reasonably balanced. No resampling needed.")

if not {'train_ids.csv','val_ids.csv','test_ids.csv'} <= set(os.listdir('.')):
    df_train, df_temp = train_test_split(
        df, test_size=0.3, random_state=RANDOM_SEED, stratify=df['label']
    )
    df_val, df_test = train_test_split(
        df_temp, test_size=0.5, random_state=RANDOM_SEED, stratify=df_temp['label']
    )
    df_train[['id']].to_csv('train_ids.csv', index=False)
    df_val[['id']].to_csv('val_ids.csv', index=False)
    df_test[['id']].to_csv('test_ids.csv', index=False)
else:
    ids_tr = set(pd.read_csv('train_ids.csv')['id'].tolist())
    ids_va = set(pd.read_csv('val_ids.csv')['id'].tolist())
    ids_te = set(pd.read_csv('test_ids.csv')['id'].tolist())
    df_train = df[df['id'].isin(ids_tr)].copy()
    df_val   = df[df['id'].isin(ids_va)].copy()
    df_test  = df[df['id'].isin(ids_te)].copy()

print(f"\nFinal split sizes: train={len(df_train)}, val={len(df_val)}, test={len(df_test)}")
print('Label ratios (train):', df_train['label'].value_counts(normalize=True).sort_index().to_dict())
print('Label ratios (val):  ', df_val['label'].value_counts(normalize=True).sort_index().to_dict())
print('Label ratios (test): ', df_test['label'].value_counts(normalize=True).sort_index().to_dict())


Cleaning dataset...
Cleaned dataset: kept 30140/30164 rows
Normalizing text (lemmatization and stemming)...
After normalization: 29756 rows

Class distribution:
  -1 (negative/toxic): 8573 (28.81%)
  0 (neutral): 11826 (39.74%)
  1 (positive): 9357 (31.45%)

✓ Data is reasonably balanced. No resampling needed.

Final split sizes: train=20824, val=4456, test=4476
Label ratios (train): {-1: 0.2877929312331925, 0: 0.39766615443718784, 1: 0.31454091432961967}
Label ratios (val):   {-1: 0.289048473967684, 0: 0.39587073608617596, 1: 0.31508078994614}
Label ratios (test):  {-1: 0.28865058087578194, 0: 0.3978999106344951, 1: 0.31344950848972297}


# Fast mode toggle

Set `FAST_MODE` to `True` to downsample the training split (40%) for quicker CPU experiments.


In [3]:
FAST_MODE = True  # Set to True for faster training (40% data), False for full dataset
TRAIN_FRACTION = 0.40 if FAST_MODE else 1.0
VAL_FRACTION = 1.0  # Keep full validation/test by default

if FAST_MODE and TRAIN_FRACTION < 1.0:
    df_train = (df_train
                .sample(frac=TRAIN_FRACTION, random_state=RANDOM_SEED)
                .sort_values('id')
                .reset_index(drop=True))
    if VAL_FRACTION < 1.0:
        df_val = (df_val
                  .sample(frac=VAL_FRACTION, random_state=RANDOM_SEED)
                  .sort_values('id')
                  .reset_index(drop=True))
        df_test = (df_test
                   .sample(frac=VAL_FRACTION, random_state=RANDOM_SEED)
                   .sort_values('id')
                   .reset_index(drop=True))
    print(f"[FAST_MODE ENABLED] Using {TRAIN_FRACTION*100:.0f}% of training data\n  Train: {len(df_train)} samples{len(df_train)} (~{TRAIN_FRACTION*100:.0f}%), val={len(df_val)}, test={len(df_test)}")
else:
    print("FAST_MODE disabled: using full train/val/test splits")


[FAST_MODE ENABLED] Using 40% of training data
  Train: 8330 samples8330 (~40%), val=4456, test=4476


# **Baseline TF-IDF + Logistic Regression (per proposal Section V.C)**


**`Purpose`**

This block builds a baseline model using TF-IDF vectorization and Logistic Regression as a reference point per proposal Section V.C. The baseline model acts as a foundation for measuring improvements from more complex deep learning methods. It uses traditional machine learning techniques that are simple, interpretable, and computationally efficient.

**`Input`**

The inputs are the train_df and val_df frames created earlier. Only the review and label columns are used. The TF-IDF vectorizer is configured with word 1-2 ngrams and a vocabulary limit to cap memory and training time. The labels are taken directly as integer classes (-1, 0, 1).

**`Output`**

The block prints a compact dictionary that contains baseline accuracy, precision, recall, and F1-score (per proposal Section VI.A). It also appends a structured row to runs_log.csv so that the baseline appears in the experiment ledger with model name, scores, and notes. These outputs provide both an on-screen summary and a durable record for later tables and charts.

**`Details`**

A TF-IDF vectorizer is fit on the training text and applied to the validation text, producing sparse matrices. A Logistic Regression model is trained with hyperparameter tuning via grid search and cross-validation (per proposal Section VI.B). Predictions for the validation set are compared against the gold labels to compute accuracy, precision, recall, and F1-score, where macro-F1 treats all classes equally. The metrics are printed and then written to the log file with consistent column names.

**`Line-by-line Description.`**

`from sklearn.feature_extraction.text import TfidfVectorizer` imports the TF-IDF vectorizer for converting text to numerical features.

`from sklearn.linear_model import LogisticRegression` imports Logistic Regression classifier for baseline model.

`from sklearn.model_selection import GridSearchCV, StratifiedKFold` imports grid search and stratified K-fold cross-validation utilities.

`tfidf = TfidfVectorizer(max_features=100000, ngram_range=(1,3), ...)` creates a TF-IDF vectorizer with 1-3 ngrams, max 100k features, and sublinear term frequency scaling.

`X_tr = df_train['review'].tolist()` and `y_tr = df_train['label'].values` extracts training text and labels into lists/arrays.

`X_va = df_val['review'].tolist()` and `y_va = df_val['label'].values` extracts validation text and labels.

`X_tr_tfidf = tfidf.fit_transform(X_tr)` fits the TF-IDF vectorizer on training text and transforms it to a sparse feature matrix.

`X_va_tfidf = tfidf.transform(X_va)` transforms validation text using the fitted vectorizer.

`param_grid = {'C': [0.1, 0.5, 1.0, 2.0, 4.0, 8.0], ...}` defines hyperparameter search space including regularization strength (C), class weights, max iterations, and penalty type.

`logreg_base = LogisticRegression(solver='liblinear', random_state=RANDOM_SEED)` creates base Logistic Regression model with liblinear solver.

`skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)` creates 5-fold stratified cross-validation splitter.

`grid_search = GridSearchCV(logreg_base, param_grid=param_grid, cv=skf, scoring='f1_macro', ...)` creates grid search object configured to maximize macro-F1 score.

`grid_search.fit(X_tr_tfidf, y_tr)` runs grid search with cross-validation to find best hyperparameters.

`logreg = grid_search.best_estimator_` extracts the best model from grid search results.

`preds = logreg.predict(X_va_tfidf)` generates predictions on validation set using best model.

`acc_base = accuracy_score(y_va, preds)` calculates validation accuracy.

`prec_base = precision_score(y_va, preds, average='macro', zero_division=0)` calculates macro-averaged precision.

`rec_base = recall_score(y_va, preds, average='macro', zero_division=0)` calculates macro-averaged recall.

`f1_base = f1_score(y_va, preds, average='macro', zero_division=0)` calculates macro-averaged F1-score.

`print(classification_report(y_va, preds, ...))` generates and displays detailed per-class classification report.

`row = {...}` creates a dictionary row with model metadata and performance metrics for logging.

`pd.DataFrame([row]).to_csv("runs_log.csv", mode="a", ...)` appends the row to experiment log file.

`joblib.dump(logreg, "models/baseline_tfidf_logreg.joblib")` saves the trained model to disk for later use.

`joblib.dump(tfidf, "models/baseline_tfidf_vectorizer.joblib")` saves the fitted vectorizer to disk for consistent feature extraction.


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, precision_score, recall_score


tfidf = TfidfVectorizer(
    max_features=100000,
    ngram_range=(1,3),
    lowercase=True,
    min_df=2,
    max_df=0.95,
    sublinear_tf=True
)

X_tr = df_train['review'].tolist()
y_tr = df_train['label'].values
X_va = df_val['review'].tolist()
y_va = df_val['label'].values

print("Fitting TF-IDF vectorizer...")
X_tr_tfidf = tfidf.fit_transform(X_tr)
X_va_tfidf = tfidf.transform(X_va)

print("\nPerforming hyperparameter tuning with GridSearchCV...")
param_grid = {
    'C': [0.1, 0.5, 1.0, 2.0, 4.0, 8.0],
    'class_weight': [None, 'balanced', {-1: 1.2, 0: 0.8, 1: 1.2}],
    'max_iter': [2000, 3000],
    'penalty': ['l1', 'l2']
}

logreg_base = LogisticRegression(solver='liblinear', random_state=RANDOM_SEED)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)

grid_search = GridSearchCV(
    logreg_base,
    param_grid=param_grid,
    cv=skf,
    scoring='f1_macro',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_tr_tfidf, y_tr)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score (F1-macro): {grid_search.best_score_:.4f}")

logreg = grid_search.best_estimator_
preds = logreg.predict(X_va_tfidf)

acc_base = accuracy_score(y_va, preds)
prec_base = precision_score(y_va, preds, average='macro', zero_division=0)
rec_base = recall_score(y_va, preds, average='macro', zero_division=0)
f1_base = f1_score(y_va, preds, average='macro', zero_division=0)

print("\nBaseline Model Performance (Validation Set):")
print({
    "model": "tfidf-logreg",
    "accuracy": acc_base,
    "precision": prec_base,
    "recall": rec_base,
    "f1_macro": f1_base
})

print("\nClassification Report:")
print(classification_report(y_va, preds,
                          target_names=['negative/toxic', 'neutral', 'positive'],
                          zero_division=0))

row = {
    "member": "baseline",
    "model": "tfidf-logreg",
    "num_train_epochs": None,
    "per_device_train_batch_size": None,
    "learning_rate": None,
    "weight_decay": None,
    "warmup_steps": None,
    "lr_scheduler_type": None,
    "gradient_accumulation_steps": None,
    "max_seq_length": None,
    "seed": RANDOM_SEED,
    "fp16": False,
    "accuracy": acc_base,
    "precision": prec_base,
    "recall": rec_base,
    "f1_macro": f1_base,
    "notes": f"TF-IDF + LogReg baseline with GridSearchCV. Best params: {grid_search.best_params_}"
}

pd.DataFrame([row]).to_csv("runs_log.csv", mode="a",
                           index=False, header=not os.path.exists("runs_log.csv"))

os.makedirs("models", exist_ok=True)
import joblib
joblib.dump(logreg, "models/baseline_tfidf_logreg.joblib")
joblib.dump(tfidf, "models/baseline_tfidf_vectorizer.joblib")
print("\n✓ Baseline model saved to models/baseline_tfidf_logreg.joblib")


Fitting TF-IDF vectorizer...

Performing hyperparameter tuning with GridSearchCV...
Fitting 5 folds for each of 72 candidates, totalling 360 fits

Best parameters: {'C': 2.0, 'class_weight': {-1: 1.2, 0: 0.8, 1: 1.2}, 'max_iter': 2000, 'penalty': 'l1'}
Best CV score (F1-macro): 0.6390

Baseline Model Performance (Validation Set):
{'model': 'tfidf-logreg', 'accuracy': 0.6364452423698385, 'precision': 0.6419309572443174, 'recall': 0.6337474541408289, 'f1_macro': 0.6368850204779207}

Classification Report:
                precision    recall  f1-score   support

negative/toxic       0.63      0.56      0.59      1288
       neutral       0.59      0.64      0.62      1764
      positive       0.71      0.69      0.70      1404

      accuracy                           0.64      4456
     macro avg       0.64      0.63      0.64      4456
  weighted avg       0.64      0.64      0.64      4456


✓ Baseline model saved to models/baseline_tfidf_logreg.joblib


# Baseline Evaluation on Test Set


In [5]:

try:
    _ = tfidf
    _ = logreg
    print("Using models from previous cell...")
except NameError:

    import joblib
    import os
    model_path = "models/baseline_tfidf_logreg.joblib"
    vectorizer_path = "models/baseline_tfidf_vectorizer.joblib"

    if os.path.exists(model_path) and os.path.exists(vectorizer_path):
        print("Loading baseline models from disk...")
        tfidf = joblib.load(vectorizer_path)
        logreg = joblib.load(model_path)
        print("✓ Models loaded successfully")
    else:
        raise NameError(
            "Baseline models not found. Please run Cell 12 (Baseline TF-IDF + Logistic Regression training) first to train and save the models."
        )

X_te = df_test['review'].tolist()
y_te = df_test['label'].values
X_te_tfidf = tfidf.transform(X_te)

test_preds = logreg.predict(X_te_tfidf)

test_acc = accuracy_score(y_te, test_preds)
test_prec = precision_score(y_te, test_preds, average='macro', zero_division=0)
test_rec = recall_score(y_te, test_preds, average='macro', zero_division=0)
test_f1 = f1_score(y_te, test_preds, average='macro', zero_division=0)

print("Baseline Model Performance (Test Set):")
print({
    "model": "tfidf-logreg",
    "accuracy": test_acc,
    "precision": test_prec,
    "recall": test_rec,
    "f1_macro": test_f1
})

print("\nTest Set Classification Report:")
print(classification_report(y_te, test_preds,
                          target_names=['negative/toxic', 'neutral', 'positive'],
                          zero_division=0))

os.makedirs("exports", exist_ok=True)
pd.DataFrame({
    'review': X_te,
    'gold': y_te,
    'pred': test_preds
}).to_csv('exports/baseline_predictions_test.csv', index=False)

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_te, test_preds, labels=[-1, 0, 1])
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay(cm, display_labels=['negative/toxic', 'neutral', 'positive']).plot(ax=ax, colorbar=False)
plt.title('Baseline Model - Test Set Confusion Matrix')
plt.tight_layout()
os.makedirs("exports/confusion_matrices", exist_ok=True)
plt.savefig('exports/confusion_matrices/baseline_cm_test.png', dpi=150)
plt.show()  # Display in notebook
print("\n✓ Confusion matrix saved to exports/confusion_matrices/baseline_cm_test.png")


Using models from previous cell...
Baseline Model Performance (Test Set):
{'model': 'tfidf-logreg', 'accuracy': 0.6224307417336908, 'precision': 0.6293671945894598, 'recall': 0.6196908253354203, 'f1_macro': 0.6233782959331148}

Test Set Classification Report:
                precision    recall  f1-score   support

negative/toxic       0.62      0.56      0.59      1292
       neutral       0.58      0.63      0.60      1781
      positive       0.69      0.67      0.68      1403

      accuracy                           0.62      4476
     macro avg       0.63      0.62      0.62      4476
  weighted avg       0.62      0.62      0.62      4476


✓ Confusion matrix saved to exports/confusion_matrices/baseline_cm_test.png


# **BERT Model Initialization and Tokenization**


**`Purpose`**

This block initializes the BERT-base-uncased model per proposal Section V.C for toxicity classification. It loads the BERT tokenizer (WordPiece tokenizer) and prepares the datasets for transformer fine-tuning by converting tweets into numerical input representations suitable for model processing.

**`Input`**

The inputs are the training, validation, and test DataFrames created earlier. The BERT tokenizer is loaded from the transformers library, and a maximum sequence length is specified to keep batch shapes uniform.

**`Output`**

The block prints the resolved model name and confirms tokenization completion. Three datasets.Dataset objects are produced with tensor columns input_ids, attention_mask, and label. A classification model with three output labels is created and moved to the detected device.

**`Details`**

The BERT tokenizer (WordPiece tokenizer) is loaded with the fast backend and wrapped in a function that applies truncation and padding to a fixed length of 128 tokens (standard for tweets). The pandas frames are converted into Dataset objects, tokenization is applied in batches for speed, and the dataset columns are formatted as PyTorch tensors. The model is loaded with a task-specific head sized to three classes and placed on CPU or GPU.

**`Line-by-line Description.`**

`from datasets import Dataset` imports HuggingFace Dataset class for efficient data handling.

`from transformers import AutoTokenizer, AutoModelForSequenceClassification` imports BERT tokenizer and model classes.

`USE_GPU = torch.cuda.is_available()` checks if GPU is available for training acceleration.

`device = torch.device("cuda:0")` or `device = torch.device("cpu")` sets the device for model placement.

`NUM_WORKERS = 0 if platform.system() == 'Windows' else 4` sets data loading workers (Windows compatibility: single-process).

`MODEL_NAME = "bert-base-uncased"` defines the base BERT model identifier.

`MAX_LEN = 128` sets maximum sequence length for tokenization (standard for tweets).

`tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)` loads BERT's WordPiece tokenizer with fast backend.

`def tokenize_fn(batch):` defines a tokenization function that applies truncation and padding to fixed length.

`return tokenizer(batch["review"], truncation=True, padding="max_length", max_length=MAX_LEN)` tokenizes text with truncation and padding to max length.

`label_mapping = {-1: 0, 0: 1, 1: 2}` defines mapping from proposal label format (-1,0,1) to BERT format (0,1,2).

`df_train_bert['label'] = df_train_bert['label'].map(label_mapping)` converts training labels to BERT format.

`ds_train = Dataset.from_pandas(df_train_bert[['review','label']].reset_index(drop=True))` converts pandas DataFrame to HuggingFace Dataset.

`NUM_PROC_TOKENIZE = 4 if USE_GPU else 2` sets number of processes for parallel tokenization (4 for GPU, 2 for CPU).

`ds_train = ds_train.map(tokenize_fn, batched=True, num_proc=NUM_PROC_TOKENIZE, remove_columns=['review'])` applies tokenization in batches with parallel processing, removing original review column.

`ds_train = ds_train.with_format("torch", columns=cols)` formats dataset columns as PyTorch tensors for model input.

`num_labels = 3` sets number of classification classes (negative/toxic, neutral, positive).

`model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)` loads BERT model with classification head sized to three classes.

`model = model.to(device)` moves model to appropriate device (CPU or GPU).


In [6]:
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import platform

USE_GPU = torch.cuda.is_available()
if USE_GPU:
    print(f"✓ GPU detected: {torch.cuda.get_device_name(0)}")
    print(f"  CUDA Version: {torch.version.cuda}")
    print(f"  GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    device = torch.device("cuda:0")
    NUM_WORKERS = 0 if platform.system() == 'Windows' else 4
    PIN_MEMORY = True
else:
    print("⚠ No GPU detected, using CPU (training will be slow)")
    device = torch.device("cpu")
    NUM_WORKERS = 0
    PIN_MEMORY = False

MODEL_NAME = "bert-base-uncased"
print(f"\nUsing model: {MODEL_NAME}")

MAX_LEN = 128

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

def tokenize_fn(batch):
    """Tokenize tweets using BERT's WordPiece tokenizer (per proposal Section V.B)"""
    return tokenizer(batch["review"], truncation=True, padding="max_length", max_length=MAX_LEN)

print("\nTokenizing datasets (this may take a moment)...")

label_mapping = {-1: 0, 0: 1, 1: 2}
df_train_bert = df_train.copy()
df_val_bert = df_val.copy()
df_test_bert = df_test.copy()
df_train_bert['label'] = df_train_bert['label'].map(label_mapping)
df_val_bert['label'] = df_val_bert['label'].map(label_mapping)
df_test_bert['label'] = df_test_bert['label'].map(label_mapping)

ds_train = Dataset.from_pandas(df_train_bert[['review','label']].reset_index(drop=True))
ds_val   = Dataset.from_pandas(df_val_bert[['review','label']].reset_index(drop=True))
ds_test  = Dataset.from_pandas(df_test_bert[['review','label']].reset_index(drop=True))

if platform.system() == 'Windows':
    NUM_PROC_TOKENIZE = None
    print("  Using single-process tokenization (Windows compatibility)")
else:
    NUM_PROC_TOKENIZE = 4 if USE_GPU else 2
    print(f"  Using {NUM_PROC_TOKENIZE} processes for tokenization")

ds_train = ds_train.map(tokenize_fn, batched=True, num_proc=NUM_PROC_TOKENIZE, remove_columns=['review'])
ds_val   = ds_val.map(tokenize_fn, batched=True, num_proc=NUM_PROC_TOKENIZE, remove_columns=['review'])
ds_test  = ds_test.map(tokenize_fn, batched=True, num_proc=NUM_PROC_TOKENIZE, remove_columns=['review'])

cols = ['input_ids','attention_mask','label']
ds_train = ds_train.with_format("torch", columns=cols)
ds_val   = ds_val.with_format("torch", columns=cols)
ds_test  = ds_test.with_format("torch", columns=cols)

print(f"✓ Datasets ready: train={len(ds_train)}, val={len(ds_val)}, test={len(ds_test)}")

num_labels = 3
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)
model = model.to(device)
print(f"✓ Model loaded on {device}")


✓ GPU detected: Tesla T4
  CUDA Version: 12.6
  GPU Memory: 15.83 GB

Using model: bert-base-uncased


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]


Tokenizing datasets (this may take a moment)...
  Using 4 processes for tokenization


Map (num_proc=4):   0%|          | 0/8330 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/4456 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/4476 [00:00<?, ? examples/s]

✓ Datasets ready: train=8330, val=4456, test=4476


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✓ Model loaded on cuda:0


# **BERT Training Arguments and Configuration**


**`Purpose`**

This block defines how BERT training will proceed per proposal Section V.D. It establishes a metric function that reports accuracy, precision, recall, and F1-score, builds training arguments with AdamW optimizer and cross-entropy loss, and constructs a Trainer object that ties the model, data, tokenizer, arguments, and metrics together.

**`Input`**

Inputs include the tokenized datasets, the initialized model and tokenizer, and hyperparameters such as number of epochs, batch sizes, learning rate, weight decay, and evaluation cadence.

**`Output`**

The cell prints a confirmation that the trainer is ready and includes the active model name. Internally, it prepares all objects required for training and evaluation.

**`Details`**

A compute function converts raw model outputs into predicted labels and compares them with gold labels to obtain accuracy, precision, recall, and macro-F1 (target: >0.85 per proposal Section III). Training arguments are configured with AdamW optimizer, cross-entropy loss, linear warmup with learning rate scheduling, early stopping to prevent overfitting, and mixed-precision when GPU is available.

**`Line-by-line Description.`**

`from transformers import TrainingArguments, Trainer, EarlyStoppingCallback` imports HuggingFace training utilities.

`def compute_metrics(eval_pred):` defines a function to compute evaluation metrics from model predictions.

`preds = np.argmax(eval_pred.predictions, axis=1)` extracts predicted class indices by finding maximum logit value.

`labels = eval_pred.label_ids` extracts true labels from evaluation predictions.

`acc = accuracy_score(labels, preds)` calculates accuracy as proportion of correctly classified samples.

`prec = precision_score(labels, preds, average='macro', zero_division=0)` calculates macro-averaged precision across classes.

`rec = recall_score(labels, preds, average='macro', zero_division=0)` calculates macro-averaged recall across classes.

`f1m = f1_score(labels, preds, average='macro', zero_division=0)` calculates macro-averaged F1-score (primary metric).

`return {'accuracy': acc, 'precision': prec, 'recall': rec, 'f1_macro': f1m}` returns dictionary of computed metrics.

`if USE_GPU: TRAIN_BATCH_SIZE = 32` sets larger batch size for GPU training (32 for GPU, 8 for CPU).

`USE_FP16 = True` if GPU available enables mixed-precision training to speed up training and reduce memory usage.

`def make_training_args(**overrides):` defines a function to create training arguments with overridable defaults.

`base_epochs = 3 if FAST_MODE else 4` sets number of training epochs based on fast mode setting.

`total_steps = max(1, (len(ds_train) // max(1, TRAIN_BATCH_SIZE)) * base_epochs)` calculates total training steps.

`warmup_steps = max(25, int(total_steps * 0.1))` calculates warmup steps as 10% of total steps (minimum 25).

`cfg = dict(output_dir=f"./checkpoints/bert-base/run1", ...)` creates training configuration dictionary.

`learning_rate=3e-5` sets learning rate for AdamW optimizer (standard for BERT fine-tuning).

`weight_decay=0.01` sets L2 regularization strength to prevent overfitting.

`warmup_ratio=0.1` sets learning rate warmup ratio (linear warmup over first 10% of steps).

`lr_scheduler_type="linear"` specifies linear learning rate decay after warmup.

`load_best_model_at_end=True` enables loading best model checkpoint at end of training.

`metric_for_best_model="f1_macro"` sets macro-F1 as the metric for selecting best model checkpoint.

`early_stopping_patience=2` configures early stopping to stop training if validation metric doesn't improve for 2 evaluations.

`callbacks = [EarlyStoppingCallback(early_stopping_patience=2, ...)]` creates early stopping callback to prevent overfitting.

`trainer = Trainer(model=model, args=training_args, train_dataset=ds_train, ...)` creates Trainer object tying together model, data, arguments, and metrics.


In [7]:
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback
from sklearn.metrics import precision_score, recall_score
import inspect

def compute_metrics(eval_pred):
    """Compute metrics per proposal Section VI.A: accuracy, precision, recall, F1-score"""
    preds = np.argmax(eval_pred.predictions, axis=1)
    labels = eval_pred.label_ids
    acc = accuracy_score(labels, preds)
    prec = precision_score(labels, preds, average='macro', zero_division=0)
    rec = recall_score(labels, preds, average='macro', zero_division=0)
    f1m = f1_score(labels, preds, average='macro', zero_division=0)
    return {'accuracy': acc, 'precision': prec, 'recall': rec, 'f1_macro': f1m}

if USE_GPU:
    TRAIN_BATCH_SIZE = 32
    EVAL_BATCH_SIZE = 64
    USE_FP16 = True
    GRADIENT_CHECKPOINTING = False
else:
    TRAIN_BATCH_SIZE = 8
    EVAL_BATCH_SIZE = 16
    USE_FP16 = False
    GRADIENT_CHECKPOINTING = False

sig = inspect.signature(TrainingArguments.__init__)
argnames = set(sig.parameters.keys())

def make_training_args(**overrides):
    """Create training arguments per proposal Section V.D"""
    base_epochs = 3 if FAST_MODE else 4
    total_steps = max(1, (len(ds_train) // max(1, TRAIN_BATCH_SIZE)) * base_epochs)
    warmup_steps = max(25, int(total_steps * 0.1))

    cfg = dict(
        output_dir=f"./checkpoints/bert-base/run1",
        num_train_epochs=base_epochs,
        per_device_train_batch_size=TRAIN_BATCH_SIZE,
        per_device_eval_batch_size=EVAL_BATCH_SIZE,
        learning_rate=3e-5,
        weight_decay=0.01,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        gradient_accumulation_steps=1,
        load_best_model_at_end=True,
        metric_for_best_model="f1_macro",
        greater_is_better=True,
        seed=RANDOM_SEED,
        logging_steps=50,
        eval_steps=100,
        save_steps=200,
        save_total_limit=2,
        report_to=[],
        optim="adamw_torch",
        fp16=USE_FP16,
        dataloader_num_workers=NUM_WORKERS,
        dataloader_pin_memory=PIN_MEMORY,
        remove_unused_columns=False,
        gradient_checkpointing=GRADIENT_CHECKPOINTING,
    )
    cfg.update(overrides)

    if "evaluation_strategy" in argnames:
        cfg["evaluation_strategy"] = cfg.get("evaluation_strategy", "steps")
    elif "eval_strategy" in argnames:
        cfg["eval_strategy"] = cfg.get("eval_strategy", "steps")

    if "save_strategy" in argnames:
        cfg["save_strategy"] = cfg.get("save_strategy", "steps")

    safe_cfg = {k:v for k,v in cfg.items() if k in argnames}
    return TrainingArguments(**safe_cfg)

training_args = make_training_args()

if GRADIENT_CHECKPOINTING and hasattr(model, 'gradient_checkpointing_enable'):
    model.gradient_checkpointing_enable()
    print("✓ Gradient checkpointing enabled (saves memory)")

callbacks = [EarlyStoppingCallback(early_stopping_patience=2, early_stopping_threshold=0.001)]

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_train,
    eval_dataset=ds_val,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    callbacks=callbacks,
)

print(f"✓ Trainer ready on {device}")
print(f"  Batch size: {TRAIN_BATCH_SIZE} (train), {EVAL_BATCH_SIZE} (eval)")
print(f"  FP16: {USE_FP16}, Workers: {NUM_WORKERS}, Pin Memory: {PIN_MEMORY}")
print(f"  Early stopping: patience=2")
print(f"  Optimizer: AdamW, Loss: Cross-entropy, Scheduler: Linear warmup")


✓ Trainer ready on cuda:0
  Batch size: 32 (train), 64 (eval)
  FP16: True, Workers: 4, Pin Memory: True
  Early stopping: patience=2
  Optimizer: AdamW, Loss: Cross-entropy, Scheduler: Linear warmup


  trainer = Trainer(


# Automated Hyperparameter Tuning

**`Purpose`**

Run Optuna-driven grid and random searches per proposal Section V.C for hyperparameter tuning and validation curve analysis to find optimum performance. The search explores learning rate, batch size, epochs, weight decay, and warmup ratio to identify the best configuration via validation macro-F1.

**`Input`**

Uses the tokenized datasets (`ds_train`, `ds_val`), global tokenizer/model selections, and shared helpers (`compute_metrics`, `make_training_args`). Configuration depends on `AUTO_TUNE_ENABLED`, search spaces, and trial limits.

**`Output`**

Writes per-strategy trial tables to `tuning/` directory, a combined summary, and logs the best configuration. The winning configuration is retrained, evaluated on validation and test splits, logged to `runs_log.csv`, and predictions exported.

**`Details`**

Defines a lightweight `WeightedTrainer` compatible with Optuna, registers helper functions to build Trainers for suggested hyperparameters. Optuna's `GridSampler` and `RandomSampler` explore the respective spaces, timing each trial and storing validation metrics.

**`Line-by-line Description.`**

`AUTO_TUNE_ENABLED = True` enables/disables automated hyperparameter tuning.

`GRID_SEARCH_SPACE = {"learning_rate": [3e-5], ...}` defines grid search hyperparameter space with discrete values.

`RANDOM_SEARCH_SPACE = {"learning_rate": ("log_uniform", 2e-5, 5e-5), ...}` defines random search hyperparameter space with continuous distributions.

`RANDOM_TRIALS = 8` sets number of random search trials to run.

`MAX_AUTOTUNE_EPOCHS = 4` sets maximum epochs per trial during hyperparameter tuning.

`class WeightedTrainer(Trainer):` defines a custom Trainer class compatible with Optuna for hyperparameter tuning.

`def build_trainer_for_trial(hparams, run_name):` defines function to build a Trainer instance with suggested hyperparameters.

`trial_args = make_training_args(..., learning_rate=hparams.get("learning_rate", 3e-5), ...)` creates training arguments with trial-specific hyperparameters.

`trial_model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=3).to(device)` creates fresh model instance for each trial.

`trial_trainer = WeightedTrainer(model=trial_model, args=trial_args, ...)` creates Trainer instance for current trial.

`def suggest_params(trial, strategy):` defines function to suggest hyperparameters based on search strategy (grid or random).

`if strategy == "grid":` handles grid search parameter suggestion by computing parameter combinations from grid space.

`lr_idx = idx % len(lr_vals)` and similar lines compute parameter indices for grid search combination.

`return {...}` returns dictionary of suggested hyperparameters for current trial.

`else: return {...}` handles random search by using Optuna's suggest methods for continuous parameters.

`def objective(trial):` defines Optuna objective function that trains and evaluates a model for one trial.

`strategy = trial.study.sampler.__class__.__name__` determines search strategy (GridSampler or RandomSampler) from study.

`hparams = suggest_params(trial, "grid" if "Grid" in strategy else "random")` gets suggested hyperparameters for current trial.

`trainer_obj.train()` trains the model for current trial with suggested hyperparameters.

`start_time = time.time()` records trial start time for performance measurement.

`eval_results = trainer_obj.evaluate()` evaluates trained model on validation set.

`f1_macro = eval_results.get("eval_f1_macro", 0.0)` extracts validation F1-macro (objective to maximize).

`trial.set_user_attr("train_time", train_time)` stores trial execution time as user attribute.

`return f1_macro` returns validation F1-macro as Optuna objective value.

`sampler = GridSampler(GRID_SEARCH_SPACE)` creates grid search sampler for exhaustive search.

`study = optuna.create_study(direction="maximize", sampler=sampler)` creates Optuna study configured to maximize F1-macro.

`study.optimize(objective, n_trials=total_trials, show_progress_bar=True)` runs grid search optimization for all combinations.

`sampler = RandomSampler(seed=RANDOM_SEED)` creates random search sampler with fixed seed for reproducibility.

`study.optimize(objective, n_trials=RANDOM_TRIALS, ...)` runs random search optimization for specified number of trials.

`trials_df = pd.DataFrame([...])` creates DataFrame with trial results (hyperparameters and metrics).

`trials_df.to_csv(TUNING_DIR / f"{strategy}_trials.csv", index=False)` saves trial results to CSV file.

`best_hparams = study.best_trial.params` extracts best hyperparameters from completed study.

`best_trainer.train()` retrains model with best hyperparameters on full training set.

`best_trainer.save_model(best_ckpt_dir)` saves best model checkpoint to disk.

`tokenizer.save_pretrained(best_ckpt_dir)` saves tokenizer alongside model checkpoint.

`row = {...}` creates log entry with best trial results for experiment ledger.

`pd.DataFrame([row]).to_csv("runs_log.csv", mode="a", ...)` appends best trial results to experiment log.


In [8]:
AUTO_TUNE_ENABLED = True

GRID_SEARCH_SPACE = {
    "learning_rate": [3e-5],
    "per_device_train_batch_size": [8, 16],
    "weight_decay": [0.0, 0.01],
    "num_train_epochs": [2, 3],
}


RANDOM_SEARCH_SPACE = {
    "learning_rate": ("log_uniform", 2e-5, 5e-5),
    "per_device_train_batch_size": ("choice", [8, 12, 16, 24, 32]),
    "weight_decay": ("uniform", 0.0, 0.1),
    "num_train_epochs": ("int", 2, 4),
}

RANDOM_TRIALS = 8
MAX_AUTOTUNE_EPOCHS = 4

print("Hyperparameter tuning configuration:")


Hyperparameter tuning configuration:


In [9]:
import gc
import time
from optuna.samplers import GridSampler, RandomSampler
from pathlib import Path

TUNING_DIR = Path("tuning")
TUNING_DIR.mkdir(exist_ok=True)

if AUTO_TUNE_ENABLED:

    class WeightedTrainer(Trainer):
        def __init__(self, *args, class_weights=None, **kwargs):
            super().__init__(*args, **kwargs)
            self.class_weights = class_weights

    def build_trainer_for_trial(hparams, run_name):
        """Build trainer with suggested hyperparameters"""
        trial_args = make_training_args(
            output_dir=f"./checkpoints/bert-base/{run_name}",
            learning_rate=hparams.get("learning_rate", 3e-5),
            per_device_train_batch_size=hparams.get("per_device_train_batch_size", 16),
            weight_decay=hparams.get("weight_decay", 0.01),
            num_train_epochs=min(hparams.get("num_train_epochs", 3), MAX_AUTOTUNE_EPOCHS),
            eval_strategy="epoch",
            save_strategy="epoch",
        )

        trial_model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=3).to(device)

        trial_trainer = WeightedTrainer(
            model=trial_model,
            args=trial_args,
            train_dataset=ds_train,
            eval_dataset=ds_val,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer,
            callbacks=[EarlyStoppingCallback(early_stopping_patience=2, early_stopping_threshold=0.001)],
        )
        return trial_trainer, trial_args

    def suggest_params(trial, strategy):
        """Suggest hyperparameters based on strategy"""
        if strategy == "grid":
            lr_vals = GRID_SEARCH_SPACE["learning_rate"]
            bs_vals = GRID_SEARCH_SPACE["per_device_train_batch_size"]
            wd_vals = GRID_SEARCH_SPACE["weight_decay"]
            ep_vals = GRID_SEARCH_SPACE["num_train_epochs"]

            trial_idx = trial.number
            total_combos = len(lr_vals) * len(bs_vals) * len(wd_vals) * len(ep_vals)
            if trial_idx >= total_combos:
                raise optuna.TrialPruned()

            idx = trial_idx
            lr_idx = idx % len(lr_vals)
            idx //= len(lr_vals)
            bs_idx = idx % len(bs_vals)
            idx //= len(bs_vals)
            wd_idx = idx % len(wd_vals)
            idx //= len(wd_vals)
            ep_idx = idx % len(ep_vals)

            return {
                "learning_rate": lr_vals[lr_idx],
                "per_device_train_batch_size": bs_vals[bs_idx],
                "weight_decay": wd_vals[wd_idx],
                "num_train_epochs": ep_vals[ep_idx],
            }
        else:
            return {
                "learning_rate": trial.suggest_float("learning_rate", 2e-5, 5e-5, log=True),
                "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 12, 16, 24, 32]),
                "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.1),
                "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 4),
            }

    def objective(trial):
        """Optuna objective function"""
        strategy = trial.study.sampler.__class__.__name__
        hparams = suggest_params(trial, "grid" if "Grid" in strategy else "random")

        run_name = f"trial_{trial.number}"
        trainer_obj, args_obj = build_trainer_for_trial(hparams, run_name)

        start_time = time.time()
        trainer_obj.train()
        train_time = time.time() - start_time

        eval_results = trainer_obj.evaluate()
        f1_macro = eval_results.get("eval_f1_macro", 0.0)


        del trainer_obj
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

        trial.set_user_attr("train_time", train_time)
        trial.set_user_attr("accuracy", eval_results.get("eval_accuracy", 0.0))
        trial.set_user_attr("precision", eval_results.get("eval_precision", 0.0))
        trial.set_user_attr("recall", eval_results.get("eval_recall", 0.0))

        return f1_macro

    SEARCH_STRATEGIES = ["grid", "random"] if AUTO_TUNE_ENABLED else []
    summary_rows = []

    for strategy in SEARCH_STRATEGIES:
        print(f"\n=== Hyperparameter search: {strategy.upper()} ===")

        if strategy == "grid":
            sampler = GridSampler(GRID_SEARCH_SPACE)
            study = optuna.create_study(direction="maximize", sampler=sampler)

            total_trials = (len(GRID_SEARCH_SPACE["learning_rate"]) *
                          len(GRID_SEARCH_SPACE["per_device_train_batch_size"]) *
                          len(GRID_SEARCH_SPACE["weight_decay"]) *
                          len(GRID_SEARCH_SPACE["num_train_epochs"]))
            study.optimize(objective, n_trials=total_trials, show_progress_bar=True)
        else:
            sampler = RandomSampler(seed=RANDOM_SEED)
            study = optuna.create_study(direction="maximize", sampler=sampler)
            study.optimize(objective, n_trials=RANDOM_TRIALS, show_progress_bar=True)

        trials_df = pd.DataFrame([
            {
                "trial": t.number,
                "f1_macro": t.value,
                "learning_rate": t.params.get("learning_rate"),
                "batch_size": t.params.get("per_device_train_batch_size"),
                "weight_decay": t.params.get("weight_decay"),
                "epochs": t.params.get("num_train_epochs"),
                "accuracy": t.user_attrs.get("accuracy", 0),
                "precision": t.user_attrs.get("precision", 0),
                "recall": t.user_attrs.get("recall", 0),
                "train_time": t.user_attrs.get("train_time", 0),
            }
            for t in study.trials if t.value is not None
        ])

        trials_df.to_csv(TUNING_DIR / f"{strategy}_trials.csv", index=False)
        print(f"Saved {strategy} trials to {TUNING_DIR / f'{strategy}_trials.csv'}")

        if study.best_trial:
            best_hparams = study.best_trial.params
            print(f"\nBest {strategy} trial: F1-macro = {study.best_trial.value:.4f}")
            print(f"Best params: {best_hparams}")

            print(f"\nRetraining with best {strategy} configuration...")
            best_trainer, _ = build_trainer_for_trial(best_hparams, f"best_{strategy}")
            best_trainer.train()

            val_results = best_trainer.evaluate()
            test_results = best_trainer.evaluate(eval_dataset=ds_test)

            best_ckpt_dir = f"./checkpoints/bert-base/best_{strategy}"
            best_trainer.save_model(best_ckpt_dir)
            tokenizer.save_pretrained(best_ckpt_dir)

            summary_rows.append({
                "strategy": strategy,
                "f1_macro_val": val_results.get("eval_f1_macro", 0),
                "f1_macro_test": test_results.get("eval_f1_macro", 0),
                "accuracy_val": val_results.get("eval_accuracy", 0),
                "accuracy_test": test_results.get("eval_accuracy", 0),
                "best_params": str(best_hparams),
            })

            row = {
                "member": f"bert-{strategy}",
                "model": MODEL_NAME,
                "num_train_epochs": best_hparams.get("num_train_epochs"),
                "per_device_train_batch_size": best_hparams.get("per_device_train_batch_size"),
                "learning_rate": best_hparams.get("learning_rate"),
                "weight_decay": best_hparams.get("weight_decay"),
                "warmup_steps": None,
                "lr_scheduler_type": "linear",
                "gradient_accumulation_steps": 1,
                "max_seq_length": MAX_LEN,
                "seed": RANDOM_SEED,
                "fp16": USE_FP16,
                "accuracy": test_results.get("eval_accuracy", 0),
                "precision": test_results.get("eval_precision", 0),
                "recall": test_results.get("eval_recall", 0),
                "f1_macro": test_results.get("eval_f1_macro", 0),
                "notes": f"Best {strategy} search. Params: {best_hparams}",
            }
            pd.DataFrame([row]).to_csv("runs_log.csv", mode="a", index=False,
                                     header=not os.path.exists("runs_log.csv"))

    if summary_rows:
        summary_df = pd.DataFrame(summary_rows)
        summary_df.to_csv(TUNING_DIR / "strategy_summary.csv", index=False)
        print(f"\n✓ Saved strategy summary to {TUNING_DIR / 'strategy_summary.csv'}")
        print(summary_df)

    AUTO_TUNE_ENABLED = False  # Disable for subsequent cells
else:
    print("AUTO_TUNE_ENABLED is False. Skipping hyperparameter tuning.")


[I 2025-11-15 18:21:14,034] A new study created in memory with name: no-name-820d2c9a-1bc1-43cb-b600-8feeccc1ae5a



=== Hyperparameter search: GRID ===


  0%|          | 0/8 [00:00<?, ?it/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.804,0.805001,0.672576,0.698924,0.660293,0.666503
2,0.594,0.814846,0.663375,0.675542,0.657824,0.663941


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 18:24:46,323] Trial 0 finished with value: 0.6665025386088582 and parameters: {}. Best is trial 0 with value: 0.6665025386088582.


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.807,0.793709,0.670781,0.687766,0.662287,0.669839
2,0.6836,0.814609,0.667415,0.678369,0.663614,0.66904


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 18:27:20,229] Trial 1 finished with value: 0.6698388377079203 and parameters: {}. Best is trial 1 with value: 0.6698388377079203.


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.803,0.813714,0.672127,0.697949,0.659464,0.667231
2,0.6003,0.814008,0.670332,0.682572,0.665384,0.671474


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 18:30:58,376] Trial 2 finished with value: 0.6714738862200339 and parameters: {}. Best is trial 2 with value: 0.6714738862200339.


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.801,0.796792,0.671005,0.685829,0.663921,0.670905
2,0.6962,0.804887,0.674147,0.686506,0.669793,0.675779


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 18:33:36,168] Trial 3 finished with value: 0.6757790326459004 and parameters: {}. Best is trial 3 with value: 0.6757790326459004.


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.8062,0.829206,0.660458,0.687764,0.649172,0.65321
2,0.6234,0.841678,0.660682,0.662634,0.666174,0.663596
3,0.5624,0.947541,0.663375,0.67226,0.660821,0.665291


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 18:38:49,645] Trial 4 finished with value: 0.6652912087643376 and parameters: {}. Best is trial 3 with value: 0.6757790326459004.


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.8053,0.803696,0.674372,0.6915,0.665387,0.672725
2,0.6867,0.850119,0.653725,0.659186,0.658635,0.657006
3,0.5413,0.878534,0.668986,0.680247,0.664491,0.670168


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 18:42:36,663] Trial 5 finished with value: 0.6727248673894713 and parameters: {}. Best is trial 3 with value: 0.6757790326459004.


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.801,0.802905,0.666966,0.694174,0.654394,0.66155
2,0.6172,0.838843,0.666293,0.67011,0.666691,0.668244
3,0.5869,0.963637,0.658438,0.667403,0.655145,0.659637


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 18:48:00,610] Trial 6 finished with value: 0.6682440866561308 and parameters: {}. Best is trial 3 with value: 0.6757790326459004.


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.8144,0.813432,0.671903,0.689374,0.663077,0.670486
2,0.6944,0.832471,0.656867,0.663201,0.658041,0.659767
3,0.544,0.866975,0.666068,0.676887,0.661804,0.667297


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 18:51:53,127] Trial 7 finished with value: 0.6704863186873188 and parameters: {}. Best is trial 3 with value: 0.6757790326459004.
Saved grid trials to tuning/grid_trials.csv

Best grid trial: F1-macro = 0.6758
Best params: {}

Retraining with best grid configuration...


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.8144,0.813432,0.671903,0.689374,0.663077,0.670486
2,0.6944,0.832471,0.656867,0.663201,0.658041,0.659767
3,0.544,0.866975,0.666068,0.676887,0.661804,0.667297


[I 2025-11-15 18:55:54,547] A new study created in memory with name: no-name-13191712-178d-4111-aaff-d96d4e793921



=== Hyperparameter search: RANDOM ===


  0%|          | 0/8 [00:00<?, ?it/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.8163,0.828483,0.656867,0.69171,0.642601,0.646828
2,0.6814,0.846385,0.661131,0.664143,0.664714,0.663858
3,0.6369,0.86437,0.666517,0.673371,0.66339,0.666907
4,0.5014,1.242701,0.659785,0.665737,0.657958,0.661111


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 19:03:07,174] Trial 0 finished with value: 0.6669069431026896 and parameters: {'learning_rate': 2.8188664052384835e-05, 'per_device_train_batch_size': 8, 'weight_decay': 0.005808361216819946, 'num_train_epochs': 4}. Best is trial 0 with value: 0.6669069431026896.


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.8146,0.786872,0.672127,0.693082,0.661959,0.670463
2,0.6771,0.804838,0.672127,0.685505,0.666572,0.673072


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 19:05:47,402] Trial 1 finished with value: 0.673071977117174 and parameters: {'learning_rate': 3.469266868719914e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.018182496720710064, 'num_train_epochs': 2}. Best is trial 1 with value: 0.673071977117174.


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.7814,0.800841,0.660682,0.680086,0.651235,0.658005
2,0.7152,0.805236,0.667864,0.678382,0.666181,0.670285
3,0.5575,0.845732,0.669659,0.680261,0.665156,0.67064


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 19:09:07,746] Trial 2 finished with value: 0.6706395736073101 and parameters: {'learning_rate': 2.6430182166924245e-05, 'per_device_train_batch_size': 24, 'weight_decay': 0.029214464853521818, 'num_train_epochs': 3}. Best is trial 1 with value: 0.673071977117174.


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.8095,0.796785,0.675943,0.698231,0.664984,0.671845
2,0.614,0.805018,0.666966,0.681973,0.660829,0.66789


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 19:12:47,049] Trial 3 finished with value: 0.6718453749775296 and parameters: {'learning_rate': 3.037515404772984e-05, 'per_device_train_batch_size': 8, 'weight_decay': 0.06075448519014384, 'num_train_epochs': 2}. Best is trial 1 with value: 0.673071977117174.


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.8257,0.805086,0.671454,0.687644,0.662599,0.669382
2,0.7247,0.820491,0.665171,0.67719,0.661595,0.66729
3,0.57,0.872881,0.666966,0.67826,0.662877,0.668488


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 19:17:00,459] Trial 4 finished with value: 0.6693821574448732 and parameters: {'learning_rate': 2.1228368952944975e-05, 'per_device_train_batch_size': 12, 'weight_decay': 0.0684233026512157, 'num_train_epochs': 3}. Best is trial 1 with value: 0.673071977117174.


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.8224,0.792184,0.673025,0.688721,0.6647,0.671654
2,0.6958,0.829545,0.665171,0.671275,0.6662,0.667939
3,0.5737,0.869241,0.664722,0.673936,0.661633,0.666355


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 19:20:53,478] Trial 5 finished with value: 0.6716542632585442 and parameters: {'learning_rate': 2.2366286923412623e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.031171107608941095, 'num_train_epochs': 3}. Best is trial 1 with value: 0.673071977117174.


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.837,0.83434,0.659336,0.684898,0.646875,0.652746
2,0.7249,0.862079,0.645422,0.649346,0.654454,0.648553
3,0.4837,0.895082,0.64789,0.658666,0.6441,0.649425


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 19:25:07,903] Trial 6 finished with value: 0.652745635150938 and parameters: {'learning_rate': 3.300561952272884e-05, 'per_device_train_batch_size': 12, 'weight_decay': 0.05978999788110852, 'num_train_epochs': 4}. Best is trial 1 with value: 0.673071977117174.


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.7969,0.804314,0.66921,0.685319,0.660572,0.666287
2,0.7118,0.800847,0.677065,0.686665,0.672167,0.677417
3,0.5849,0.837268,0.664946,0.674823,0.661075,0.666177


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[I 2025-11-15 19:28:28,625] Trial 7 finished with value: 0.6774166339229465 and parameters: {'learning_rate': 2.1689258392227166e-05, 'per_device_train_batch_size': 24, 'weight_decay': 0.08287375091519295, 'num_train_epochs': 3}. Best is trial 7 with value: 0.6774166339229465.
Saved random trials to tuning/random_trials.csv

Best random trial: F1-macro = 0.6774
Best params: {'learning_rate': 2.1689258392227166e-05, 'per_device_train_batch_size': 24, 'weight_decay': 0.08287375091519295, 'num_train_epochs': 3}

Retraining with best random configuration...


  super().__init__(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
1,0.7969,0.804314,0.66921,0.685319,0.660572,0.666287
2,0.7118,0.800847,0.677065,0.686665,0.672167,0.677417
3,0.5849,0.837268,0.664946,0.674823,0.661075,0.666177



✓ Saved strategy summary to tuning/strategy_summary.csv
  strategy  f1_macro_val  f1_macro_test  accuracy_val  accuracy_test  \
0     grid      0.670486       0.667874      0.671903       0.669124   
1   random      0.677417       0.668883      0.677065       0.667337   

                                         best_params  
0                                                 {}  
1  {'learning_rate': 2.1689258392227166e-05, 'per...  


# **BERT Fine-tuning and Training**

**`Purpose`**

Train the BERT model per proposal Section V.D with either default hyperparameters or the best configuration from automated hyperparameter tuning. This block performs the actual fine-tuning of the BERT-base-uncased transformer model for toxicity classification, converting the pre-trained transformer into a task-specific classifier.

**`Input`**

Uses the tokenized datasets (`ds_train`, `ds_val`), the initialized BERT model and tokenizer, and training arguments configured in previous cells. If hyperparameter tuning was enabled, the best configuration from grid or random search is used; otherwise, default parameters are applied.

**`Output`**

Prints validation metrics (accuracy, precision, recall, F1-macro) after training and saves the best model checkpoint to `checkpoints/bert-base/best/` directory. The saved checkpoint includes model weights, tokenizer files, and configuration, enabling later evaluation and deployment.

**`Details`**

The fine-tuning process follows the proposal methodology: tokenization using BERT's WordPiece tokenizer, conversion of tweets to input IDs and attention masks, fine-tuning with a classification head for sentiment labeling, using AdamW optimizer with cross-entropy loss and linear warmup scheduler. Early stopping prevents overfitting by monitoring validation F1-macro. If tuning was skipped, trains with default parameters; otherwise, uses the best configuration identified during hyperparameter search.

**`Line-by-line Description.`**

`if not AUTO_TUNE_ENABLED or not os.path.exists("./checkpoints/bert-base/best_grid"):` checks whether hyperparameter tuning was disabled or no best grid checkpoint exists, determining whether to train with default parameters or use tuned configuration.

`print("Training BERT model with default/configured parameters...")` and subsequent print statements provide user feedback about the training process and methodology being followed.

`trainer.train()` initiates the fine-tuning process, iterating through the training dataset with the configured hyperparameters, optimizer (AdamW), loss function (cross-entropy), and learning rate scheduler (linear warmup).

`val_results = trainer.evaluate()` runs evaluation on the validation dataset after training completes, computing accuracy, precision, recall, and F1-macro using the `compute_metrics` function defined earlier.

`print(f"  Accuracy: {val_results.get('eval_accuracy', 0):.4f}")` and similar print statements display validation performance metrics formatted to four decimal places for easy reading.

`best_ckpt_dir = "./checkpoints/bert-base/best"` sets the checkpoint directory path where the best model will be saved.

`trainer.save_model(best_ckpt_dir)` saves the trained model weights, configuration, and tokenizer files to the checkpoint directory, enabling later loading for evaluation or deployment.

`tokenizer.save_pretrained(best_ckpt_dir)` saves the BERT tokenizer files (vocab, config) alongside the model so the complete pipeline can be restored.

`else: print("Using best model from hyperparameter tuning.")` handles the case where a tuned model checkpoint already exists, skipping training and using the pre-trained checkpoint instead.


In [10]:
if not AUTO_TUNE_ENABLED or not os.path.exists("./checkpoints/bert-base/best_grid"):
    print("Training BERT model with default/configured parameters...")
    print("Fine-tuning process per proposal Section V.D:")
    print("  - Tokenization using BERT's WordPiece tokenizer")
    print("  - Convert tweets to input IDs and attention masks")
    print("  - Fine-tune with classification head for sentiment labeling")
    print("  - Optimizer: AdamW, Loss: Cross-entropy, Scheduler: Linear warmup")

    trainer.train()

    val_results = trainer.evaluate()
    print("\nValidation Results:")
    print(f"  Accuracy: {val_results.get('eval_accuracy', 0):.4f}")
    print(f"  Precision: {val_results.get('eval_precision', 0):.4f}")
    print(f"  Recall: {val_results.get('eval_recall', 0):.4f}")
    print(f"  F1-macro: {val_results.get('eval_f1_macro', 0):.4f}")

    best_ckpt_dir = "./checkpoints/bert-base/best"
    trainer.save_model(best_ckpt_dir)
    tokenizer.save_pretrained(best_ckpt_dir)
    print(f"\n✓ Best model saved to {best_ckpt_dir}")
else:
    print("Using best model from hyperparameter tuning.")
    best_ckpt_dir = "./checkpoints/bert-base/best_grid"


Training BERT model with default/configured parameters...
Fine-tuning process per proposal Section V.D:
  - Tokenization using BERT's WordPiece tokenizer
  - Convert tweets to input IDs and attention masks
  - Fine-tune with classification head for sentiment labeling
  - Optimizer: AdamW, Loss: Cross-entropy, Scheduler: Linear warmup


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Macro
100,0.9693,0.902755,0.598743,0.608631,0.592695,0.598071
200,0.8316,0.809114,0.659336,0.689359,0.645769,0.65611
300,0.7124,0.840141,0.661355,0.67625,0.657057,0.66334
400,0.7024,0.815131,0.659785,0.672662,0.656397,0.662081
500,0.7205,0.806869,0.662926,0.676182,0.65908,0.664945
600,0.599,0.848802,0.662029,0.671106,0.658562,0.663132
700,0.5729,0.847344,0.66158,0.675217,0.655634,0.662122



Validation Results:
  Accuracy: 0.6593
  Precision: 0.6894
  Recall: 0.6458
  F1-macro: 0.6561

✓ Best model saved to ./checkpoints/bert-base/best


# **Model Evaluation and Baseline Comparison**


# Baseline Evaluation on Validation Set

**`Purpose`**

Evaluate the baseline TF-IDF + Logistic Regression model on the validation set per proposal Section VI.A before test set evaluation. This ensures consistency in evaluation methodology and provides a validation baseline for comparison with BERT model performance, allowing detection of overfitting or dataset-specific issues.

**`Input`**

Uses the trained baseline model (`best_baseline_model`), TF-IDF vectorizer, and validation dataset (`df_val` with `X_va` and `y_va`). The model should be loaded from previous training cells or from saved checkpoints in the `models/` directory.

**`Output`**

Prints validation set performance metrics (accuracy, precision, recall, F1-macro) and a detailed classification report. Generates and saves a confusion matrix visualization to `exports/confusion_matrices/baseline_cm_val.png`. These outputs provide validation performance baseline for comparison with test set results and BERT model metrics.

**`Details`**

Transforms validation text using the fitted TF-IDF vectorizer, generates predictions and probability scores from the baseline Logistic Regression model. Computes classification metrics including per-class precision, recall, and F1-score, and macro-averaged metrics as specified in proposal Section VI.A. Creates a confusion matrix visualization showing true vs predicted labels across all three classes (negative/toxic, neutral, positive) to identify classification patterns and error distributions.

**`Line-by-line Description.`**

`print("Evaluating baseline model on validation set...")` provides user feedback indicating that validation evaluation is starting.

`X_val_tfidf = tfidf.transform(X_va)` transforms the validation text data using the pre-fitted TF-IDF vectorizer, converting raw text into sparse numerical feature vectors.

`val_preds_baseline = best_baseline_model.predict(X_val_tfidf)` generates class predictions for each validation sample using the trained Logistic Regression model.

`val_probs_baseline = best_baseline_model.predict_proba(X_val_tfidf)` computes probability distributions over all three classes for each validation sample, enabling confidence analysis.

`val_acc_baseline = accuracy_score(y_va, val_preds_baseline)` calculates the proportion of correctly classified samples (accuracy metric).

`val_prec_baseline = precision_score(y_va, val_preds_baseline, average='macro', zero_division=0)` computes macro-averaged precision across all classes, treating each class equally.

`val_rec_baseline = recall_score(y_va, val_preds_baseline, average='macro', zero_division=0)` computes macro-averaged recall across all classes, measuring the model's ability to find positive cases.

`val_f1_baseline = f1_score(y_va, val_preds_baseline, average='macro', zero_division=0)` computes macro-averaged F1-score, the harmonic mean of precision and recall, as the primary performance metric.

`print(f"  Accuracy: {val_acc_baseline:.4f}")` and subsequent print statements display formatted validation metrics with four decimal places.

`print(classification_report(y_va, val_preds_baseline, ...))` generates a detailed per-class classification report showing precision, recall, F1-score, and support for each sentiment class.

`cm_val_baseline = confusion_matrix(y_va, val_preds_baseline, labels=[-1, 0, 1])` creates a confusion matrix showing the count of true vs predicted labels for all three classes.

`fig, ax = plt.subplots(figsize=(6, 5))` creates a matplotlib figure and axes for plotting the confusion matrix.

`ConfusionMatrixDisplay(cm_val_baseline, display_labels=[...]).plot(ax=ax, colorbar=False)` visualizes the confusion matrix with color-coded cells showing prediction accuracy patterns.

`plt.savefig('exports/confusion_matrices/baseline_cm_val.png', dpi=150)` saves the confusion matrix visualization to disk with high resolution for inclusion in reports.

In [12]:
try:
    _ = tfidf
    _ = logreg
    print("Using models from previous cell...")
except NameError:
    import joblib
    import os
    model_path = "models/baseline_tfidf_logreg.joblib"
    vectorizer_path = "models/baseline_tfidf_vectorizer.joblib"

    if os.path.exists(model_path) and os.path.exists(vectorizer_path):
        print("Loading baseline models from disk...")
        tfidf = joblib.load(vectorizer_path)
        logreg = joblib.load(model_path)
        print("✓ Models loaded successfully")
    else:
        raise NameError(
            "Baseline models not found. Please run Cell 12 (Baseline TF-IDF + Logistic Regression training) first to train and save the models."
        )

print("Evaluating baseline model on validation set...")
X_val_tfidf = tfidf.transform(X_va)
val_preds_baseline = logreg.predict(X_val_tfidf)
val_probs_baseline = logreg.predict_proba(X_val_tfidf)

val_acc_baseline = accuracy_score(y_va, val_preds_baseline)
val_prec_baseline = precision_score(y_va, val_preds_baseline, average='macro', zero_division=0)
val_rec_baseline = recall_score(y_va, val_preds_baseline, average='macro', zero_division=0)
val_f1_baseline = f1_score(y_va, val_preds_baseline, average='macro', zero_division=0)

print("\nBaseline Model Performance (Validation Set):")
print(f"  Accuracy: {val_acc_baseline:.4f}")
print(f"  Precision: {val_prec_baseline:.4f}")
print(f"  Recall: {val_rec_baseline:.4f}")
print(f"  F1-macro: {val_f1_baseline:.4f}")

print("\nValidation Set Classification Report:")
print(classification_report(y_va, val_preds_baseline,
                          target_names=['negative/toxic', 'neutral', 'positive'],
                          zero_division=0))

cm_val_baseline = confusion_matrix(y_va, val_preds_baseline, labels=[-1, 0, 1])
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay(cm_val_baseline, display_labels=['negative/toxic', 'neutral', 'positive']).plot(ax=ax, colorbar=False)
plt.title('Baseline Model - Validation Set Confusion Matrix')
plt.tight_layout()
plt.savefig('exports/confusion_matrices/baseline_cm_val.png', dpi=150)
plt.show()  # Display in notebook
print("\n✓ Validation confusion matrix saved to exports/confusion_matrices/baseline_cm_val.png")


Using models from previous cell...
Evaluating baseline model on validation set...

Baseline Model Performance (Validation Set):
  Accuracy: 0.6364
  Precision: 0.6419
  Recall: 0.6337
  F1-macro: 0.6369

Validation Set Classification Report:
                precision    recall  f1-score   support

negative/toxic       0.63      0.56      0.59      1288
       neutral       0.59      0.64      0.62      1764
      positive       0.71      0.69      0.70      1404

      accuracy                           0.64      4456
     macro avg       0.64      0.63      0.64      4456
  weighted avg       0.64      0.64      0.64      4456


✓ Validation confusion matrix saved to exports/confusion_matrices/baseline_cm_val.png


# BERT Evaluation on Validation Set

**`Purpose`**

Evaluate the best fine-tuned BERT model on the validation set per proposal Section VI.A before test set evaluation. This assessment ensures the BERT model's validation performance, allows direct comparison with the baseline model's validation metrics, and verifies that the transformer model generalizes appropriately to unseen validation data.

**`Input`**

Uses the best BERT model checkpoint (from `checkpoints/bert-base/best_grid`, `best_random`, or `best`), the tokenized validation dataset (`ds_val`), training arguments, tokenizer, and compute metrics function. The model is loaded from the checkpoint directory established during training or hyperparameter tuning.

**`Output`**

Prints validation set performance metrics (accuracy, precision, recall, F1-macro) and a detailed classification report. Generates and saves a confusion matrix visualization to `exports/confusion_matrices/bert_cm_val.png`. These outputs enable validation performance comparison between baseline and BERT models, and verify model behavior before final test evaluation.

**`Details`**

Loads the best BERT checkpoint (prioritizing grid search results if available), creates a Trainer instance with the validation dataset, and runs evaluation using the same metrics computation as training. Converts BERT's internal label mapping (0, 1, 2) back to proposal format (-1, 0, 1) for consistent reporting. Generates predictions and computes confusion matrix, then visualizes classification performance across all three sentiment classes to identify strengths and weaknesses in the model's predictions.

**`Line-by-line Description.`**

`try: _ = best_model` attempts to access the best_model variable to check if it exists from previous cells, avoiding redundant loading.

`except NameError:` handles the case where best_model doesn't exist, triggering model loading from checkpoint.

`if os.path.exists("./checkpoints/bert-base/best_grid"):` checks for the best grid search checkpoint first, prioritizing hyperparameter-tuned models.

`best_ckpt_dir = "./checkpoints/bert-base/best_grid"` sets the checkpoint directory path to the grid search results.

`elif os.path.exists("./checkpoints/bert-base/best_random"):` falls back to random search checkpoint if grid search results aren't available.

`elif os.path.exists("./checkpoints/bert-base/best"):` falls back to default training checkpoint if no hyperparameter tuning was performed.

`best_model = AutoModelForSequenceClassification.from_pretrained(best_ckpt_dir).to(device)` loads the trained BERT model from the checkpoint directory and moves it to the appropriate device (CPU or GPU).

`best_model.eval()` sets the model to evaluation mode, disabling dropout and batch normalization updates for consistent inference behavior.

`val_eval_trainer = Trainer(model=best_model, args=training_args, eval_dataset=ds_val, ...)` creates a Trainer instance configured for evaluation on the validation dataset.

`val_results = val_eval_trainer.evaluate()` runs evaluation using the Trainer's built-in metrics computation, returning accuracy, precision, recall, and F1-macro.

`print(f"  Accuracy: {val_results.get('eval_accuracy', 0):.4f}")` and similar print statements display formatted validation metrics.

`val_predictions = val_eval_trainer.predict(ds_val)` generates raw model predictions (logits) for all validation samples.

`val_preds_bert = np.argmax(val_predictions.predictions, axis=1)` extracts predicted class indices by finding the maximum logit value for each sample.

`reverse_mapping = {0: -1, 1: 0, 2: 1}` defines the mapping to convert BERT's internal label encoding (0,1,2) back to proposal format (-1,0,1).

`val_preds = np.array([reverse_mapping[p] for p in val_preds_bert])` converts predicted labels from BERT format to proposal format for consistent reporting.

`val_labels = np.array([reverse_mapping[l] for l in val_labels_bert])` converts true labels from BERT format to proposal format.

`print(classification_report(val_labels, val_preds, ...))` generates a detailed per-class classification report with proposal-format labels.

`cm_val_bert = confusion_matrix(val_labels, val_preds, labels=[-1, 0, 1])` creates a confusion matrix using proposal-format labels.

`ConfusionMatrixDisplay(cm_val_bert, ...).plot(ax=ax, colorbar=False)` visualizes the confusion matrix with class names matching the proposal format.

`plt.savefig('exports/confusion_matrices/bert_cm_val.png', dpi=150)` saves the confusion matrix visualization to disk.

In [13]:
try:
    _ = best_model
    _ = best_ckpt_dir
    print("Using model from previous cells...")
except NameError:
    from transformers import AutoModelForSequenceClassification
    from pathlib import Path
    import os

    if os.path.exists("./checkpoints/bert-base/best_grid"):
        best_ckpt_dir = "./checkpoints/bert-base/best_grid"
    elif os.path.exists("./checkpoints/bert-base/best_random"):
        best_ckpt_dir = "./checkpoints/bert-base/best_random"
    elif os.path.exists("./checkpoints/bert-base/best"):
        best_ckpt_dir = "./checkpoints/bert-base/best"
    else:
        raise FileNotFoundError("No checkpoint directory found. Please run training first.")

    best_model = AutoModelForSequenceClassification.from_pretrained(best_ckpt_dir).to(device)
    best_model.eval()
    print(f"Model loaded from {best_ckpt_dir}")

print("\nEvaluating BERT model on validation set...")

val_eval_trainer = Trainer(
    model=best_model,
    args=training_args,
    eval_dataset=ds_val,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

val_results = val_eval_trainer.evaluate()

print("\nBERT Model Performance (Validation Set):")
print(f"  Accuracy: {val_results.get('eval_accuracy', 0):.4f}")
print(f"  Precision: {val_results.get('eval_precision', 0):.4f}")
print(f"  Recall: {val_results.get('eval_recall', 0):.4f}")
print(f"  F1-macro: {val_results.get('eval_f1_macro', 0):.4f}")

val_predictions = val_eval_trainer.predict(ds_val)
val_preds_bert = np.argmax(val_predictions.predictions, axis=1)
val_labels_bert = val_predictions.label_ids

reverse_mapping = {0: -1, 1: 0, 2: 1}
val_preds = np.array([reverse_mapping[p] for p in val_preds_bert])
val_labels = np.array([reverse_mapping[l] for l in val_labels_bert])

print("\nValidation Set Classification Report:")
print(classification_report(val_labels, val_preds,
                          target_names=['negative/toxic', 'neutral', 'positive'],
                          zero_division=0))

cm_val_bert = confusion_matrix(val_labels, val_preds, labels=[-1, 0, 1])
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay(cm_val_bert, display_labels=['negative/toxic', 'neutral', 'positive']).plot(ax=ax, colorbar=False)
plt.title('BERT Model - Validation Set Confusion Matrix')
plt.tight_layout()
plt.savefig('exports/confusion_matrices/bert_cm_val.png', dpi=150)
plt.show()  # Display in notebook
print("\n✓ Validation confusion matrix saved to exports/confusion_matrices/bert_cm_val.png")


Model loaded from ./checkpoints/bert-base/best_grid

Evaluating BERT model on validation set...


  val_eval_trainer = Trainer(



BERT Model Performance (Validation Set):
  Accuracy: 0.6719
  Precision: 0.6894
  Recall: 0.6631
  F1-macro: 0.6705

Validation Set Classification Report:
                precision    recall  f1-score   support

negative/toxic       0.72      0.56      0.63      1288
       neutral       0.61      0.74      0.67      1764
      positive       0.74      0.69      0.71      1404

      accuracy                           0.67      4456
     macro avg       0.69      0.66      0.67      4456
  weighted avg       0.68      0.67      0.67      4456


✓ Validation confusion matrix saved to exports/confusion_matrices/bert_cm_val.png


In [14]:
from transformers import AutoModelForSequenceClassification

best_model = AutoModelForSequenceClassification.from_pretrained(best_ckpt_dir).to(device)
best_model.eval()

eval_trainer = Trainer(
    model=best_model,
    args=training_args,
    eval_dataset=ds_test,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

print("Evaluating BERT model on test set...")
test_results = eval_trainer.evaluate()

print("\nBERT Model Performance (Test Set) - per proposal Section VI.A:")
print(f"  Accuracy: {test_results.get('eval_accuracy', 0):.4f}")
print(f"  Precision: {test_results.get('eval_precision', 0):.4f}")
print(f"  Recall: {test_results.get('eval_recall', 0):.4f}")
print(f"  F1-macro: {test_results.get('eval_f1_macro', 0):.4f}")

test_predictions = eval_trainer.predict(ds_test)
test_preds_bert = np.argmax(test_predictions.predictions, axis=1)
test_labels_bert = test_predictions.label_ids

reverse_mapping = {0: -1, 1: 0, 2: 1}
test_preds = np.array([reverse_mapping[p] for p in test_preds_bert])
test_labels = np.array([reverse_mapping[l] for l in test_labels_bert])

print("\nTest Set Classification Report:")
print(classification_report(test_labels, test_preds,
                          target_names=['negative/toxic', 'neutral', 'positive'],
                          zero_division=0))

cm_bert = confusion_matrix(test_labels, test_preds, labels=[-1, 0, 1])
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay(cm_bert, display_labels=['negative/toxic', 'neutral', 'positive']).plot(ax=ax, colorbar=False)
plt.title('BERT Model - Test Set Confusion Matrix')
plt.tight_layout()
plt.savefig('exports/confusion_matrices/bert_cm_test.png', dpi=150)
plt.show()  # Display in notebook
print("\n✓ Confusion matrix saved to exports/confusion_matrices/bert_cm_test.png")

test_df_results = pd.DataFrame({
    'review': df_test['review'].tolist(),
    'gold': test_labels,
    'pred': test_preds
})
test_df_results.to_csv('exports/bert_predictions_test.csv', index=False)
print("✓ Test predictions saved to exports/bert_predictions_test.csv")

from sklearn.metrics import roc_curve, auc, precision_recall_curve, roc_auc_score, average_precision_score
from sklearn.preprocessing import label_binarize

y_test_bin = label_binarize(test_labels, classes=[-1, 0, 1])
n_classes = 3
test_probs = torch.softmax(torch.tensor(test_predictions.predictions), dim=-1).numpy()

fpr = dict()
tpr = dict()
roc_auc = dict()
precision = dict()
recall = dict()
pr_auc = dict()

bert_to_proposal = {0: 0, 1: 1, 2: 2}
for i, class_name in enumerate(['negative/toxic', 'neutral', 'positive']):
    class_idx = [-1, 0, 1][i]
    bert_idx = {-1: 0, 0: 1, 1: 2}[class_idx]
    y_true_class = (test_labels == class_idx).astype(int)
    y_score_class = test_probs[:, bert_idx]

    fpr[i], tpr[i], _ = roc_curve(y_true_class, y_score_class)
    roc_auc[i] = auc(fpr[i], tpr[i])

    precision[i], recall[i], _ = precision_recall_curve(y_true_class, y_score_class)
    pr_auc[i] = average_precision_score(y_true_class, y_score_class)

    print(f"\n{class_name} (class {class_idx}):")
    print(f"  ROC-AUC: {roc_auc[i]:.4f}")
    print(f"  PR-AUC: {pr_auc[i]:.4f}")

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax = axes[0]
for i, class_name in enumerate(['negative/toxic', 'neutral', 'positive']):
    ax.plot(fpr[i], tpr[i], label=f'{class_name} (AUC = {roc_auc[i]:.3f})')
ax.plot([0, 1], [0, 1], 'k--', label='Random')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curves (per proposal Section III)')
ax.legend()
ax.grid(True, alpha=0.3)

ax = axes[1]
for i, class_name in enumerate(['negative/toxic', 'neutral', 'positive']):
    ax.plot(recall[i], precision[i], label=f'{class_name} (AP = {pr_auc[i]:.3f})')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.set_title('Precision-Recall Curves (per proposal Section III)')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
os.makedirs("exports/roc_curves", exist_ok=True)
plt.savefig('exports/roc_curves/bert_roc_pr_curves.png', dpi=150)
plt.show()  # Display in notebook
print("\n✓ ROC and PR curves saved to exports/roc_curves/bert_roc_pr_curves.png")

comparison_df = pd.DataFrame({
    'Model': ['Baseline (TF-IDF + LogReg)', 'BERT-base-uncased'],
    'Accuracy': [test_acc, test_results.get('eval_accuracy', 0)],
    'Precision': [test_prec, test_results.get('eval_precision', 0)],
    'Recall': [test_rec, test_results.get('eval_recall', 0)],
    'F1-macro': [test_f1, test_results.get('eval_f1_macro', 0)],
})

print("\n" + "="*60)
print("Baseline vs BERT Comparison (per proposal Section VI.B)")
print("="*60)
print(comparison_df.to_string(index=False))
comparison_df.to_csv('exports/model_comparison.csv', index=False)
print("\n✓ Model comparison saved to exports/model_comparison.csv")


Evaluating BERT model on test set...


  eval_trainer = Trainer(



BERT Model Performance (Test Set) - per proposal Section VI.A:
  Accuracy: 0.6691
  Precision: 0.6884
  Recall: 0.6599
  F1-macro: 0.6679

Test Set Classification Report:
                precision    recall  f1-score   support

negative/toxic       0.73      0.56      0.63      1292
       neutral       0.60      0.74      0.66      1781
      positive       0.73      0.68      0.71      1403

      accuracy                           0.67      4476
     macro avg       0.69      0.66      0.67      4476
  weighted avg       0.68      0.67      0.67      4476


✓ Confusion matrix saved to exports/confusion_matrices/bert_cm_test.png
✓ Test predictions saved to exports/bert_predictions_test.csv

negative/toxic (class -1):
  ROC-AUC: 0.8299
  PR-AUC: 0.7159

neutral (class 0):
  ROC-AUC: 0.7678
  PR-AUC: 0.6500

positive (class 1):
  ROC-AUC: 0.8601
  PR-AUC: 0.7787

✓ ROC and PR curves saved to exports/roc_curves/bert_roc_pr_curves.png

Baseline vs BERT Comparison (per proposal Section V

# Validation vs Test Set Comparison

**`Purpose`**

Compare model performance between validation and test sets per proposal Section VI.A to detect overfitting and ensure generalization capability. This analysis validates that both baseline and BERT models maintain consistent performance across validation and test splits, confirming that training hyperparameters did not overfit to validation data.

**`Input`**

Uses validation and test set performance metrics from both baseline and BERT models, including accuracy, precision, recall, and F1-macro scores. Requires that validation evaluation (baseline and BERT) and test evaluation (baseline and BERT) cells have been executed previously.

**`Output`**

Prints a comprehensive comparison table showing validation vs test metrics for both models, including performance differences. Saves a CSV comparison table to `exports/val_test_comparison.csv` and generates a multi-panel visualization (bar charts comparing validation vs test performance for each metric) saved to `exports/val_test_comparison.png`. These outputs enable detection of overfitting and provide visual evidence of model generalization.

**`Details`**

Compares four metric sets: baseline validation, baseline test, BERT validation, and BERT test. Computes absolute differences between validation and test metrics to identify significant performance gaps that may indicate overfitting. Creates a structured DataFrame with all metrics for both models across both splits, then visualizes the comparison using grouped bar charts for each metric (accuracy, precision, recall, F1-macro) to highlight performance consistency or discrepancies between validation and test evaluations.

In [15]:

print("="*70)
print("Validation vs Test Set Performance Comparison")
print("="*70)

print("\nBaseline Model (TF-IDF + Logistic Regression):")
print(f"  Validation - Accuracy: {val_acc_baseline:.4f}, F1-macro: {val_f1_baseline:.4f}")
try:
    print(f"  Test       - Accuracy: {test_acc:.4f}, F1-macro: {test_f1:.4f}")
    print(f"  Difference - Accuracy: {abs(val_acc_baseline - test_acc):.4f}, F1-macro: {abs(val_f1_baseline - test_f1):.4f}")
except NameError:
    print("  Test results not yet available. Run test evaluation cell first.")

print("\nBERT Model (bert-base-uncased):")
print(f"  Validation - Accuracy: {val_results.get('eval_accuracy', 0):.4f}, F1-macro: {val_results.get('eval_f1_macro', 0):.4f}")
try:
    print(f"  Test       - Accuracy: {test_results.get('eval_accuracy', 0):.4f}, F1-macro: {test_results.get('eval_f1_macro', 0):.4f}")
    val_test_diff_acc = abs(val_results.get('eval_accuracy', 0) - test_results.get('eval_accuracy', 0))
    val_test_diff_f1 = abs(val_results.get('eval_f1_macro', 0) - test_results.get('eval_f1_macro', 0))
    print(f"  Difference - Accuracy: {val_test_diff_acc:.4f}, F1-macro: {val_test_diff_f1:.4f}")
except NameError:
    print("  Test results not yet available. Run test evaluation cell first.")
    test_results = {'eval_accuracy': 0, 'eval_precision': 0, 'eval_recall': 0, 'eval_f1_macro': 0}
    test_acc = 0
    test_prec = 0
    test_rec = 0
    test_f1 = 0

comparison_df = pd.DataFrame({
    'Model': ['Baseline (Val)', 'Baseline (Test)', 'BERT (Val)', 'BERT (Test)'],
    'Accuracy': [val_acc_baseline, test_acc, val_results.get('eval_accuracy', 0), test_results.get('eval_accuracy', 0)],
    'Precision': [val_prec_baseline, test_prec, val_results.get('eval_precision', 0), test_results.get('eval_precision', 0)],
    'Recall': [val_rec_baseline, test_rec, val_results.get('eval_recall', 0), test_results.get('eval_recall', 0)],
    'F1-macro': [val_f1_baseline, test_f1, val_results.get('eval_f1_macro', 0), test_results.get('eval_f1_macro', 0)]
})

comparison_df.to_csv('exports/val_test_comparison.csv', index=False)
print("\n✓ Validation vs Test comparison saved to exports/val_test_comparison.csv")

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-macro']
for idx, metric in enumerate(metrics):
    ax = axes[idx // 2, idx % 2]
    x = np.arange(2)
    width = 0.35
    baseline_vals = [comparison_df.iloc[0][metric], comparison_df.iloc[1][metric]]
    bert_vals = [comparison_df.iloc[2][metric], comparison_df.iloc[3][metric]]
    ax.bar(x - width/2, baseline_vals, width, label='Baseline', alpha=0.8)
    ax.bar(x + width/2, bert_vals, width, label='BERT', alpha=0.8)
    ax.set_xlabel('Dataset')
    ax.set_ylabel(metric)
    ax.set_title(f'{metric}: Validation vs Test')
    ax.set_xticks(x)
    ax.set_xticklabels(['Validation', 'Test'])
    ax.legend()
    ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('exports/val_test_comparison.png', dpi=150)
plt.show()  # Display in notebook
print("✓ Validation vs Test comparison visualization saved to exports/val_test_comparison.png")


Validation vs Test Set Performance Comparison

Baseline Model (TF-IDF + Logistic Regression):
  Validation - Accuracy: 0.6364, F1-macro: 0.6369
  Test       - Accuracy: 0.6224, F1-macro: 0.6234
  Difference - Accuracy: 0.0140, F1-macro: 0.0135

BERT Model (bert-base-uncased):
  Validation - Accuracy: 0.6719, F1-macro: 0.6705
  Test       - Accuracy: 0.6691, F1-macro: 0.6679
  Difference - Accuracy: 0.0028, F1-macro: 0.0026

✓ Validation vs Test comparison saved to exports/val_test_comparison.csv
✓ Validation vs Test comparison visualization saved to exports/val_test_comparison.png


# Visualization and Analysis

**`Purpose`**

Generate comprehensive visualization grid per proposal Section VI.B to analyze and compare model performance, classification patterns, and dataset characteristics. This block creates a unified visual analysis combining confusion matrices, metric comparisons, and class distribution to provide a holistic view of model behavior and dataset composition.

**`Input`**

Uses confusion matrices (`cm`, `cm_bert`) from test set evaluations, test set performance metrics for both baseline and BERT models, and test dataset class distribution. Requires that test evaluation cells for both models have been executed to populate these variables.

**`Output`**

Generates a 2x2 visualization grid saved to `exports/model_analysis_grid.png` containing: (1) baseline model confusion matrix, (2) BERT model confusion matrix, (3) side-by-side metric comparison bar chart, and (4) test set class distribution. This unified visualization enables quick comparison of model performance and identification of classification patterns.

**`Details`**

Creates a multi-panel figure with four subplots. The top row displays confusion matrices for both baseline and BERT models, showing true vs predicted label distributions for all three classes. The bottom left panel shows a grouped bar chart comparing accuracy, precision, recall, and F1-macro between baseline and BERT models on the test set. The bottom right panel visualizes the test set class distribution using colored bars. All visualizations use consistent labeling and styling for easy interpretation and comparison.

**`Line-by-line Description.`**

`fig, axes = plt.subplots(2, 2, figsize=(14, 12))` creates a 2x2 grid of subplots with specified figure size for the comprehensive visualization grid.

`ax = axes[0, 0]` selects the top-left subplot for the baseline model confusion matrix.

`ConfusionMatrixDisplay(cm, display_labels=[...]).plot(ax=ax, colorbar=False)` visualizes the baseline model's confusion matrix showing true vs predicted labels.

`ax.set_title('Baseline Model - Test Set')` sets the subplot title to identify which model and dataset are displayed.

`ax = axes[0, 1]` selects the top-right subplot for the BERT model confusion matrix.

`ConfusionMatrixDisplay(cm_bert, display_labels=[...]).plot(ax=ax, colorbar=False)` visualizes the BERT model's confusion matrix for comparison.

`ax = axes[1, 0]` selects the bottom-left subplot for the metric comparison bar chart.

`metrics = ['Accuracy', 'Precision', 'Recall', 'F1-macro']` defines the list of metrics to compare between models.

`baseline_vals = [test_acc, test_prec, test_rec, test_f1]` extracts baseline model test set metrics into a list for plotting.

`bert_vals = [test_results.get('eval_accuracy', 0), ...]` extracts BERT model test set metrics into a list for plotting.

`x = np.arange(len(metrics))` creates x-axis positions for each metric in the comparison chart.

`width = 0.35` sets the bar width for grouped bars, allowing baseline and BERT bars to appear side-by-side.

`ax.bar(x - width/2, baseline_vals, width, label='Baseline', alpha=0.8)` creates grouped bars for baseline metrics with left offset.

`ax.bar(x + width/2, bert_vals, width, label='BERT', alpha=0.8)` creates grouped bars for BERT metrics with right offset.

`ax.set_ylabel('Score')` labels the y-axis as score values.

`ax.set_title('Model Comparison - Test Set Metrics')` sets the subplot title.

`ax.set_xticks(x)` and `ax.set_xticklabels(metrics)` positions and labels x-axis ticks with metric names.

`ax.legend()` displays the legend distinguishing baseline and BERT bars.

`ax.grid(True, alpha=0.3, axis='y')` adds horizontal grid lines for easier value reading.

`ax = axes[1, 1]` selects the bottom-right subplot for class distribution visualization.

`label_counts = df_test['label'].value_counts().sort_index()` counts occurrences of each label in the test set and sorts by label value.

`labels = ['negative/toxic', 'neutral', 'positive']` defines human-readable label names for display.

`colors = ['#F87171', '#FBBF24', '#34D399']` defines color scheme (red, yellow, green) matching sentiment classes.

`ax.bar(labels, [...], color=colors, alpha=0.8)` creates colored bars showing the count of each class in the test set.

`ax.set_ylabel('Count')` labels the y-axis as sample count.

`ax.set_title('Test Set Class Distribution')` sets the subplot title.

`plt.tight_layout()` automatically adjusts subplot spacing to prevent label overlap.

`plt.savefig('exports/model_analysis_grid.png', dpi=150)` saves the complete visualization grid to disk with high resolution.


In [16]:
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

ax = axes[0, 0]
ConfusionMatrixDisplay(cm, display_labels=['negative/toxic', 'neutral', 'positive']).plot(ax=ax, colorbar=False)
ax.set_title('Baseline Model - Test Set')

ax = axes[0, 1]
ConfusionMatrixDisplay(cm_bert, display_labels=['negative/toxic', 'neutral', 'positive']).plot(ax=ax, colorbar=False)
ax.set_title('BERT Model - Test Set')

ax = axes[1, 0]
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-macro']
baseline_vals = [test_acc, test_prec, test_rec, test_f1]
bert_vals = [test_results.get('eval_accuracy', 0),
             test_results.get('eval_precision', 0),
             test_results.get('eval_recall', 0),
             test_results.get('eval_f1_macro', 0)]

x = np.arange(len(metrics))
width = 0.35
ax.bar(x - width/2, baseline_vals, width, label='Baseline', alpha=0.8)
ax.bar(x + width/2, bert_vals, width, label='BERT', alpha=0.8)
ax.set_ylabel('Score')
ax.set_title('Model Comparison - Test Set Metrics')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

ax = axes[1, 1]
label_counts = df_test['label'].value_counts().sort_index()
labels = ['negative/toxic', 'neutral', 'positive']
colors = ['#F87171', '#FBBF24', '#34D399']
ax.bar(labels, [label_counts.get(-1, 0), label_counts.get(0, 0), label_counts.get(1, 0)], color=colors, alpha=0.8)
ax.set_ylabel('Count')
ax.set_title('Test Set Class Distribution')
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('exports/model_analysis_grid.png', dpi=150)
plt.show()  # Display in notebook
print("✓ Visualization grid saved to exports/model_analysis_grid.png")


✓ Visualization grid saved to exports/model_analysis_grid.png


# Inference Examples and Model Testing

**`Purpose`**

Demonstrate BERT model inference on diverse sample tweets per proposal Section III to showcase the model's capability in handling sarcasm, informal language, and subtle toxicity patterns. This block provides practical examples of real-world model predictions, showing confidence scores and probability distributions across sentiment classes for various tweet types commonly encountered in social media.

**`Input`**

Uses the best trained BERT model and tokenizer (loaded from checkpoint directory), and a curated set of sample tweets including toxic, neutral, positive, sarcastic/toxic, and informal positive examples. The model should be in evaluation mode and loaded onto the appropriate device (CPU or GPU).

**`Output`**

Prints detailed inference results for each sample tweet, including the predicted sentiment label, confidence percentage, and probability distribution across all three classes (negative/toxic, neutral, positive). The output demonstrates the model's ability to handle nuanced language patterns and provides transparency into prediction confidence and class probabilities.

**`Details`**

Iterates through a diverse set of sample tweets representing different linguistic challenges: explicit toxicity, neutral statements, positive sentiment, sarcastic/toxic language (where surface meaning contradicts true sentiment), and informal positive expressions with emojis and slang. For each tweet, tokenizes the input using the BERT tokenizer with truncation and padding, runs inference through the model to obtain logits, applies softmax to compute probability distributions, and extracts the predicted class along with confidence scores. Converts BERT's internal label encoding (0, 1, 2) to proposal format (-1, 0, 1) and displays human-readable predictions with probability breakdowns for transparency and interpretability.

**`Line-by-line Description.`**

`sample_tweets = [(...) for ...]` defines a list of tuples containing label descriptions and tweet text representing diverse linguistic challenges (toxic, neutral, positive, sarcastic, informal).

`print("Testing BERT model on sample tweets (per proposal: handles sarcasm, informal language):\n")` provides context about the inference demonstration purpose.

`try: _ = best_model` checks if the best_model variable exists from previous cells to avoid redundant loading.

`best_model = best_model.to(device)` ensures the model is on the correct device (CPU or GPU) for inference.

`best_model.eval()` sets the model to evaluation mode, disabling dropout and batch normalization updates for consistent predictions.

`except NameError:` handles the case where best_model doesn't exist, triggering loading from checkpoint.

`best_model = AutoModelForSequenceClassification.from_pretrained(best_ckpt_dir).to(device)` loads the trained BERT model from the checkpoint directory.

`best_tokenizer = AutoTokenizer.from_pretrained(best_ckpt_dir, use_fast=True)` loads the BERT tokenizer matching the model checkpoint.

`for label, tweet in sample_tweets:` iterates through each sample tweet to demonstrate inference on diverse examples.

`inputs = best_tokenizer(tweet, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)` tokenizes the tweet text, converting it to PyTorch tensors with truncation and padding to max length of 128 tokens, then moves to device.

`with torch.no_grad():` disables gradient computation during inference to save memory and speed up prediction.

`outputs = best_model(**inputs)` runs forward pass through the BERT model, generating raw logits for each class.

`probs = torch.softmax(outputs.logits, dim=-1)[0].cpu().numpy()` applies softmax to convert logits to probability distribution over classes, extracts first sample, and converts to numpy array.

`pred_bert = np.argmax(probs)` finds the index of the class with highest probability (BERT's internal encoding: 0, 1, 2).

`bert_to_proposal = {0: -1, 1: 0, 2: 1}` defines mapping from BERT's label encoding to proposal format.

`pred = bert_to_proposal[pred_bert]` converts predicted label from BERT format to proposal format (-1, 0, 1).

`label_map = {-1: "negative/toxic", 0: "neutral", 1: "positive"}` defines human-readable label names.

`pred_label = label_map.get(pred, "unknown")` converts predicted label value to human-readable string.

`confidence = probs[pred_bert] * 100` calculates prediction confidence as percentage (probability of predicted class times 100).

`prob_map = {0: -1, 1: 0, 2: 1}` defines mapping for converting probability indices to proposal format.

`probs_proposal = {label_map[prob_map[i]]: float(probs[i]) for i in range(3)}` creates a dictionary mapping human-readable class names to their probabilities.

`print(f"\n{label}:")` prints the sample tweet category (e.g., "Toxic example").

`print(f"  Tweet: \"{tweet}\"")` displays the original tweet text in quotes.

`print(f"  Prediction: {pred_label} (confidence: {confidence:.2f}%)")` displays the predicted sentiment class and confidence percentage.

`print(f"  Probabilities: negative/toxic={...}, neutral={...}, positive={...}")` displays the full probability distribution across all three classes for transparency.


In [17]:
sample_tweets = [
    ("Toxic example", "This is absolutely disgusting! People like you should be banned from social media. Horrible!"),
    ("Neutral example", "Just finished my morning coffee. Weather is okay today, nothing special."),
    ("Positive example", "So grateful for all the support today! Amazing community, thank you everyone! 🙏"),
    ("Sarcastic/Toxic", "Oh wonderful, another day of dealing with this nonsense. Just perfect..."),
    ("Informal positive", "This made my day! So happy right now! Best news ever! 🔥"),
]

print("Testing BERT model on sample tweets (per proposal: handles sarcasm, informal language):\n")
print("="*70)

try:
    _ = best_model
    best_model = best_model.to(device)
    best_model.eval()
except NameError:
    from transformers import AutoModelForSequenceClassification
    best_model = AutoModelForSequenceClassification.from_pretrained(best_ckpt_dir).to(device)
    best_model.eval()

best_tokenizer = AutoTokenizer.from_pretrained(best_ckpt_dir, use_fast=True)

for label, tweet in sample_tweets:
    inputs = best_tokenizer(tweet, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)

    with torch.no_grad():
        outputs = best_model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)[0].cpu().numpy()
        pred_bert = np.argmax(probs)

    bert_to_proposal = {0: -1, 1: 0, 2: 1}
    pred = bert_to_proposal[pred_bert]

    label_map = {-1: "negative/toxic", 0: "neutral", 1: "positive"}
    pred_label = label_map.get(pred, "unknown")
    confidence = probs[pred_bert] * 100

    prob_map = {0: -1, 1: 0, 2: 1}
    probs_proposal = {label_map[prob_map[i]]: float(probs[i]) for i in range(3)}

    print(f"\n{label}:")
    print(f"  Tweet: \"{tweet}\"")
    print(f"  Prediction: {pred_label} (confidence: {confidence:.2f}%)")
    print(f"  Probabilities: negative/toxic={probs_proposal['negative/toxic']:.3f}, neutral={probs_proposal['neutral']:.3f}, positive={probs_proposal['positive']:.3f}")

print("\n" + "="*70)
print("✓ Inference examples completed")


Testing BERT model on sample tweets (per proposal: handles sarcasm, informal language):


Toxic example:
  Tweet: "This is absolutely disgusting! People like you should be banned from social media. Horrible!"
  Prediction: negative/toxic (confidence: 86.36%)
  Probabilities: negative/toxic=0.864, neutral=0.118, positive=0.018

Neutral example:
  Tweet: "Just finished my morning coffee. Weather is okay today, nothing special."
  Prediction: neutral (confidence: 60.14%)
  Probabilities: negative/toxic=0.172, neutral=0.601, positive=0.227

Positive example:
  Tweet: "So grateful for all the support today! Amazing community, thank you everyone! 🙏"
  Prediction: positive (confidence: 93.71%)
  Probabilities: negative/toxic=0.023, neutral=0.039, positive=0.937

Sarcastic/Toxic:
  Tweet: "Oh wonderful, another day of dealing with this nonsense. Just perfect..."
  Prediction: positive (confidence: 76.33%)
  Probabilities: negative/toxic=0.077, neutral=0.160, positive=0.763

Informal positive:


# Export and Deployment Preparation

**`Purpose`**

Export all necessary artifacts for model deployment and reproduction per proposal Section VII. This block prepares quantized model weights, consolidated experiment logs, model metadata, and summary reports to ensure that classmates and graders can reproduce, audit, and extend the workflow without rerunning training from scratch.

**`Input`**

Uses the best trained BERT model checkpoint, experiment logs from `runs_log.csv`, test set evaluation results, and all previous model performance metrics. Requires that training, evaluation, and comparison cells have been executed to generate the necessary artifacts.

**`Output`**

Saves quantized PyTorch model weights to checkpoint directory (`pytorch_model_quantized.bin`), exports consolidated experiment logs to CSV and Excel formats in `exports/` directory, generates a JSON model card (`exports/model_card.json`) with metadata and performance metrics, and creates a text summary report (`exports/summary_report.txt`) documenting dataset statistics, model performance, and improvement metrics. Ensures all deployment artifacts are properly formatted and accessible.

**`Details`**

The export process follows deployment best practices: (1) saves the full model in PyTorch format (`pytorch_model.bin`) and creates a quantized version (`pytorch_model_quantized.bin`) using dynamic quantization to reduce model size while maintaining performance, (2) consolidates all experiment runs from `runs_log.csv` into exportable CSV and Excel formats for easy sharing and analysis, (3) generates a comprehensive model card in JSON format containing model architecture details, training configuration, dataset information, and performance metrics, (4) creates a human-readable summary report comparing baseline and BERT model performance, documenting F1-macro improvements and target achievement status. All outputs are organized in the `exports/` directory with clear naming conventions for easy discovery and deployment.

**`Line-by-line Description.`**

`print("Exporting artifacts for deployment...")` provides user feedback about the export process starting.

`try: _ = best_model` checks if best_model exists from previous cells to avoid redundant loading.

`except NameError:` handles missing model by loading from checkpoint directory.

`if os.path.exists("./checkpoints/bert-base/best_grid"):` checks for hyperparameter-tuned grid search checkpoint first (highest priority).

`best_ckpt_dir = "./checkpoints/bert-base/best_grid"` sets checkpoint path to grid search results.

`elif os.path.exists("./checkpoints/bert-base/best_random"):` falls back to random search checkpoint if grid search unavailable.

`elif os.path.exists("./checkpoints/bert-base/best"):` falls back to default training checkpoint if no tuning was performed.

`best_model = AutoModelForSequenceClassification.from_pretrained(best_ckpt_dir).to(device)` loads the trained model from checkpoint.

`full_model_path = Path(best_ckpt_dir) / "pytorch_model.bin"` constructs the path for saving PyTorch model weights.

`safetensors_path = Path(best_ckpt_dir) / "model.safetensors"` constructs the path for safetensors format (may exist from HuggingFace save).

`best_model.save_pretrained(best_ckpt_dir, safe_serialization=False)` saves model using HuggingFace's save method with PyTorch format (not safetensors).

`best_model.cpu()` moves model to CPU before saving state dict (required for some PyTorch save operations).

`torch.save(best_model.state_dict(), full_model_path)` explicitly saves model state dictionary as pytorch_model.bin for deployment compatibility.

`best_model.to(device)` moves model back to original device for subsequent operations.

`file_size = os.path.getsize(full_model_path) / (1024*1024)` calculates model file size in megabytes for verification.

`if safetensors_path.exists(): safetensors_path.unlink()` removes safetensors file to use PyTorch format exclusively.

`quantized_model = torch.quantization.quantize_dynamic(best_model.cpu(), {torch.nn.Linear}, dtype=torch.qint8)` applies dynamic quantization to linear layers, converting weights to 8-bit integers to reduce model size.

`quantized_path = Path(best_ckpt_dir) / 'pytorch_model_quantized.bin'` constructs path for quantized model weights.

`torch.save(quantized_model.state_dict(), quantized_path)` saves the quantized model weights to disk.

`if os.path.exists("runs_log.csv"):` checks if experiment log file exists before exporting.

`runs_df = pd.read_csv("runs_log.csv")` loads all experiment runs into a pandas DataFrame.

`runs_df.to_csv("exports/experiment_runs_all.csv", index=False)` exports experiment logs to CSV format in exports directory.

`runs_df.to_excel("exports/experiment_runs_all.xlsx", index=False)` exports experiment logs to Excel format for easy sharing and analysis.

`model_card = {...}` creates a dictionary containing model metadata, configuration, and performance metrics.

`"model_name": MODEL_NAME` records the base model identifier (bert-base-uncased).

`"task": "Toxic Comment Detection on Twitter"` specifies the task domain.

`"num_labels": 3` records the number of classification classes.

`"labels": ["negative/toxic (-1)", "neutral (0)", "positive (1)"]` lists all class labels with their encodings.

`"best_test_accuracy": float(test_results.get('eval_accuracy', 0))` records final test set accuracy.

`"best_test_f1_macro": float(test_results.get('eval_f1_macro', 0))` records final test set F1-macro score.

`"improvement": f"{((test_results.get('eval_f1_macro', 0) - test_f1) / test_f1 * 100):.2f}%"` calculates and formats percentage improvement over baseline.

`with open("exports/model_card.json", "w") as f: json.dump(model_card, f, indent=2)` saves model card as formatted JSON file.

`summary_text = f"""..."""` creates a formatted text summary report using f-string template.

`with open("exports/summary_report.txt", "w") as f: f.write(summary_text)` saves the human-readable summary report to text file.

`print("="*70)` and `print("✓ All exports completed successfully!")` provides visual confirmation that export process finished successfully.


In [18]:
print("Exporting artifacts for deployment...")

try:
    _ = best_model
    _ = best_ckpt_dir
    print("Using model from previous cells...")
except NameError:
    print("Loading model from saved checkpoint...")
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    from pathlib import Path
    import os

    if os.path.exists("./checkpoints/bert-base/best_grid"):
        best_ckpt_dir = "./checkpoints/bert-base/best_grid"
        print(f"   Found best_grid checkpoint: {best_ckpt_dir}")
    elif os.path.exists("./checkpoints/bert-base/best_random"):
        best_ckpt_dir = "./checkpoints/bert-base/best_random"
        print(f"   Found best_random checkpoint: {best_ckpt_dir}")
    elif os.path.exists("./checkpoints/bert-base/best"):
        best_ckpt_dir = "./checkpoints/bert-base/best"
        print(f"   Found best checkpoint: {best_ckpt_dir}")
    else:
        raise FileNotFoundError(
            "No checkpoint directory found. Please run the training cells first.\n"
            "Expected locations: checkpoints/bert-base/best, checkpoints/bert-base/best_grid, or checkpoints/bert-base/best_random"
        )

    best_model = AutoModelForSequenceClassification.from_pretrained(best_ckpt_dir).to(device)
    best_model.eval()
    print(f"   ✓ Model loaded from {best_ckpt_dir}")

print("\n0. Ensuring full model is saved...")
full_model_path = Path(best_ckpt_dir) / "pytorch_model.bin"
safetensors_path = Path(best_ckpt_dir) / "model.safetensors"

print("   Saving model in PyTorch format (pytorch_model.bin)...")
best_model.save_pretrained(best_ckpt_dir, safe_serialization=False)
print("   [OK] Model saved via save_pretrained()")

print("   Explicitly saving state_dict as pytorch_model.bin...")
best_model.cpu()
torch.save(best_model.state_dict(), full_model_path)
best_model.to(device)
print("   [OK] State dict saved directly")

if full_model_path.exists():
    file_size = os.path.getsize(full_model_path) / (1024*1024)
    print(f"   [OK] Verified: pytorch_model.bin exists ({file_size:.2f} MB)")

    if safetensors_path.exists():
        print("   [INFO] Removing model.safetensors (using pytorch_model.bin instead)")
        safetensors_path.unlink()
else:
    print("   [ERROR] pytorch_model.bin was not saved! Check permissions and disk space.")

print("\n1. Saving quantized model weights...")
try:
    quantized_model = torch.quantization.quantize_dynamic(
        best_model.cpu(), {torch.nn.Linear}, dtype=torch.qint8
    )
    quantized_path = Path(best_ckpt_dir) / 'pytorch_model_quantized.bin'
    torch.save(quantized_model.state_dict(), quantized_path)
    print(f"   ✓ Quantized model saved to {quantized_path}")
except Exception as e:
    print(f"   ⚠ Quantization failed: {e}")

print("\n2. Exporting consolidated experiment logs...")
if os.path.exists("runs_log.csv"):
    runs_df = pd.read_csv("runs_log.csv")
    runs_df.to_csv("exports/experiment_runs_all.csv", index=False)
    print("   ✓ Experiment logs exported to exports/experiment_runs_all.csv")

    try:
        runs_df.to_excel("exports/experiment_runs_all.xlsx", index=False)
        print("   ✓ Excel summary exported to exports/experiment_runs_all.xlsx")
    except:
        print("   ⚠ Excel export skipped (openpyxl not available)")

print("\n3. Generating model card...")
model_card = {
    "model_name": MODEL_NAME,
    "task": "Toxic Comment Detection on Twitter",
    "num_labels": 3,
    "labels": ["negative/toxic (-1)", "neutral (0)", "positive (1)"],
    "dataset": "mteb/tweet_sentiment_extraction",
    "training_split": "70%",
    "validation_split": "15%",
    "test_split": "15%",
    "max_sequence_length": MAX_LEN,
    "optimizer": "AdamW",
    "loss": "Cross-entropy",
    "scheduler": "Linear warmup",
    "best_test_accuracy": float(test_results.get('eval_accuracy', 0)),
    "best_test_f1_macro": float(test_results.get('eval_f1_macro', 0)),
    "baseline_accuracy": float(test_acc),
    "baseline_f1_macro": float(test_f1),
    "improvement": f"{((test_results.get('eval_f1_macro', 0) - test_f1) / test_f1 * 100):.2f}%",
}

with open("exports/model_card.json", "w") as f:
    json.dump(model_card, f, indent=2)
print("   ✓ Model card saved to exports/model_card.json")

print("\n4. Creating summary report...")
summary_text = f"""
Twitter Toxicity Detection Project - Summary Report
==================================================

Dataset: mteb/tweet_sentiment_extraction
Total samples: {len(df)}
Train/Val/Test split: 70%/15%/15%

Baseline Model (TF-IDF + Logistic Regression):
  - Accuracy: {test_acc:.4f}
  - Precision: {test_prec:.4f}
  - Recall: {test_rec:.4f}
  - F1-macro: {test_f1:.4f}

BERT Model (bert-base-uncased):
  - Accuracy: {test_results.get('eval_accuracy', 0):.4f}
  - Precision: {test_results.get('eval_precision', 0):.4f}
  - Recall: {test_results.get('eval_recall', 0):.4f}
  - F1-macro: {test_results.get('eval_f1_macro', 0):.4f}

Improvement over baseline:
  - F1-macro improvement: {((test_results.get('eval_f1_macro', 0) - test_f1) / test_f1 * 100):.2f}%

Target F1-macro (per proposal): >0.85
Achieved F1-macro: {test_results.get('eval_f1_macro', 0):.4f}
Target met: {'✓ YES' if test_results.get('eval_f1_macro', 0) > 0.85 else '✗ NO'}

All artifacts exported to:
  - Model checkpoints: {best_ckpt_dir}
  - Predictions: exports/
  - Visualizations: exports/confusion_matrices/, exports/roc_curves/
  - Logs: runs_log.csv, exports/experiment_runs_all.csv
"""

with open("exports/summary_report.txt", "w") as f:
    f.write(summary_text)
print(summary_text)
print("\n✓ Summary report saved to exports/summary_report.txt")
print("\n" + "="*70)
print("✓ All exports completed successfully!")
print("="*70)


Exporting artifacts for deployment...
Using model from previous cells...

0. Ensuring full model is saved...
   Saving model in PyTorch format (pytorch_model.bin)...
   [OK] Model saved via save_pretrained()
   Explicitly saving state_dict as pytorch_model.bin...
   [OK] State dict saved directly
   [OK] Verified: pytorch_model.bin exists (417.73 MB)
   [INFO] Removing model.safetensors (using pytorch_model.bin instead)

1. Saving quantized model weights...


For migrations of users: 
1. Eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead 
2. FX graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx,torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e) 
3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e) 
see https://github.com/pytorch/ao/issues/2259 for more details
  quantized_model = torch.quantization.quantize_dynamic(


   ✓ Quantized model saved to checkpoints/bert-base/best_grid/pytorch_model_quantized.bin

2. Exporting consolidated experiment logs...
   ✓ Experiment logs exported to exports/experiment_runs_all.csv
   ✓ Excel summary exported to exports/experiment_runs_all.xlsx

3. Generating model card...
   ✓ Model card saved to exports/model_card.json

4. Creating summary report...

Twitter Toxicity Detection Project - Summary Report

Dataset: mteb/tweet_sentiment_extraction
Total samples: 29756
Train/Val/Test split: 70%/15%/15%

Baseline Model (TF-IDF + Logistic Regression):
  - Accuracy: 0.6224
  - Precision: 0.6294
  - Recall: 0.6197
  - F1-macro: 0.6234

BERT Model (bert-base-uncased):
  - Accuracy: 0.6691
  - Precision: 0.6884
  - Recall: 0.6599
  - F1-macro: 0.6679

Improvement over baseline:
  - F1-macro improvement: 7.14%

Target F1-macro (per proposal): >0.85
Achieved F1-macro: 0.6679
Target met: ✗ NO

All artifacts exported to:
  - Model checkpoints: ./checkpoints/bert-base/best_grid
  