# SC4002 Assignment - Part 3.1, 3.3, & 3.4: Hyperparameter Tuning

**Team 3: Aaron Chen & Javier Tin**

This notebook performs a **reproducible grid search** to find the optimal hyperparameters for all required models:
1.  **BiLSTM (Part 3.1)**
2.  **BiGRU (Part 3.1)**
3.  **BiLSTM + Attention (Part 3.3)**
4.  **BiLSTM + Focal Loss (Part 3.4)**

This will allow a fair comparison, as we will find the best-performing version of each architecture.

All training is **seeded (SEED = 42)** for reproducible results.

## 1. Imports & Setup

This cell imports all data from our compliant `data_pipeline.py` and sets the random seed.

In [1]:
# === Core PyTorch Imports ===
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import time
import itertools
import json
import pandas as pd

# === Plotting Imports ===
import matplotlib.pyplot as plt
import numpy as np
import sys
import os
import random

def set_seed(seed):
    """Sets the random seed for full reproducibility."""
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.use_deterministic_algorithms(True) 
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False 
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8' 

SEED = 42
set_seed(SEED)

# === Check PyTorch and CUDA Versions ===
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Current device: {torch.cuda.current_device()}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")

# === Data Pipeline Import ===
try:
    from data_pipeline import (
        train_iterator, 
        valid_iterator, 
        test_iterator, 
        TEXT, 
        LABEL, 
        create_embedding_layer,
        device,
        BATCH_SIZE
    )
    print("\n✓ Successfully imported data pipeline.")
    print(f"  - Using device: {device}")
    print(f"  - Batch Size: {BATCH_SIZE}")
except ImportError:
    print("--- ERROR ---")
    print("Could not find 'data_pipeline.py'.")
    print("Please make sure 'data_pipeline.py' is in the same directory as this notebook.")

PyTorch version: 2.5.1
CUDA available: True
CUDA version: 12.1
Current device: 0
Device name: NVIDIA GeForce RTX 4060 Laptop GPU


  self.itos, self.stoi, self.vectors, self.dim = torch.load(path_pt)



✓ Successfully imported data pipeline.
  - Using device: cuda
  - Batch Size: 64


## 2. Define Hyperparameter Grid

We will test all four models (BiLSTM, BiGRU, Attention, FocalLoss) across a range of hyperparameters.

**Note:** `Attention` refers to the `ImprovedBiLSTM_Attention` architecture.
**Note:** `FocalLoss` refers to the *baseline `BiLSTM_Model`* architecture, but trained with `FocalLoss`.

In [2]:
param_grid = {
    'model_type': ['Attention', 'FocalLoss'],
    'hidden_dim': [256, 384, 512],
    'n_layers': [2, 3, 4],
    'dropout': [0.4, 0.5, 0.6],
    'weight_decay': [5e-6, 1e-5, 5e-5] 
}

# Create all combinations
keys, values = zip(*param_grid.items())
hyperparam_combos = [dict(zip(keys, v)) for v in itertools.product(*values)]

print(f"Total combinations to test: {len(hyperparam_combos)}")
print("\n--- First 5 Combinations ---")
for combo in hyperparam_combos[:5]:
    print(combo)

Total combinations to test: 162

--- First 5 Combinations ---
{'model_type': 'Attention', 'hidden_dim': 256, 'n_layers': 2, 'dropout': 0.4, 'weight_decay': 5e-06}
{'model_type': 'Attention', 'hidden_dim': 256, 'n_layers': 2, 'dropout': 0.4, 'weight_decay': 1e-05}
{'model_type': 'Attention', 'hidden_dim': 256, 'n_layers': 2, 'dropout': 0.4, 'weight_decay': 5e-05}
{'model_type': 'Attention', 'hidden_dim': 256, 'n_layers': 2, 'dropout': 0.5, 'weight_decay': 5e-06}
{'model_type': 'Attention', 'hidden_dim': 256, 'n_layers': 2, 'dropout': 0.5, 'weight_decay': 1e-05}


## 3. Model & Training Definitions

Here we define all models (Baselines and Improved) and the Focal Loss function.

In [3]:
# === 1. Baseline BiLSTM Model (Part 3.1) ===
class BiLSTM_Model(nn.Module):
    def __init__(self, input_dim, emb_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        self.embedding = create_embedding_layer(freeze=False)
        self.lstm = nn.LSTM(
            input_size=emb_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            bidirectional=bidirectional,
            dropout=dropout if n_layers > 1 else 0,
            batch_first=False
        )
        fc_input_dim = hidden_dim * 2 if bidirectional else hidden_dim
        self.fc = nn.Linear(fc_input_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, text, lengths):
        embedded = self.embedding(text)
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths.to('cpu'), enforce_sorted=False)
        packed_output, (hidden, cell) = self.lstm(packed_embedded)
        last_hidden_state = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        dropped_hidden = self.dropout(last_hidden_state)
        prediction = self.fc(dropped_hidden)
        return prediction

# === 2. Baseline BiGRU Model (Part 3.1) ===
class BiGRU_Model(nn.Module):
    def __init__(self, input_dim, emb_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        self.embedding = create_embedding_layer(freeze=False)
        self.gru = nn.GRU(
            input_size=emb_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            bidirectional=bidirectional,
            dropout=dropout if n_layers > 1 else 0,
            batch_first=False
        )
        fc_input_dim = hidden_dim * 2 if bidirectional else hidden_dim
        self.fc = nn.Linear(fc_input_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, text, lengths):
        embedded = self.embedding(text)
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths.to('cpu'), enforce_sorted=False)
        packed_output, hidden = self.gru(packed_embedded)
        last_hidden_state = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        dropped_hidden = self.dropout(last_hidden_state)
        prediction = self.fc(dropped_hidden)
        return prediction

# === 3. IMPROVED Attention Module (Part 3.3) ===
class ImprovedAttention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.attention = nn.Linear(hidden_dim * 2, hidden_dim * 2)
        self.context_vector = nn.Linear(hidden_dim * 2, 1, bias=False)
        self.dropout = nn.Dropout(0.1)
        
    def forward(self, lstm_output, final_hidden):
        lstm_output = lstm_output.permute(1, 0, 2)
        energy = torch.tanh(self.attention(lstm_output))
        attention_weights = F.softmax(self.context_vector(energy), dim=1)
        attention_weights = self.dropout(attention_weights)
        context = torch.bmm(attention_weights.transpose(1, 2), lstm_output)
        return context.squeeze(1)

# === 4. IMPROVED BiLSTM + Attention (Part 3.3) ===
class ImprovedBiLSTM_Attention(nn.Module):
    def __init__(self, input_dim, emb_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        self.embedding = create_embedding_layer(freeze=False)
        self.lstm = nn.LSTM(
            input_size=emb_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            bidirectional=bidirectional,
            dropout=dropout if n_layers > 1 else 0,
            batch_first=False
        )
        self.layer_norm = nn.LayerNorm(hidden_dim * 2)
        self.attention = ImprovedAttention(hidden_dim)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, lengths):
        embedded = self.embedding(text)
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths.to('cpu'), enforce_sorted=False)
        packed_output, (hidden, cell) = self.lstm(packed_embedded)
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=False)
        output = self.layer_norm(output)
        context_vector = self.attention(output, hidden)
        dropped_context = self.dropout(context_vector)
        prediction = self.fc(dropped_context)
        return prediction

# === 5. IMPROVED Focal Loss (Part 3.4) ===
class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2, label_smoothing=0.1):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.label_smoothing = label_smoothing
        
    def forward(self, inputs, targets):
        ce_loss = F.cross_entropy(inputs, targets, reduction='none', label_smoothing=self.label_smoothing)
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        return focal_loss.mean()

# === Training/Evaluation Function Definitions ===

def get_accuracy(preds, y):
    top_pred = preds.argmax(1, keepdim=True)
    correct = top_pred.eq(y.view_as(top_pred)).sum()
    acc = correct.float() / y.shape[0]
    return acc

def train_epoch(model, iterator, optimizer, criterion):
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text, lengths = batch.text
        predictions = model(text, lengths)
        loss = criterion(predictions, batch.label)
        loss.backward()
        # Add gradient clipping for stability
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

def evaluate_epoch(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    epoch_acc = 0
    with torch.no_grad():
        for batch in iterator:
            text, lengths = batch.text
            predictions = model(text, lengths)
            loss = criterion(predictions, batch.label)
            acc = get_accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

print("✓ All model and loss function classes defined.")

✓ All model and loss function classes defined.


## 4. Run Grid Search

This will loop through all combinations. This cell will take a long time to run!

In [None]:
# === Get Static Parameters ===
INPUT_DIM = len(TEXT.vocab)
OUTPUT_DIM = len(LABEL.vocab)
EMBEDDING_DIM = create_embedding_layer().embedding_dim
N_EPOCHS = 10 # Train each combo for 10 epochs

grid_search_results = []
total_runs = len(hyperparam_combos)

print(f"--- Starting Grid Search for {total_runs} combinations --- This will take some time. ---")

for i, params in enumerate(hyperparam_combos):
    run_num = i + 1
    print(f"\n{'='*20} RUN {run_num}/{total_runs} {'='*20}")
    print(f"Params: {params}")
    
    # 1. Instantiate Model based on model_type
    model_type = params['model_type']
    
    if model_type == 'BiLSTM':
        model = BiLSTM_Model(
            INPUT_DIM, EMBEDDING_DIM, params['hidden_dim'], OUTPUT_DIM,
            params['n_layers'], True, params['dropout']
        ).to(device)
    elif model_type == 'BiGRU':
        model = BiGRU_Model(
            INPUT_DIM, EMBEDDING_DIM, params['hidden_dim'], OUTPUT_DIM,
            params['n_layers'], True, params['dropout']
        ).to(device)
    elif model_type == 'Attention':
        model = ImprovedBiLSTM_Attention(
            INPUT_DIM, EMBEDDING_DIM, params['hidden_dim'], OUTPUT_DIM,
            params['n_layers'], True, params['dropout']
        ).to(device)
    elif model_type == 'FocalLoss':
        model = BiLSTM_Model( # Using baseline arch for this test
            INPUT_DIM, EMBEDDING_DIM, params['hidden_dim'], OUTPUT_DIM,
            params['n_layers'], True, params['dropout']
        ).to(device)
    
    # 2. Instantiate Optimizer and Criterion
    optimizer = optim.Adam(model.parameters(), weight_decay=params['weight_decay'])
    
    if model_type == 'FocalLoss':
        criterion = FocalLoss(alpha=1, gamma=2, label_smoothing=0.1).to(device)
    else:
        criterion = nn.CrossEntropyLoss().to(device)
    
    best_valid_acc = -1.0
    start_run_time = time.time()
    
    # 3. Training Loop for this combination
    for epoch in range(N_EPOCHS):
        train_epoch(model, train_iterator, optimizer, criterion)
        valid_loss, valid_acc = evaluate_epoch(model, valid_iterator, criterion)
        
        if valid_acc > best_valid_acc:
            best_valid_acc = valid_acc
    
    end_run_time = time.time()
    run_duration_mins = (end_run_time - start_run_time) / 60
    
    # 4. Store Results
    result = params.copy()
    result['best_valid_acc'] = best_valid_acc * 100 # As percentage
    result['time_mins'] = run_duration_mins
    grid_search_results.append(result)
    
    print(f"Run {run_num} complete. Time: {run_duration_mins:.2f}m. Best Valid Acc: {best_valid_acc*100:.2f}%")

print("\n--- GRID SEARCH COMPLETE ---")

--- Starting Grid Search for 162 combinations --- This will take some time. ---

Params: {'model_type': 'Attention', 'hidden_dim': 256, 'n_layers': 2, 'dropout': 0.4, 'weight_decay': 5e-06}


## 5. Analyze Results

Now we can load all the new results into a `pandas.DataFrame` to find the winning model for each category.

In [None]:
results_df = pd.DataFrame(grid_search_results)

# Save all results to one file
results_df.to_json("grid_search_results_all.json", orient="records", indent=4)
print("✓ Saved all tuning results to grid_search_results_all.json")

# --- Find the best model for each category ---
print("\n======================================================")
print("           WINNING HYPERPARAMETERS")
print("======================================================")



# 3.3 - Best Attention Model
best_attention = results_df[results_df['model_type'] == 'Attention'].sort_values(by='best_valid_acc', ascending=False).iloc[0]
print("\n--- [Part 3.3] Best BiLSTM + Attention ---")
print(f"Accuracy: {best_attention['best_valid_acc']:.2f}%")
print(best_attention.drop('best_valid_acc').to_dict())

# 3.4 - Best Focal Loss Model
best_focal = results_df[results_df['model_type'] == 'FocalLoss'].sort_values(by='best_valid_acc', ascending=False).iloc[0]
print("\n--- [Part 3.4] Best BiLSTM + Focal Loss ---")
print(f"Accuracy: {best_focal['best_valid_acc']:.2f}%")
print(best_focal.drop('best_valid_acc').to_dict())

print("\n======================================================")

✓ Saved all tuning results to grid_search_results_all.json

           WINNING HYPERPARAMETERS

--- [Part 3.3] Best BiLSTM + Attention ---
Accuracy: 87.24%
{'model_type': 'Attention', 'hidden_dim': 384, 'n_layers': 4, 'dropout': 0.5, 'weight_decay': 5e-06, 'time_mins': 0.25966018438339233}

--- [Part 3.4] Best BiLSTM + Focal Loss ---
Accuracy: 88.80%
{'model_type': 'FocalLoss', 'hidden_dim': 384, 'n_layers': 3, 'dropout': 0.6, 'weight_decay': 1e-05, 'time_mins': 0.18910448948542277}



### Next Steps

1.  This notebook has found the **best** hyperparameters for all four models.
2.  Now, create a **new, clean notebook** (like your `SC4002_RNN_Experiments.ipynb`).
3.  In that notebook, train **only these 4 winning models** using their optimized parameters.
4.  Run `get_topic_accuracy` on all of them and generate your final comparison tables and conclusions for the report.