# **Kaggle Challenge: Pirate Pain Dataset üè¥‚Äç‚ò†Ô∏è (v3: Selective Scaling)**

This notebook implements a robust K-Fold Cross-Validation and Ensembling strategy. This version includes a key fix: **Selective Scaling**.

**Strategy:**
1.  **Feature Engineering:** Create an `is_pirate` binary feature (1 if pirate, 0 otherwise).
2.  **Selective Scaling:** Apply `StandardScaler` *only* to the 35 continuous joint/pain features. The `is_pirate` feature is left as a raw 0/1 binary input.
3.  **Hyperparameter Search:** Use Ray Tune & Optuna on a single 80/20 split to find a good set of hyperparameters (`FINAL_CONFIG`).
4.  **K-Fold Training:** Train `K=5` models on 5 different folds, using the `FINAL_CONFIG`. Each model is trained with early stopping and saved to disk.
5.  **Ensemble Prediction:** Load all 5 models, average their (softmax) probabilities on the test set, and aggregate these probabilities for a final, robust submission.

## ‚öôÔ∏è 1. Setup & Libraries

In [1]:
# Set seed for reproducibility
SEED = 123

# Import necessary libraries
import os
import logging
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import copy
from itertools import product
import time

# Set environment variables before importing modules
os.environ['MPLCONFIGDIR'] = os.getcwd() + '/configs/'

# Suppress warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=Warning)

# --- PyTorch Imports ---
import torch
from torch import nn
from torch.utils.tensorboard import SummaryWriter
from torch.utils.data import TensorDataset, DataLoader

# --- Sklearn Imports ---
from sklearn.preprocessing import StandardScaler, LabelEncoder, StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix

# --- Ray[tune] & Optuna Imports ---
import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.optuna import OptunaSearch
from functools import partial

# --- Setup Directories & Device ---
logs_dir = "tensorboard"
os.makedirs("models", exist_ok=True)
os.makedirs("submissions", exist_ok=True)
os.makedirs(logs_dir, exist_ok=True)

if torch.cuda.is_available():
    device = torch.device("cuda")
    torch.cuda.manual_seed_all(SEED)
    torch.backends.cudnn.benchmark = True
    print("\n--- Using GPU ---")
else:
    device = torch.device("cpu")
    print("\n--- Using CPU ---")

print(f"PyTorch version: {torch.__version__}")
print(f"Device: {device}")

# Configure plot display settings
sns.set_theme(font_scale=1.4)
sns.set_style('white')
plt.rc('font', size=14)
%matplotlib inline

2025-11-12 14:21:54,391	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2025-11-12 14:21:54,617	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.



--- Using GPU ---
PyTorch version: 2.5.1
Device: cuda


## üîÑ 2. Data Loading & Feature Engineering

In [2]:
print("--- 1. Loading Data ---")

# --- Define File Paths and Features ---
DATA_DIR = "data"
X_TRAIN_PATH = os.path.join(DATA_DIR, "pirate_pain_train.csv")
Y_TRAIN_PATH = os.path.join(DATA_DIR, "pirate_pain_train_labels.csv")
X_TEST_PATH = os.path.join(DATA_DIR, "pirate_pain_test.csv")
SUBMISSION_PATH = os.path.join(DATA_DIR, "sample_submission.csv")

try:
    # Load features and labels
    features_long_df = pd.read_csv(X_TRAIN_PATH)
    labels_df = pd.read_csv(Y_TRAIN_PATH)
    X_test_long_df = pd.read_csv(X_TEST_PATH)
    
    # --- Define constants ---
    N_TIMESTEPS = 160
    JOINT_FEATURES = [f"joint_{i:02d}" for i in range(31)]
    PAIN_FEATURES = [f"pain_survey_{i}" for i in range(1, 5)]
    FEATURES = JOINT_FEATURES + PAIN_FEATURES
    N_FEATURES_ORIGINAL = len(FEATURES) # This is 35
    LABEL_MAPPING = {'no_pain': 0, 'low_pain': 1, 'high_pain': 2}
    N_CLASSES = len(LABEL_MAPPING)

    # --- Reshape function ---
    def reshape_data(df, features_list, n_timesteps):
        df_pivot = df.pivot(index='sample_index', columns='time', values=features_list)
        data_2d = df_pivot.values
        n_samples = data_2d.shape[0]
        data_3d = data_2d.reshape(n_samples, len(features_list), n_timesteps)
        return data_3d.transpose(0, 2, 1)

    # --- Load and reshape X_train_full (35 features) ---
    X_train_full = reshape_data(
        features_long_df[features_long_df['sample_index'].isin(labels_df['sample_index'].unique())], 
        FEATURES, 
        N_TIMESTEPS
    )
    
    # --- Load and reshape X_test (35 features) ---
    X_test_full = reshape_data(
        X_test_long_df, FEATURES, N_TIMESTEPS
    )

    # --- Load and prepare y_train_full ---
    y_train_full_df = labels_df.sort_values(by='sample_index')
    le = LabelEncoder()
    le.fit(list(LABEL_MAPPING.keys()))
    y_train_full = le.transform(y_train_full_df['label'])
    
    print(f"Loaded X_train_full (shape: {X_train_full.shape}) and y_train_full (shape: {y_train_full.shape})")
    print(f"Loaded X_test_full (shape: {X_test_full.shape})")

    # --- 2. Engineer 'is_pirate' Feature (for Train) ---
    print("\n--- 2. Engineering 'is_pirate' Feature ---")
    static_cols = ['sample_index', 'n_legs', 'n_hands', 'n_eyes']
    static_df = features_long_df[static_cols].drop_duplicates().set_index('sample_index')
    
    pirate_filter = (
        (static_df['n_legs'] == 'one+peg_leg') |
        (static_df['n_hands'] == 'one+hook_hand') |
        (static_df['n_eyes'] == 'one+eye_patch')
    )
    pirate_indices = static_df[pirate_filter].index
    sample_indices_ordered = sorted(features_long_df[features_long_df['sample_index'].isin(labels_df['sample_index'].unique())]['sample_index'].unique())
    is_pirate_map = np.array([1 if idx in pirate_indices else 0 for idx in sample_indices_ordered])
    pirate_feature_broadcast = np.tile(is_pirate_map.reshape(-1, 1, 1), (1, N_TIMESTEPS, 1))
    
    # Concatenate with X_train_full
    X_train_full_engineered = np.concatenate([X_train_full, pirate_feature_broadcast], axis=2)
    
    # --- 3. Engineer 'is_pirate' Feature (for Test) ---
    static_df_test = X_test_long_df[static_cols].drop_duplicates().set_index('sample_index')
    pirate_filter_test = (
        (static_df_test['n_legs'] == 'one+peg_leg') |
        (static_df_test['n_hands'] == 'one+hook_hand') |
        (static_df_test['n_eyes'] == 'one+eye_patch')
    )
    pirate_indices_test = static_df_test[pirate_filter_test].index
    sample_indices_test_ordered = sorted(X_test_long_df['sample_index'].unique())
    is_pirate_map_test = np.array([1 if idx in pirate_indices_test else 0 for idx in sample_indices_test_ordered])
    pirate_feature_broadcast_test = np.tile(is_pirate_map_test.reshape(-1, 1, 1), (1, N_TIMESTEPS, 1))
    
    # Concatenate with X_test_full
    X_test_full_engineered = np.concatenate([X_test_full, pirate_feature_broadcast_test], axis=2)
    
    N_FEATURES_NEW = X_train_full_engineered.shape[2] # This will be 36
    print(f"Created X_train_full_engineered (shape: {X_train_full_engineered.shape})")
    print(f"Created X_test_full_engineered (shape: {X_test_full_engineered.shape})")
    print(f"N_FEATURES is now: {N_FEATURES_NEW}")

    # --- 4. Calculate Class Weights ---
    print("\n--- 3. Calculating Class Weights ---")
    class_counts_series = labels_df['label'].value_counts()
    counts_ordered = class_counts_series.reindex(LABEL_MAPPING.keys()).values
    class_weights_tensor = 1.0 / torch.tensor(counts_ordered, dtype=torch.float)
    class_weights_tensor = class_weights_tensor / class_weights_tensor.sum() # Normalize weights
    class_weights_tensor = class_weights_tensor.to(device)
    
    print(f"Class counts (0, 1, 2): {counts_ordered}")
    print(f"Calculated class weights: {class_weights_tensor}")

except FileNotFoundError as e:
    print(f"Error: Could not find a required file. {e}")
except Exception as e:
    print(f"An error occurred: {e}")


--- 1. Loading Data ---
Loaded X_train_full (shape: (661, 160, 35)) and y_train_full (shape: (661,))
Loaded X_test_full (shape: (1324, 160, 35))

--- 2. Engineering 'is_pirate' Feature ---
Created X_train_full_engineered (shape: (661, 160, 36))
Created X_test_full_engineered (shape: (1324, 160, 36))
N_FEATURES is now: 36

--- 3. Calculating Class Weights ---
Class counts (0, 1, 2): [511  94  56]
Calculated class weights: tensor([0.0643, 0.3493, 0.5864], device='cuda:0')


## üõ†Ô∏è 3. Helper Functions

In [3]:
def create_sliding_windows(X_3d, y=None, window_size=100, stride=20):
    """
    Takes 3D data (n_samples, n_timesteps, n_features)
    and creates overlapping windows.
    """
    new_X = []
    new_y = []
    # This new array tracks which original sample each window came from.
    window_indices = [] 
    
    n_samples, n_timesteps, n_features = X_3d.shape
    
    # Iterate over each original sample
    for i in range(n_samples):
        sample = X_3d[i]
        
        # Slide a window over this sample
        idx = 0
        while (idx + window_size) <= n_timesteps:
            window = sample[idx : idx + window_size]
            new_X.append(window)
            window_indices.append(i) # Track the original sample index (0, 1, 2...)
            
            if y is not None:
                new_y.append(y[i]) # The label is the same for all windows
                
            idx += stride
            
    if y is not None:
        # Return new X, new y, and the index mapping
        return np.array(new_X), np.array(new_y), np.array(window_indices)
    else:
        # Return new X and the index mapping
        return np.array(new_X), np.array(window_indices)
    
def make_loader(ds, batch_size, shuffle, drop_last):
    """Creates a PyTorch DataLoader with optimized settings."""
    return DataLoader(
        ds,
        batch_size=int(batch_size), # Ensure batch_size is an int
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=0,
        pin_memory=True,
        pin_memory_device="cuda" if torch.cuda.is_available() else "",
        prefetch_factor=None,
    )

def recurrent_summary(model, input_size):
    """Custom summary function that correctly counts parameters for RNN/GRU/LSTM layers."""
    output_shapes = {}
    hooks = []

    def get_hook(name):
        def hook(module, input, output):
            if isinstance(output, tuple):
                shape1 = list(output[0].shape)
                shape1[0] = -1  # Replace batch dimension with -1

                if isinstance(output[1], tuple):  # LSTM case: (h_n, c_n)
                    shape2 = list(output[1][0].shape)
                else:  # RNN/GRU case: h_n only
                    shape2 = list(output[1].shape)
                shape2[1] = -1
                output_shapes[name] = f"[{shape1}, {shape2}]"
            else:
                shape = list(output.shape)
                shape[0] = -1
                output_shapes[name] = f"{shape}"
        return hook

    try:
        device_summary = next(model.parameters()).device
    except StopIteration:
        device_summary = torch.device("cpu")

    dummy_input = torch.randn(1, *input_size).to(device_summary)

    for name, module in model.named_children():
        if isinstance(module, (nn.Linear, nn.RNN, nn.GRU, nn.LSTM)):
            hook_handle = module.register_forward_hook(get_hook(name))
            hooks.append(hook_handle)

    model.eval()
    with torch.no_grad():
        try:
            model(dummy_input)
        except Exception as e:
            print(f"Error during dummy forward pass: {e}")
            for h in hooks:
                h.remove()
            return

    for h in hooks:
        h.remove()

    print("-" * 79)
    print(f"{'Layer (type)':<25} {'Output Shape':<28} {'Param #':<18}")
    print("=" * 79)

    total_params = 0
    total_trainable_params = 0

    for name, module in model.named_children():
        if name in output_shapes:
            module_params = sum(p.numel() for p in module.parameters())
            trainable_params = sum(p.numel() for p in module.parameters() if p.requires_grad)

            total_params += module_params
            total_trainable_params += trainable_params

            layer_name = f"{name} ({type(module).__name__})"
            output_shape_str = str(output_shapes[name])
            params_str = f"{trainable_params:,}"

            print(f"{layer_name:<25} {output_shape_str:<28} {params_str:<15}")

    print("=" * 79)
    print(f"Total params: {total_params:,}")
    print(f"Trainable params: {total_trainable_params:,}")
    print(f"Non-trainable params: {total_params - total_trainable_params:,}")
    print("-" * 79)

## üß† 4. Model & Training Engine

In [None]:
class RecurrentClassifier(nn.Module):
    """
    Generic RNN classifier (RNN, LSTM, GRU).
    Uses the last hidden state for classification.
    """
    def __init__(
            self,
            input_size,
            hidden_size,
            num_layers,
            num_classes,
            rnn_type='GRU',
            bidirectional=False,
            dropout_rate=0.2
            ):
        super().__init__()

        self.rnn_type = rnn_type
        self.num_layers = num_layers
        self.hidden_size = hidden_size
        self.bidirectional = bidirectional

        rnn_map = {
            'RNN': nn.RNN,
            'LSTM': nn.LSTM,
            'GRU': nn.GRU
        }
        rnn_module = rnn_map[rnn_type]

        # Dropout is only applied between layers (if num_layers > 1)
        dropout_val = dropout_rate if num_layers > 1 else 0

        self.rnn = rnn_module(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,       # Input shape: (batch, seq_len, features)
            bidirectional=bidirectional,
            dropout=dropout_val
        )

        if self.bidirectional:
            classifier_input_size = hidden_size * 2 # Concat fwd + bwd
        else:
            classifier_input_size = hidden_size

        self.classifier = nn.Linear(classifier_input_size, num_classes)

    def forward(self, x):
        """ x shape: (batch_size, seq_length, input_size) """
        rnn_out, hidden = self.rnn(x)

        if self.rnn_type == 'LSTM':
            hidden = hidden[0] # Use only the hidden state, not the cell state

        # Get the last layer's hidden state
        if self.bidirectional:
            # Reshape to (num_layers, num_directions, batch, hidden_size)
            hidden = hidden.view(self.num_layers, 2, -1, self.hidden_size)
            # Concat the last fwd and bwd hidden states
            hidden_to_classify = torch.cat([hidden[-1, 0, :, :], hidden[-1, 1, :, :]], dim=1)
        else:
            # Just take the last layer's hidden state
            hidden_to_classify = hidden[-1]

        logits = self.classifier(hidden_to_classify)
        return logits

def train_one_epoch(model, train_loader, criterion, optimizer, scaler, device, l1_lambda=0, l2_lambda=0):
    model.train()
    running_loss = 0.0
    all_predictions = []
    all_targets = []

    for batch_idx, (inputs, targets) in enumerate(train_loader):
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad(set_to_none=True)

        with torch.amp.autocast(device_type=device.type, enabled=(device.type == 'cuda')):
            logits = model(inputs)
            loss = criterion(logits, targets)
            
            # Add L1/L2 regularization if provided
            if l1_lambda > 0 or l2_lambda > 0:
                l1_norm = sum(p.abs().sum() for p in model.parameters())
                l2_norm = sum(p.pow(2).sum() for p in model.parameters())
                loss = loss + l1_lambda * l1_norm + l2_lambda * l2_norm

        scaler.scale(loss).backward()
        
        # Unscale gradients before clipping
        scaler.unscale_(optimizer) 
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5) 

        scaler.step(optimizer)
        scaler.update()

        if torch.isnan(loss):
            print(f"Warning: NaN loss detected in batch {batch_idx}. Skipping batch.")
            continue

        running_loss += loss.item() * inputs.size(0)
        predictions = logits.argmax(dim=1)
        all_predictions.append(predictions.cpu().numpy())
        all_targets.append(targets.cpu().numpy())

    if not all_targets:
        return 0.0, 0.0 # Return 0 if all batches were nan

    epoch_loss = running_loss / len(np.concatenate(all_targets))
    epoch_f1 = f1_score(
        np.concatenate(all_targets),
        np.concatenate(all_predictions),
        average='weighted'
    )
    return epoch_loss, epoch_f1

def validate_one_epoch(model, val_loader, criterion, device):
    model.eval()
    running_loss = 0.0
    all_predictions = []
    all_targets = []

    with torch.no_grad():
        for inputs, targets in val_loader:
            inputs, targets = inputs.to(device), targets.to(device)

            with torch.amp.autocast(device_type=device.type, enabled=(device.type == 'cuda')):
                logits = model(inputs)
                loss = criterion(logits, targets)

            running_loss += loss.item() * inputs.size(0)
            predictions = logits.argmax(dim=1)
            all_predictions.append(predictions.cpu().numpy())
            all_targets.append(targets.cpu().numpy())

    epoch_loss = running_loss / len(val_loader.dataset.tensors[1])
    epoch_f1 = f1_score(
        np.concatenate(all_targets),
        np.concatenate(all_predictions),
        average='weighted'
    )
    return epoch_loss, epoch_f1

def log_metrics_to_tensorboard(writer, epoch, train_loss, train_f1, val_loss, val_f1, model):
    writer.add_scalar('Loss/Training', train_loss, epoch)
    writer.add_scalar('Loss/Validation', val_loss, epoch)
    writer.add_scalar('F1/Training', train_f1, epoch)
    writer.add_scalar('F1/Validation', val_f1, epoch)


def objective_function_cv(config, X_full_engineered, y_full, class_weights_tensor):
    """
    Robust Objective: Runs 3-Fold CV for EVERY trial.
    Reports the AVERAGE validation F1 across folds to Ray Tune.
    """
    # Define CV strategy inside the trial (e.g., 3-Fold is usually enough for HPO)
    N_HPO_FOLDS = 3 
    skf = StratifiedKFold(n_splits=N_HPO_FOLDS, shuffle=True, random_state=SEED)
    
    # Prepare data loaders for all folds UP FRONT to save time in the loop
    fold_loaders = []
    
    # We need to scale inside the folds to avoid leakage, just like in the final loop
    for train_idx, val_idx in skf.split(X_full_engineered, y_full):
        
        # 1. Split
        X_train_fold = X_full_engineered[train_idx]
        y_train_fold = y_full[train_idx]
        X_val_fold = X_full_engineered[val_idx]
        y_val_fold = y_full[val_idx]
        
        # 2. Selective Scale (Fit on Train, Transform Train & Val)
        scaler_fold = StandardScaler()
        
        # Separate continuous vs binary
        X_train_cont = X_train_fold[:, :, :35]
        X_train_bin = X_train_fold[:, :, 35:]
        X_val_cont = X_val_fold[:, :, :35]
        X_val_bin = X_val_fold[:, :, 35:]
        
        # Fit/Transform
        ns, ts, f_cont = X_train_cont.shape
        X_train_cont_2d = X_train_cont.reshape(ns * ts, f_cont)
        scaler_fold.fit(X_train_cont_2d)
        
        X_train_scaled_cont = scaler_fold.transform(X_train_cont_2d).reshape(ns, ts, f_cont)
        X_val_scaled_cont = scaler_fold.transform(X_val_cont.reshape(-1, f_cont)).reshape(-1, ts, f_cont)
        
        # Re-concat
        X_train_final = np.concatenate([X_train_scaled_cont, X_train_bin], axis=2)
        X_val_final = np.concatenate([X_val_scaled_cont, X_val_bin], axis=2)
        
        # 3. Windowing
        X_train_w, y_train_w, _ = create_sliding_windows(
            X_train_final, y_train_fold, config["window_size"], config["stride"]
        )
        X_val_w, y_val_w, _ = create_sliding_windows(
            X_val_final, y_val_fold, config["window_size"], config["stride"]
        )
        
        # 4. Loaders
        train_ds = TensorDataset(torch.from_numpy(X_train_w).float(), torch.from_numpy(y_train_w).long())
        val_ds = TensorDataset(torch.from_numpy(X_val_w).float(), torch.from_numpy(y_val_w).long())
        
        t_loader = make_loader(train_ds, config["batch_size"], shuffle=True, drop_last=True)
        v_loader = make_loader(val_ds, config["batch_size"], shuffle=False, drop_last=False)
        
        fold_loaders.append((t_loader, v_loader))

    # Initialize K models and K optimizers
    models = []
    optimizers = []
    scalers = []
    
    for _ in range(N_HPO_FOLDS):
        model = RecurrentClassifier(
            input_size=36, # Hardcoded for engineered features
            hidden_size=config["hidden_size"],
            num_layers=config["num_layers"],
            num_classes=3,
            dropout_rate=config["dropout_rate"],
            bidirectional=config["bidirectional"],
            rnn_type=config["rnn_type"]
        ).to(device)
        
        if torch.__version__[0] >= "2": model = torch.compile(model)
        
        optim = torch.optim.AdamW(model.parameters(), lr=config["lr"], weight_decay=config["l2_lambda"])
        scaler = torch.amp.GradScaler(enabled=(device.type == 'cuda'))
        
        models.append(model)
        optimizers.append(optim)
        scalers.append(scaler)

    criterion = nn.CrossEntropyLoss(weight=class_weights_tensor)
    EPOCHS = 100 # Reduce epochs slightly since we are doing 3x the work
    
    # --- The Parallel Training Loop ---
    for epoch in range(1, EPOCHS + 1):
        
        fold_val_f1s = []
        fold_train_losses = []
        
        # Train each fold for 1 epoch
        for i in range(N_HPO_FOLDS):
            train_loader, val_loader = fold_loaders[i]
            model = models[i]
            optimizer = optimizers[i]
            scaler = scalers[i]
            
            t_loss, _ = train_one_epoch(model, train_loader, criterion, optimizer, scaler, device)
            _, v_f1 = validate_one_epoch(model, val_loader, criterion, device)
            
            fold_train_losses.append(t_loss)
            fold_val_f1s.append(v_f1)
        
        # Calculate AVERAGE metrics across the 3 folds
        avg_val_f1 = np.mean(fold_val_f1s)
        avg_train_loss = np.mean(fold_train_losses)
        
        # Report the AVERAGE to Ray Tune
        # If the config is bad on *any* fold, the average drops, and ASHA kills it.
        tune.report({
            "val_f1": avg_val_f1,
            "train_loss": avg_train_loss
        })

def fit(model, train_loader, val_loader, epochs, criterion, optimizer, scaler, device,
        l1_lambda=0, l2_lambda=0, patience=0, evaluation_metric="val_f1", mode='max',
        restore_best_weights=True, writer=None, verbose=10, experiment_name=""):
    """
    Full training loop with early stopping, model checkpointing, and logging.
    """
    training_history = {
        'train_loss': [], 'val_loss': [],
        'train_f1': [], 'val_f1': []
    }
    
    model_path = f"models/{experiment_name}_best_model.pt"

    if patience > 0:
        patience_counter = 0
        best_metric = float('-inf') if mode == 'max' else float('inf')
        best_epoch = 0

    print(f"--- Starting Training: {experiment_name} ---")
    print(f"Will train for {epochs} epochs with patience={patience} monitoring {evaluation_metric}")

    for epoch in range(1, epochs + 1):
        train_loss, train_f1 = train_one_epoch(
            model, train_loader, criterion, optimizer, scaler, device, l1_lambda, l2_lambda
        )

        val_loss, val_f1 = validate_one_epoch(
            model, val_loader, criterion, device
        )

        training_history['train_loss'].append(train_loss)
        training_history['val_loss'].append(val_loss)
        training_history['train_f1'].append(train_f1)
        training_history['val_f1'].append(val_f1)

        if writer is not None:
            log_metrics_to_tensorboard(
                writer, epoch, train_loss, train_f1, val_loss, val_f1, model
            )

        if verbose > 0 and (epoch % verbose == 0 or epoch == 1):
            print(f"Epoch {epoch:3d}/{epochs} | "
                  f"Train: Loss={train_loss:.4f}, F1={train_f1:.4f} | "
                  f"Val: Loss={val_loss:.4f}, F1={val_f1:.4f}")

        if patience > 0:
            current_metric = training_history[evaluation_metric][-1]
            is_improvement = (current_metric > best_metric) if mode == 'max' else (current_metric < best_metric)

            if is_improvement:
                best_metric = current_metric
                best_epoch = epoch
                torch.save(model.state_dict(), model_path)
                patience_counter = 0
            else:
                patience_counter += 1
                if patience_counter >= patience:
                    print(f"\nEarly stopping triggered after {epoch} epochs.")
                    break

    if restore_best_weights and patience > 0:
        print(f"Restoring best model from epoch {best_epoch} with {evaluation_metric} {best_metric:.4f}")
        model.load_state_dict(torch.load(model_path))

    if patience == 0:
        print("Training complete. Saving final model.")
        torch.save(model.state_dict(), model_path.replace("_best_model.pt", "_final_model.pt"))

    if writer is not None:
        writer.close()
    
    print(f"--- Finished Training: {experiment_name} ---")
    return model, training_history, best_epoch if 'best_epoch' in locals() else epochs

## üß™ 5. Phase 1: Hyperparameter Search

### 5.1. Preprocessing for HPO (Selective Scaling)

We create a single 80/20 split and apply our selective scaling: scale features 0-34, but not feature 35 (`is_pirate`).

In [5]:
# --- 1. Split Data (NON-WINDOWED) ---
print("--- Splitting NON-WINDOWED data for HPO ---")
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=SEED)

for train_idx, val_idx in sss.split(X_train_full_engineered, y_train_full):
    X_train_split_full = X_train_full_engineered[train_idx]
    y_train_split_full = y_train_full[train_idx]
    X_val_split_full = X_train_full_engineered[val_idx]
    y_val_split_full = y_train_full[val_idx]

print(f"  X_train_split_full: {X_train_split_full.shape}")
print(f"  X_val_split_full:   {X_val_split_full.shape}")

# --- 2. Scale Features (SELECTIVELY) ---
print("\n--- Applying Selective Scaling for HPO ---")
scaler_hpo = StandardScaler()

# 1. Separate continuous (features 0-34) and binary (feature 35)
X_train_cont = X_train_split_full[:, :, :35]
X_train_bin = X_train_split_full[:, :, 35:] # The 'is_pirate' feature
X_val_cont = X_val_split_full[:, :, :35]
X_val_bin = X_val_split_full[:, :, 35:]

# 2. Fit scaler ONLY on 2D-reshaped CONTINUOUS training data
ns, ts, f_cont = X_train_cont.shape
X_train_cont_2d = X_train_cont.reshape(ns * ts, f_cont)
scaler_hpo.fit(X_train_cont_2d)
print(f"Fitted scaler on continuous training data shape: {X_train_cont_2d.shape}")

# 3. Transform continuous parts of both train and val
X_train_scaled_2d = scaler_hpo.transform(X_train_cont_2d)
X_train_scaled_cont = X_train_scaled_2d.reshape(ns, ts, f_cont)

ns_val, ts_val, f_val_cont = X_val_cont.shape
X_val_cont_2d = X_val_cont.reshape(ns_val * ts_val, f_val_cont)
X_val_scaled_2d = scaler_hpo.transform(X_val_cont_2d)
X_val_scaled_cont = X_val_scaled_2d.reshape(ns_val, ts_val, f_val_cont)

# 4. Re-concatenate with the UNTOUCHED binary feature
X_train_full_scaled = np.concatenate([X_train_scaled_cont, X_train_bin], axis=2)
X_val_full_scaled = np.concatenate([X_val_scaled_cont, X_val_bin], axis=2)

print(f"  X_train_full_scaled (final): {X_train_full_scaled.shape}")
print(f"  X_val_full_scaled (final):   {X_val_full_scaled.shape}")

# Verify the binary feature is still 0/1
# print(f"Min/Max of pirate feature in scaled train: {X_train_full_scaled[:, :, 35].min()}, {X_train_full_scaled[:, :, 35].max()}")

# Clean up
del X_train_cont, X_train_bin, X_val_cont, X_val_bin
del X_train_cont_2d, X_train_scaled_2d, X_val_cont_2d, X_val_scaled_2d
del X_train_scaled_cont, X_val_scaled_cont

--- Splitting NON-WINDOWED data for HPO ---
  X_train_split_full: (528, 160, 36)
  X_val_split_full:   (133, 160, 36)

--- Applying Selective Scaling for HPO ---
Fitted scaler on continuous training data shape: (84480, 35)
  X_train_full_scaled (final): (528, 160, 36)
  X_val_full_scaled (final):   (133, 160, 36)


### 5.2. HPO Search Execution (Ray Tune + Optuna)

In [None]:
# --- 1. Define the Search Space for Optuna --
search_space = {
    # Windowing params
    "window_size": tune.choice([10, 20, 30]),
    "stride": tune.choice([1, 2, 5]),
    
    # Model params
    "rnn_type": tune.choice(['GRU']),
    "lr": tune.loguniform(1e-5, 5e-3),
    "batch_size": tune.choice([64, 128, 256]),  
    "hidden_size": tune.choice([128, 256, 384]),
    "num_layers": tune.choice([2, 3]),
    "dropout_rate": tune.uniform(0.1, 0.6),
    "bidirectional": tune.choice([True, False]),
    "l2_lambda": tune.loguniform(1e-7, 1e-3) # This is weight_decay in AdamW
}
# --- 2. Define the Optimizer (Optuna) and Scheduler (ASHA) ---
optuna_search = OptunaSearch(
    metric="val_f1",
    mode="max"
)

scheduler = ASHAScheduler(
    metric="val_f1",
    mode="max",
    grace_period=20,  # Min epochs a trial must run
    reduction_factor=2  # How aggressively to stop trials
)

# --- 3. Initialize Ray ---
# Shutdown previous sessions if any (helps in notebooks)
if ray.is_initialized():
    ray.shutdown()

ray_logs_path = os.path.abspath("./ray_results")
os.makedirs(ray_logs_path, exist_ok=True)
os.environ["RAY_TEMP_DIR"] = ray_logs_path

ray.init(
    num_cpus=16, 
    num_gpus=1, 
    ignore_reinit_error=True,
    log_to_driver=False # Suppress logs in notebook
)

def short_trial_name(trial):
    """Creates a short, unique name for each trial folder."""
    return f"{trial.trainable_name}_{trial.trial_id}"


# --- 4. Run the Tuner --
print("Starting hyperparameter search...")

# Use tune.with_parameters to pass our NON-WINDOWED, SCALED numpy arrays
# and the class weights to the objective function
objective_with_data = tune.with_parameters(
    objective_function_cv, # Use the new function
    X_full_engineered=X_train_full_engineered, # The full (661, 160, 36) array
    y_full=y_train_full,                       # The full labels
    class_weights_tensor=class_weights_tensor
)

analysis = tune.run(
    objective_with_data,
    resources_per_trial={"cpu": 4, "gpu": 0.5}, # Increased GPU usage per trial slightly
    config=search_space,
    num_samples=20, # Reduce samples if time is an issue
    search_alg=optuna_search,
    scheduler=scheduler,
    name="pirate_pain_robust_cv_search",
    trial_dirname_creator=short_trial_name,
    verbose=1
)

print("\n--- Search Complete ---\n")

# --- 5. Get Best Results ---
print("Getting best trial from analysis...")
best_trial = analysis.get_best_trial(metric="val_f1", mode="max", scope="all")
if best_trial:
    FINAL_CONFIG = best_trial.config
    FINAL_BEST_VAL_F1 = best_trial.last_result["val_f1"]
    
    print(f"Best validation F1 score: {FINAL_BEST_VAL_F1:.4f}")
    print("Best hyperparameters found:")
    print(FINAL_CONFIG)
else:
    print("ERROR: No trials completed successfully. Using a default config.")
    # Fallback config in case HPO fails
    FINAL_CONFIG = {
        'window_size': 100, 'stride': 10, 'rnn_type': 'GRU', 'lr': 0.0005,
        'batch_size': 64, 'hidden_size': 256, 'num_layers': 3,
        'dropout_rate': 0.5, 'bidirectional': True, 'l2_lambda': 1e-06
    }
    FINAL_BEST_VAL_F1 = 0.0

# Clean up HPO data
del X_train_full_scaled, y_train_split_full, X_val_full_scaled, y_val_split_full, scaler_hpo

0,1
Current time:,2025-11-12 15:50:01
Running for:,01:27:54.21
Memory:,8.0/13.9 GiB

Trial name,status,loc,batch_size,bidirectional,dropout_rate,hidden_size,l2_lambda,lr,num_layers,rnn_type,stride,window_size,iter,total time (s),train_loss,train_f1,val_loss
objective_function_a81fccea,TERMINATED,127.0.0.1:26188,256,False,0.389684,128,0.000150741,5.51364e-05,3,LSTM,20,80,100,20.3301,0.0584517,0.871666,0.236291
objective_function_a15f428d,TERMINATED,127.0.0.1:6444,64,True,0.599838,384,0.000441689,0.00101293,2,LSTM,20,80,100,366.082,1.60095e-05,1.0,0.300293
objective_function_4b1bc738,TERMINATED,127.0.0.1:31576,64,False,0.318428,384,1.52309e-07,5.13627e-05,2,LSTM,10,80,100,376.06,0.00438218,0.998521,0.477483
objective_function_38242623,TERMINATED,127.0.0.1:19052,256,False,0.312721,384,1.54554e-07,0.000977984,2,LSTM,20,100,100,66.4265,0.00358984,0.992023,0.299066
objective_function_7dbe0c92,TERMINATED,127.0.0.1:12636,128,True,0.226078,128,2.62517e-05,1.0298e-05,2,GRU,10,120,20,27.5393,0.551215,0.735775,0.565655
objective_function_fff3a8cf,TERMINATED,127.0.0.1:8348,128,True,0.101754,128,1.24821e-07,1.92944e-05,3,LSTM,20,80,20,19.8883,0.275991,0.714153,0.309728
objective_function_3d561f97,TERMINATED,127.0.0.1:6920,128,False,0.18108,256,0.000130863,0.0039916,2,GRU,20,100,100,112.006,6.07378e-06,1.0,0.509429
objective_function_a5220080,TERMINATED,127.0.0.1:15936,256,False,0.172465,128,1.61589e-07,0.000111708,2,GRU,20,100,20,11.6405,0.278291,0.74776,0.355103
objective_function_f4adbcc3,TERMINATED,127.0.0.1:31896,64,True,0.325689,384,3.01678e-06,1.42631e-05,3,GRU,10,120,40,336.235,0.0595018,0.922545,0.270003
objective_function_4c27ad08,TERMINATED,127.0.0.1:32592,256,True,0.593208,128,2.64184e-06,9.28073e-05,2,LSTM,10,120,20,16.6816,0.165329,0.787004,0.223046


2025-11-12 15:50:01,581	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to 'c:/Users/Karim Negm/Documents/AN2DL/Challenge 1/ray_results/pirate_pain_optuna_search' in 0.0743s.
2025-11-12 15:50:01,619	INFO tune.py:1041 -- Total run time: 5274.29 seconds (5274.12 seconds for the tuning loop).



--- Search Complete ---

Getting best trial from analysis...
Best validation F1 score: 0.9377
Best hyperparameters found:
{'window_size': 80, 'stride': 10, 'rnn_type': 'GRU', 'lr': 0.0004812212932637246, 'batch_size': 64, 'hidden_size': 384, 'num_layers': 3, 'dropout_rate': 0.5385357600632669, 'bidirectional': True, 'l2_lambda': 0.0007087527682059368}


## üèÜ 6. Phase 2: K-Fold Ensemble Training

In [7]:
# ===================================================================
# --- üèÜ FINAL MODEL CONFIGURATION üèÜ ---
# ===================================================================
print("--- üèÜ Final Configuration Set --- ")
print(f"Best Val F1 from HPO search: {FINAL_BEST_VAL_F1:.4f}")
print(FINAL_CONFIG)

# --- Set variables for the K-Fold & submission cells ---
FINAL_MODEL_TYPE = FINAL_CONFIG["rnn_type"]
FINAL_HIDDEN_SIZE = FINAL_CONFIG["hidden_size"]
FINAL_HIDDEN_LAYERS = FINAL_CONFIG["num_layers"]
FINAL_BIDIRECTIONAL = FINAL_CONFIG["bidirectional"]
FINAL_DROPOUT_RATE = FINAL_CONFIG["dropout_rate"]
FINAL_LEARNING_RATE = FINAL_CONFIG["lr"]
FINAL_L2_LAMBDA = FINAL_CONFIG["l2_lambda"]
FINAL_BATCH_SIZE = FINAL_CONFIG["batch_size"]
FINAL_WINDOW_SIZE = FINAL_CONFIG["window_size"]
FINAL_STRIDE = FINAL_CONFIG["stride"]
N_SPLITS = 5 # Number of folds

FINAL_EXPERIMENT_NAME = f"{FINAL_MODEL_TYPE}_H{FINAL_HIDDEN_SIZE}_L{FINAL_HIDDEN_LAYERS}_B{FINAL_BIDIRECTIONAL}_Optuna_KFold_Ensemble"
submission_filename_base = f"submission_{FINAL_EXPERIMENT_NAME}_w{FINAL_WINDOW_SIZE}_s{FINAL_STRIDE}.csv"
print(f"Submission name will be: {submission_filename_base}")

--- üèÜ Final Configuration Set --- 
Best Val F1 from HPO search: 0.9377
{'window_size': 80, 'stride': 10, 'rnn_type': 'GRU', 'lr': 0.0004812212932637246, 'batch_size': 64, 'hidden_size': 384, 'num_layers': 3, 'dropout_rate': 0.5385357600632669, 'bidirectional': True, 'l2_lambda': 0.0007087527682059368}
Submission name will be: submission_GRU_H384_L3_BTrue_Optuna_KFold_Ensemble_w80_s10.csv


In [8]:
skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED) 
print(f"--- Starting {N_SPLITS}-Fold CV Training ---")
print(f"Splitting original engineered data: {X_train_full_engineered.shape}")
print(f"Using Class Weights: {class_weights_tensor.cpu().numpy()}")

fold_val_f1_list = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X_train_full_engineered, y_train_full)):
    fold_name = f"kfold_fold_{fold+1}"
    print(f"\n--- Fold {fold+1}/{N_SPLITS} --- ({fold_name}) ---")
    
    X_train_fold_full = X_train_full_engineered[train_idx]
    y_train_fold_full = y_train_full[train_idx]
    X_val_fold_full = X_train_full_engineered[val_idx]
    y_val_fold_full = y_train_full[val_idx]

    # --- Scale INSIDE the fold (SELECTIVELY) ---
    scaler_fold = StandardScaler()
    
    # 1. Separate continuous (features 0-34) and binary (feature 35)
    X_train_cont = X_train_fold_full[:, :, :35]
    X_train_bin = X_train_fold_full[:, :, 35:] # The 'is_pirate' feature
    X_val_cont = X_val_fold_full[:, :, :35]
    X_val_bin = X_val_fold_full[:, :, 35:]

    # 2. Fit scaler ONLY on 2D-reshaped CONTINUOUS training data
    ns, ts, f_cont = X_train_cont.shape
    X_train_cont_2d = X_train_cont.reshape(ns * ts, f_cont)
    scaler_fold.fit(X_train_cont_2d)

    # 3. Transform continuous parts of both train and val
    X_train_scaled_2d = scaler_fold.transform(X_train_cont_2d)
    X_train_fold_scaled_cont = X_train_scaled_2d.reshape(ns, ts, f_cont)

    ns_val, ts_val, f_val_cont = X_val_cont.shape
    X_val_cont_2d = X_val_cont.reshape(ns_val * ts_val, f_val_cont)
    X_val_scaled_2d = scaler_fold.transform(X_val_cont_2d)
    X_val_fold_scaled_cont = X_val_scaled_2d.reshape(ns_val, ts_val, f_val_cont)

    # 4. Re-concatenate with the UNTOUCHED binary feature
    X_train_fold_scaled = np.concatenate([X_train_fold_scaled_cont, X_train_bin], axis=2)
    X_val_fold_scaled = np.concatenate([X_val_fold_scaled_cont, X_val_bin], axis=2)

    # --- Create Sliding Windows (POST-SPLIT) ---
    X_train_w, y_train_w, _ = create_sliding_windows(
        X_train_fold_scaled, y_train_fold_full, 
        window_size=FINAL_WINDOW_SIZE, stride=FINAL_STRIDE
    )
    X_val_w, y_val_w, _ = create_sliding_windows(
        X_val_fold_scaled, y_val_fold_full, 
        window_size=FINAL_WINDOW_SIZE, stride=FINAL_STRIDE
    )
    print(f"  Fold Train Windows: {X_train_w.shape}, Fold Val Windows: {X_val_w.shape}")

    # --- Create Tensors, datasets and dataloaders --
    X_train_fold = torch.from_numpy(X_train_w).float()
    y_train_fold = torch.from_numpy(y_train_w).long()
    X_val_fold = torch.from_numpy(X_val_w).float()
    y_val_fold = torch.from_numpy(y_val_w).long()
    train_ds_fold = TensorDataset(X_train_fold, y_train_fold)
    val_ds_fold = TensorDataset(X_val_fold, y_val_fold)
    
    train_loader_fold = make_loader(train_ds_fold, batch_size=FINAL_BATCH_SIZE, shuffle=True, drop_last=True)
    val_loader_fold = make_loader(val_ds_fold, batch_size=FINAL_BATCH_SIZE, shuffle=False, drop_last=False)
    
    # --- Create a fresh model (using FINAL_CONFIG) ---
    model_fold = RecurrentClassifier(
        input_size=N_FEATURES_NEW, # This is 36
        hidden_size=FINAL_HIDDEN_SIZE, num_layers=FINAL_HIDDEN_LAYERS,
        num_classes=N_CLASSES, dropout_rate=FINAL_DROPOUT_RATE,
        bidirectional=FINAL_BIDIRECTIONAL, rnn_type=FINAL_MODEL_TYPE
    ).to(device)
    
    if torch.__version__[0] >= "2": model_fold = torch.compile(model_fold)
    optimizer_fold = torch.optim.AdamW(model_fold.parameters(), lr=FINAL_LEARNING_RATE, weight_decay=FINAL_L2_LAMBDA)
    scaler_fold_amp = torch.amp.GradScaler(enabled=(device.type == 'cuda'))
    criterion_fold = nn.CrossEntropyLoss(weight=class_weights_tensor)
    
    # --- Train this fold with early stopping ---
    model_fold, _, _ = fit(
        model=model_fold, train_loader=train_loader_fold,
        val_loader=val_loader_fold, epochs=300,
        criterion=criterion_fold, optimizer=optimizer_fold,
        scaler=scaler_fold_amp, device=device,
        writer=None, verbose=25,
        experiment_name=fold_name, patience=30
    )
    
    val_loss, val_f1 = validate_one_epoch(model_fold, val_loader_fold, criterion_fold, device)
    fold_val_f1_list.append(val_f1)
    print(f"Fold {fold+1} Best Model Val F1: {val_f1:.4f}")

print(f"\n--- üèÜ K-Fold Training Complete ---")
print(f"Fold F1 scores: {[round(f, 4) for f in fold_val_f1_list]}")
print(f"Average F1 across folds: {np.mean(fold_val_f1_list):.4f}")

# Clean up
del X_train_fold, y_train_fold, X_val_fold, y_val_fold
del X_train_w, y_train_w, X_val_w, y_val_w
del X_train_fold_full, y_train_fold_full, X_val_fold_full, y_val_fold_full
del X_train_cont, X_train_bin, X_val_cont, X_val_bin
del X_train_cont_2d, X_train_scaled_2d, X_val_cont_2d, X_val_scaled_2d
del X_train_fold_scaled_cont, X_val_fold_scaled_cont

--- Starting 5-Fold CV Training ---
Splitting original engineered data: (661, 160, 36)
Using Class Weights: [0.06426252 0.34934196 0.5863955 ]

--- Fold 1/5 --- (kfold_fold_1) ---
  Fold Train Windows: (4752, 80, 36), Fold Val Windows: (1197, 80, 36)
--- Starting Training: kfold_fold_1 ---
Will train for 300 epochs with patience=30 monitoring val_f1
Epoch   1/300 | Train: Loss=0.2969, F1=0.7717 | Val: Loss=0.3028, F1=0.8111
Epoch  25/300 | Train: Loss=0.0045, F1=0.9968 | Val: Loss=0.2908, F1=0.9147

Early stopping triggered after 48 epochs.
Restoring best model from epoch 18 with val_f1 0.9205
--- Finished Training: kfold_fold_1 ---
Fold 1 Best Model Val F1: 0.9205

--- Fold 2/5 --- (kfold_fold_2) ---
  Fold Train Windows: (4761, 80, 36), Fold Val Windows: (1188, 80, 36)
--- Starting Training: kfold_fold_2 ---
Will train for 300 epochs with patience=30 monitoring val_f1
Epoch   1/300 | Train: Loss=0.3057, F1=0.7609 | Val: Loss=0.2484, F1=0.8271
Epoch  25/300 | Train: Loss=0.0067, F1=0.

## üì¨ 7. Phase 3: Ensemble Submission

In [9]:
print("\n--- Preparing full dataset for FINAL SCALER ---")

# --- 1. Prepare Final Scaler (Fit on ALL training data) ---
scaler_final = StandardScaler()

# 1. Separate continuous (features 0-34) and binary (feature 35)
X_train_full_cont = X_train_full_engineered[:, :, :35]
X_train_full_bin = X_train_full_engineered[:, :, 35:] # The 'is_pirate' feature

# 2. Fit scaler ONLY on 2D-reshaped CONTINUOUS training data
ns, ts, f_cont = X_train_full_cont.shape
X_train_full_cont_2d = X_train_full_cont.reshape(ns * ts, f_cont)
scaler_final.fit(X_train_full_cont_2d)
print(f"Fitted FINAL scaler on continuous training data shape: {X_train_full_cont_2d.shape}")

# --- 2. Prepare, Scale (Selectively), and Window the TEST data ---
print("\n--- Preparing Test Set (Selective Scaling) ---")

# 1. Separate test data
X_test_cont = X_test_full_engineered[:, :, :35]
X_test_bin = X_test_full_engineered[:, :, 35:]

# 2. Transform continuous part of test data
ns_test, ts_test, f_test_cont = X_test_cont.shape
X_test_cont_2d = X_test_cont.reshape(ns_test * ts_test, f_test_cont)
X_test_scaled_2d = scaler_final.transform(X_test_cont_2d)
X_test_scaled_cont = X_test_scaled_2d.reshape(ns_test, ts_test, f_test_cont)

# 3. Re-concatenate with the UNTOUCHED binary feature
X_test_final_scaled = np.concatenate([X_test_scaled_cont, X_test_bin], axis=2)
print(f"Created final scaled test set (shape: {X_test_final_scaled.shape})")

# --- 3. Apply Sliding Windows ---
print("--- Applying sliding windows to final test set ---")
X_test_final_windowed, test_window_indices = create_sliding_windows(
    X_test_final_scaled, y=None, 
    window_size=FINAL_WINDOW_SIZE, stride=FINAL_STRIDE
)
print(f"Test windowed shape: {X_test_final_windowed.shape}")

# --- 4. Create Final TestLoader ---
final_test_features = torch.from_numpy(X_test_final_windowed).float()
final_test_ds = TensorDataset(final_test_features)
test_loader = make_loader(final_test_ds, batch_size=FINAL_BATCH_SIZE, shuffle=False, drop_last=False)
print("Final TestLoader created.")

# --- 5. Get Predictions from all K-Fold Models ---
all_fold_probabilities = []
print(f"\n--- Generating predictions from {N_SPLITS} fold models ---")

for fold in range(N_SPLITS):
    fold_name = f"kfold_fold_{fold+1}"
    model_path = f"models/{fold_name}_best_model.pt"
    print(f"Loading model {fold+1}/{N_SPLITS} from {model_path}...")

    # Create a fresh model shell
    model_fold = RecurrentClassifier(
        input_size=N_FEATURES_NEW, # Use 36
        hidden_size=FINAL_HIDDEN_SIZE, num_layers=FINAL_HIDDEN_LAYERS,
        num_classes=N_CLASSES, dropout_rate=FINAL_DROPOUT_RATE,
        bidirectional=FINAL_BIDIRECTIONAL, rnn_type=FINAL_MODEL_TYPE
    ).to(device)
    
    # Load the saved weights (with compile-fix)
    state_dict = torch.load(model_path, map_location=device)
    # Remove the '_orig_mod.' prefix if model was compiled
    new_state_dict = {k.replace('_orig_mod.', ''): v for k, v in state_dict.items()}
    model_fold.load_state_dict(new_state_dict)
    model_fold.eval()

    # Get Softmax probabilities
    fold_predictions = []
    with torch.no_grad():
        for (inputs,) in test_loader: 
            inputs = inputs.to(device)
            with torch.amp.autocast(device_type=device.type, enabled=(device.type == 'cuda')):
                logits = model_fold(inputs)
                probs = torch.softmax(logits, dim=1)
                fold_predictions.append(probs.cpu().numpy())
    all_fold_probabilities.append(np.concatenate(fold_predictions))

# --- 6. Average the Probabilities ---
print(f"\n--- Averaging {len(all_fold_probabilities)} sets of probabilities... ---")
mean_probabilities = np.mean(all_fold_probabilities, axis=0)
print(f"Mean probability matrix shape: {mean_probabilities.shape}")

# --- 7. Aggregate Mean Probabilities (MEAN) ---
print("Aggregating window probabilities to sample predictions (using MEAN)...")
prob_cols = [f"prob_{i}" for i in range(N_CLASSES)]
df_probs = pd.DataFrame(mean_probabilities, columns=prob_cols)
df_probs['original_index'] = test_window_indices 
agg_probs = df_probs.groupby('original_index')[prob_cols].mean().values
print(f"Aggregated to {len(agg_probs)} final probability vectors.")

# --- 8. Get Final Predictions and Save ---
final_predictions_numeric = np.argmax(agg_probs, axis=1)
predicted_labels = le.inverse_transform(final_predictions_numeric)

print("Loading sample submission file for correct formatting...")
test_sample_indices = sorted(X_test_long_df['sample_index'].unique())

if len(predicted_labels) != len(test_sample_indices):
    print(f"ERROR: Prediction count mismatch!")
else:
    print("Prediction count matches. Creating submission.")
    final_submission_df = pd.DataFrame({
        'sample_index': test_sample_indices,
        'label': predicted_labels 
    })
    final_submission_df['sample_index'] = final_submission_df['sample_index'].apply(lambda x: f"{x:03d}")

    submission_filepath = os.path.join("submissions", submission_filename_base)
    final_submission_df.to_csv(submission_filepath, index=False)

    print(f"\nSuccessfully saved to {submission_filepath}!")
    print("This file is correctly formatted for Kaggle:")
    print(final_submission_df.head())

# Clean up
del all_fold_probabilities, final_test_features, final_test_ds, test_loader
del X_test_full_engineered, X_test_final_scaled, X_test_final_windowed
del X_train_full_cont, X_train_full_bin, X_train_full_cont_2d, scaler_final
del X_test_cont, X_test_bin, X_test_cont_2d, X_test_scaled_2d, X_test_scaled_cont


--- Preparing full dataset for FINAL SCALER ---
Fitted FINAL scaler on continuous training data shape: (105760, 35)

--- Preparing Test Set (Selective Scaling) ---
Created final scaled test set (shape: (1324, 160, 36))
--- Applying sliding windows to final test set ---
Test windowed shape: (11916, 80, 36)
Final TestLoader created.

--- Generating predictions from 5 fold models ---
Loading model 1/5 from models/kfold_fold_1_best_model.pt...
Loading model 2/5 from models/kfold_fold_2_best_model.pt...
Loading model 3/5 from models/kfold_fold_3_best_model.pt...
Loading model 4/5 from models/kfold_fold_4_best_model.pt...
Loading model 5/5 from models/kfold_fold_5_best_model.pt...

--- Averaging 5 sets of probabilities... ---
Mean probability matrix shape: (11916, 3)
Aggregating window probabilities to sample predictions (using MEAN)...
Aggregated to 1324 final probability vectors.
Loading sample submission file for correct formatting...
Prediction count matches. Creating submission.

Succe