# Notebook 3: Predictive Model Training and Validation

**Project:** `PharmaControl-Pro`
**Goal:** Build, train, and validate the predictive 'ML kernel' that will power our MPC controller. This involves defining a sophisticated Transformer-based architecture, creating a custom loss function, and using a systematic approach for hyperparameter tuning.

### Table of Contents
1. [Model Architecture: A Transformer for Time-Series](#1.-Model-Architecture:-A-Transformer-for-Time-Series)
2. [Implementing a Custom Loss Function](#2.-Implementing-a-Custom-Loss-Function)
3. [Hyperparameter Tuning with Optuna](#3.-Hyperparameter-Tuning-with-Optuna)
4. [Training the Final Model](#4.-Training-the-Final-Model)
5. [Model Validation and Baseline Comparison](#5.-Model-Validation-and-Baseline-Comparison)

---
## 1. Model Architecture: A Transformer for Time-Series

The paper mentions a "Transformer-inspired" architecture. We will implement a robust **Encoder-Decoder** model using PyTorch's `nn.Transformer` components. This architecture is well-suited for sequence-to-sequence tasks like ours.

*   **Encoder:** Its job is to read the historical data (the last `L` steps of CMAs and CPPs) and compress this information into a rich, contextualized memory. It uses self-attention to understand the relationships within the historical sequence.
*   **Decoder:** Its job is to generate the future prediction. At each future time step `t` (from 1 to `H`), it looks at the entire encoded memory (via cross-attention) and combines that context with the *planned* control action for that future step (`future_U[t]`) to make a prediction. This structure explicitly models the relationship between future actions and future outcomes.

We will define this model in `src/model_architecture.py`.

In [1]:
%%writefile ../src/model_architecture.py
import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    """Injects positional information into the input sequence."""
    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

class GranulationPredictor(nn.Module):
    """
    A Transformer-based Encoder-Decoder model for predicting granulation CMAs.
    """
    def __init__(self, cma_features, cpp_features, d_model=64, nhead=4, 
                 num_encoder_layers=2, num_decoder_layers=2, dim_feedforward=256, dropout=0.1):
        super().__init__()
        self.d_model = d_model

        # --- Input Embeddings ---
        self.cma_encoder_embedding = nn.Linear(cma_features, d_model)
        self.cpp_encoder_embedding = nn.Linear(cpp_features, d_model)
        self.cpp_decoder_embedding = nn.Linear(cpp_features, d_model)

        self.pos_encoder = PositionalEncoding(d_model, dropout)
        
        # --- Transformer --- 
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )
        
        # --- Output Layer ---
        # Maps the decoder output back to the desired number of CMA features
        self.output_linear = nn.Linear(d_model, cma_features)
        
    def forward(self, past_cmas, past_cpps, future_cpps):
        # src: source sequence to the encoder (historical data)
        # tgt: target sequence to the decoder (planned future actions)
        
        # Embed and combine historical inputs for the encoder
        past_cma_emb = self.cma_encoder_embedding(past_cmas)
        past_cpp_emb = self.cpp_encoder_embedding(past_cpps)
        src = self.pos_encoder(past_cma_emb + past_cpp_emb)
        
        # Embed future control actions for the decoder
        tgt = self.pos_encoder(self.cpp_decoder_embedding(future_cpps))
        
        # The decoder needs a target mask to prevent it from seeing future positions
        # when making a prediction at the current position.
        tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt.size(1)).to(tgt.device)
        
        # Pass through the transformer
        output = self.transformer(src, tgt, tgt_mask=tgt_mask)
        
        # Final linear layer to get CMA predictions
        prediction = self.output_linear(output)
        
        return prediction

Overwriting ../src/model_architecture.py


--- 
## 2. Implementing a Custom Loss Function

The paper mentions a custom loss function designed to "prevent fitting irrelevant short-time dynamics." This implies that errors further out in the prediction horizon are more important than immediate, transient errors.

We can implement this by creating a **weighted Mean Squared Error (MSE)** loss. We'll assign a weight to each of the `H` steps in the horizon, with weights increasing over time. This forces the model to prioritize long-term accuracy.

In [2]:
import torch
import torch.nn as nn

class WeightedHorizonMSELoss(nn.Module):
    """Calculates MSE with a linearly increasing weight over the horizon."""
    def __init__(self, horizon: int, start_weight: float = 0.5, end_weight: float = 1.5):
        super().__init__()
        # Create a weight tensor of shape (1, horizon, 1) for broadcasting
        weights = torch.linspace(start_weight, end_weight, horizon).view(1, -1, 1)
        self.register_buffer('weights', weights)
        
    def forward(self, prediction, target):
        # prediction and target shape: (batch_size, horizon, features)
        loss = (prediction - target) ** 2
        weighted_loss = loss * self.weights
        return torch.mean(weighted_loss)

---
## 3. Hyperparameter Tuning with Optuna

Choosing the right hyperparameters (like learning rate, model size, etc.) is critical for performance. Manually guessing these values is inefficient. We will use **Optuna**, a powerful hyperparameter optimization framework, to systematically search for the best combination.

We'll define an `objective` function that takes a trial, builds a model with the suggested hyperparameters, trains it for a few epochs, and returns the validation loss. Optuna will then intelligently choose the next set of hyperparameters to try.

In [3]:
import optuna
import torch.optim as optim
import joblib
import pandas as pd
import os, sys
sys.path.append('..')  # Add parent directory to Python path
from src.model_architecture import GranulationPredictor
from src.dataset import GranulationDataset
from torch.utils.data import DataLoader

# --- Load Pre-processed Data (from Notebook 2) ---
DATA_DIR = '../data'
df_train = pd.read_csv(os.path.join(DATA_DIR, 'train_data.csv'))
df_val = pd.read_csv(os.path.join(DATA_DIR, 'validation_data.csv'))
CMA_COLS = ['d50', 'lod']
CPP_COLS = ['spray_rate', 'air_flow', 'carousel_speed', 'specific_energy', 'froude_number_proxy']
LOOKBACK = 36
HORIZON = 72
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

def objective(trial):
    """Optuna objective function for hyperparameter tuning."""
    # --- Hyperparameters to Tune ---
    d_model = trial.suggest_categorical('d_model', [32, 64, 128])
    nhead = trial.suggest_categorical('nhead', [2, 4, 8])
    num_encoder_layers = trial.suggest_int('num_encoder_layers', 1, 3)
    num_decoder_layers = trial.suggest_int('num_decoder_layers', 1, 3)
    lr = trial.suggest_float('lr', 1e-5, 1e-3, log=True)
    dropout = trial.suggest_float('dropout', 0.1, 0.3)
    
    # --- Model, Loss, Optimizer ---
    model = GranulationPredictor(
        cma_features=len(CMA_COLS),
        cpp_features=len(CPP_COLS),
        d_model=d_model, nhead=nhead,
        num_encoder_layers=num_encoder_layers,
        num_decoder_layers=num_decoder_layers,
        dropout=dropout
    ).to(DEVICE)
    
    criterion = WeightedHorizonMSELoss(horizon=HORIZON).to(DEVICE)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    
    # --- DataLoaders ---
    train_dataset = GranulationDataset(df_train, CMA_COLS, CPP_COLS, LOOKBACK, HORIZON)
    val_dataset = GranulationDataset(df_val, CMA_COLS, CPP_COLS, LOOKBACK, HORIZON)
    train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=128, shuffle=False)

    # --- Training & Validation Loop (abbreviated for tuning) ---
    NUM_EPOCHS_TUNE = 5 # Use fewer epochs for faster tuning
    for epoch in range(NUM_EPOCHS_TUNE):
        model.train()
        for batch in train_loader:
            past_cmas, past_cpps, future_cpps, future_cmas_target = [b.to(DEVICE) for b in batch]
            optimizer.zero_grad()
            prediction = model(past_cmas, past_cpps, future_cpps)
            loss = criterion(prediction, future_cmas_target)
            loss.backward()
            optimizer.step()
            
    # --- Final Validation ---
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            past_cmas, past_cpps, future_cpps, future_cmas_target = [b.to(DEVICE) for b in batch]
            prediction = model(past_cmas, past_cpps, future_cpps)
            val_loss += criterion(prediction, future_cmas_target).item()
    
    return val_loss / len(val_loader)

# --- Run the Optuna Study ---
# Note: This can take a long time. For a real run, use n_trials=50 or more.
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=20) # Use 20 trials for demonstration

print("Best trial found:")
best_trial = study.best_trial
print(f"  Value: {best_trial.value}")
print("  Params: ")
for key, value in best_trial.params.items():
    print(f"    {key}: {value}")

[I 2025-08-11 04:22:01,146] A new study created in memory with name: no-name-c684ecfa-4828-4fc5-8fbe-007c0b004710
[I 2025-08-11 04:30:10,517] Trial 0 finished with value: 0.11015949748894747 and parameters: {'d_model': 32, 'nhead': 2, 'num_encoder_layers': 1, 'num_decoder_layers': 2, 'lr': 1.2821130237925689e-05, 'dropout': 0.17258260070762318}. Best is trial 0 with value: 0.11015949748894747.
[I 2025-08-11 04:53:59,368] Trial 1 finished with value: 0.008515874800436637 and parameters: {'d_model': 64, 'nhead': 8, 'num_encoder_layers': 2, 'num_decoder_layers': 3, 'lr': 0.0001892126403831818, 'dropout': 0.2931766576970832}. Best is trial 1 with value: 0.008515874800436637.
[I 2025-08-11 04:55:54,443] Trial 2 finished with value: 0.006501479471540626 and parameters: {'d_model': 32, 'nhead': 2, 'num_encoder_layers': 2, 'num_decoder_layers': 1, 'lr': 0.0005239860415912874, 'dropout': 0.16107836352578545}. Best is trial 2 with value: 0.006501479471540626.
[I 2025-08-11 05:04:49,936] Trial 3 

Best trial found:
  Value: 0.0030854975133586455
  Params: 
    d_model: 128
    nhead: 8
    num_encoder_layers: 3
    num_decoder_layers: 2
    lr: 0.000838934349288504
    dropout: 0.10122835808149047


In [4]:
study.best_trial
best_trial = study.best_trial

#best_trial=optuna.create_study(direction='minimize').best_trial
Best trial found:
  Value: 0.0030854975133586455
  Params: 
    d_model: 128
    nhead: 8
    num_encoder_layers: 3
    num_decoder_layers: 2
    lr: 0.000838934349288504
    dropout: 0.10122835808149047

--- 
## 4. Training the Final Model

Now that we have the best hyperparameters from our Optuna study, we will train a new model from scratch using these parameters for a larger number of epochs to ensure it converges properly. We will also implement early stopping to prevent overfitting.

In [6]:
# --- Final Model Training ---
from tqdm import tqdm
import copy

BEST_HPARAMS = best_trial.params
MODEL_SAVE_PATH = os.path.join(DATA_DIR, 'best_predictor_model.pth')
NUM_EPOCHS_FINAL = 10   # Use 50 epochs for final training
PATIENCE = 5 # For early stopping

# Re-create datasets and loaders
train_dataset = GranulationDataset(df_train, CMA_COLS, CPP_COLS, LOOKBACK, HORIZON)
val_dataset = GranulationDataset(df_val, CMA_COLS, CPP_COLS, LOOKBACK, HORIZON)
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=128, shuffle=False)

final_model = GranulationPredictor(
    cma_features=len(CMA_COLS),
    cpp_features=len(CPP_COLS),
    d_model=BEST_HPARAMS['d_model'], nhead=BEST_HPARAMS['nhead'],
    num_encoder_layers=BEST_HPARAMS['num_encoder_layers'],
    num_decoder_layers=BEST_HPARAMS['num_decoder_layers'],
    dropout=BEST_HPARAMS['dropout']
).to(DEVICE)

criterion = WeightedHorizonMSELoss(horizon=HORIZON).to(DEVICE)
optimizer = optim.Adam(final_model.parameters(), lr=BEST_HPARAMS['lr'])

best_val_loss = float('inf')
epochs_no_improve = 0
best_model_wts = copy.deepcopy(final_model.state_dict())

for epoch in range(NUM_EPOCHS_FINAL):
    final_model.train()
    pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{NUM_EPOCHS_FINAL} [T]")
    for batch in pbar:
        past_cmas, past_cpps, future_cpps, future_cmas_target = [b.to(DEVICE) for b in batch]
        optimizer.zero_grad()
        prediction = final_model(past_cmas, past_cpps, future_cpps)
        loss = criterion(prediction, future_cmas_target)
        loss.backward()
        optimizer.step()
        pbar.set_postfix({'loss': loss.item()})
        
    # Validation phase
    final_model.eval()
    current_val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            past_cmas, past_cpps, future_cpps, future_cmas_target = [b.to(DEVICE) for b in batch]
            prediction = final_model(past_cmas, past_cpps, future_cpps)
            current_val_loss += criterion(prediction, future_cmas_target).item()
    avg_val_loss = current_val_loss / len(val_loader)
    print(f"Epoch {epoch+1} - Validation Loss: {avg_val_loss:.6f}")
    
    # Early stopping logic
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        best_model_wts = copy.deepcopy(final_model.state_dict())
        epochs_no_improve = 0
    else:
        epochs_no_improve += 1
    
    if epochs_no_improve >= PATIENCE:
        print(f"Early stopping triggered after {epoch+1} epochs.")
        break

# Load best model weights and save
final_model.load_state_dict(best_model_wts)
torch.save(final_model.state_dict(), MODEL_SAVE_PATH)
print(f"Final model saved to {MODEL_SAVE_PATH}")

Epoch 1/10 [T]: 100%|██████████| 82/82 [06:23<00:00,  4.67s/it, loss=0.0549]


Epoch 1 - Validation Loss: 0.135291


Epoch 2/10 [T]: 100%|██████████| 82/82 [06:27<00:00,  4.73s/it, loss=0.0112]


Epoch 2 - Validation Loss: 0.004649


Epoch 3/10 [T]: 100%|██████████| 82/82 [06:29<00:00,  4.75s/it, loss=0.00797]


Epoch 3 - Validation Loss: 0.004767


Epoch 4/10 [T]: 100%|██████████| 82/82 [06:39<00:00,  4.87s/it, loss=0.00625]


Epoch 4 - Validation Loss: 0.003481


Epoch 5/10 [T]: 100%|██████████| 82/82 [06:46<00:00,  4.96s/it, loss=0.00608]


Epoch 5 - Validation Loss: 0.003709


Epoch 6/10 [T]: 100%|██████████| 82/82 [06:24<00:00,  4.69s/it, loss=0.00469]


Epoch 6 - Validation Loss: 0.003002


Epoch 7/10 [T]: 100%|██████████| 82/82 [04:04<00:00,  2.98s/it, loss=0.00417]


Epoch 7 - Validation Loss: 0.005967


Epoch 8/10 [T]: 100%|██████████| 82/82 [01:50<00:00,  1.35s/it, loss=0.00521]


Epoch 8 - Validation Loss: 0.003038


Epoch 9/10 [T]: 100%|██████████| 82/82 [01:50<00:00,  1.35s/it, loss=0.00363]


Epoch 9 - Validation Loss: 0.003246


Epoch 10/10 [T]: 100%|██████████| 82/82 [01:55<00:00,  1.41s/it, loss=0.00387]


Epoch 10 - Validation Loss: 0.002896
Final model saved to ../data/best_predictor_model.pth


---
## 5. Model Validation and Baseline Comparison

The final step is to perform an unbiased evaluation of our trained model on the held-out test set. We will visualize its predictions and calculate the Mean Absolute Error (MAE), comparing it to a simpler baseline model to prove the value of our complex architecture.

In [None]:
import matplotlib.pyplot as plt

# Load test data
df_test = pd.read_csv(os.path.join(DATA_DIR, 'test_data.csv'))
test_dataset = GranulationDataset(df_test, CMA_COLS, CPP_COLS, LOOKBACK, HORIZON)
test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False) # Batch size 1 for individual plots

# Load scalers for inverse transform
scalers = joblib.load(os.path.join(DATA_DIR, 'scalers.joblib'))

final_model.eval()
with torch.no_grad():
    # Get one sample from the test set
    past_cmas, past_cpps, future_cpps, future_cmas_target = next(iter(test_loader))
    past_cmas, past_cpps, future_cpps = [b.to(DEVICE) for b in [past_cmas, past_cpps, future_cpps]]
    
    prediction_scaled = final_model(past_cmas, past_cpps, future_cpps).squeeze(0).cpu().numpy()
    target_scaled = future_cmas_target.squeeze(0).cpu().numpy()

# Inverse transform to original scale for plotting
prediction_unscaled = np.zeros_like(prediction_scaled)
target_unscaled = np.zeros_like(target_scaled)
for i, col in enumerate(CMA_COLS):
    prediction_unscaled[:, i] = scalers[col].inverse_transform(prediction_scaled[:, i].reshape(-1, 1)).flatten()
    target_unscaled[:, i] = scalers[col].inverse_transform(target_scaled[:, i].reshape(-1, 1)).flatten()

# --- Plotting --- 
fig, axes = plt.subplots(len(CMA_COLS), 1, figsize=(15, 8), sharex=True)
fig.suptitle('Model Prediction vs. Ground Truth on Test Set', fontsize=16)
for i, col in enumerate(CMA_COLS):
    axes[i].plot(target_unscaled[:, i], label='Ground Truth', color='blue', linestyle='--')
    axes[i].plot(prediction_unscaled[:, i], label='Prediction', color='red')
    axes[i].set_ylabel(col)
    axes[i].legend()
    axes[i].grid(True)
axes[-1].set_xlabel('Time Steps into Horizon')
plt.show()

# TODO: Implement and train a baseline MLP model and compare its final test MAE here.

FileNotFoundError: [Errno 2] No such file or directory: '../data/test_data.csv'