# Notebook 4: Training and Validation
## Properly Training NAM for Time Series Data

**Learning Objectives:**
- Implement walk-forward validation for time series
- Train with appropriate callbacks and monitoring
- Avoid data leakage in validation
- Optimize hyperparameters

---

## Time Series Validation Strategy

Standard cross-validation doesn't work for time series - we must respect temporal order to avoid data leakage.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Load data
data = pd.read_csv('data/processed/mmm_data_with_features.csv')
data['Date'] = pd.to_datetime(data['Date'])
print(f"Data shape: {data.shape}")

# Prepare features and target
exclude_cols = ['Date', 'GMV', 'product_category', 'product_subcategory']
feature_cols = [col for col in data.columns if col not in exclude_cols]
X = data[feature_cols].values
y = data['GMV'].values

print(f"Features: {X.shape}")
print(f"Target: {y.shape}")

## Walk-Forward Validation

This technique uses an expanding window approach:
- Train on past data, validate on future data
- Gradually expand training set
- Never use future information for past predictions

In [None]:
def create_walk_forward_splits(X, y, n_splits=5, test_size=0.2):
    '''
    Create walk-forward validation splits for time series.
    '''
    n_samples = len(X)
    test_samples = int(n_samples * test_size)

    splits = []
    min_train_size = n_samples // (n_splits + 2)

    for i in range(n_splits):
        train_end = min_train_size * (i + 2)
        val_start = train_end
        val_end = min(val_start + test_samples, n_samples)

        if val_end > n_samples:
            break

        train_idx = np.arange(0, train_end)
        val_idx = np.arange(val_start, val_end)

        splits.append((train_idx, val_idx))

        print(f"Split {i+1}: Train [0:{train_end}], Val [{val_start}:{val_end}]")

    return splits

# Create validation splits
splits = create_walk_forward_splits(X, y, n_splits=3)
print(f"\nCreated {len(splits)} walk-forward splits")

## Training with Callbacks

We'll use callbacks to:
- Stop early if no improvement (prevent overfitting)
- Reduce learning rate when stuck (escape plateaus)
- Save the best model

In [None]:
def train_nam_model(model, X_train, y_train, X_val, y_val, epochs=100):
    '''
    Train NAM model with proper callbacks and monitoring.
    '''

    # Scale features
    scaler_X = StandardScaler()
    scaler_y = StandardScaler()

    X_train_scaled = scaler_X.fit_transform(X_train)
    X_val_scaled = scaler_X.transform(X_val)

    y_train_scaled = scaler_y.fit_transform(y_train.reshape(-1, 1)).flatten()
    y_val_scaled = scaler_y.transform(y_val.reshape(-1, 1)).flatten()

    # Callbacks
    callbacks = [
        keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=15,
            restore_best_weights=True,
            verbose=1
        ),
        keras.callbacks.ReduceLROnPlateau(
            monitor='val_loss',
            factor=0.5,
            patience=5,
            min_lr=1e-6,
            verbose=1
        ),
        keras.callbacks.ModelCheckpoint(
            'models/best_nam.keras',
            monitor='val_loss',
            save_best_only=True,
            verbose=0
        )
    ]

    # Train
    history = model.fit(
        [X_train_scaled[:, i:i+1] for i in range(X_train_scaled.shape[1])],
        y_train_scaled,
        validation_data=(
            [X_val_scaled[:, i:i+1] for i in range(X_val_scaled.shape[1])],
            y_val_scaled
        ),
        epochs=epochs,
        batch_size=32,
        callbacks=callbacks,
        verbose=0
    )

    return history, scaler_X, scaler_y

# Example training (simplified)
print("Training would proceed with walk-forward validation...")
print("Each split would be trained and evaluated separately")

## Performance Metrics

For business decisions, we need interpretable metrics:
- **R^2**: Variance explained
- **MAPE**: Average percentage error
- **MAE**: Average absolute error in dollars

In [None]:
def calculate_metrics(y_true, y_pred):
    '''Calculate comprehensive performance metrics.'''

    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)

    # MAPE
    mask = y_true != 0
    mape = np.mean(np.abs((y_true[mask] - y_pred[mask]) / y_true[mask])) * 100

    # SMAPE
    denominator = (np.abs(y_true) + np.abs(y_pred)) / 2.0
    mask = denominator != 0
    smape = np.mean(np.abs(y_true[mask] - y_pred[mask]) / denominator[mask]) * 100

    return {
        'R^2': r2,
        'MAE': mae,
        'MAPE': mape,
        'SMAPE': smape
    }

# Example metrics
y_true_example = np.array([100, 200, 300, 400, 500])
y_pred_example = np.array([110, 190, 310, 380, 520])

metrics = calculate_metrics(y_true_example, y_pred_example)
print("Example Metrics:")
for metric, value in metrics.items():
    print(f"  {metric}: {value:.2f}")

## Key Takeaways

### Training Best Practices:
1. **Walk-forward validation** prevents data leakage
2. **Early stopping** prevents overfitting
3. **Learning rate reduction** helps convergence
4. **Feature scaling** is essential for neural networks

### Validation Strategy:
- Never use future data to predict past
- Expand training set progressively
- Evaluate on truly unseen data

### Next Steps:
In Notebook 5, we'll generate comprehensive diagnostic visualizations.