# 03 — Deep Learning: Exploratory LSTM
## HVAC Market Analysis — Metropolitan France (96 departments)

### METHODOLOGICAL DISCLAIMER

This notebook is included for **educational and exploratory purposes**.

With a limited training dataset (~5376 rows = 56 months x 96 departments), an LSTM network
**is NOT the optimal model**. Classical models (Ridge, LightGBM) are expected
to outperform. This notebook demonstrates:

1. How to adapt an LSTM to tabular time series data
2. Best practices (lookback, early stopping, normalization)
3. Limitations of deep learning on small-to-medium datasets

**Architecture**:
- LSTM 1 layer, 32 units, dropout 0.3
- 3-month sequences (minimal lookback)
- Loss = HuberLoss (robust to outliers, delta=1.0)
- Optimizer = Adam (lr=0.001)

In [1]:
# ============================================================
# IMPORTS
# ============================================================
import sys
sys.path.insert(0, '..')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

print(f'PyTorch version : {torch.__version__}')
print(f'Device : {"GPU" if torch.cuda.is_available() else "CPU"}')

PyTorch version : 2.10.0+cu128
Device : CPU


---
## 1. Data preparation

In [None]:
# ============================================================
# 1.1 — Load and prepare the data
# ============================================================
TARGET = 'nb_installations_pac'
TRAIN_END = 202406
VAL_END = 202412

df = pd.read_csv('../data/features/hvac_features_dataset.csv')

# Temporal split
df_train = df[df['date_id'] <= TRAIN_END].copy()
df_val = df[(df['date_id'] > TRAIN_END) & (df['date_id'] <= VAL_END)].copy()
df_test = df[df['date_id'] > VAL_END].copy()

# Numeric features (exclude identifiers, metadata, other targets, outlier flags)
EXCLUDE_COLS = {
    'date_id', 'dept', 'dept_name', 'city_ref', 'latitude', 'longitude',
    'n_valid_features', 'pct_valid_features',
    'nb_installations_clim', 'nb_dpe_total', 'nb_dpe_classe_ab',
    'pct_pac', 'pct_clim', 'pct_classe_ab',
}
OUTLIER_PATTERNS = ['_outlier_iqr', '_outlier_zscore', '_outlier_iforest',
                    '_outlier_consensus', '_outlier_score']

feature_cols = [
    c for c in df.columns
    if c not in EXCLUDE_COLS and c != TARGET
    and not any(p in c for p in OUTLIER_PATTERNS)
    and df[c].dtype in [np.float64, np.int64, np.float32, np.int32]
]

# Prepare X and y
X_train, y_train = df_train[feature_cols], df_train[TARGET]
X_val, y_val = df_val[feature_cols], df_val[TARGET]
X_test, y_test = df_test[feature_cols], df_test[TARGET]

# Drop all-NaN columns before imputation (SimpleImputer silently drops them,
# causing shape mismatch when rebuilding arrays)
all_nan_cols = [c for c in feature_cols if X_train[c].isna().all()]
if all_nan_cols:
    print(f'Dropping {len(all_nan_cols)} all-NaN columns: {all_nan_cols}')
    feature_cols = [c for c in feature_cols if c not in all_nan_cols]
    X_train = df_train[feature_cols]
    X_val = df_val[feature_cols]
    X_test = df_test[feature_cols]

# Imputation + normalization
imputer = SimpleImputer(strategy='median')
scaler = StandardScaler()

X_train_np = scaler.fit_transform(imputer.fit_transform(X_train)).astype(np.float32)
X_val_np = scaler.transform(imputer.transform(X_val)).astype(np.float32)
X_test_np = scaler.transform(imputer.transform(X_test)).astype(np.float32)

y_train_np = y_train.values.astype(np.float32)
y_val_np = y_val.values.astype(np.float32)
y_test_np = y_test.values.astype(np.float32)

print(f'Features: {len(feature_cols)}')
print(f'Train: {X_train_np.shape}, Val: {X_val_np.shape}, Test: {X_test_np.shape}')

In [None]:
# ============================================================
# 1.2 — Create temporal sequences for the LSTM
# ============================================================
# LSTM needs 3D input: (batch, timesteps, features)
# We create sliding windows of size LOOKBACK

LOOKBACK = 3  # 3 months of context

def create_sequences(X, y, lookback=LOOKBACK):
    """Create temporal sequences for the LSTM.
    
    Input: X (n_samples, n_features), y (n_samples,)
    Output: X_seq (n_sequences, lookback, n_features), y_seq (n_sequences,)
    
    For each timestep t >= lookback, the sequence is:
    X[t-lookback : t] -> y[t]
    """
    X_seq, y_seq = [], []
    for i in range(lookback, len(X)):
        X_seq.append(X[i - lookback : i])
        y_seq.append(y[i])
    return np.array(X_seq), np.array(y_seq)

# Training sequences
X_seq_train, y_seq_train = create_sequences(X_train_np, y_train_np)

# For val and test: use the end of the previous set as context
X_for_val = np.vstack([X_train_np[-LOOKBACK:], X_val_np])
y_for_val = np.concatenate([y_train_np[-LOOKBACK:], y_val_np])
X_seq_val, y_seq_val = create_sequences(X_for_val, y_for_val)

X_for_test = np.vstack([X_val_np[-LOOKBACK:], X_test_np])
y_for_test = np.concatenate([y_val_np[-LOOKBACK:], y_test_np])
X_seq_test, y_seq_test = create_sequences(X_for_test, y_for_test)

print(f'Sequences created (lookback={LOOKBACK}):')
print(f'  Train: {X_seq_train.shape} -> {y_seq_train.shape}')
print(f'  Val:   {X_seq_val.shape} -> {y_seq_val.shape}')
print(f'  Test:  {X_seq_test.shape} -> {y_seq_test.shape}')

---
## 2. LSTM Architecture

In [None]:
# ============================================================
# 2.1 — LSTM network definition
# ============================================================
# Intentionally simple architecture to avoid overfitting
#
# Input (batch, lookback=3, n_features)
#   -> LSTM (32 units, 1 layer)
#   -> Dropout (0.3) — regularization
#   -> Linear (32 -> 1) — prediction

class LSTMNet(nn.Module):
    def __init__(self, n_features, hidden_size=32, dropout=0.3):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=n_features,
            hidden_size=hidden_size,
            num_layers=1,        # Single layer (limited dataset)
            batch_first=True,    # Format (batch, seq, features)
        )
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_size, 1)
    
    def forward(self, x):
        # x shape: (batch, lookback, n_features)
        lstm_out, (h_n, c_n) = self.lstm(x)
        # Take the output of the last timestep
        last_hidden = lstm_out[:, -1, :]  # (batch, hidden_size)
        out = self.dropout(last_hidden)
        out = self.fc(out)                 # (batch, 1)
        return out

# Instantiate the model
n_features = X_seq_train.shape[2]
HIDDEN_SIZE = 32

model = LSTMNet(n_features, HIDDEN_SIZE)
print(model)
print(f'\nParameters: {sum(p.numel() for p in model.parameters()):,}')

---
## 3. Training

In [None]:
# ============================================================
# 3.1 — Training configuration
# ============================================================
EPOCHS = 150
BATCH_SIZE = 16
LEARNING_RATE = 0.001
PATIENCE = 20  # Early stopping

# PyTorch DataLoader
train_ds = TensorDataset(
    torch.FloatTensor(X_seq_train),
    torch.FloatTensor(y_seq_train),
)
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=False)

# Loss and optimizer
criterion = nn.HuberLoss(delta=1.0)  # Robust to outliers (combines MSE + MAE)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

# Learning rate scheduler: reduce LR when validation loss plateaus
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=10, verbose=False
)

print(f'Hyperparameters:')
print(f'  Max epochs    : {EPOCHS}')
print(f'  Batch size    : {BATCH_SIZE}')
print(f'  Learning rate : {LEARNING_RATE} (with ReduceLROnPlateau)')
print(f'  Patience (ES) : {PATIENCE}')
print(f'  Hidden size   : {HIDDEN_SIZE}')
print(f'  Lookback      : {LOOKBACK}')
print(f'  Loss function : HuberLoss (delta=1.0)')
print(f'  Gradient clip : max_norm=1.0 (prevents exploding gradients)')

In [None]:
# ============================================================
# 3.2 — Training loop with early stopping + LR scheduling
# ============================================================
# Best practices applied:
# 1. Early stopping (patience=20) — stop when validation loss plateaus
# 2. Best weight restoration — use weights from the best epoch
# 3. Gradient clipping (max_norm=1.0) — prevent exploding gradients in LSTM
# 4. Learning rate scheduling (ReduceLROnPlateau) — adapt LR to loss plateau

train_losses = []
val_losses = []
lr_history = []
best_val_loss = float('inf')
patience_counter = 0
best_state = None

# Validation tensors (no DataLoader, predict in one pass)
X_val_tensor = torch.FloatTensor(X_seq_val)
y_val_tensor = torch.FloatTensor(y_seq_val)

for epoch in range(EPOCHS):
    # --- Training ---
    model.train()
    epoch_loss = 0.0
    n_batches = 0
    
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        y_pred = model(X_batch).squeeze()
        loss = criterion(y_pred, y_batch)
        loss.backward()
        
        # Gradient clipping (prevents exploding gradients in LSTM)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        optimizer.step()
        epoch_loss += loss.item() * len(y_batch)
        n_batches += len(y_batch)
    
    train_loss = epoch_loss / n_batches
    train_losses.append(train_loss)
    
    # --- Validation ---
    model.eval()
    with torch.no_grad():
        val_pred = model(X_val_tensor).squeeze()
        val_loss = criterion(val_pred, y_val_tensor).item()
    val_losses.append(val_loss)
    
    # Learning rate scheduling
    scheduler.step(val_loss)
    current_lr = optimizer.param_groups[0]['lr']
    lr_history.append(current_lr)
    
    # --- Early stopping ---
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        best_state = model.state_dict().copy()  # Save best weights
    else:
        patience_counter += 1
    
    # Log every 20 epochs
    if (epoch + 1) % 20 == 0 or patience_counter >= PATIENCE:
        print(f'  Epoch {epoch+1:3d}/{EPOCHS} — '
              f'Train loss: {train_loss:.4f}, Val loss: {val_loss:.4f} '
              f'(patience: {patience_counter}/{PATIENCE}, lr: {current_lr:.6f})')
    
    if patience_counter >= PATIENCE:
        print(f'\n  Early stopping at epoch {epoch+1}')
        break

# Restore best weights
if best_state is not None:
    model.load_state_dict(best_state)
    print(f'  Best weights restored (val_loss = {best_val_loss:.4f})')

In [None]:
# ============================================================
# 3.3 — Learning curves + LR scheduling visualization
# ============================================================
fig, axes = plt.subplots(1, 2, figsize=(16, 5))
fig.suptitle('LSTM — Training Diagnostics', fontsize=14)

# Learning curves
axes[0].plot(train_losses, label='Train loss', linewidth=2)
axes[0].plot(val_losses, label='Val loss', linewidth=2)
axes[0].axhline(best_val_loss, color='red', linestyle='--', alpha=0.5,
           label=f'Best val loss = {best_val_loss:.4f}')

best_epoch = val_losses.index(min(val_losses))
axes[0].axvline(best_epoch, color='green', linestyle='--', alpha=0.5,
           label=f'Best epoch = {best_epoch+1}')

axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('HuberLoss')
axes[0].set_title('Learning Curves (Train vs Validation)')
axes[0].legend(fontsize=9)
axes[0].grid(True, alpha=0.3)

# Learning rate schedule
axes[1].plot(lr_history, linewidth=2, color='purple')
axes[1].axvline(best_epoch, color='green', linestyle='--', alpha=0.5,
           label=f'Best epoch = {best_epoch+1}')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Learning Rate')
axes[1].set_title('Learning Rate Schedule (ReduceLROnPlateau)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Overfitting gap
gap = (val_losses[best_epoch] - train_losses[best_epoch]) / train_losses[best_epoch] * 100
print(f'At best epoch ({best_epoch+1}):')
print(f'  Train loss: {train_losses[best_epoch]:.4f}')
print(f'  Val loss:   {val_losses[best_epoch]:.4f}')
print(f'  Gap: {gap:.1f}%')
print(f'  Final LR: {lr_history[-1]:.6f} (started at {LEARNING_RATE})')

---
## 4. Evaluation

In [None]:
# ============================================================
# 4.1 — Predictions on validation and test sets
# ============================================================
model.eval()
with torch.no_grad():
    y_pred_val = model(torch.FloatTensor(X_seq_val)).squeeze().numpy()
    y_pred_test = model(torch.FloatTensor(X_seq_test)).squeeze().numpy()

# Clip negative predictions (we predict counts)
y_pred_val = np.clip(y_pred_val, 0, None)
y_pred_test = np.clip(y_pred_test, 0, None)

# Metrics
print('LSTM (EXPLORATORY)')
print('=' * 50)
for name, y_true, y_pred in [('Validation', y_seq_val, y_pred_val),
                               ('Test', y_seq_test, y_pred_test)]:
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f'  {name:12s} : RMSE={rmse:.2f}, MAE={mae:.2f}, R2={r2:.4f}')

In [None]:
# ============================================================
# 4.2 — Predictions vs Actual
# ============================================================
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
fig.suptitle(f'LSTM — Predictions vs Actual ({TARGET})', fontsize=14)

for ax, name, y_true, y_pred in [
    (axes[0], 'Validation', y_seq_val, y_pred_val),
    (axes[1], 'Test', y_seq_test, y_pred_test),
]:
    ax.plot(range(len(y_true)), y_true, 'b-o', markersize=3, label='Actual', linewidth=1.5)
    ax.plot(range(len(y_pred)), y_pred, 'r--s', markersize=3, label='Predicted', linewidth=1.5)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    ax.set_title(f'{name} (RMSE={rmse:.2f}, R2={r2:.3f})')
    ax.set_xlabel('Temporal index')
    ax.set_ylabel(TARGET)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# ============================================================
# 4.3 — Residual analysis
# ============================================================
residuals = y_seq_test - y_pred_test

fig, axes = plt.subplots(1, 3, figsize=(16, 5))
fig.suptitle('LSTM — Residual Analysis (test set)', fontsize=14)

# Distribution
axes[0].hist(residuals, bins=20, edgecolor='black', alpha=0.7)
axes[0].axvline(0, color='red', linestyle='--')
axes[0].set_title('Residual Distribution')
axes[0].set_xlabel('Residual (actual - predicted)')

# Residuals vs predictions
axes[1].scatter(y_pred_test, residuals, alpha=0.6, s=30)
axes[1].axhline(0, color='red', linestyle='--')
axes[1].set_title('Residuals vs Predictions')
axes[1].set_xlabel('Prediction')
axes[1].set_ylabel('Residual')
axes[1].grid(True, alpha=0.3)

# QQ-plot
from scipy import stats
stats.probplot(residuals, dist='norm', plot=axes[2])
axes[2].set_title('QQ-plot (residual normality)')

plt.tight_layout()
plt.show()

### 4.2 — Hyperparameter Sensitivity Analysis (Ablation Study)

To verify that the poor LSTM performance is **not** due to suboptimal hyperparameters,
we systematically test variations of the 3 key parameters:

1. **hidden_size**: Controls model capacity (16, 32, 64)
2. **dropout**: Controls regularization (0.1, 0.3, 0.5)
3. **lookback**: Controls temporal context length (1, 3, 6 months)

Each configuration is trained from scratch with the same protocol (early stopping, gradient clipping).
If **no configuration** significantly improves performance, the issue is **data volume**, not tuning.

In [None]:
# ============================================================
# 4.2 — Ablation study: systematic HP sensitivity
# ============================================================
# Test 7 configurations, each varying one parameter from the baseline

configs = [
    {'name': 'hidden=16',  'hidden_size': 16, 'dropout': 0.3, 'lookback': 3},
    {'name': 'hidden=32*', 'hidden_size': 32, 'dropout': 0.3, 'lookback': 3},  # baseline
    {'name': 'hidden=64',  'hidden_size': 64, 'dropout': 0.3, 'lookback': 3},
    {'name': 'drop=0.1',   'hidden_size': 32, 'dropout': 0.1, 'lookback': 3},
    {'name': 'drop=0.5',   'hidden_size': 32, 'dropout': 0.5, 'lookback': 3},
    {'name': 'look=1',     'hidden_size': 32, 'dropout': 0.3, 'lookback': 1},
    {'name': 'look=6',     'hidden_size': 32, 'dropout': 0.3, 'lookback': 6},
]

def train_lstm_config(cfg, X_train_np, y_train_np, X_val_np, y_val_np, 
                      X_test_np, y_test_np, epochs=100, patience=15):
    """Train a single LSTM configuration and return metrics."""
    lookback = cfg['lookback']
    
    # Create sequences with this lookback
    def make_seq(X, y, lb):
        Xs, ys = [], []
        for i in range(lb, len(X)):
            Xs.append(X[i - lb : i])
            ys.append(y[i])
        return np.array(Xs), np.array(ys)
    
    X_s_train, y_s_train = make_seq(X_train_np, y_train_np, lookback)
    
    X_for_v = np.vstack([X_train_np[-lookback:], X_val_np])
    y_for_v = np.concatenate([y_train_np[-lookback:], y_val_np])
    X_s_val, y_s_val = make_seq(X_for_v, y_for_v, lookback)
    
    X_for_t = np.vstack([X_val_np[-lookback:], X_test_np])
    y_for_t = np.concatenate([y_val_np[-lookback:], y_test_np])
    X_s_test, y_s_test = make_seq(X_for_t, y_for_t, lookback)
    
    if len(X_s_train) < 5:
        return {'val_rmse': float('nan'), 'test_rmse': float('nan'),
                'val_r2': float('nan'), 'test_r2': float('nan')}
    
    # Build model
    m = LSTMNet(X_s_train.shape[2], cfg['hidden_size'], cfg['dropout'])
    opt = torch.optim.Adam(m.parameters(), lr=0.001)
    crit = nn.HuberLoss(delta=1.0)
    sched = torch.optim.lr_scheduler.ReduceLROnPlateau(opt, patience=8, factor=0.5)
    
    ds = TensorDataset(torch.FloatTensor(X_s_train), torch.FloatTensor(y_s_train))
    loader = DataLoader(ds, batch_size=16, shuffle=False)
    
    best_vl = float('inf')
    pat_cnt = 0
    best_st = None
    
    for ep in range(epochs):
        m.train()
        for xb, yb in loader:
            opt.zero_grad()
            pred = m(xb).squeeze()
            loss = crit(pred, yb)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(m.parameters(), max_norm=1.0)
            opt.step()
        
        m.eval()
        with torch.no_grad():
            vp = m(torch.FloatTensor(X_s_val)).squeeze()
            vl = crit(vp, torch.FloatTensor(y_s_val)).item()
        sched.step(vl)
        
        if vl < best_vl:
            best_vl = vl
            pat_cnt = 0
            best_st = m.state_dict().copy()
        else:
            pat_cnt += 1
        if pat_cnt >= patience:
            break
    
    if best_st:
        m.load_state_dict(best_st)
    
    m.eval()
    with torch.no_grad():
        pv = np.clip(m(torch.FloatTensor(X_s_val)).squeeze().numpy(), 0, None)
        pt = np.clip(m(torch.FloatTensor(X_s_test)).squeeze().numpy(), 0, None)
    
    return {
        'val_rmse': np.sqrt(mean_squared_error(y_s_val, pv)),
        'test_rmse': np.sqrt(mean_squared_error(y_s_test, pt)),
        'val_r2': r2_score(y_s_val, pv),
        'test_r2': r2_score(y_s_test, pt),
        'epochs': ep + 1,
    }

# Run all configurations
print('LSTM ABLATION STUDY')
print('=' * 80)
ablation_results = []
for i, cfg in enumerate(configs):
    print(f'  [{i+1}/{len(configs)}] Training {cfg["name"]}...', end=' ', flush=True)
    res = train_lstm_config(cfg, X_train_np, y_train_np, X_val_np, y_val_np,
                           X_test_np, y_test_np)
    res['config'] = cfg['name']
    ablation_results.append(res)
    print(f'Val RMSE={res["val_rmse"]:.1f}, Test R2={res["test_r2"]:.3f} ({res["epochs"]} epochs)')

df_ablation = pd.DataFrame(ablation_results)
print('\n' + df_ablation[['config', 'val_rmse', 'test_rmse', 'val_r2', 'test_r2']].to_string(index=False))

In [None]:
# ============================================================
# 4.2b — Ablation study visualization
# ============================================================
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
fig.suptitle('LSTM Ablation Study — Hyperparameter Sensitivity', fontsize=14)

# Val RMSE by configuration
colors = ['steelblue'] * 3 + ['darkorange'] * 2 + ['darkgreen'] * 2
bars = axes[0].bar(range(len(df_ablation)), df_ablation['val_rmse'], color=colors, edgecolor='black')
axes[0].set_xticks(range(len(df_ablation)))
axes[0].set_xticklabels(df_ablation['config'], rotation=45, ha='right')
axes[0].set_ylabel('Validation RMSE')
axes[0].set_title('Val RMSE by configuration (lower = better)')
axes[0].grid(True, alpha=0.3, axis='y')

# Annotate
for bar, val in zip(bars, df_ablation['val_rmse']):
    axes[0].annotate(f'{val:.1f}', xy=(bar.get_x() + bar.get_width()/2, val),
                    ha='center', va='bottom', fontsize=9)

# Test R2 by configuration
bars2 = axes[1].bar(range(len(df_ablation)), df_ablation['test_r2'], color=colors, edgecolor='black')
axes[1].set_xticks(range(len(df_ablation)))
axes[1].set_xticklabels(df_ablation['config'], rotation=45, ha='right')
axes[1].set_ylabel('Test R2')
axes[1].set_title('Test R2 by configuration (higher = better)')
axes[1].axhline(0, color='red', linestyle='--', alpha=0.5, label='R2=0 (random)')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Best configuration
best_cfg = df_ablation.loc[df_ablation['val_rmse'].idxmin()]
print(f'\nBest configuration: {best_cfg["config"]}')
print(f'  Val RMSE = {best_cfg["val_rmse"]:.1f}, Test R2 = {best_cfg["test_r2"]:.3f}')
print(f'\nConclusion: {"No configuration achieves acceptable performance (R2 > 0.5)." if df_ablation["test_r2"].max() < 0.5 else "Some configurations show improvement."}'
      f'\n-> The poor LSTM performance is due to INSUFFICIENT DATA, not hyperparameter choice.')

---
## 5. Comparison with classical models

Load results from the training pipeline to compare.

In [None]:
# ============================================================
# 5.1 — Load training results
# ============================================================
try:
    df_results = pd.read_csv('../data/models/training_results.csv')
    print('Results loaded from training_results.csv')
    print(df_results.to_string(index=False))
except FileNotFoundError:
    print('File training_results.csv not found.')
    print('Run first: python -m src.pipeline train')

In [None]:
# ============================================================
# 5.2 — Full comparison table
# ============================================================
lstm_rmse_val = np.sqrt(mean_squared_error(y_seq_val, y_pred_val))
lstm_rmse_test = np.sqrt(mean_squared_error(y_seq_test, y_pred_test))
lstm_r2_val = r2_score(y_seq_val, y_pred_val)
lstm_r2_test = r2_score(y_seq_test, y_pred_test)

print('\nFULL COMPARISON')
print('=' * 70)
print(f'{"Model":15s} | {"Val RMSE":>10s} | {"Test RMSE":>10s} | {"Val R2":>10s} | {"Test R2":>10s}')
print('-' * 70)

if 'df_results' in dir() and df_results is not None:
    for _, row in df_results.iterrows():
        print(f'{row["model"]:15s} | {row.get("val_rmse", 0):10.2f} | '
              f'{row.get("test_rmse", 0):10.2f} | {row.get("val_r2", 0):10.4f} | '
              f'{row.get("test_r2", 0):10.4f}')
else:
    print(f'{"LSTM":15s} | {lstm_rmse_val:10.2f} | {lstm_rmse_test:10.2f} | '
          f'{lstm_r2_val:10.4f} | {lstm_r2_test:10.4f}')

print('=' * 70)

---
## 6. Conclusions

### LSTM results (96 departments, ~5376 rows):

| Model | Val RMSE | Val R² | Test RMSE | Test R² |
|-------|----------|--------|-----------|---------|
| **LSTM** | 100.10 | 0.091 | 139.84 | -0.699 |
| _LightGBM (ref)_ | 10.75 | 0.990 | 12.10 | 0.987 |
| _Ridge (ref)_ | 16.28 | 0.976 | 21.29 | 0.961 |

- **LSTM performance is very poor** (R² < 0.1 on validation, negative on test), as expected with limited data
- The model essentially fails to generalize — it predicts close to the mean on validation and diverges on test
- Early stopping prevented complete overfitting, but the data volume is fundamentally insufficient for deep learning

### Lessons learned:
1. Deep learning requires **orders of magnitude more data** — our ~3400 training sequences are far below the threshold
2. On small-to-medium tabular datasets, **regularized classical models** (Ridge, LightGBM) are vastly superior
3. The LSTM architecture (1 layer, 32 units) was intentionally conservative, but even larger networks would not overcome the data limitation
4. Tabular data with mixed feature types (temporal, economic, geographic) is not the ideal use case for LSTM — it excels on raw sequential signals

### Recommendation:
- Use **LightGBM** as the production model (best R²=0.987 on test)
- Ridge as interpretable fallback (R²=0.961 on test)
- The LSTM would become relevant if the dataset reached >50,000 sequences (e.g., weekly granularity across all 96 departments over 10+ years)