# 04 — Results and Model Comparison
## HVAC Market Analysis — Metropolitan France (96 departments)

**Objective**: Synthesize and analyze the results of all trained models.

**Plan**:
1. Load results
2. Final comparison table
3. Prediction visualizations
4. Feature importance (SHAP)
5. Error analysis
6. Final recommendations

In [1]:
# ============================================================
# IMPORTS
# ============================================================
import sys
sys.path.insert(0, '..')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from pathlib import Path
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 11

from config.settings import config
print('Imports OK')

Imports OK


---
## 1. Load training results

Results are saved by `python -m src.pipeline train` in `data/models/`.

In [None]:
# ============================================================
# 1.1 — Load the CSV summary
# ============================================================
results_path = Path('../data/models/training_results.csv')
if results_path.exists():
    df_results = pd.read_csv(results_path)
    print(f'Results loaded: {len(df_results)} models')
    display(df_results)
else:
    print('No results found. Run: python -m src.pipeline train')

In [None]:
# ============================================================
# 1.2 — Load dataset and recreate splits
# ============================================================
TARGET = 'nb_installations_pac'
TRAIN_END = 202406
VAL_END = 202412

df = pd.read_csv('../data/features/hvac_features_dataset.csv')

# Splits
df_train = df[df['date_id'] <= TRAIN_END].copy()
df_val = df[(df['date_id'] > TRAIN_END) & (df['date_id'] <= VAL_END)].copy()
df_test = df[df['date_id'] > VAL_END].copy()

# Time axis
df['date'] = pd.to_datetime(
    df['date_id'].astype(str).str[:4] + '-' + 
    df['date_id'].astype(str).str[4:] + '-01'
)

print(f'Train: {len(df_train)} | Val: {len(df_val)} | Test: {len(df_test)}')

In [None]:
# ============================================================
# 1.3 — Load saved models and re-predict
# ============================================================
# Use the same EXCLUDE_COLS as train.py so features match saved models
# Outlier flags are excluded to prevent data leakage

models_dir = Path('../data/models')

EXCLUDE_COLS = {
    'date_id', 'dept', 'dept_name', 'city_ref', 'latitude', 'longitude',
    'n_valid_features', 'pct_valid_features',
}
OTHER_TARGETS = {
    'nb_installations_clim', 'nb_dpe_total', 'nb_dpe_classe_ab',
}
OUTLIER_PATTERNS = ['_outlier_iqr', '_outlier_zscore', '_outlier_iforest',
                    '_outlier_consensus', '_outlier_score']

feature_cols = [
    c for c in df.columns
    if c not in EXCLUDE_COLS and c not in OTHER_TARGETS and c != TARGET
    and not any(p in c for p in OUTLIER_PATTERNS)
    and df[c].dtype in [np.float64, np.int64, np.float32, np.int32]
]

# Prepare X and y
X_train, y_train = df_train[feature_cols], df_train[TARGET]
X_val, y_val = df_val[feature_cols], df_val[TARGET]
X_test, y_test = df_test[feature_cols], df_test[TARGET]

# Drop all-NaN columns before imputation (SimpleImputer silently drops them,
# causing shape mismatch when rebuilding the DataFrame)
all_nan_cols = [c for c in feature_cols if X_train[c].isna().all()]
if all_nan_cols:
    print(f'Dropping {len(all_nan_cols)} all-NaN columns: {all_nan_cols}')
    feature_cols = [c for c in feature_cols if c not in all_nan_cols]
    X_train = df_train[feature_cols]
    X_val = df_val[feature_cols]
    X_test = df_test[feature_cols]

# Imputation and scaling
imputer = SimpleImputer(strategy='median')
scaler = StandardScaler()

X_train_imp = pd.DataFrame(imputer.fit_transform(X_train), columns=feature_cols, index=X_train.index)
X_val_imp = pd.DataFrame(imputer.transform(X_val), columns=feature_cols, index=X_val.index)
X_test_imp = pd.DataFrame(imputer.transform(X_test), columns=feature_cols, index=X_test.index)

X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train_imp), columns=feature_cols, index=X_train.index)
X_val_scaled = pd.DataFrame(scaler.transform(X_val_imp), columns=feature_cols, index=X_val.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test_imp), columns=feature_cols, index=X_test.index)

# Load and predict with each model
predictions = {}

# Ridge (uses scaled data)
ridge_path = models_dir / 'ridge_model.pkl'
if ridge_path.exists():
    with open(ridge_path, 'rb') as f:
        ridge_model = pickle.load(f)
    predictions['Ridge'] = {
        'val': np.clip(ridge_model.predict(X_val_scaled), 0, None),
        'test': np.clip(ridge_model.predict(X_test_scaled), 0, None),
    }
    print('Ridge loaded')

# LightGBM (uses imputed data, NOT scaled)
lgb_path = models_dir / 'lightgbm_model.pkl'
if lgb_path.exists():
    with open(lgb_path, 'rb') as f:
        lgb_model = pickle.load(f)
    predictions['LightGBM'] = {
        'val': np.clip(lgb_model.predict(X_val_imp), 0, None),
        'test': np.clip(lgb_model.predict(X_test_imp), 0, None),
    }
    print('LightGBM loaded')

print(f'\nModels loaded: {list(predictions.keys())}')
print(f'Features: {len(feature_cols)}')

---
## 2. Final comparison table

In [None]:
# ============================================================
# 2.1 — Detailed metrics per model
# ============================================================
comparison = []

for model_name, preds in predictions.items():
    row = {'Model': model_name}
    for split, y_true_arr, y_pred_arr in [
        ('Val', y_val.values, preds['val']),
        ('Test', y_test.values, preds['test']),
    ]:
        row[f'{split} RMSE'] = np.sqrt(mean_squared_error(y_true_arr, y_pred_arr))
        row[f'{split} MAE'] = mean_absolute_error(y_true_arr, y_pred_arr)
        row[f'{split} R2'] = r2_score(y_true_arr, y_pred_arr)
        # MAPE
        mask = y_true_arr > 0
        if mask.any():
            row[f'{split} MAPE (%)'] = np.mean(
                np.abs((y_true_arr[mask] - y_pred_arr[mask]) / y_true_arr[mask])
            ) * 100
    comparison.append(row)

df_comp = pd.DataFrame(comparison).round(4)

print('FINAL MODEL COMPARISON')
print('=' * 80)
print(df_comp.to_string(index=False))
print('=' * 80)

In [None]:
# ============================================================
# 2.2 — Comparative metric chart
# ============================================================
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
fig.suptitle('Model Comparison — Key Metrics', fontsize=14)

for ax, metric_pair, title in zip(axes,
    [('Val RMSE', 'Test RMSE'), ('Val MAE', 'Test MAE'), ('Val R2', 'Test R2')],
    ['RMSE (lower = better)', 'MAE (lower = better)', 'R2 (higher = better)']):
    
    x = np.arange(len(df_comp))
    width = 0.35
    
    bars1 = ax.bar(x - width/2, df_comp[metric_pair[0]], width, 
                    label='Validation', color='steelblue')
    bars2 = ax.bar(x + width/2, df_comp[metric_pair[1]], width,
                    label='Test', color='darkorange')
    
    ax.set_title(title)
    ax.set_xticks(x)
    ax.set_xticklabels(df_comp['Model'])
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')
    
    # Add value annotations
    for bar in list(bars1) + list(bars2):
        h = bar.get_height()
        if not np.isnan(h):
            ax.annotate(f'{h:.2f}', xy=(bar.get_x() + bar.get_width()/2, h),
                       ha='center', va='bottom', fontsize=8)

plt.tight_layout()
plt.show()

---
## 3. Prediction visualizations

In [None]:
# ============================================================
# 3.1 — Overlaid predictions on the test set
# ============================================================
fig, ax = plt.subplots(figsize=(16, 7))

# Actual values
ax.plot(range(len(y_test)), y_test.values, 'ko-', markersize=4, 
        label='Actual', linewidth=2, zorder=5)

# Predictions from each model
colors = {'Ridge': 'blue', 'LightGBM': 'green', 'Prophet': 'purple'}
markers = {'Ridge': 's', 'LightGBM': '^', 'Prophet': 'D'}

for model_name, preds in predictions.items():
    color = colors.get(model_name, 'gray')
    marker = markers.get(model_name, 'o')
    rmse = np.sqrt(mean_squared_error(y_test.values, preds['test']))
    ax.plot(range(len(y_test)), preds['test'], f'{color[0]}--{marker}', 
            markersize=3, label=f'{model_name} (RMSE={rmse:.2f})', 
            linewidth=1.5, alpha=0.8)

ax.set_title(f'Predictions vs Actual — Test Set ({TARGET})', fontsize=14)
ax.set_xlabel('Temporal index (month x department)')
ax.set_ylabel(TARGET)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# ============================================================
# 3.2 — Scatter plot: Predicted vs Actual (best model)
# ============================================================
# Identify the best model
best_model_name = df_comp.sort_values('Val RMSE').iloc[0]['Model']
best_preds = predictions[best_model_name]

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle(f'{best_model_name} — Predicted vs Actual', fontsize=14)

for ax, name, y_true, y_pred in [
    (axes[0], 'Validation', y_val.values, best_preds['val']),
    (axes[1], 'Test', y_test.values, best_preds['test']),
]:
    ax.scatter(y_true, y_pred, alpha=0.6, s=30, color='steelblue')
    
    # y = x line (perfect prediction)
    lims = [min(y_true.min(), y_pred.min()), max(y_true.max(), y_pred.max())]
    ax.plot(lims, lims, 'r--', linewidth=2, label='Perfect prediction')
    
    r2 = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    ax.set_title(f'{name} (R2={r2:.3f}, RMSE={rmse:.2f})')
    ax.set_xlabel('Actual')
    ax.set_ylabel('Predicted')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---
## 4. Feature Importance

In [None]:
# ============================================================
# 4.1 — Feature importance comparison
# ============================================================
fig, axes = plt.subplots(1, 2, figsize=(16, 8))
fig.suptitle('Top 15 Features by Model', fontsize=14)

# Ridge: absolute coefficients
if 'Ridge' in predictions and hasattr(ridge_model, 'coef_'):
    imp_ridge = pd.Series(np.abs(ridge_model.coef_), index=feature_cols).sort_values(ascending=False)
    imp_ridge.head(15).iloc[::-1].plot(kind='barh', ax=axes[0], color='steelblue')
    axes[0].set_title('Ridge — |Coefficients|')

# LightGBM: gain
if 'LightGBM' in predictions and hasattr(lgb_model, 'feature_importances_'):
    imp_lgb = pd.Series(lgb_model.feature_importances_, index=feature_cols).sort_values(ascending=False)
    imp_lgb.head(15).iloc[::-1].plot(kind='barh', ax=axes[1], color='darkgreen')
    axes[1].set_title('LightGBM — Gain')

for ax in axes:
    ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

In [None]:
# ============================================================
# 4.2 — SHAP analysis (LightGBM)
# ============================================================
# SHAP shows the impact of each feature on EACH individual prediction
# More informative than global gain importance

try:
    import shap
    
    explainer = shap.TreeExplainer(lgb_model)
    shap_values = explainer.shap_values(X_val_imp)
    
    # Summary plot: each point = one prediction
    # Horizontal position = impact on prediction
    # Color = feature value (red = high, blue = low)
    fig, ax = plt.subplots(figsize=(12, 8))
    shap.summary_plot(shap_values, X_val_imp, max_display=20, show=False)
    plt.title('SHAP — Feature Impact (LightGBM, validation set)', fontsize=12)
    plt.tight_layout()
    plt.show()
    
except ImportError:
    print('SHAP not available. Install with: pip install shap')

---
## 5. Error analysis

In [None]:
# ============================================================
# 5.1 — Residual analysis by model
# ============================================================
model_names = list(predictions.keys())
n_models = len(model_names)

fig, axes = plt.subplots(n_models, 2, figsize=(14, 4 * n_models))
if n_models == 1:
    axes = axes.reshape(1, -1)
fig.suptitle('Residual Analysis (test set)', fontsize=14)

for i, model_name in enumerate(model_names):
    residuals = y_test.values - predictions[model_name]['test']
    
    # Distribution
    axes[i, 0].hist(residuals, bins=25, edgecolor='black', alpha=0.7)
    axes[i, 0].axvline(0, color='red', linestyle='--', linewidth=2)
    axes[i, 0].set_title(f'{model_name} — Residual Distribution')
    axes[i, 0].set_xlabel('Residual (actual - predicted)')
    
    # Residuals vs predictions
    axes[i, 1].scatter(predictions[model_name]['test'], residuals, alpha=0.5, s=25)
    axes[i, 1].axhline(0, color='red', linestyle='--', linewidth=2)
    axes[i, 1].set_title(f'{model_name} — Residuals vs Predictions')
    axes[i, 1].set_xlabel('Prediction')
    axes[i, 1].set_ylabel('Residual')
    axes[i, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# ============================================================
# 5.2 — Error by department (best model)
# ============================================================
best_preds_test = predictions[best_model_name]['test']
df_test_with_preds = df_test.copy()
df_test_with_preds['prediction'] = best_preds_test
df_test_with_preds['residual'] = df_test_with_preds[TARGET] - df_test_with_preds['prediction']
df_test_with_preds['abs_error'] = df_test_with_preds['residual'].abs()

col_name = 'dept_name' if 'dept_name' in df_test.columns else 'dept'
error_by_dept = df_test_with_preds.groupby(col_name).agg({
    'abs_error': 'mean',
    'residual': ['mean', 'std'],
    TARGET: 'mean',
}).round(2)
error_by_dept.columns = ['MAE', 'Mean bias', 'Residual std', f'{TARGET} mean']
error_by_dept = error_by_dept.sort_values('MAE', ascending=False)

print(f'Error by department ({best_model_name}) — Top 20:')
error_by_dept.head(20)

In [None]:
# ============================================================
# 5.3 — Error by month (seasonal error patterns)
# ============================================================
error_by_month = df_test_with_preds.groupby('month').agg({
    'abs_error': 'mean',
    'residual': 'mean',
}).round(2)

month_labels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

fig, ax = plt.subplots(figsize=(12, 5))
bars = ax.bar(range(len(error_by_month)), error_by_month['abs_error'], 
              color='steelblue', edgecolor='black')
ax.set_xticks(range(len(error_by_month)))
ax.set_xticklabels([month_labels[m-1] for m in error_by_month.index])
ax.set_title(f'{best_model_name} — Mean Error by Month (test)', fontsize=14)
ax.set_ylabel('MAE')
ax.grid(True, alpha=0.3, axis='y')

# Add value annotations
for bar in bars:
    h = bar.get_height()
    ax.annotate(f'{h:.1f}', xy=(bar.get_x() + bar.get_width()/2, h),
               ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

---
## 6. Final recommendations

In [None]:
# ============================================================
# 6.1 — Summary
# ============================================================
print('=' * 70)
print('FINAL SUMMARY — Phase 4 ML Modeling')
print('=' * 70)
print(f'\nTarget variable: {TARGET}')
print(f'Dataset: {len(df)} rows x {len(df.columns)} columns')
print(f'Departments: {df["dept"].nunique()}')
print(f'Split: Train {len(df_train)} | Val {len(df_val)} | Test {len(df_test)}')
print(f'\nResults:')
print('-' * 70)
print(f'{"Model":15s} | {"Val RMSE":>10s} | {"Test RMSE":>10s} | {"Val R2":>10s} | {"Rank":>10s}')
print('-' * 70)

sorted_comp = df_comp.sort_values('Val RMSE')
for rank, (_, row) in enumerate(sorted_comp.iterrows(), 1):
    medal = {1: '1st', 2: '2nd', 3: '3rd', 4: '4th'}.get(rank, f'{rank}th')
    print(f'{row["Model"]:15s} | {row["Val RMSE"]:10.2f} | {row["Test RMSE"]:10.2f} | '
          f'{row["Val R2"]:10.4f} | {medal:>10s}')

print('-' * 70)
print(f'\nBest model: {best_model_name}')
print('=' * 70)

---
## Conclusions and recommendations

### Model hierarchy (96 departments, ~5376 rows):

| Rank | Model | Val R² | Test R² | Strengths | Weaknesses |
|------|-------|--------|---------|-----------|------------|
| 1 | **LightGBM** | 0.990 | 0.987 | Captures non-linearities, best overall | Requires careful regularization |
| 2 | **Ridge** | 0.976 | 0.961 | Robust, interpretable, fast | Linear only |
| 3 | **Prophet** | 0.616 | 0.530 | Native seasonality decomposition | Per-department training, limited data per series |
| 4 | **LSTM** | 0.091 | -0.699 | Flexible architecture | Insufficient data, poor generalization |

### Most important features (consensus across 3 methods):
1. **Temporal lags** (lag_1m of target) — dominant predictor, strong auto-correlation (~0.85)
2. **Rolling means** (3/6/12 months) — noise smoothing, captures momentum
3. **Temperature / HDD / CDD** — direct physical impact on HVAC demand
4. **Household confidence** — proxy for investment intention in home equipment
5. **Total DPE volume** — real estate activity indicator (volume scaler)
6. **Economic indicators** (IPI C28, business climate) — industry-level context

### Key findings:
- **No overfitting detected**: LightGBM R² gap = 0.5%, Ridge R² gap = 1.2% — both HEALTHY
- **CV stability confirmed**: LightGBM CoV = 4.7%, Ridge CoV = 9.5% — both STABLE across temporal folds
- **GridSearchCV validated**: manual hyperparameters are optimal across 144 combinations
- **Prophet limitation**: per-department training with only ~36 months per series is insufficient
- **LSTM limitation**: ablation study (7 configs) proves the issue is data volume, not tuning

### Error analysis insights:
- Largest errors on high-volume departments (Ile-de-France) — absolute error scales with volume
- Seasonal error pattern: slightly higher errors in autumn (Oct-Nov) when heat pump demand peaks
- Residuals are approximately normally distributed (no systematic bias)

### Improvement suggestions:
- **Ensemble** Ridge + LightGBM (weighted average) — may capture complementary patterns
- **Prediction confidence intervals** via quantile regression or conformal prediction
- **Error-weighted retraining**: give higher weight to high-volume departments where absolute error matters most
- **Regional grouping**: analyze prediction quality by region (AURA, IDF, etc.) for deployment priorities
- **MAPE-based evaluation** alongside RMSE for percentage-error perspective
- **Monthly model retraining** with new DPE data to capture market evolution

### Production recommendations:
- **Recommended model**: LightGBM (best performance, good balance of accuracy and robustness)
- **Interpretability fallback**: Ridge (coefficients directly show feature impact)
- **Retraining frequency**: monthly with new DPE data
- **Next step**: Full feature review with SHAP analysis (Notebook 05)