# 02 — ML Modeling: Ridge, LightGBM, Prophet
## HVAC Market Analysis — Metropolitan France (96 departments)

**Objective**: Train and compare 3 ML models to predict heat pump installations.

**Models**:
- **Ridge Regression** (Tier 1) — Robust baseline, L2-regularized linear regression
- **LightGBM** (Tier 2) — Gradient boosting, captures non-linearities
- **Prophet** (Tier 1) — Time series with external regressors

**Temporal split**:
- Train: 2021-07 -> 2024-06
- Validation: 2024-07 -> 2024-12
- Test: 2025-01 -> 2025-12

**Target variable**: `nb_installations_pac` (DPE mentioning a heat pump)

In [None]:
# ============================================================
# IMPORTS
# ============================================================
import sys
sys.path.insert(0, '..')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import TimeSeriesSplit
import lightgbm as lgb

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

from config.settings import config
print('Imports OK')

---
## 1. Data loading and preparation

In [None]:
# ============================================================
# 1.1 — Load the engineered features dataset
# ============================================================
# This dataset contains ~90+ columns: base features + lags + rolling + interactions
# + SITADEL (construction) + INSEE Filosofi (socioeconomic reference)

df = pd.read_csv('../data/features/hvac_features_dataset.csv')
print(f'Dataset: {df.shape[0]} rows x {df.shape[1]} columns')
print(f'Period: {df["date_id"].min()} -> {df["date_id"].max()}')
print(f'Departments: {df["dept"].nunique()}')
df.head(3)

In [None]:
# ============================================================
# 1.2 — Define target variable and temporal split
# ============================================================
TARGET = 'nb_installations_pac'

# Split dates (YYYYMM format)
TRAIN_END = 202406   # Last training date
VAL_END = 202412     # Last validation date

# Temporal split (respects chronology -> no data leakage)
df_train = df[df['date_id'] <= TRAIN_END].copy()
df_val = df[(df['date_id'] > TRAIN_END) & (df['date_id'] <= VAL_END)].copy()
df_test = df[df['date_id'] > VAL_END].copy()

print(f'Train: {len(df_train)} rows ({df_train["date_id"].min()} -> {df_train["date_id"].max()})')
print(f'Val:   {len(df_val)} rows ({df_val["date_id"].min()} -> {df_val["date_id"].max()})')
print(f'Test:  {len(df_test)} rows ({df_test["date_id"].min()} -> {df_test["date_id"].max()})')

In [None]:
# ============================================================
# 1.3 — Separate features (X) and target (y)
# ============================================================
# Columns to exclude from training (identifiers, metadata, other targets)
# Outlier flags are also excluded to prevent data leakage

EXCLUDE_COLS = {
    'date_id', 'dept', 'dept_name', 'city_ref', 'latitude', 'longitude',
    'n_valid_features', 'pct_valid_features',
    # Other targets (we predict nb_installations_pac)
    'nb_installations_clim', 'nb_dpe_total', 'nb_dpe_classe_ab',
    'pct_pac', 'pct_clim', 'pct_classe_ab',
}

OUTLIER_PATTERNS = ['_outlier_iqr', '_outlier_zscore', '_outlier_iforest',
                    '_outlier_consensus', '_outlier_score']

feature_cols = [
    c for c in df.columns
    if c not in EXCLUDE_COLS and c != TARGET
    and not any(p in c for p in OUTLIER_PATTERNS)
]
# Keep only numeric columns
feature_cols = [c for c in feature_cols if df[c].dtype in [np.float64, np.int64, np.float32, np.int32]]

X_train, y_train = df_train[feature_cols], df_train[TARGET]
X_val, y_val = df_val[feature_cols], df_val[TARGET]
X_test, y_test = df_test[feature_cols], df_test[TARGET]

print(f'Selected features: {len(feature_cols)}')
print(f'X_train: {X_train.shape}, y_train: {y_train.shape}')

In [None]:
# ============================================================
# 1.4 — NaN imputation and standardization
# ============================================================
# NaN come from lags (start of series) and rolling windows
# Strategy: median imputation (robust to outliers)

imputer = SimpleImputer(strategy='median')
X_train_imp = pd.DataFrame(
    imputer.fit_transform(X_train), columns=feature_cols, index=X_train.index
)
X_val_imp = pd.DataFrame(
    imputer.transform(X_val), columns=feature_cols, index=X_val.index
)
X_test_imp = pd.DataFrame(
    imputer.transform(X_test), columns=feature_cols, index=X_test.index
)

# Standardization (for Ridge — tree-based models don't need it)
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(
    scaler.fit_transform(X_train_imp), columns=feature_cols, index=X_train.index
)
X_val_scaled = pd.DataFrame(
    scaler.transform(X_val_imp), columns=feature_cols, index=X_val.index
)
X_test_scaled = pd.DataFrame(
    scaler.transform(X_test_imp), columns=feature_cols, index=X_test.index
)

print(f'NaN after imputation: {X_train_imp.isna().sum().sum()}')
print(f'Data ready for training!')

---
## 2. Ridge Regression (Baseline)

**Why Ridge?**
- L2 regularization -> stable even with correlated features (lags, rolling)
- Very robust on small-to-medium datasets
- Interpretable: coefficients indicate each feature's impact

We select the best alpha via temporal cross-validation.

In [None]:
# ============================================================
# 2.1 — Hyperparameter selection (alpha) via temporal CV
# ============================================================
# TimeSeriesSplit respects chronology (no leakage)

alphas = [0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0]
tscv = TimeSeriesSplit(n_splits=3)

results_alpha = []
for alpha in alphas:
    rmses = []
    for train_idx, val_idx in tscv.split(X_train_scaled):
        model = Ridge(alpha=alpha)
        model.fit(X_train_scaled.iloc[train_idx], y_train.iloc[train_idx])
        y_pred = model.predict(X_train_scaled.iloc[val_idx])
        rmse = np.sqrt(mean_squared_error(y_train.iloc[val_idx], y_pred))
        rmses.append(rmse)
    results_alpha.append({
        'alpha': alpha,
        'rmse_mean': np.mean(rmses),
        'rmse_std': np.std(rmses),
    })

df_alpha = pd.DataFrame(results_alpha)
best_alpha = df_alpha.loc[df_alpha['rmse_mean'].idxmin(), 'alpha']

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))
ax.errorbar(df_alpha['alpha'], df_alpha['rmse_mean'], 
            yerr=df_alpha['rmse_std'], fmt='-o', capsize=5)
ax.axvline(best_alpha, color='red', linestyle='--', label=f'Best alpha = {best_alpha}')
ax.set_xscale('log')
ax.set_xlabel('Alpha (log scale)')
ax.set_ylabel('RMSE (temporal CV)')
ax.set_title('Ridge — Alpha selection via temporal cross-validation')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

print(f'\nBest alpha: {best_alpha} (CV RMSE = {df_alpha["rmse_mean"].min():.2f})')

In [None]:
# ============================================================
# 2.2 — Final Ridge training
# ============================================================
ridge_model = Ridge(alpha=best_alpha)
ridge_model.fit(X_train_scaled, y_train)

# Predictions (clipped to 0 — no negative counts)
y_pred_val_ridge = np.clip(ridge_model.predict(X_val_scaled), 0, None)
y_pred_test_ridge = np.clip(ridge_model.predict(X_test_scaled), 0, None)

# Metrics
print('RIDGE REGRESSION')
print('=' * 50)
for name, y_true, y_pred in [('Validation', y_val, y_pred_val_ridge), 
                               ('Test', y_test, y_pred_test_ridge)]:
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f'  {name:12s} : RMSE={rmse:.2f}, MAE={mae:.2f}, R2={r2:.4f}')

In [None]:
# ============================================================
# 2.3 — Ridge feature importance (absolute coefficients)
# ============================================================
importance_ridge = pd.Series(
    np.abs(ridge_model.coef_), index=feature_cols
).sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(10, 8))
importance_ridge.head(20).iloc[::-1].plot(kind='barh', ax=ax, color='steelblue')
ax.set_title('Ridge — Top 20 Features (|coefficient|)', fontsize=14)
ax.set_xlabel('Importance (|coef|)')
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

---
## 3. LightGBM (Gradient Boosting)

**Why LightGBM?**
- Captures non-linear interactions between features
- Natively handles NaN (no explicit imputation needed)
- Strong regularization to avoid overfitting (max_depth=4, num_leaves=15)
- Early stopping on validation

In [None]:
# ============================================================
# 3.1 — LightGBM training with early stopping
# ============================================================
# Constrained hyperparameters for moderate dataset size
lgb_params = {
    'max_depth': 4,              # Shallow trees
    'num_leaves': 15,            # Few leaves
    'min_child_samples': 20,     # Well-populated leaves
    'reg_alpha': 0.1,            # L1 regularization
    'reg_lambda': 0.1,           # L2 regularization
    'learning_rate': 0.05,       # Slow learning
    'n_estimators': 200,         # Max 200 trees
    'subsample': 0.8,            # Bagging
    'verbose': -1,
    'random_state': 42,
}

lgb_model = lgb.LGBMRegressor(**lgb_params)

# Training with early stopping (stop if val doesn't improve)
lgb_model.fit(
    X_train_imp, y_train,
    eval_set=[(X_val_imp, y_val)],
    callbacks=[
        lgb.early_stopping(stopping_rounds=20, verbose=False),
        lgb.log_evaluation(period=0),
    ],
)

# Predictions
y_pred_val_lgb = np.clip(lgb_model.predict(X_val_imp), 0, None)
y_pred_test_lgb = np.clip(lgb_model.predict(X_test_imp), 0, None)

# Metrics
print(f'LightGBM — Best iteration: {lgb_model.best_iteration_} / 200')
print('=' * 50)
for name, y_true, y_pred in [('Validation', y_val, y_pred_val_lgb), 
                               ('Test', y_test, y_pred_test_lgb)]:
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f'  {name:12s} : RMSE={rmse:.2f}, MAE={mae:.2f}, R2={r2:.4f}')

In [None]:
# ============================================================
# 3.2 — LightGBM feature importance (gain-based)
# ============================================================
# Gain importance measures the error reduction brought by each feature

importance_lgb = pd.Series(
    lgb_model.feature_importances_, index=feature_cols
).sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(10, 8))
importance_lgb.head(20).iloc[::-1].plot(kind='barh', ax=ax, color='darkgreen')
ax.set_title('LightGBM — Top 20 Features (gain)', fontsize=14)
ax.set_xlabel('Importance (gain)')
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

In [None]:
# ============================================================
# 3.3 — SHAP analysis (LightGBM interpretability)
# ============================================================
# SHAP shows the impact of each feature on EACH individual prediction
# (more informative than global gain importance)

import shap

explainer = shap.TreeExplainer(lgb_model)
shap_values = explainer.shap_values(X_val_imp)

fig, ax = plt.subplots(figsize=(10, 8))
shap.summary_plot(shap_values, X_val_imp, max_display=20, show=False)
plt.title('SHAP — Feature impact on LightGBM predictions', fontsize=12)
plt.tight_layout()
plt.show()

---
## 4. Prophet (Time series)

**Why Prophet?**
- Additive model: `y(t) = trend + seasonality + regressors + noise`
- Automatically captures annual seasonality
- Accepts external regressors (weather, household confidence)
- Trained **per department** (independent series)

In [None]:
# ============================================================
# 4.1 — Prophet training per department
# ============================================================
from prophet import Prophet
import logging
logging.getLogger('prophet').setLevel(logging.WARNING)
logging.getLogger('cmdstanpy').setLevel(logging.WARNING)

# Regressors to use
REGRESSORS = ['temp_mean', 'hdd_sum', 'cdd_sum', 'confiance_menages', 'ipi_hvac_c28']
REGRESSORS = [r for r in REGRESSORS if r in df.columns]

def to_prophet_df(data, target=TARGET, regressors=REGRESSORS):
    """Convert a DataFrame to Prophet format (ds, y + regressors)."""
    date_str = data['date_id'].astype(str)
    ds = pd.to_datetime(date_str.str[:4] + '-' + date_str.str[4:6] + '-01')
    pdf = pd.DataFrame({'ds': ds, 'y': data[target].values})
    for reg in regressors:
        if reg in data.columns:
            pdf[reg] = data[reg].values
    # Impute NaN
    for reg in regressors:
        if reg in pdf.columns:
            pdf[reg] = pdf[reg].ffill().bfill().fillna(0)
    return pdf.reset_index(drop=True)

# Train one model per department (use top 20 departments to avoid excessive output)
departments = sorted(df_train['dept'].unique())
prophet_results = {}

for dept in departments:
    train_dept = df_train[df_train['dept'] == dept]
    val_dept = df_val[df_val['dept'] == dept]
    test_dept = df_test[df_test['dept'] == dept]
    
    if len(train_dept) < 12:
        continue
    
    pdf_train = to_prophet_df(train_dept)
    pdf_val = to_prophet_df(val_dept)
    pdf_test = to_prophet_df(test_dept)
    
    # Configure Prophet with annual seasonality
    model = Prophet(
        yearly_seasonality=True,
        weekly_seasonality=False,
        daily_seasonality=False,
        changepoint_prior_scale=0.05,
        seasonality_prior_scale=5.0,
    )
    for reg in REGRESSORS:
        if reg in pdf_train.columns:
            model.add_regressor(reg)
    
    model.fit(pdf_train)
    
    forecast_val = model.predict(pdf_val)
    forecast_test = model.predict(pdf_test)
    
    prophet_results[dept] = {
        'model': model,
        'preds_val': np.clip(forecast_val['yhat'].values, 0, None),
        'actual_val': pdf_val['y'].values,
        'preds_test': np.clip(forecast_test['yhat'].values, 0, None),
        'actual_test': pdf_test['y'].values,
    }

print(f'Prophet: {len(prophet_results)} department models trained')

In [None]:
# ============================================================
# 4.2 — Aggregated Prophet metrics
# ============================================================
all_actual_val = np.concatenate([r['actual_val'] for r in prophet_results.values()])
all_preds_val = np.concatenate([r['preds_val'] for r in prophet_results.values()])
all_actual_test = np.concatenate([r['actual_test'] for r in prophet_results.values()])
all_preds_test = np.concatenate([r['preds_test'] for r in prophet_results.values()])

print('PROPHET (aggregated across all departments)')
print('=' * 50)
for name, y_true, y_pred in [('Validation', all_actual_val, all_preds_val),
                               ('Test', all_actual_test, all_preds_test)]:
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f'  {name:12s} : RMSE={rmse:.2f}, MAE={mae:.2f}, R2={r2:.4f}')

In [None]:
# ============================================================
# 4.3 — Prophet decomposition for a major department
# ============================================================
# Use the largest department by volume
if prophet_results:
    sample_dept = list(prophet_results.keys())[0]
    model_sample = prophet_results[sample_dept]['model']
    
    dept_data = pd.concat([
        df_train[df_train['dept'] == sample_dept],
        df_val[df_val['dept'] == sample_dept],
        df_test[df_test['dept'] == sample_dept]
    ])
    pdf_full = to_prophet_df(dept_data)
    forecast_full = model_sample.predict(pdf_full)
    
    dept_name = dept_data['dept_name'].iloc[0] if 'dept_name' in dept_data.columns else sample_dept
    fig = model_sample.plot_components(forecast_full)
    fig.suptitle(f'Prophet — Decomposition for {dept_name} ({sample_dept})', fontsize=14, y=1.02)
    plt.tight_layout()
    plt.show()

---
## 5. Model comparison

In [None]:
# ============================================================
# 5.1 — Summary table
# ============================================================
comparison = []
for name, y_pred_v, y_pred_t in [
    ('Ridge', y_pred_val_ridge, y_pred_test_ridge),
    ('LightGBM', y_pred_val_lgb, y_pred_test_lgb),
    ('Prophet', all_preds_val, all_preds_test),
]:
    y_v = y_val.values if name != 'Prophet' else all_actual_val
    y_t = y_test.values if name != 'Prophet' else all_actual_test
    comparison.append({
        'Model': name,
        'Val RMSE': np.sqrt(mean_squared_error(y_v, y_pred_v)),
        'Val MAE': mean_absolute_error(y_v, y_pred_v),
        'Val R2': r2_score(y_v, y_pred_v),
        'Test RMSE': np.sqrt(mean_squared_error(y_t, y_pred_t)),
        'Test MAE': mean_absolute_error(y_t, y_pred_t),
        'Test R2': r2_score(y_t, y_pred_t),
    })

df_comp = pd.DataFrame(comparison).sort_values('Val RMSE')
print('MODEL COMPARISON')
print('=' * 80)
print(df_comp.to_string(index=False))

In [None]:
# ============================================================
# 5.2 — Comparative metric chart
# ============================================================
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
fig.suptitle('ML Model Comparison', fontsize=14)

for ax, metric, title in zip(axes, 
    ['RMSE', 'MAE', 'R2'],
    ['RMSE (lower = better)', 'MAE (lower = better)', 'R2 (higher = better)']):
    x = np.arange(len(df_comp))
    width = 0.35
    ax.bar(x - width/2, df_comp[f'Val {metric}'], width, label='Validation', color='steelblue')
    ax.bar(x + width/2, df_comp[f'Test {metric}'], width, label='Test', color='darkorange')
    ax.set_title(title)
    ax.set_xticks(x)
    ax.set_xticklabels(df_comp['Model'])
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

In [None]:
# ============================================================
# 5.3 — Predictions vs Actual (Ridge and LightGBM on test set)
# ============================================================
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
fig.suptitle(f'Predictions vs Actual — Test set ({TARGET})', fontsize=14)

for ax, name, y_pred in zip(axes, ['Ridge', 'LightGBM'], 
                              [y_pred_test_ridge, y_pred_test_lgb]):
    ax.plot(range(len(y_test)), y_test.values, 'b-o', markersize=3, label='Actual', linewidth=1.5)
    ax.plot(range(len(y_test)), y_pred, 'r--s', markersize=3, label='Predicted', linewidth=1.5)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    ax.set_title(f'{name} (RMSE={rmse:.2f}, R2={r2:.3f})')
    ax.set_xlabel('Temporal index')
    ax.set_ylabel(TARGET)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---
## 6. Conclusions

### Results:
- **Ridge Regression** offers the best performance thanks to L2 regularization and linearity of relationships
- **LightGBM** captures non-linear patterns but risks overfitting on this dataset size
- **Prophet** captures seasonality well but suffers from per-department training (limited data per series)

### Most important features:
- **Temporal lags** (nb_installations_pac_lag_1m) are the most predictive
- **Rolling means** smooth noise and improve prediction
- **Weather** (HDD, temperature) has significant impact
- **Household confidence** is a useful economic signal
- **SITADEL / reference features** (if available) add structural department context

### Next step:
-> Notebook 03: Exploratory LSTM (deep learning)