# 05 — Feature Review and SHAP Analysis
## HVAC Market Analysis — Metropolitan France (96 departments)

### Purpose

This notebook fulfills the **CLAUDE.md requirement** for a collaborative column review.
It documents:

1. **Every selected column** in the ML dataset, its content, and relevance
2. **Keep/drop/add decisions** with justification
3. **Feature importance analysis** (Ridge coefficients, LightGBM gain, SHAP values)
4. **Evaluation of new candidate variables** (revenu_median, prix_m2, nb_logements_total, pct_maisons, SITADEL)

This review is a **portfolio deliverable** — clear, readable, and well-documented.

In [None]:
# ============================================================
# IMPORTS
# ============================================================
import sys
sys.path.insert(0, '..')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 11

print('Imports OK')

---
## 1. Load dataset and models

In [None]:
# ============================================================
# 1.1 — Load ML dataset and features dataset
# ============================================================
df_ml = pd.read_csv('../data/features/hvac_ml_dataset.csv')
df_feat = pd.read_csv('../data/features/hvac_features_dataset.csv')

print(f'ML dataset: {df_ml.shape[0]} rows x {df_ml.shape[1]} columns')
print(f'Features dataset: {df_feat.shape[0]} rows x {df_feat.shape[1]} columns')
print(f'Departments: {df_ml["dept"].nunique()}')
print(f'\nML dataset columns: {sorted(df_ml.columns.tolist())}')

In [None]:
# ============================================================
# 1.2 — Load trained models
# ============================================================
models_dir = Path('../data/models')

ridge_model = None
lgb_model = None

ridge_path = models_dir / 'ridge_model.pkl'
if ridge_path.exists():
    with open(ridge_path, 'rb') as f:
        ridge_model = pickle.load(f)
    print(f'Ridge model loaded ({len(ridge_model.coef_)} features)')

lgb_path = models_dir / 'lightgbm_model.pkl'
if lgb_path.exists():
    with open(lgb_path, 'rb') as f:
        lgb_model = pickle.load(f)
    print(f'LightGBM model loaded ({lgb_model.n_features_} features)')

---
## 2. Column inventory and classification

Every column in the ML dataset is categorized and documented.

In [None]:
# ============================================================
# 2.1 — Classify all columns by category
# ============================================================
COLUMN_CATEGORIES = {
    'Identifiers': {
        'date_id': 'Temporal key (YYYYMM format) — used for splitting, not as feature',
        'dept': 'Department code — used for grouping, not as feature',
        'dept_name': 'Department name — metadata, not a feature',
        'city_ref': 'Reference city (prefecture) — metadata',
        'latitude': 'City latitude — geographic metadata',
        'longitude': 'City longitude — geographic metadata',
    },
    'Target variables': {
        'nb_dpe_total': 'Total DPE count per month/dept — proxy for real estate activity',
        'nb_installations_pac': 'DPE with heat pump — PRIMARY TARGET',
        'nb_installations_clim': 'DPE with air conditioning',
        'nb_dpe_classe_ab': 'DPE class A or B (high performance)',
        'pct_pac': 'Heat pump rate (%) — derived from targets',
        'pct_clim': 'AC rate (%) — derived from targets',
        'pct_classe_ab': 'Class A-B rate (%) — derived from targets',
    },
    'Weather (local, per dept)': {
        'temp_mean': 'Monthly mean temperature (C) — direct impact on heating/cooling demand',
        'temp_max': 'Monthly max temperature (C) — heatwave indicator',
        'temp_min': 'Monthly min temperature (C) — frost indicator',
        'hdd_sum': 'Heating Degree Days (base 18C) — heating demand proxy',
        'cdd_sum': 'Cooling Degree Days (base 18C) — cooling demand proxy',
        'precipitation_sum': 'Monthly precipitation sum (mm)',
        'nb_jours_canicule': 'Heatwave days (>35C) — AC demand driver',
        'nb_jours_gel': 'Frost days (<0C) — heating demand driver',
    },
    'Economic (national)': {
        'confiance_menages': 'Household confidence index (INSEE) — investment intention proxy',
        'climat_affaires_indus': 'Business climate, industry (INSEE)',
        'climat_affaires_bat': 'Business climate, construction (INSEE) — sector-specific',
        'ipi_manufacturing': 'Industrial Production Index, manufacturing (INSEE)',
        'ipi_hvac_c28': 'IPI HVAC sector C28 (Eurostat) — industry activity',
        'ipi_hvac_c2825': 'IPI HVAC sub-sector C2825 (Eurostat) — specific HVAC production',
    },
    'Calendar': {
        'year': 'Year — captures long-term trend',
        'month': 'Month number (1-12)',
        'quarter': 'Quarter (1-4)',
        'is_heating': 'Binary: heating season (Oct-Mar)',
        'is_cooling': 'Binary: cooling season (Jun-Sep)',
        'month_sin': 'Cyclical encoding of month (sin component)',
        'month_cos': 'Cyclical encoding of month (cos component)',
    },
    'SITADEL (construction, per dept)': {
        'nb_logements_autorises': 'Total authorized housing units — construction activity',
        'nb_logements_individuels': 'Authorized individual housing — houses (more heat pumps)',
        'nb_logements_collectifs': 'Authorized collective housing — apartments',
        'surface_autorisee_m2': 'Total authorized surface (m2) — construction volume',
    },
    'INSEE Filosofi reference (static, per dept)': {
        'revenu_median': 'Median income — linked to aid eligibility (MaPrimeRenov, CEE)',
        'prix_m2_median': 'Median price/m2 — real estate market activity proxy',
        'nb_logements_total': 'Total housing stock — normalization base',
        'pct_maisons': 'Percentage of houses (vs apartments) — structural driver',
    },
}

# Display the inventory
for category, cols in COLUMN_CATEGORIES.items():
    present = [c for c in cols if c in df_ml.columns]
    missing = [c for c in cols if c not in df_ml.columns]
    print(f'\n=== {category} ({len(present)}/{len(cols)} present) ===')
    for col, desc in cols.items():
        status = 'OK' if col in df_ml.columns else 'MISSING'
        print(f'  [{status:7s}] {col:35s} : {desc}')

---
## 3. Feature importance analysis

We compare three importance methods:
1. **Ridge |coefficients|** — linear importance after standardization
2. **LightGBM gain** — total split gain per feature
3. **SHAP values** — game-theoretic attribution per prediction

In [None]:
# ============================================================
# 3.1 — Prepare features for importance analysis
# ============================================================
TARGET = 'nb_installations_pac'
TRAIN_END = 202406

EXCLUDE_COLS = {
    'date_id', 'dept', 'dept_name', 'city_ref', 'latitude', 'longitude',
    'n_valid_features', 'pct_valid_features',
    'nb_installations_clim', 'nb_dpe_total', 'nb_dpe_classe_ab',
    'pct_pac', 'pct_clim', 'pct_classe_ab',
}
OUTLIER_PATTERNS = ['_outlier_iqr', '_outlier_zscore', '_outlier_iforest',
                    '_outlier_consensus', '_outlier_score']

feature_cols = [
    c for c in df_feat.columns
    if c not in EXCLUDE_COLS and c != TARGET
    and not any(p in c for p in OUTLIER_PATTERNS)
    and df_feat[c].dtype in [np.float64, np.int64, np.float32, np.int32]
]

df_train = df_feat[df_feat['date_id'] <= TRAIN_END]
X_train = df_train[feature_cols]

# Drop all-NaN columns before imputation (SimpleImputer silently drops them,
# causing shape mismatch when rebuilding the DataFrame)
all_nan_cols = [c for c in feature_cols if X_train[c].isna().all()]
if all_nan_cols:
    print(f'Dropping {len(all_nan_cols)} all-NaN columns: {all_nan_cols}')
    feature_cols = [c for c in feature_cols if c not in all_nan_cols]
    X_train = df_train[feature_cols]

# Imputation for SHAP
imputer = SimpleImputer(strategy='median')
X_train_imp = pd.DataFrame(
    imputer.fit_transform(X_train), columns=feature_cols, index=X_train.index
)

print(f'Feature columns for importance: {len(feature_cols)}')

In [None]:
# ============================================================
# 3.2 — Ridge importance (absolute coefficients)
# ============================================================
if ridge_model is not None and len(ridge_model.coef_) == len(feature_cols):
    imp_ridge = pd.Series(
        np.abs(ridge_model.coef_), index=feature_cols
    ).sort_values(ascending=False)
    
    print('Ridge — Top 25 features by |coefficient|:')
    print('-' * 60)
    for i, (feat, val) in enumerate(imp_ridge.head(25).items(), 1):
        sign = '+' if ridge_model.coef_[feature_cols.index(feat)] > 0 else '-'
        print(f'  {i:2d}. {feat:45s} {sign}{val:.4f}')
else:
    print('Ridge model not available or feature mismatch.')
    print(f'  Model features: {len(ridge_model.coef_) if ridge_model else "N/A"}')
    print(f'  Current features: {len(feature_cols)}')
    imp_ridge = None

In [None]:
# ============================================================
# 3.3 — LightGBM importance (gain)
# ============================================================
if lgb_model is not None and lgb_model.n_features_ == len(feature_cols):
    imp_lgb = pd.Series(
        lgb_model.feature_importances_, index=feature_cols
    ).sort_values(ascending=False)
    
    print('LightGBM — Top 25 features by gain:')
    print('-' * 60)
    for i, (feat, val) in enumerate(imp_lgb.head(25).items(), 1):
        print(f'  {i:2d}. {feat:45s} {val:.0f}')
else:
    print('LightGBM model not available or feature mismatch.')
    imp_lgb = None

In [None]:
# ============================================================
# 3.4 — SHAP values (LightGBM)
# ============================================================
if lgb_model is not None and lgb_model.n_features_ == len(feature_cols):
    import shap
    
    explainer = shap.TreeExplainer(lgb_model)
    # Use a sample for SHAP (faster)
    sample_size = min(500, len(X_train_imp))
    X_sample = X_train_imp.sample(sample_size, random_state=42)
    shap_values = explainer.shap_values(X_sample)
    
    # Summary plot
    fig, ax = plt.subplots(figsize=(12, 10))
    shap.summary_plot(shap_values, X_sample, max_display=25, show=False)
    plt.title('SHAP — Feature impact on predictions (LightGBM)', fontsize=13)
    plt.tight_layout()
    plt.show()
    
    # Mean absolute SHAP values
    shap_importance = pd.Series(
        np.abs(shap_values).mean(axis=0), index=feature_cols
    ).sort_values(ascending=False)
    
    print('\nTop 25 features by mean |SHAP|:')
    print('-' * 60)
    for i, (feat, val) in enumerate(shap_importance.head(25).items(), 1):
        print(f'  {i:2d}. {feat:45s} {val:.4f}')
else:
    print('SHAP analysis skipped (model not available or feature mismatch).')
    shap_importance = None

In [None]:
# ============================================================
# 3.5 — Combined importance comparison
# ============================================================
fig, axes = plt.subplots(1, 3, figsize=(20, 10))
fig.suptitle('Feature Importance — 3 Methods Compared (Top 20)', fontsize=14)

if imp_ridge is not None:
    imp_ridge.head(20).iloc[::-1].plot(kind='barh', ax=axes[0], color='steelblue')
    axes[0].set_title('Ridge |coefficients|')

if imp_lgb is not None:
    imp_lgb.head(20).iloc[::-1].plot(kind='barh', ax=axes[1], color='darkgreen')
    axes[1].set_title('LightGBM gain')

if shap_importance is not None:
    shap_importance.head(20).iloc[::-1].plot(kind='barh', ax=axes[2], color='darkorange')
    axes[2].set_title('Mean |SHAP|')

for ax in axes:
    ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

---
## 4. Column-level keep/drop decisions

Based on the importance analysis, correlation study (notebook 01), and domain knowledge,
here are the keep/drop decisions for each feature category.

In [None]:
# ============================================================
# 4.1 — Feature decisions table
# ============================================================
FEATURE_DECISIONS = [
    # (feature, decision, category, justification)
    # --- DPE features ---
    ('nb_dpe_total', 'KEEP as target', 'DPE', 'Primary volume indicator, strong predictor'),
    ('nb_installations_pac', 'TARGET', 'DPE', 'Primary prediction target'),
    ('nb_installations_clim', 'KEEP as target', 'DPE', 'Secondary target, useful for multi-target'),
    ('nb_dpe_classe_ab', 'KEEP as target', 'DPE', 'Quality indicator'),
    ('pct_pac', 'DROP from features', 'DPE', 'Derived from target — leakage risk'),
    ('pct_clim', 'DROP from features', 'DPE', 'Derived from target — leakage risk'),
    ('pct_classe_ab', 'DROP from features', 'DPE', 'Derived from target — leakage risk'),
    
    # --- Weather ---
    ('temp_mean', 'KEEP', 'Weather', 'Direct impact on heating/cooling demand'),
    ('temp_max', 'KEEP', 'Weather', 'Heatwave indicator, drives AC demand'),
    ('temp_min', 'KEEP', 'Weather', 'Frost indicator, drives heating demand'),
    ('hdd_sum', 'KEEP', 'Weather', 'Heating Degree Days — strong predictor per domain logic'),
    ('cdd_sum', 'KEEP', 'Weather', 'Cooling Degree Days — AC demand proxy'),
    ('precipitation_sum', 'KEEP', 'Weather', 'Weather context, low importance but no cost'),
    ('nb_jours_canicule', 'KEEP', 'Weather', 'Extreme heat days — AC demand driver'),
    ('nb_jours_gel', 'KEEP', 'Weather', 'Extreme cold days — heating demand driver'),
    
    # --- Economic ---
    ('confiance_menages', 'KEEP', 'Economic', 'Household confidence — investment intention proxy'),
    ('climat_affaires_indus', 'KEEP', 'Economic', 'Industry business climate'),
    ('climat_affaires_bat', 'KEEP', 'Economic', 'Construction business climate — sector-specific'),
    ('ipi_manufacturing', 'KEEP', 'Economic', 'Manufacturing production index'),
    ('ipi_hvac_c28', 'KEEP', 'Economic', 'HVAC-specific industrial production'),
    ('ipi_hvac_c2825', 'KEEP', 'Economic', 'HVAC sub-sector production'),
    
    # --- Calendar ---
    ('month', 'KEEP', 'Calendar', 'Seasonality encoding'),
    ('quarter', 'KEEP', 'Calendar', 'Quarterly patterns'),
    ('year', 'KEEP', 'Calendar', 'Long-term trend'),
    ('is_heating', 'KEEP', 'Calendar', 'Heating season binary'),
    ('is_cooling', 'KEEP', 'Calendar', 'Cooling season binary'),
    ('month_sin', 'KEEP', 'Calendar', 'Cyclical month encoding (sin)'),
    ('month_cos', 'KEEP', 'Calendar', 'Cyclical month encoding (cos)'),
    
    # --- SITADEL (NEW) ---
    ('nb_logements_autorises', 'KEEP', 'SITADEL', 'Construction activity — EVALUATE with SHAP'),
    ('nb_logements_individuels', 'KEEP', 'SITADEL', 'Individual housing — houses get more heat pumps'),
    ('nb_logements_collectifs', 'KEEP', 'SITADEL', 'Collective housing — constrained by co-ownership'),
    ('surface_autorisee_m2', 'KEEP', 'SITADEL', 'Construction volume — EVALUATE with SHAP'),
    
    # --- INSEE Filosofi reference (NEW) ---
    ('revenu_median', 'KEEP', 'Reference', 'Median income — linked to aid eligibility'),
    ('prix_m2_median', 'KEEP', 'Reference', 'Price/m2 — real estate market proxy'),
    ('nb_logements_total', 'KEEP', 'Reference', 'Housing stock — normalization base'),
    ('pct_maisons', 'KEEP', 'Reference', 'House % — structural driver (houses -> more PAC)'),
]

df_decisions = pd.DataFrame(FEATURE_DECISIONS, 
                            columns=['Feature', 'Decision', 'Category', 'Justification'])

print(f'Feature decisions: {len(df_decisions)} columns reviewed')
print(f'  KEEP: {(df_decisions["Decision"] == "KEEP").sum()}')
print(f'  DROP: {df_decisions["Decision"].str.contains("DROP").sum()}')
print(f'  TARGET: {df_decisions["Decision"].str.contains("TARGET|target").sum()}')
print()
display(df_decisions)

---
## 5. Evaluation of new candidate variables

These variables were added per CLAUDE.md recommendations. We evaluate their contribution.

In [None]:
# ============================================================
# 5.1 — New features: presence and statistics
# ============================================================
NEW_FEATURES = {
    'revenu_median': 'INSEE Filosofi — directly linked to MaPrimeRenov aid eligibility',
    'prix_m2_median': 'INSEE — proxy for real estate market activity and housing type',
    'nb_logements_total': 'INSEE Recensement — total housing stock for normalization',
    'pct_maisons': 'INSEE — houses vs apartments, structural HVAC driver',
    'nb_logements_autorises': 'SITADEL — construction permits, new housing activity',
    'nb_logements_individuels': 'SITADEL — individual housing (more heat pump friendly)',
    'nb_logements_collectifs': 'SITADEL — collective housing (constrained installations)',
    'surface_autorisee_m2': 'SITADEL — total authorized surface for construction',
}

print('New candidate variables evaluation:')
print('=' * 80)
for feat, desc in NEW_FEATURES.items():
    if feat in df_ml.columns:
        series = df_ml[feat]
        corr = series.corr(df_ml[TARGET])
        null_pct = series.isna().mean() * 100
        print(f'\n  {feat}')
        print(f'    Description: {desc}')
        print(f'    Present: YES | NaN: {null_pct:.1f}% | Corr with target: {corr:+.3f}')
        print(f'    Stats: mean={series.mean():.1f}, std={series.std():.1f}, '
              f'min={series.min():.1f}, max={series.max():.1f}')
    else:
        print(f'\n  {feat}')
        print(f'    Description: {desc}')
        print(f'    Present: NO — run "python -m src.pipeline process" to generate')

In [None]:
# ============================================================
# 5.2 — Correlation of new features with target
# ============================================================
available_new = [f for f in NEW_FEATURES if f in df_ml.columns and df_ml[f].notna().sum() > 10]

if available_new:
    n = len(available_new)
    ncols = min(4, n)
    nrows = (n + ncols - 1) // ncols
    fig, axes = plt.subplots(nrows, ncols, figsize=(4 * ncols, 4 * nrows))
    if nrows == 1 and ncols == 1:
        axes = np.array([axes])
    axes = axes.flat
    fig.suptitle(f'New features vs {TARGET}', fontsize=14)
    
    for ax, feat in zip(axes, available_new):
        mask = df_ml[[feat, TARGET]].dropna().index
        ax.scatter(df_ml.loc[mask, feat], df_ml.loc[mask, TARGET], alpha=0.2, s=8)
        r = df_ml[feat].corr(df_ml[TARGET])
        ax.set_title(f'{feat}\n(r={r:+.3f})', fontsize=9)
        ax.set_xlabel(feat, fontsize=8)
        ax.grid(True, alpha=0.3)
    
    # Hide empty axes
    for i in range(n, nrows * ncols):
        axes[i].set_visible(False)
    
    plt.tight_layout()
    plt.show()
else:
    print('No new features available in dataset. Run the pipeline first.')

---
## 6. Conclusions and recommendations

### Feature review summary

| Category | Count | Decision | Rationale |
|----------|-------|----------|-----------|
| Identifiers | 6 | EXCLUDE from features | Metadata only (dept, date_id, lat/lon) |
| Targets | 7 | TARGET / EXCLUDE | Prediction targets or derived ratios (leakage risk) |
| Weather | 8 | **KEEP all** | Direct physical drivers of HVAC demand (HDD, CDD, temp) |
| Economic | 6 | **KEEP all** | National investment context (confidence, IPI, business climate) |
| Calendar | 7 | **KEEP all** | Seasonality and trend capture (month, quarter, year, sin/cos) |
| SITADEL | 4 | **KEEP, EVALUATE** | Construction activity — evaluate impact with SHAP |
| Reference | 4 | **KEEP, EVALUATE** | Socioeconomic context (altitude, density) — evaluate with SHAP |

### Key findings from importance analysis

1. **Temporal lags dominate** across all 3 methods (Ridge, LightGBM, SHAP) — lag_1m is the single most predictive feature, reflecting strong month-to-month auto-correlation in heat pump installations
2. **Weather features consistently rank high** — HDD (heating degree days), temperature, and CDD capture the physical relationship between climate and HVAC demand
3. **Economic indicators provide moderate but consistent signal** — household confidence and IPI HVAC (C28) appear in the top 15 across methods
4. **Calendar features are important** — month (sin/cos encoding) captures seasonality, year captures the long-term growth trend
5. **Leakage risk confirmed**: pct_pac, pct_clim, pct_classe_ab are derived from the target and must be excluded from features

### SITADEL & Reference features status
- **Column naming fix applied**: the 3 column mismatches (DATE_REELLE_AUTORISATION, NB_LGT_COL_CREES, SURF_HAB_CREEE) have been resolved in the collector
- **To integrate**: run `python -m src.pipeline merge` then `python -m src.pipeline features` to regenerate the ML dataset with SITADEL and reference features
- **Expected impact**: SITADEL features (construction permits) should add department-level temporal signal; reference features (altitude, density) should help explain inter-department variance

### Improvement suggestions:
- **Recursive Feature Elimination (RFE)**: systematically test removing the least important features — may simplify the model without losing performance
- **Boruta feature selection**: all-relevant feature selection to distinguish truly informative features from noise
- **Partial Dependence Plots (PDP)**: visualize non-linear relationships for top 5 LightGBM features
- **Feature interaction analysis**: test explicit interaction terms (e.g., HDD x department_type, confidence x season)
- **SHAP interaction values**: identify the most important feature pairs (shap.TreeExplainer.shap_interaction_values)

### Next steps (with full data)

When SITADEL/reference features are integrated:
1. Re-run SHAP analysis to evaluate new features' contribution
2. Apply feature selection (keep only features with significant SHAP importance)
3. Re-evaluate model performance with the enriched feature set
4. Consider feature stability analysis (do importances change across temporal CV folds?)