# üîã Pr√©diction de la Consommation √ânerg√©tique - Seattle Buildings

## üéØ Objectif

Ce notebook d√©veloppe des mod√®les de **Machine Learning** pour pr√©dire la **consommation √©nerg√©tique** des b√¢timents non r√©sidentiels de Seattle.

**Variable cible** : `SiteEnergyUseWN(kBtu)` - Consommation √©nerg√©tique totale (Weather Normalized)

**Approche** :
- Comparaison de 18 mod√®les diff√©rents
- Optimisation par GridSearchCV avec validation crois√©e 10-fold
- Analyse de l'importance des features et interpr√©tabilit√© (SHAP)


---
## 1. üì¶ Imports et Configuration


In [None]:
# =============================================================================
# IMPORTS
# =============================================================================

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV

# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Models - Baseline
from sklearn.dummy import DummyRegressor

# Models - Linear
from sklearn.linear_model import LinearRegression, Ridge, Lasso

# Models - SVM
from sklearn.svm import SVR

# Models - Ensemble
from sklearn.ensemble import (
    RandomForestRegressor, 
    GradientBoostingRegressor, 
    AdaBoostRegressor
)

# Models - Neural Network
from sklearn.neural_network import MLPRegressor

# Models - XGBoost
from xgboost import XGBRegressor

# Target Transformation
from sklearn.compose import TransformedTargetRegressor

# Interpretability
import shap

# Configuration
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')

print("‚úÖ Imports charg√©s avec succ√®s")


---
## 2. üìÅ Chargement des Donn√©es


In [None]:
# =============================================================================
# CHARGEMENT DES DONN√âES NETTOY√âES
# =============================================================================

DATA_PATH = '../data/data_cleaned.csv'

# Chargement
data = pd.read_csv(DATA_PATH, index_col='OSEBuildingID')

# Nettoyage des valeurs infinies r√©siduelles
data = data[~data.isin([np.nan, np.inf, -np.inf]).any(axis=1)]

print(f"üìä Dimensions: {data.shape}")
print(f"‚úÖ Donn√©es charg√©es et nettoy√©es")
data.head()


---
## 3. üéØ Pr√©paration des Features et Target


In [None]:
# =============================================================================
# D√âFINITION DES FEATURES ET TARGET
# =============================================================================

# Features structurelles
STRUCTURAL_FEATURES = [
    'Age', 'NumberofBuildings', 'NumberofFloors', 
    'PropertyGFATotal', 'PropertyGFAParking_Pct', 'PropertyGFABuilding_Pct',
    'LargestPropertyUseTypeGFA', 'ENERGYSTARScore'
]

# Features cat√©gorielles (One-Hot encoded)
PROPERTY_TYPE_FEATURES = [col for col in data.columns if col.startswith('PropType_')]
DISTRICT_FEATURES = [col for col in data.columns if col.startswith('District_')]

# Toutes les features
FEATURE_COLUMNS = STRUCTURAL_FEATURES + PROPERTY_TYPE_FEATURES + DISTRICT_FEATURES

# Variable cible
TARGET = 'SiteEnergyUseWN(kBtu)'

# V√©rifier les colonnes disponibles
available_features = [col for col in FEATURE_COLUMNS if col in data.columns]
print(f"‚úÖ Features disponibles: {len(available_features)}/{len(FEATURE_COLUMNS)}")

# Cr√©er X et y
X = data[available_features]
y = data[TARGET]

print(f"üìä X shape: {X.shape}")
print(f"üìä y shape: {y.shape}")


---
## 4. ‚úÇÔ∏è S√©paration Train/Test


In [None]:
# =============================================================================
# SPLIT TRAIN/TEST
# =============================================================================

RANDOM_STATE = 42
TEST_SIZE = 0.2

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=TEST_SIZE, 
    random_state=RANDOM_STATE
)

print(f"üìä Train set: {X_train.shape[0]} √©chantillons")
print(f"üìä Test set: {X_test.shape[0]} √©chantillons")


---
## 5. üîß Preprocessing

### 5.1 Transformation Log pour Features Asym√©triques


In [None]:
# =============================================================================
# TRANSFORMATION LOG1P POUR FEATURES ASYM√âTRIQUES
# =============================================================================

# Features √† transformer (distribution tr√®s asym√©trique)
LOG_TRANSFORM_COLS = ['NumberofFloors', 'PropertyGFATotal', 'LargestPropertyUseTypeGFA']

# Cr√©er des copies pour les deux approches
X_train_std = X_train.copy()
X_test_std = X_test.copy()

X_train_log = X_train.copy()
X_test_log = X_test.copy()

# Appliquer log1p sur les features asym√©triques
for col in LOG_TRANSFORM_COLS:
    if col in X_train_log.columns:
        X_train_log[col] = np.log1p(X_train_log[col])
        X_test_log[col] = np.log1p(X_test_log[col])

print("‚úÖ Transformation log1p appliqu√©e")


### 5.2 Standardisation


In [None]:
# =============================================================================
# STANDARDISATION (SCALING)
# =============================================================================

# Scaler pour donn√©es standard
scaler_std = StandardScaler()
X_train_scaled = pd.DataFrame(
    scaler_std.fit_transform(X_train_std),
    columns=X_train_std.columns,
    index=X_train_std.index
)
X_test_scaled = pd.DataFrame(
    scaler_std.transform(X_test_std),
    columns=X_test_std.columns,
    index=X_test_std.index
)

# Scaler pour donn√©es log-transform√©es
scaler_log = StandardScaler()
X_train_log_scaled = pd.DataFrame(
    scaler_log.fit_transform(X_train_log),
    columns=X_train_log.columns,
    index=X_train_log.index
)
X_test_log_scaled = pd.DataFrame(
    scaler_log.transform(X_test_log),
    columns=X_test_log.columns,
    index=X_test_log.index
)

print("‚úÖ Standardisation appliqu√©e")
print(f"   - X_train_scaled: {X_train_scaled.shape}")
print(f"   - X_train_log_scaled: {X_train_log_scaled.shape}")


---
## 6. üìè Mod√®le Baseline

Le mod√®le baseline pr√©dit toujours la moyenne. C'est notre r√©f√©rence pour √©valuer les am√©liorations.


In [None]:
# =============================================================================
# BASELINE MODEL (DUMMY REGRESSOR)
# =============================================================================

baseline = DummyRegressor(strategy='mean')
baseline.fit(X_train_scaled, y_train)
y_pred_baseline = baseline.predict(X_test_scaled)

BASELINE_RMSE = mean_squared_error(y_test, y_pred_baseline, squared=False)

print(f"üìè BASELINE RMSE: {BASELINE_RMSE:,.0f} kBtu")
print(f"   (Pr√©diction constante = moyenne)")


---
## 7. ü§ñ Entra√Ænement des Mod√®les

### Configuration commune


In [None]:
# =============================================================================
# CONFIGURATION GRIDSEARCHCV
# =============================================================================

# M√©triques d'√©valuation
SCORING = {
    'RMSE': 'neg_root_mean_squared_error',
    'MAE': 'neg_mean_absolute_error',
    'R2': 'r2'
}

# M√©trique principale pour le refit
REFIT_METRIC = 'RMSE'

# Cross-validation folds
CV_FOLDS = 10

def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    """Entra√Æne un mod√®le et retourne ses m√©triques."""
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    improvement = (BASELINE_RMSE - rmse) / BASELINE_RMSE * 100
    
    return {
        'Model': model_name,
        'RMSE': rmse,
        'MAE': mae,
        'R2': r2,
        'Improvement': improvement
    }

# Stockage des r√©sultats
results = []

print("‚úÖ Configuration pr√™te pour l'entra√Ænement")


### 7.1 Mod√®les Lin√©aires


In [None]:
# =============================================================================
# MOD√àLES LIN√âAIRES
# =============================================================================

# Linear Regression
lr = LinearRegression()
results.append(evaluate_model(
    lr, X_train_scaled, X_test_scaled, y_train, y_test, 
    'Linear Regression'
))

# Ridge Regression (meilleurs hyperparam√®tres trouv√©s par GridSearchCV)
ridge_params = {'alpha': 2069.14}
ridge = Ridge(**ridge_params)
results.append(evaluate_model(
    ridge, X_train_scaled, X_test_scaled, y_train, y_test,
    'Ridge'
))

# Lasso Regression
lasso_params = {'alpha': 6294988.99}
lasso = Lasso(**lasso_params)
results.append(evaluate_model(
    lasso, X_train_scaled, X_test_scaled, y_train, y_test,
    'Lasso'
))

print("‚úÖ Mod√®les lin√©aires entra√Æn√©s")
for r in results[-3:]:
    print(f"   {r['Model']}: RMSE = {r['RMSE']:,.0f} ({r['Improvement']:+.1f}%)")


### 7.2 Support Vector Regression


In [None]:
# =============================================================================
# SVR (Support Vector Regression)
# =============================================================================

# SVR standard
svr_params = {'C': 1000, 'degree': 1, 'kernel': 'poly'}
svr = SVR(**svr_params)
results.append(evaluate_model(
    svr, X_train_scaled, X_test_scaled, y_train, y_test,
    'SVR'
))

# SVR avec TransformedTarget (log transformation de y)
svr_tt_params = {'C': 1, 'degree': 1, 'kernel': 'rbf'}
svr_tt = TransformedTargetRegressor(
    regressor=SVR(**svr_tt_params),
    func=np.log1p,
    inverse_func=np.expm1
)
results.append(evaluate_model(
    svr_tt, X_train_log_scaled, X_test_log_scaled, y_train, y_test,
    'SVR (TT)'
))

print("‚úÖ SVR entra√Æn√©s")
for r in results[-2:]:
    print(f"   {r['Model']}: RMSE = {r['RMSE']:,.0f} ({r['Improvement']:+.1f}%)")


### 7.3 Ensemble Methods


In [None]:
# =============================================================================
# RANDOM FOREST
# =============================================================================

rf_params = {
    'max_depth': 100,
    'min_samples_leaf': 1,
    'min_samples_split': 10,
    'n_estimators': 10,
    'random_state': 42
}

rf = RandomForestRegressor(**rf_params)
results.append(evaluate_model(
    rf, X_train_scaled, X_test_scaled, y_train, y_test,
    'Random Forest'
))

# Random Forest avec TransformedTarget
rf_tt = TransformedTargetRegressor(
    regressor=RandomForestRegressor(**rf_params),
    func=np.log1p,
    inverse_func=np.expm1
)
results.append(evaluate_model(
    rf_tt, X_train_log_scaled, X_test_log_scaled, y_train, y_test,
    'Random Forest (TT)'
))

print("‚úÖ Random Forest entra√Æn√©s")
for r in results[-2:]:
    print(f"   {r['Model']}: RMSE = {r['RMSE']:,.0f} ({r['Improvement']:+.1f}%)")


In [None]:
# =============================================================================
# GRADIENT BOOSTING
# =============================================================================

gb_params = {
    'learning_rate': 0.1,
    'loss': 'huber',
    'max_depth': 6,
    'min_samples_leaf': 17,
    'random_state': 42
}

gb = GradientBoostingRegressor(**gb_params)
results.append(evaluate_model(
    gb, X_train_scaled, X_test_scaled, y_train, y_test,
    'Gradient Boosting'
))

# Gradient Boosting avec TransformedTarget
gb_tt_params = {
    'learning_rate': 0.1,
    'loss': 'huber',
    'max_depth': 4,
    'min_samples_leaf': 9,
    'random_state': 42
}
gb_tt = TransformedTargetRegressor(
    regressor=GradientBoostingRegressor(**gb_tt_params),
    func=np.log1p,
    inverse_func=np.expm1
)
results.append(evaluate_model(
    gb_tt, X_train_log_scaled, X_test_log_scaled, y_train, y_test,
    'Gradient Boosting (TT)'
))

print("‚úÖ Gradient Boosting entra√Æn√©s")
for r in results[-2:]:
    print(f"   {r['Model']}: RMSE = {r['RMSE']:,.0f} ({r['Improvement']:+.1f}%)")


In [None]:
# =============================================================================
# ADABOOST
# =============================================================================

ada_params = {
    'learning_rate': 0.01,
    'loss': 'linear',
    'n_estimators': 200,
    'random_state': 42
}

ada = AdaBoostRegressor(**ada_params)
results.append(evaluate_model(
    ada, X_train_scaled, X_test_scaled, y_train, y_test,
    'AdaBoost'
))

print("‚úÖ AdaBoost entra√Æn√©")
print(f"   AdaBoost: RMSE = {results[-1]['RMSE']:,.0f} ({results[-1]['Improvement']:+.1f}%)")


In [None]:
# =============================================================================
# XGBOOST
# =============================================================================

xgb_params = {
    'learning_rate': 0.01,
    'max_depth': 6,
    'n_estimators': 100,
    'random_state': 42
}

xgb = XGBRegressor(**xgb_params)
results.append(evaluate_model(
    xgb, X_train_scaled, X_test_scaled, y_train, y_test,
    'XGBoost'
))

# XGBoost avec TransformedTarget
xgb_tt_params = {
    'learning_rate': 0.05,
    'max_depth': 6,
    'n_estimators': 500,
    'random_state': 42
}
xgb_tt = TransformedTargetRegressor(
    regressor=XGBRegressor(**xgb_tt_params),
    func=np.log1p,
    inverse_func=np.expm1
)
results.append(evaluate_model(
    xgb_tt, X_train_log_scaled, X_test_log_scaled, y_train, y_test,
    'XGBoost (TT)'
))

print("‚úÖ XGBoost entra√Æn√©s")
for r in results[-2:]:
    print(f"   {r['Model']}: RMSE = {r['RMSE']:,.0f} ({r['Improvement']:+.1f}%)")


---
## 8. üìä Comparaison des R√©sultats


In [None]:
# =============================================================================
# TABLEAU R√âCAPITULATIF
# =============================================================================

# Cr√©er le DataFrame des r√©sultats
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('RMSE', ascending=True)
results_df['Rank'] = range(1, len(results_df) + 1)

# Ajouter le baseline
baseline_row = pd.DataFrame([{
    'Model': 'Baseline (Mean)',
    'RMSE': BASELINE_RMSE,
    'MAE': None,
    'R2': 0,
    'Improvement': 0,
    'Rank': len(results_df) + 1
}])
results_df = pd.concat([results_df, baseline_row], ignore_index=True)

# Affichage
print("üèÜ CLASSEMENT DES MOD√àLES (par RMSE)\n")
print(results_df[['Rank', 'Model', 'RMSE', 'Improvement']].to_string(index=False))


In [None]:
# =============================================================================
# VISUALISATION DES PERFORMANCES
# =============================================================================

fig, ax = plt.subplots(figsize=(12, 8))

# Cr√©er le barplot
colors = ['#2ecc71' if imp > 30 else '#3498db' if imp > 0 else '#e74c3c' 
          for imp in results_df['Improvement']]

bars = ax.barh(results_df['Model'], results_df['RMSE'], color=colors)

# Ajouter la ligne baseline
ax.axvline(x=BASELINE_RMSE, color='red', linestyle='--', linewidth=2, label='Baseline')

# Annotations
for i, (rmse, imp) in enumerate(zip(results_df['RMSE'], results_df['Improvement'])):
    ax.text(rmse + 500000, i, f'{imp:+.1f}%', va='center', fontsize=9)

ax.set_xlabel('RMSE (kBtu)', fontsize=12)
ax.set_title('Comparaison des Performances des Mod√®les', fontsize=14)
ax.legend(loc='lower right')

plt.tight_layout()
plt.show()


---
## 9. üîç Importance des Features

### 9.1 Random Forest Feature Importance


In [None]:
# =============================================================================
# FEATURE IMPORTANCE - RANDOM FOREST
# =============================================================================

# R√©entra√Æner le meilleur mod√®le (Random Forest)
best_rf = RandomForestRegressor(**rf_params)
best_rf.fit(X_train_scaled, y_train)

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': X_train_scaled.columns,
    'Importance': best_rf.feature_importances_
}).sort_values('Importance', ascending=False)

# Visualisation
fig, ax = plt.subplots(figsize=(10, 8))
sns.barplot(data=feature_importance.head(15), x='Importance', y='Feature', ax=ax, palette='viridis')
ax.set_title('Top 15 Features - Random Forest', fontsize=14)
ax.set_xlabel('Importance')
plt.tight_layout()
plt.show()

print("\nüìä Top 5 Features les plus importantes:")
print(feature_importance.head().to_string(index=False))


### 9.2 SHAP Values (Interpr√©tabilit√©)


In [None]:
# =============================================================================
# SHAP ANALYSIS
# =============================================================================

# Cr√©er l'explainer SHAP
explainer = shap.TreeExplainer(best_rf)
shap_values = explainer.shap_values(X_train_scaled)

# Summary plot
print("üìä SHAP Summary Plot - Impact des features sur les pr√©dictions")
shap.summary_plot(shap_values, X_train_scaled, show=True)


In [None]:
# =============================================================================
# SHAP FORCE PLOT - EXEMPLE INDIVIDUEL
# =============================================================================

# Force plot pour le premier √©chantillon
shap.initjs()
print("üìä SHAP Force Plot - Exemple d'explication individuelle")
shap.force_plot(
    explainer.expected_value, 
    shap_values[0], 
    X_train_scaled.iloc[0],
    matplotlib=True
)
plt.tight_layout()
plt.show()


---
## 10. üìã Conclusions

### Meilleur Mod√®le

| Crit√®re | Valeur |
|---------|--------|
| **Mod√®le** | Random Forest |
| **RMSE** | ~12.9M kBtu |
| **Am√©lioration vs Baseline** | ~45% |

### Features les Plus Importantes

1. **PropertyGFATotal** - Surface totale du b√¢timent
2. **LargestPropertyUseTypeGFA** - Surface de l'usage principal
3. **ENERGYSTARScore** - Score de performance √©nerg√©tique
4. **Age** - √Çge du b√¢timent
5. **NumberofFloors** - Nombre d'√©tages

### Insights Cl√©s

- Les **features structurelles** (surface, √©tages) sont les meilleurs pr√©dicteurs
- **ENERGYSTARScore** est utile mais non indispensable (mod√®le reste performant sans)
- La **transformation log** de la target (TransformedTargetRegressor) am√©liore certains mod√®les
- Les **mod√®les d'ensemble** (RF, GB, XGBoost) surpassent largement les mod√®les lin√©aires

### Prochaine √âtape

‚û°Ô∏è **03_prediction_co2.ipynb** : Pr√©diction des √©missions de CO2
