# Projet: Pr√©diction de la demande √©nerg√©tique

**IFT3395/IFT6390 - Fondements de l'apprentissage machine**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pierrelux/mlbook/blob/main/exercises/projet_energie.ipynb)

**Comp√©tition Kaggle:** [Rejoindre la comp√©tition](https://www.kaggle.com/t/72daeb9bff104caf912f9a0b0f42eb5a)

---

## Contexte

Hydro-Qu√©bec publie des donn√©es ouvertes sur la consommation √©lectrique de clients participant √† un programme de gestion de la demande. Ces donn√©es incluent la consommation horaire, les conditions m√©t√©orologiques, et des indicateurs d'√©v√©nements de pointe.

Votre mission: construire un mod√®le de pr√©diction de la consommation √©nerg√©tique en utilisant **uniquement** les m√©thodes vues dans les chapitres 1 √† 5 du cours.

## Objectifs d'apprentissage

√Ä la fin de ce projet, vous serez en mesure de:

1. Impl√©menter les moindres carr√©s ordinaires (OLS) √† partir de z√©ro
2. Impl√©menter la r√©gression logistique avec descente de gradient
3. Appliquer la r√©gularisation Ridge et interpr√©ter ses effets
4. Construire un mod√®le √† deux √©tages: classification ‚Üí r√©gression
5. Utiliser les probabilit√©s pr√©dites comme caract√©ristiques

## √âvaluation

| Composante | Pond√©ration | Description |
|------------|-------------|-------------|
| **Entrevue orale** | **60%** | V√©rification de la compr√©hension |
| Code soumis | 20% | Compl√©tion des parties 1-7 |
| Kaggle | 10% | Position au classement |
| Rapport √©crit | 10% | Analyse et r√©flexion |

### Bar√®me de l'entrevue orale (60%)

| Crit√®re | Points | Ce qu'on √©value |
|---------|--------|-----------------|
| D√©rivation OLS au tableau | 15 | Ma√Ætrise de la solution analytique |
| Explication descente de gradient | 10 | Compr√©hension des mises √† jour |
| Justification des choix | 15 | Pourquoi ces caract√©ristiques? Pourquoi TimeSeriesSplit? |
| Questions th√©oriques | 10 | Ridge = MAP, entropie crois√©e, etc. |
| Modifications en direct | 10 | Adapter le code et pr√©dire les effets |

**Important**: L'entrevue orale est la composante principale de l'√©valuation. Vous devez √™tre capable d'expliquer et de justifier chaque ligne de code que vous soumettez.

### ‚ö†Ô∏è Avertissement sur l'utilisation d'outils IA

Les outils comme ChatGPT, Cursor, Copilot peuvent vous aider, **mais** :
- Vous devez comprendre **chaque ligne** de code que vous soumettez
- L'entrevue orale r√©v√©lera rapidement si vous comprenez ou non
- **60% de la note** d√©pend de votre capacit√© √† expliquer votre travail

**Conseil** : Utilisez ces outils pour apprendre, pas pour √©viter d'apprendre. Du code copi√© sans compr√©hension m√®ne √† l'√©chec √† l'entrevue orale.

---

## Partie 0: Configuration et chargement des donn√©es

Ex√©cutez cette cellule pour importer les biblioth√®ques et charger les donn√©es.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Configuration des graphiques
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
%config InlineBackend.figure_format = 'retina'

print("Configuration termin√©e!")

### Chargement des donn√©es

Les donn√©es proviennent du jeu de donn√©es ouvert [consommation-clients-evenements-pointe](https://donnees.hydroquebec.com/explore/dataset/consommation-clients-evenements-pointe/) d'Hydro-Qu√©bec. Nous les chargeons directement depuis GitHub.

In [None]:
# URLs des donn√©es sur GitHub
BASE_URL = "https://raw.githubusercontent.com/pierrelux/mlbook/main/data/"

# Charger les donn√©es
print("Chargement des donn√©es depuis GitHub...")
train = pd.read_csv(BASE_URL + "energy_train.csv", parse_dates=['horodatage_local'])

# Pour l'√©valuation locale: test avec la cible (energie_kwh)
test = pd.read_csv(BASE_URL + "energy_test_avec_cible.csv", parse_dates=['horodatage_local'])

# Pour Kaggle: test sans la cible (pour g√©n√©rer les pr√©dictions)
test_kaggle = pd.read_csv(BASE_URL + "energy_test.csv", parse_dates=['horodatage_local'])

print(f"Ensemble d'entra√Ænement: {len(train)} observations")
print(f"Ensemble de test: {len(test)} observations")
print(f"\nP√©riode d'entra√Ænement: {train['horodatage_local'].min()} √† {train['horodatage_local'].max()}")
print(f"P√©riode de test: {test['horodatage_local'].min()} √† {test['horodatage_local'].max()}")

In [None]:
# Aper√ßu des donn√©es
print("Colonnes disponibles:")
print(train.columns.tolist())
print(f"\nProportion √©v√©nements de pointe (train): {train['evenement_pointe'].mean():.1%}")
train.head()

### Description des variables

Les donn√©es contiennent des mesures m√©t√©orologiques et temporelles pour pr√©dire la consommation √©nerg√©tique.

In [None]:
# Description des variables
print("Variables m√©t√©orologiques:")
print("  - temperature_ext: Temp√©rature ext√©rieure moyenne (¬∞C)")
print("  - humidite: Humidit√© relative moyenne (%)")
print("  - vitesse_vent: Vitesse du vent moyenne (km/h)")
print("  - neige: Pr√©cipitations de neige moyennes")
print("  - irradiance_solaire: Irradiance solaire moyenne")

print("\nVariables temporelles:")
print("  - heure, mois, jour, jour_semaine: Composantes temporelles")
print("  - heure_sin, heure_cos, mois_sin, mois_cos: Encodage cyclique")
print("  - est_weekend, est_ferie: Indicateurs binaires")

print("\nAutres:")
print("  - evenement_pointe: Indicateur d'√©v√©nement de pointe (classification)")
print("  - energie_kwh: Variable cible (consommation en kWh)")

print(f"\nStatistiques de base:")
train[['temperature_ext', 'humidite', 'energie_kwh']].describe()

In [None]:
# IMPORTANT: Division temporelle d√©j√† effectu√©e
# Les donn√©es de test couvrent la p√©riode √† partir du 1er f√©vrier 2024
# NE PAS m√©langer les donn√©es - c'est une s√©rie temporelle!

print("‚ö†Ô∏è  ATTENTION: Division temporelle")
print("Les ensembles train/test sont d√©j√† s√©par√©s chronologiquement.")
print("N'utilisez PAS de validation crois√©e al√©atoire (fuite d'information).")
print("\nPour la validation, utilisez une division temporelle sur train:")
print("  - Ex: train[:6000] pour entra√Ænement, train[6000:] pour validation")

# Note: il y a un d√©calage de distribution entre train (hiver) et test (printemps/√©t√©)
# C'est un d√©fi r√©aliste! Pensez √† utiliser des caract√©ristiques qui g√©n√©ralisent bien.
print("\nüìä D√©calage de distribution:")
print(f"  Train: {train['energie_kwh'].mean():.1f} kWh (hiver)")
print(f"  Test:  {test['energie_kwh'].mean():.1f} kWh (printemps/√©t√©)")
print("  ‚Üí Le mod√®le doit g√©n√©raliser √† travers les saisons!")

### Exploration visuelle

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Consommation vs temp√©rature
axes[0, 0].scatter(train['temperature_ext'], train['energie_kwh'], alpha=0.3, s=5)
axes[0, 0].set_xlabel('Temp√©rature (¬∞C)')
axes[0, 0].set_ylabel('√ânergie consomm√©e (kWh)')
axes[0, 0].set_title('Consommation vs Temp√©rature')

# Distribution de la consommation
axes[0, 1].hist(train['energie_kwh'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 1].set_xlabel('√ânergie (kWh)')
axes[0, 1].set_ylabel('Fr√©quence')
axes[0, 1].set_title('Distribution de la consommation')

# Profil horaire
profil_horaire = train.groupby('heure')['energie_kwh'].mean()
axes[1, 0].bar(profil_horaire.index, profil_horaire.values)
axes[1, 0].set_xlabel('Heure')
axes[1, 0].set_ylabel('√ânergie moyenne (kWh)')
axes[1, 0].set_title('Profil de consommation horaire')

# √âv√©nements de pointe par heure
pointe_horaire = train.groupby('heure')['evenement_pointe'].mean()
axes[1, 1].bar(pointe_horaire.index, pointe_horaire.values)
axes[1, 1].set_xlabel('Heure')
axes[1, 1].set_ylabel('Proportion √©v√©nements de pointe')
axes[1, 1].set_title('Fr√©quence des √©v√©nements de pointe')

plt.tight_layout()

In [None]:
# ============================================
# ANALYSE EXPLORATOIRE APPROFONDIE
# ============================================
import seaborn as sns

print("="*60)
print("ANALYSE STATISTIQUE D√âTAILL√âE")
print("="*60)

# 1. Matrice de corr√©lation des variables principales
features_corr = ['temperature_ext', 'humidite', 'vitesse_vent', 
                 'irradiance_solaire', 'clients_connectes', 'energie_kwh']

corr_matrix = train[features_corr].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1)
plt.title('Matrice de Corr√©lation - Variables Cl√©s', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nTop 5 corr√©lations avec energie_kwh:")
corr_with_target = corr_matrix['energie_kwh'].sort_values(ascending=False)[1:6]
print(corr_with_target)


# 2. Distribution par saison et jour de la semaine
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Consommation par mois
monthly_stats = train.groupby('mois')['energie_kwh'].agg(['mean', 'std'])
axes[0, 0].bar(monthly_stats.index, monthly_stats['mean'], 
               yerr=monthly_stats['std'], capsize=5, alpha=0.7, color='steelblue')
axes[0, 0].set_xlabel('Mois')
axes[0, 0].set_ylabel('Consommation moyenne (kWh)')
axes[0, 0].set_title('Variations Mensuelles (avec √©cart-type)')
axes[0, 0].grid(True, alpha=0.3, axis='y')

# Consommation par jour de la semaine
weekly_stats = train.groupby('jour_semaine')['energie_kwh'].mean()
days = ['Lun', 'Mar', 'Mer', 'Jeu', 'Ven', 'Sam', 'Dim']
axes[0, 1].bar(range(7), weekly_stats, color='coral', alpha=0.7)
axes[0, 1].set_xticks(range(7))
axes[0, 1].set_xticklabels(days)
axes[0, 1].set_ylabel('Consommation moyenne (kWh)')
axes[0, 1].set_title('Variations Hebdomadaires')
axes[0, 1].grid(True, alpha=0.3, axis='y')

# Temp√©rature vs Consommation (avec r√©gression)
axes[1, 0].scatter(train['temperature_ext'], train['energie_kwh'], 
                   alpha=0.2, s=3, c=train['heure'], cmap='viridis')
# Ajouter courbe de tendance
z = np.polyfit(train['temperature_ext'], train['energie_kwh'], 2)
p = np.poly1d(z)
temp_range = np.linspace(train['temperature_ext'].min(), 
                         train['temperature_ext'].max(), 100)
axes[1, 0].plot(temp_range, p(temp_range), 'r-', linewidth=2, 
                label='Tendance polynomiale')
axes[1, 0].set_xlabel('Temp√©rature (¬∞C)')
axes[1, 0].set_ylabel('√ânergie (kWh)')
axes[1, 0].set_title('Relation Temp√©rature-Consommation (color√© par heure)')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# clients_connectes vs energie_kwh (tr√®s important!)
axes[1, 1].scatter(train['clients_connectes'], train['energie_kwh'], 
                   alpha=0.3, s=5, c='green')
axes[1, 1].set_xlabel('Nombre de clients connect√©s')
axes[1, 1].set_ylabel('√ânergie (kWh)')
axes[1, 1].set_title('Impact du Nombre de Clients (corr√©lation forte!)')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


# 3. Analyse des √©v√©nements de pointe
print("\n" + "="*60)
print("ANALYSE DES √âV√âNEMENTS DE POINTE")
print("="*60)

pointe_stats = train.groupby('evenement_pointe')['energie_kwh'].describe()
print("\nStatistiques consommation par type:")
print(pointe_stats)

print(f"\nRatio consommation pointe/normal: "
      f"{pointe_stats.loc[1, 'mean'] / pointe_stats.loc[0, 'mean']:.2f}x")

# Boxplot comparatif
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].boxplot([train[train['evenement_pointe']==0]['energie_kwh'],
                 train[train['evenement_pointe']==1]['energie_kwh']],
                labels=['Normal', 'Pointe'])
axes[0].set_ylabel('Consommation (kWh)')
axes[0].set_title('Distribution Consommation: Normal vs Pointe')
axes[0].grid(True, alpha=0.3, axis='y')

# Heures de pointe par temp√©rature
temp_bins = pd.cut(train['temperature_ext'], bins=5)
pointe_by_temp = train.groupby([temp_bins, 'heure'])['evenement_pointe'].mean().unstack()

im = axes[1].imshow(pointe_by_temp.T, cmap='YlOrRd', aspect='auto')
axes[1].set_xlabel('Plage de temp√©rature')
axes[1].set_ylabel('Heure')
axes[1].set_title('Probabilit√© Pointe par Temp√©rature et Heure')
plt.colorbar(im, ax=axes[1], label='P(pointe)')

plt.tight_layout()
plt.show()


# 4. Test de stationnarit√© (d√©calage train/test)
print("\n" + "="*60)
print("ANALYSE D√âCALAGE TRAIN/TEST")
print("="*60)

comparison = pd.DataFrame({
    'Variable': ['Temp√©rature', 'Humidit√©', 'Vent', '√ânergie', 'Pointe (%)'],
    'Train (moyenne)': [
        train['temperature_ext'].mean(),
        train['humidite'].mean(),
        train['vitesse_vent'].mean(),
        train['energie_kwh'].mean(),
        train['evenement_pointe'].mean()*100
    ],
    'Test (moyenne)': [
        test['temperature_ext'].mean(),
        test['humidite'].mean(),
        test['vitesse_vent'].mean(),
        test['energie_kwh'].mean(),
        test['evenement_pointe'].mean()*100
    ]
})
comparison['√âcart (%)'] = 100 * (comparison['Test (moyenne)'] - 
                                  comparison['Train (moyenne)']) / comparison['Train (moyenne)']

print(comparison.to_string(index=False))
print("\nObservation: Le d√©calage train/test est significatif!")
print("Strat√©gie: Utiliser features qui g√©n√©ralisent bien (degr√©-jours, etc.)")

---

## Partie 1: Impl√©mentation OLS (10%)

Avant d'utiliser scikit-learn, vous devez impl√©menter la solution analytique des moindres carr√©s ordinaires.

**Rappel**: La solution OLS est donn√©e par:

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}$$

Pour des raisons de stabilit√© num√©rique, pr√©f√©rez `np.linalg.solve` √† l'inversion directe.

In [None]:
def ols_fit(X, y):
    """
    Calcule les coefficients OLS.
    
    Param√®tres:
        X : ndarray de forme (n, p) - matrice de caract√©ristiques (SANS colonne de 1)
        y : ndarray de forme (n,) - vecteur cible
    
    Retourne:
        beta : ndarray de forme (p+1,) - coefficients [intercept, coef1, coef2, ...]
    
    Indice: Ajoutez une colonne de 1 √† X pour l'intercept.
    """
    # VOTRE CODE ICI
    # 1. Ajouter une colonne de 1 pour l'intercept
    n = X.shape[0]
    X_with_intercept = np.column_stack((np.ones(n), X))
    # 2. R√©soudre le syst√®me X^T X beta = X^T y
    XTX = X_with_intercept.T @ X_with_intercept
    XTy = X_with_intercept.T @ y
    # 3. Retourner beta
    beta = np.linalg.solve(XTX, XTy)
    return beta    


def ols_predict(X, beta):
    """
    Pr√©dit avec les coefficients OLS.
    
    Param√®tres:
        X : ndarray de forme (n, p) - caract√©ristiques (SANS colonne de 1)
        beta : ndarray de forme (p+1,) - coefficients [intercept, coef1, ...]
    
    Retourne:
        y_pred : ndarray de forme (n,)
    """
    # VOTRE CODE ICI
    n = X.shape[0]
    X_with_intercept = np.column_stack((np.ones(n), X))
    y_pred = X_with_intercept @ beta
    return y_pred

In [None]:
# Test de votre impl√©mentation
# Caract√©ristiques simples pour commencer
features_base = ['temperature_ext', 'humidite', 'vitesse_vent']

X_train_base = train[features_base].values
y_train = train['energie_kwh'].values
X_test_base = test[features_base].values
y_test = test['energie_kwh'].values

# Votre impl√©mentation
beta_ols = ols_fit(X_train_base, y_train)
y_pred_ols = ols_predict(X_test_base, beta_ols)

# Validation avec sklearn
model_sklearn = LinearRegression()
model_sklearn.fit(X_train_base, y_train)
y_pred_sklearn = model_sklearn.predict(X_test_base)

# Comparaison
print("Comparaison OLS impl√©ment√© vs sklearn:")
print(f"  Intercept - Vous: {beta_ols[0]:.4f}, sklearn: {model_sklearn.intercept_:.4f}")
print(f"  Coefficients proches: {np.allclose(beta_ols[1:], model_sklearn.coef_, atol=1e-4)}")
print(f"\nR¬≤ sur test: {r2_score(y_test, y_pred_ols):.4f}")

In [None]:
# CELLULE √Ä AJOUTER - Diagnostiques OLS

print("\n" + "="*60)
print("DIAGNOSTIQUES OLS")
print("="*60)

# 1. Analyse des r√©sidus
residus_ols = y_test - y_pred_ols

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# R√©sidus vs pr√©dictions
axes[0, 0].scatter(y_pred_ols, residus_ols, alpha=0.4, s=10)
axes[0, 0].axhline(0, color='red', linestyle='--', linewidth=2)
axes[0, 0].set_xlabel('Pr√©dictions')
axes[0, 0].set_ylabel('R√©sidus')
axes[0, 0].set_title('R√©sidus vs Pr√©dictions (h√©t√©rosc√©dasticit√©?)')
axes[0, 0].grid(True, alpha=0.3)

# QQ-plot (normalit√© des r√©sidus)
from scipy import stats
stats.probplot(residus_ols, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot (Test de Normalit√©)')
axes[0, 1].grid(True, alpha=0.3)

# Histogramme r√©sidus
axes[1, 0].hist(residus_ols, bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].axvline(residus_ols.mean(), color='red', linestyle='--',
                   label=f'Moyenne: {residus_ols.mean():.2f}')
axes[1, 0].set_xlabel('R√©sidu')
axes[1, 0].set_ylabel('Fr√©quence')
axes[1, 0].set_title('Distribution des R√©sidus')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3, axis='y')

# Autocorr√©lation des r√©sidus (important pour s√©ries temporelles!)
from pandas.plotting import autocorrelation_plot
autocorrelation_plot(pd.Series(residus_ols), ax=axes[1, 1])
axes[1, 1].set_title('Autocorr√©lation des R√©sidus')
axes[1, 1].set_xlabel('Lag')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Tests statistiques
print(f"\nStatistiques r√©sidus:")
print(f"  Moyenne: {residus_ols.mean():.4f} (devrait √™tre ~0)")
print(f"  √âcart-type: {residus_ols.std():.2f}")
print(f"  Min: {residus_ols.min():.2f}, Max: {residus_ols.max():.2f}")

# Test de Shapiro-Wilk (normalit√©)
if len(residus_ols) < 5000:  # Limitation du test
    stat, p_value = stats.shapiro(residus_ols[:5000])
    print(f"  Test Shapiro-Wilk: p-value = {p_value:.4f}")
    if p_value < 0.05:
        print("    ‚Üí R√©sidus NON normaux (mais OK pour grandes donn√©es)")

# 2. Coefficients OLS
print(f"\n{'='*60}")
print("INTERPR√âTATION DES COEFFICIENTS OLS")
print("="*60)

coef_df = pd.DataFrame({
    'Feature': features_base,
    'Coefficient': beta_ols[1:],
    '|Coefficient|': np.abs(beta_ols[1:])
}).sort_values('|Coefficient|', ascending=False)

print(coef_df.to_string(index=False))

print(f"\nIntercept: {beta_ols[0]:.2f} kWh")
print("\nInterpr√©tation exemple:")
print(f"  - {features_base[0]}: coefficient = {beta_ols[1]:.2f}")
print(f"    ‚Üí +1¬∞C ‚Üí {beta_ols[1]:+.2f} kWh de consommation")

---

## Partie 2: R√©gression logistique avec descente de gradient (15%)

Impl√©mentez la r√©gression logistique pour la classification binaire.

**Rappels**:
- Fonction sigmo√Øde: $\sigma(z) = \frac{1}{1 + e^{-z}}$
- Perte d'entropie crois√©e: $L = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1-y_i) \log(1-p_i) \right]$
- Gradient: $\nabla L = \frac{1}{n} \mathbf{X}^\top (\sigma(\mathbf{X}\boldsymbol{\beta}) - \mathbf{y})$

In [None]:
def sigmoid(z):
    """
    Fonction sigmo√Øde.
    
    Indice: Pour la stabilit√© num√©rique, clip z entre -500 et 500.
    """
    # VOTRE CODE ICI
    z_clipped = np.clip(z, -500, 500)
    return 1 / (1 + np.exp(-z_clipped))

def cross_entropy_loss(y_true, y_pred_proba):
    """
    Calcule la perte d'entropie crois√©e binaire.
    
    Indice: Clip les probabilit√©s pour √©viter log(0).
    """
    # VOTRE CODE ICI
    epsilon = 1e-15
    y_pred_proba_clipped = np.clip(y_pred_proba, epsilon, 1 - epsilon)
    loss = -np.mean(y_true * np.log(y_pred_proba_clipped) + (1 - y_true) * np.log(1 - y_pred_proba_clipped))
    return loss

def logistic_gradient(X, y, beta):
    """
    Calcule le gradient de la perte d'entropie crois√©e.
    
    Param√®tres:
        X : ndarray (n, p+1) - caract√©ristiques AVEC colonne de 1
        y : ndarray (n,) - √©tiquettes binaires
        beta : ndarray (p+1,) - coefficients actuels
    
    Retourne:
        gradient : ndarray (p+1,)
    """
    # VOTRE CODE ICI
    n = len(y)
    z = X @ beta
    y_pred_proba = sigmoid(z)
    error = y_pred_proba - y
    gradient = (X.T @ error) / n
    
    return gradient

def logistic_fit_gd(X, y, lr=0.1, n_iter=1000, verbose=False):
    """
    Entra√Æne la r√©gression logistique par descente de gradient.
    
    Param√®tres:
        X : ndarray (n, p) - caract√©ristiques SANS colonne de 1
        y : ndarray (n,) - √©tiquettes binaires (0 ou 1)
        lr : float - taux d'apprentissage
        n_iter : int - nombre d'it√©rations
        verbose : bool - afficher la progression
    
    Retourne:
        beta : ndarray (p+1,) - coefficients [intercept, coef1, ...]
        losses : list - historique des pertes
    """
    # VOTRE CODE ICI
    n, p = X.shape
    # 1. Ajouter colonne de 1 √† X
    X_with_intercept = np.column_stack([np.ones(n), X])
    # 2. Initialiser beta √† z√©ro
    beta = np.zeros(p + 1)
    losses = []
    # 3. Boucle de descente de gradient
    for i in range(n_iter):
        Z = X_with_intercept @ beta
        y_pred_proba = sigmoid(Z)
        
        loss = cross_entropy_loss(y, y_pred_proba)
        losses.append(loss)
        
        gradient = logistic_gradient(X_with_intercept, y, beta)
        
        beta -= lr * gradient
        
        if verbose and (i % 100 == 0 or i == n_iter - 1):
            print(f"Iteration {i+1}/{n_iter}, Loss: {loss:.4f}")
    # 4. Retourner beta et historique des pertes
    return beta, losses


def logistic_predict_proba(X, beta):
    """
    Retourne les probabilit√©s P(Y=1|X).
    """
    # VOTRE CODE ICI
    n = X.shape[0]
    X_with_intercept = np.column_stack([np.ones(n), X])
    z = X_with_intercept @ beta
    return sigmoid(z)

In [None]:
# Test sur la pr√©diction des √©v√©nements de pointe
# Caract√©ristiques pour classification
features_clf = ['temperature_ext', 'heure_sin', 'heure_cos', 'est_weekend']

X_train_clf = train[features_clf].values
y_train_clf = train['evenement_pointe'].values
X_test_clf = test[features_clf].values
y_test_clf = test['evenement_pointe'].values

# Normaliser (recommand√© pour la descente de gradient)
scaler = StandardScaler()
X_train_clf_scaled = scaler.fit_transform(X_train_clf)
X_test_clf_scaled = scaler.transform(X_test_clf)

# Entra√Æner votre mod√®le
beta_log, losses = logistic_fit_gd(X_train_clf_scaled, y_train_clf, lr=0.1, n_iter=500, verbose=True)

# Tracer la courbe de convergence
plt.figure(figsize=(8, 5))
plt.plot(losses)
plt.xlabel('It√©ration')
plt.ylabel('Perte (entropie crois√©e)')
plt.title('Convergence de la descente de gradient')
plt.grid(True, alpha=0.3)
plt.tight_layout()

In [None]:
# √âvaluation
proba_train = logistic_predict_proba(X_train_clf_scaled, beta_log)
proba_test = logistic_predict_proba(X_test_clf_scaled, beta_log)

y_pred_train = (proba_train >= 0.5).astype(int)
y_pred_test = (proba_test >= 0.5).astype(int)

print("√âvaluation de votre r√©gression logistique:")
print(f"  Accuracy (train): {accuracy_score(y_train_clf, y_pred_train):.4f}")
print(f"  Accuracy (test): {accuracy_score(y_test_clf, y_pred_test):.4f}")
print(f"\nRapport de classification (test):")
print(classification_report(y_test_clf, y_pred_test, target_names=['Normal', 'Pointe']))

In [None]:
print("\n" + "="*60)
print("ANALYSE APPROFONDIE CLASSIFICATION")
print("="*60)

# 1. Matrice de confusion d√©taill√©e
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test_clf, y_pred_test)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Matrice de confusion
im = axes[0].imshow(cm, cmap='Blues')
axes[0].set_xticks([0, 1])
axes[0].set_yticks([0, 1])
axes[0].set_xticklabels(['Normal', 'Pointe'])
axes[0].set_yticklabels(['Normal', 'Pointe'])
axes[0].set_xlabel('Pr√©dit')
axes[0].set_ylabel('R√©el')
axes[0].set_title('Matrice de Confusion')

# Annoter nombres
for i in range(2):
    for j in range(2):
        text = axes[0].text(j, i, cm[i, j],
                           ha="center", va="center", color="black", fontsize=16)

plt.colorbar(im, ax=axes[0])

# Calculer m√©triques
tn, fp, fn, tp = cm.ravel()
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

print(f"\nM√©triques d√©taill√©es:")
print(f"  True Negatives:  {tn}")
print(f"  False Positives: {fp} (fausses alarmes)")
print(f"  False Negatives: {fn} (pointes manqu√©es)")
print(f"  True Positives:  {tp}")
print(f"\n  Precision: {precision:.4f} (Quand on pr√©dit pointe, c'est vrai dans {precision*100:.1f}% cas)")
print(f"  Recall:    {recall:.4f} (On d√©tecte {recall*100:.1f}% des vraies pointes)")
print(f"  F1-score:  {f1:.4f}")


# 2. Courbe ROC et AUC
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test_clf, proba_test)
roc_auc = auc(fpr, tpr)

axes[1].plot(fpr, tpr, color='darkorange', lw=2,
            label=f'ROC (AUC = {roc_auc:.3f})')
axes[1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--',
            label='Hasard')
axes[1].set_xlabel('Taux Faux Positifs (FPR)')
axes[1].set_ylabel('Taux Vrais Positifs (TPR)')
axes[1].set_title('Courbe ROC')
axes[1].legend(loc="lower right")
axes[1].grid(True, alpha=0.3)

print(f"\n  AUC-ROC: {roc_auc:.4f} (1.0 = parfait, 0.5 = hasard)")


# 3. Precision-Recall selon seuil
from sklearn.metrics import precision_recall_curve

precision_vals, recall_vals, thresholds_pr = precision_recall_curve(y_test_clf, proba_test)

axes[2].plot(recall_vals, precision_vals, color='green', lw=2)
axes[2].set_xlabel('Recall')
axes[2].set_ylabel('Precision')
axes[2].set_title('Courbe Precision-Recall')
axes[2].grid(True, alpha=0.3)

# Marquer le seuil 0.5
idx_05 = np.argmin(np.abs(thresholds_pr - 0.5))
axes[2].plot(recall_vals[idx_05], precision_vals[idx_05], 'ro',
            markersize=10, label='Seuil = 0.5')
axes[2].legend()

plt.tight_layout()
plt.show()


# 4. Analyse par seuil
print(f"\n{'='*60}")
print("EFFET DU SEUIL DE D√âCISION")
print("="*60)

seuils_test = [0.3, 0.4, 0.5, 0.6, 0.7]
results_seuils = []

for seuil in seuils_test:
    y_pred_seuil = (proba_test >= seuil).astype(int)
    acc = accuracy_score(y_test_clf, y_pred_seuil)
    cm_seuil = confusion_matrix(y_test_clf, y_pred_seuil)
    tn, fp, fn, tp = cm_seuil.ravel()
    prec = tp / (tp + fp) if (tp + fp) > 0 else 0
    rec = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    results_seuils.append({
        'Seuil': seuil,
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'FP': fp,
        'FN': fn
    })

df_seuils = pd.DataFrame(results_seuils)
print(df_seuils.to_string(index=False))

print("\nInterpr√©tation:")
print("  - Seuil bas (0.3): Plus de d√©tections ‚Üí Recall √©lev√©, mais plus de fausses alarmes")
print("  - Seuil √©lev√© (0.7): Moins de fausses alarmes ‚Üí Precision √©lev√©e, mais pointes manqu√©es")
print("  - Trade-off selon contexte m√©tier!")

---

## Partie 3: Ing√©nierie des caract√©ristiques (15%)

**√Ä partir de maintenant, vous pouvez utiliser scikit-learn.**

Cr√©ez des caract√©ristiques temporelles pour am√©liorer le mod√®le de r√©gression.

### Caract√©ristiques √† impl√©menter:

1. **Retards (lags)**: consommation aux heures pr√©c√©dentes
2. **Statistiques glissantes**: moyenne mobile, √©cart-type mobile
3. **Interactions**: temp√©rature √ó heure, etc.

Impl√©mentez **au moins 3 nouvelles caract√©ristiques**.

In [None]:
def creer_caracteristiques(df):
    """
    Cr√©e des caract√©ristiques suppl√©mentaires.
    
    VOUS DEVEZ IMPL√âMENTER AU MOINS 3 NOUVELLES CARACT√âRISTIQUES.
    
    Id√©es:
    - Retards: df['energie_kwh'].shift(1), shift(24)
    - Moyennes mobiles: df['energie_kwh'].rolling(6).mean()
    - Interactions: df['temperature_ext'] * df['heure_cos']
    - Degr√©-jours de chauffage: np.maximum(18 - df['temperature_ext'], 0)
    """
    df = df.copy()
    
    # VOTRE CODE ICI
    # Exemple:
    df['energie_lag1'] = df['energie_kwh'].shift(1)
    df['energie_lag24'] = df['energie_kwh'].shift(24)
    df['energie_lag168'] = df['energie_kwh'].shift(168)
    
    df['energie_rolling_6h'] = df['energie_kwh'].rolling(6).mean()
    df['energie_rolling_24h'] = df['energie_kwh'].rolling(24).mean()
    df['energie_rolling_std_24h'] = df['energie_kwh'].rolling(
        window=24, 
        min_periods=1
    ).std().fillna(0)
    df['energie_rolling_max_12h'] = df['energie_kwh'].rolling(
        window = 12,
        min_periods=1
    ).max()
    
    df['temp_heure_cos'] = df['temperature_ext'] * df['heure_cos']
    df['temp_heure_sin'] = df['temperature_ext'] * df['heure_sin']
    df['temp_weekend'] = df['temperature_ext'] * df['est_weekend']
    df['temp_mois_sin'] = df['temperature_ext'] * df['mois_sin']
    df['temp_mois_cos'] = df['temperature_ext'] * df['mois_cos']
    
    df['degres_jours_chauffage'] = np.maximum(18 - df['temperature_ext'], 0)
    df['degres_jours_clim'] = np.maximum(df['temperature_ext'] - 22, 0)
    df['temp_squared'] = df['temperature_ext'] ** 2
    df['temp_ressentie'] = df['temperature_ext'] - 0.5 * df['vitesse_vent']
    df['humidite_temp'] = df['humidite'] * np.abs(df['temperature_ext']) / 100
    df['est_pointe_matin'] = ((df['heure'] >= 7) & (df['heure'] <= 9)).astype(int)
    
    df['est_pointe_soir'] = ((df['heure'] >= 17) & (df['heure'] <= 20)).astype(int)
    df['est_nuit'] = ((df['heure'] >= 0) & (df['heure'] <= 6)).astype(int)
    df['est_hiver'] = df['mois'].isin([12, 1, 2]).astype(int)
    df['est_ete'] = df['mois'].isin([6, 7, 8]).astype(int)
    
    df['temp_rolling_mean_3h'] = df['temperature_ext'].rolling(
        window=3, 
        min_periods=1
    ).mean()
    df['temp_diff'] = df['temperature_ext'].diff().fillna(0)
    df['temp_amplitude_24h'] = (
        df['temperature_ext'].rolling(window=24, min_periods=1).max() - 
        df['temperature_ext'].rolling(window=24, min_periods=1).min()
    )
    
    if 'clients_connectes' in df.columns:
        df['clients_temp'] = df['clients_connectes'] * df['temperature_ext']
        df['energie_per_client'] = df['energie_kwh'] / (df['clients_connectes'] + 1)
        df['clients_weekend'] = df['clients_connectes'] * df['est_weekend']
    
    return df

# Appliquer aux donn√©es
train_eng = creer_caracteristiques(train)
test_eng = creer_caracteristiques(test)

# Supprimer les lignes avec NaN (dues aux retards)
train_eng = train_eng.dropna().reset_index(drop=True)
test_eng = test_eng.dropna().reset_index(drop=True)

print(f"Nouvelles colonnes: {[c for c in train_eng.columns if c not in train.columns]}")

In [None]:
train_enrichi = creer_caracteristiques(train)
test_enrichi = creer_caracteristiques(test)

# Supprimer les NaN (dus aux lags/rolling)
train_enrichi = train_enrichi.dropna().reset_index(drop=True)
test_enrichi = test_enrichi.dropna().reset_index(drop=True)

# V√©rifier les nouvelles colonnes
nouvelles_cols = [c for c in train_enrichi.columns if c not in train.columns]
print(f"Nombre de nouvelles features: {len(nouvelles_cols)}")
print(f"\nNouvelles features cr√©√©es:")
for col in nouvelles_cols:
    print(f"  - {col}")

# V√©rifier corr√©lations avec la cible
correlations = train_enrichi[nouvelles_cols + ['energie_kwh']].corr()['energie_kwh'].sort_values(ascending=False)
print(f"\nTop 10 features par corr√©lation avec energie_kwh:")
print(correlations.head(10))

features_to_use = [
    # M√©t√©o de base
    'temperature_ext', 'humidite', 'vitesse_vent', 'irradiance_solaire',
    
    # Temps cyclique
    'heure_sin', 'heure_cos', 'mois_sin', 'mois_cos',
    'jour_semaine_sin', 'jour_semaine_cos',
    
    # Indicateurs binaires
    'est_weekend', 'est_ferie', 'est_pointe_matin', 'est_pointe_soir',
    
    # TR√àS IMPORTANT
    'clients_connectes',
    
    # Lags (attention Kaggle!)
    'energie_lag1', 'energie_lag24',
    
    # Rolling
    'energie_rolling_mean_6h', 'energie_rolling_mean_24h',
    
    # Interactions
    'temp_heure_cos', 'temp_weekend',
    
    # Transformations m√©t√©o
    'degres_jours_chauffage', 'temp_squared'
]

# Filtrer celles qui existent vraiment
features_disponibles = [f for f in features_to_use if f in train_enrichi.columns]

print(f"\nFeatures s√©lectionn√©es: {len(features_disponibles)}")

X_train = train_enrichi[features_disponibles].values
y_train = train_enrichi['energie_kwh'].values
X_test = test_enrichi[features_disponibles].values
y_test = test_enrichi['energie_kwh'].values

# Entra√Æner un mod√®le simple pour tester
from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)
model.fit(X_train, y_train)

from sklearn.metrics import r2_score
print(f"\nR¬≤ avec features enrichies: {r2_score(y_test, model.predict(X_test)):.4f}")

In [None]:
print("="*60)
print("ANALYSE D'IMPACT DES NOUVELLES FEATURES")
print("="*60)

# 1. Heatmap corr√©lations nouvelles features
nouvelles_features_sample = nouvelles_cols[:15]  # Top 15
corr_nouvelles = train_enrichi[nouvelles_features_sample + ['energie_kwh']].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(corr_nouvelles, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, square=True, linewidths=0.5, cbar_kws={'label': 'Corr√©lation'})
plt.title('Corr√©lations Nouvelles Features avec Cible', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# 2. Test incr√©mental d'ajout de features
print(f"\n{'='*60}")
print("TEST INCR√âMENTAL - IMPACT PAR TYPE DE FEATURE")
print("="*60)

from sklearn.linear_model import Ridge

# Baseline: features de base
features_baseline = ['temperature_ext', 'humidite', 'vitesse_vent',
                     'heure_sin', 'heure_cos', 'mois_sin', 'mois_cos',
                     'est_weekend', 'clients_connectes']

# Test progressif
feature_groups = {
    'Baseline': features_baseline,
    '+ Lags': features_baseline + ['energie_lag1', 'energie_lag24'],
    '+ Rolling': features_baseline + ['energie_lag1', 'energie_lag24',
                                      'energie_rolling_6h', 'energie_rolling_24h'],
    '+ Interactions': features_baseline + ['energie_lag1', 'energie_lag24',
                                           'energie_rolling_6h', 'energie_rolling_24h',
                                           'temp_heure_cos', 'temp_weekend'],
    '+ Transformations': features_baseline + ['energie_lag1', 'energie_lag24',
                                               'energie_rolling_6h', 'energie_rolling_24h',
                                               'temp_heure_cos', 'temp_weekend',
                                               'degres_jours_chauffage', 'temp_squared']
}

results_incremental = []

for name, feats in feature_groups.items():
    feats_avail = [f for f in feats if f in train_enrichi.columns]
    
    X_tr = train_enrichi[feats_avail].values
    y_tr = train_enrichi['energie_kwh'].values
    X_te = test_enrichi[feats_avail].values
    y_te = test_enrichi['energie_kwh'].values
    
    model_temp = Ridge(alpha=1.0)
    model_temp.fit(X_tr, y_tr)
    
    r2_test = r2_score(y_te, model_temp.predict(X_te))
    rmse_test = np.sqrt(mean_squared_error(y_te, model_temp.predict(X_te)))
    
    results_incremental.append({
        'Configuration': name,
        'Nb Features': len(feats_avail),
        'R¬≤ Test': r2_test,
        'RMSE Test': rmse_test
    })

df_incremental = pd.DataFrame(results_incremental)
print(df_incremental.to_string(index=False))

# Visualiser am√©lioration
plt.figure(figsize=(10, 6))
plt.plot(df_incremental['Configuration'], df_incremental['R¬≤ Test'],
         'o-', linewidth=2, markersize=10, color='steelblue')
plt.xlabel('Configuration')
plt.ylabel('R¬≤ Test')
plt.title('Impact Incr√©mental des Features', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print(f"\nAm√©lioration totale: +{(df_incremental.iloc[-1]['R¬≤ Test'] - df_incremental.iloc[0]['R¬≤ Test']):.4f} R¬≤")

---

## Partie 4: R√©gression Ridge (15%)

Avec plusieurs caract√©ristiques corr√©l√©es, la r√©gularisation devient utile.

1. Entra√Ænez un mod√®le Ridge avec validation crois√©e pour choisir Œª
2. Comparez les performances avec OLS
3. Analysez comment les coefficients changent

In [None]:
# D√©finissez vos caract√©ristiques pour la r√©gression
# MODIFIEZ CETTE LISTE selon vos caract√©ristiques cr√©√©es en Partie 3
# IMPORTANT: clients_connectes est une variable tr√®s importante!
features_reg = [
    'temperature_ext', 'humidite', 'vitesse_vent', 'irradiance_solaire',
    'heure_sin', 'heure_cos', 'mois_sin', 'mois_cos',
    'jour_semaine_sin', 'jour_semaine_cos',
    'est_weekend', 'est_ferie',
    'clients_connectes',  # Ne pas oublier!
    # Ajoutez vos caract√©ristiques ici
]

features_reg += features_to_use

# V√©rifier que toutes les colonnes existent
features_disponibles = [f for f in features_reg if f in train_eng.columns]
print(f"Caract√©ristiques utilis√©es: {len(features_disponibles)}")

X_train_reg = train_eng[features_disponibles].values
y_train_reg = train_eng['energie_kwh'].values
X_test_reg = test_eng[features_disponibles].values
y_test_reg = test_eng['energie_kwh'].values

In [None]:
# Mod√®le OLS (baseline)
model_ols = LinearRegression()
model_ols.fit(X_train_reg, y_train_reg)

y_pred_ols_test = model_ols.predict(X_test_reg)
y_pred_ols_train = model_ols.predict(X_train_reg)

r2_ols_train = r2_score(y_train_reg, y_pred_ols_train)
r2_ols_test = r2_score(y_test_reg, y_pred_ols_test)
rmse_ols_test = np.sqrt(mean_squared_error(y_test_reg, y_pred_ols_test))

print("OLS (baseline):")
print(f"  R¬≤ train: {r2_ols_train:.4f}")
print(f"  R¬≤ test:  {r2_ols_test:.4f}")
print(f"  RMSE test: {rmse_ols_test:.4f}")

if r2_ols_train - r2_ols_test > 0.1:
    print(" Attention: surapprentissage d√©tect√© avec OLS!")


In [None]:
# Mod√®le Ridge avec validation crois√©e
# ATTENTION: Utilisez TimeSeriesSplit pour les donn√©es temporelles!
from sklearn.model_selection import TimeSeriesSplit

alphas = [0.01, 0.1, 1, 10, 100, 1000]
tscv = TimeSeriesSplit(n_splits=5)

model_ridge = RidgeCV(alphas=alphas, cv=tscv)
model_ridge.fit(X_train_reg, y_train_reg)

y_pred_ridge_train = model_ridge.predict(X_train_reg)
y_pred_ridge_test = model_ridge.predict(X_test_reg)

r2_ridge_train = r2_score(y_train_reg, y_pred_ridge_train)
r2_ridge_test = r2_score(y_test_reg, y_pred_ridge_test)
rmse_ridge_test = np.sqrt(mean_squared_error(y_test_reg, y_pred_ridge_test))

print(f"\nRidge (Œª={model_ridge.alpha_}):")
print(f"  R¬≤ train: {r2_ridge_train:.4f}")
print(f"  R¬≤ test:  {r2_ridge_test:.4f}")
print(f"  RMSE test: {rmse_ridge_test:.4f}")
print(f"  √âcart R¬≤ train-test: {r2_ridge_train - r2_ridge_test:.4f}")

In [None]:
# Comparaison des coefficients OLS vs Ridge
coef_comparison = pd.DataFrame({
    'Caract√©ristique': features_disponibles,
    'OLS': model_ols.coef_,
    'Ridge': model_ridge.coef_
})
coef_comparison['R√©duction (%)'] = 100 * (1 - np.abs(coef_comparison['Ridge']) / (np.abs(coef_comparison['OLS']) + 1e-8))
coef_comparison = coef_comparison.sort_values('R√©duction (%)', ascending=False)

print("\nComparaison des coefficients (tri√©s par r√©duction):")
print(coef_comparison.to_string(index=False))

In [None]:
results = pd.DataFrame({
    'Mod√®le': ['OLS', 'Ridge (Œª=1)', f'Ridge (Œª={model_ridge.alpha_})'],
    'R¬≤ train': [r2_ols_train, r2_ridge_train, r2_ridge_train],
    'R¬≤ test': [r2_ols_test, r2_ridge_test, r2_ridge_test],
    'RMSE test': [rmse_ols_test, rmse_ridge_test, rmse_ridge_test],
    '√âcart': [abs(r2_ols_train - r2_ols_test),
              abs(r2_ridge_train - r2_ridge_test),
              abs(r2_ridge_train - r2_ridge_test)]
})

print(results.to_string(index=False))

# Meilleur mod√®le
best_idx = results['R¬≤ test'].idxmax()
print(f"\n Meilleur mod√®le: {results.loc[best_idx, 'Mod√®le']}")

In [None]:
#Analyse des coefficients du meilleur mod√®le
coef_comparison = pd.DataFrame({
    'Feature': features_disponibles,
    'OLS': model_ols.coef_,
    'Ridge': model_ridge.coef_
})

# Calculer r√©duction (shrinkage)
coef_comparison['R√©duction (%)'] = 100 * (
    1 - np.abs(coef_comparison['Ridge']) / (np.abs(coef_comparison['OLS']) + 1e-8)
)

# Trier par r√©duction
coef_comparison = coef_comparison.sort_values('R√©duction (%)', ascending=False)

print(coef_comparison.to_string(index=False))

print("\n Observations:")
print(f"  - R√©duction moyenne: {coef_comparison['R√©duction (%)'].mean():.1f}%")
print(f"  - R√©duction max: {coef_comparison['R√©duction (%)'].max():.1f}%")
print(f"  - Feature la plus r√©duite: {coef_comparison.iloc[0]['Feature']}")

In [None]:
# Tester plusieurs Œª
lambdas_test = np.logspace(-2, 4, 50)  # 0.01 √† 10000
coefficients_path = []

for lam in lambdas_test:
    model_temp = Ridge(alpha=lam)
    model_temp.fit(X_train, y_train)
    coefficients_path.append(model_temp.coef_)

coefficients_path = np.array(coefficients_path)

# Tracer
plt.figure(figsize=(12, 6))
for i, feature in enumerate(features_disponibles[:10]):  # 10 premi√®res features
    plt.plot(lambdas_test, coefficients_path[:, i], label=feature, linewidth=2)

plt.xscale('log')
plt.xlabel('Œª (√©chelle log)', fontsize=12)
plt.ylabel('Coefficient', fontsize=12)
plt.title('Chemin de R√©gularisation Ridge', fontsize=14, fontweight='bold')
plt.axvline(model_ridge.alpha_, color='red', linestyle='--',
            linewidth=2, label=f'Œª optimal = {model_ridge.alpha_}')
plt.grid(True, alpha=0.3)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=9)
plt.tight_layout()
plt.show()

In [None]:
r2_train_list = []
r2_test_list = []

for lam in lambdas_test:
    model_temp = Ridge(alpha=lam)
    model_temp.fit(X_train, y_train)

    r2_train_list.append(r2_score(y_train, model_temp.predict(X_train)))
    r2_test_list.append(r2_score(y_test, model_temp.predict(X_test)))

plt.figure(figsize=(10, 6))
plt.plot(lambdas_test, r2_train_list, label='R¬≤ train', linewidth=2, color='blue')
plt.plot(lambdas_test, r2_test_list, label='R¬≤ test', linewidth=2, color='orange')
plt.axvline(model_ridge.alpha_, color='red', linestyle='--',
            linewidth=2, label=f'Œª optimal = {model_ridge.alpha_}')

plt.xscale('log')
plt.xlabel('Œª (√©chelle log)', fontsize=12)
plt.ylabel('R¬≤', fontsize=12)
plt.title('Courbe de Validation Ridge', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.legend(fontsize=11)
plt.tight_layout()
plt.show()

**Questions pour l'entrevue orale**:
- Pourquoi Ridge aide-t-il quand les caract√©ristiques sont corr√©l√©es?
- Quelle caract√©ristique a √©t√© la plus r√©duite? Pourquoi?
- Comment interpr√©ter Ridge comme estimation MAP?

---

## Partie 5: Sous-t√¢che de classification (15%)

Entra√Ænez un classifieur pour pr√©dire les √©v√©nements de pointe, puis utilisez la probabilit√© pr√©dite comme caract√©ristique pour la r√©gression.

**√âtapes**:
1. Entra√Æner LogisticRegression sur `evenement_pointe`
2. Extraire `P(pointe)` pour chaque observation
3. Ajouter cette probabilit√© comme caract√©ristique pour Ridge

In [None]:
# Caract√©ristiques pour la classification
# Utilisez des caract√©ristiques qui ne "trichent" pas (pas de consommation pass√©e pour pr√©dire la pointe)
features_pointe = ['temperature_ext', 'humidite', 'vitesse_vent', 'heure_sin', 'heure_cos', 'est_weekend', 'clients_connectes']

X_train_pointe = train_eng[features_pointe].values
y_train_pointe = train_eng['evenement_pointe'].values
X_test_pointe = test_eng[features_pointe].values
y_test_pointe = test_eng['evenement_pointe'].values

# Entra√Æner le classifieur
clf_pointe = LogisticRegression(max_iter=1000)
clf_pointe.fit(X_train_pointe, y_train_pointe)

# √âvaluation
print("Classification des √©v√©nements de pointe:")
print(f"  Accuracy (train): {clf_pointe.score(X_train_pointe, y_train_pointe):.4f}")
print(f"  Accuracy (test): {clf_pointe.score(X_test_pointe, y_test_pointe):.4f}")

In [None]:
# Extraire les probabilit√©s
train_eng['P_pointe'] = clf_pointe.predict_proba(X_train_pointe)[:, 1]
test_eng['P_pointe'] = clf_pointe.predict_proba(X_test_pointe)[:, 1]

print(f"Distribution de P(pointe):")
print(f"  Train: moyenne={train_eng['P_pointe'].mean():.3f}, std={train_eng['P_pointe'].std():.3f}")
print(f"  Test:  moyenne={test_eng['P_pointe'].mean():.3f}, std={test_eng['P_pointe'].std():.3f}")

**Question pour l'entrevue**: Pourquoi utiliser P(pointe) au lieu d'un indicateur 0/1?

In [None]:
print("="*60)
print("ANALYSE CALIBRATION ET DISTRIBUTION P(pointe)")
print("="*60)

# 1. Calibration plot
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Diviser en bins de probabilit√©
n_bins = 10
bins = np.linspace(0, 1, n_bins + 1)
bin_centers = (bins[:-1] + bins[1:]) / 2

# Calculer fraction r√©elle par bin
bin_indices = np.digitize(proba_test, bins) - 1
bin_indices = np.clip(bin_indices, 0, n_bins - 1)

fraction_positive = np.zeros(n_bins)
count_per_bin = np.zeros(n_bins)

for i in range(n_bins):
    mask = bin_indices == i
    if mask.sum() > 0:
        fraction_positive[i] = y_test_clf[mask].mean()
        count_per_bin[i] = mask.sum()

# Calibration curve
axes[0].plot([0, 1], [0, 1], 'k--', label='Parfaitement calibr√©')
axes[0].plot(bin_centers, fraction_positive, 'o-', linewidth=2,
            label='Notre mod√®le', markersize=8)
axes[0].set_xlabel('Probabilit√© pr√©dite')
axes[0].set_ylabel('Fraction r√©elle de pointes')
axes[0].set_title('Courbe de Calibration')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Distribution P(pointe) par classe r√©elle
axes[1].hist(proba_test[y_test_clf==0], bins=30, alpha=0.5,
            label='Normal (classe 0)', color='blue', density=True)
axes[1].hist(proba_test[y_test_clf==1], bins=30, alpha=0.5,
            label='Pointe (classe 1)', color='red', density=True)
axes[1].axvline(0.5, color='black', linestyle='--', label='Seuil 0.5')
axes[1].set_xlabel('P(pointe)')
axes[1].set_ylabel('Densit√©')
axes[1].set_title('Distribution P(pointe) par Classe R√©elle')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

# Histogramme 2D
axes[2].hist2d(test_eng['temperature_ext'], test_eng['P_pointe'],
              bins=30, cmap='YlOrRd')
axes[2].set_xlabel('Temp√©rature (¬∞C)')
axes[2].set_ylabel('P(pointe)')
axes[2].set_title('P(pointe) vs Temp√©rature')
plt.colorbar(axes[2].collections[0], ax=axes[2], label='Fr√©quence')

plt.tight_layout()
plt.show()

# 2. Analyse segment√©e
print(f"\n{'='*60}")
print("P(pointe) PAR SEGMENT")
print("="*60)

segments = {
    'Train': train_eng['P_pointe'],
    'Test': test_eng['P_pointe'],
    'Pointe r√©elle': test_eng[test_eng['evenement_pointe']==1]['P_pointe'],
    'Normal r√©el': test_eng[test_eng['evenement_pointe']==0]['P_pointe']
}

for name, data in segments.items():
    print(f"\n{name}:")
    print(f"  Min:  {data.min():.3f}")
    print(f"  Q25:  {data.quantile(0.25):.3f}")
    print(f"  M√©diane: {data.median():.3f}")
    print(f"  Q75:  {data.quantile(0.75):.3f}")
    print(f"  Max:  {data.max():.3f}")

---

## Partie 6: Mod√®le combin√© (10%)

Assemblez le mod√®le final en ajoutant `P_pointe` comme caract√©ristique.

In [None]:
# Caract√©ristiques finales (avec P_pointe)
features_final = features_disponibles + ['P_pointe']

X_train_final = train_eng[features_final].values
y_train_final = train_eng['energie_kwh'].values
X_test_final = test_eng[features_final].values
y_test_final = test_eng['energie_kwh'].values

# Mod√®le Ridge final
model_final = RidgeCV(alphas=[0.1, 1, 10, 100], cv=TimeSeriesSplit(n_splits=5))
model_final.fit(X_train_final, y_train_final)
y_pred_final = model_final.predict(X_test_final)

print("Mod√®le final (Ridge + P_pointe):")
print(f"  Œª s√©lectionn√©: {model_final.alpha_}")
print(f"  R¬≤ train: {model_final.score(X_train_final, y_train_final):.4f}")
print(f"  R¬≤ test:  {r2_score(y_test_final, y_pred_final):.4f}")
print(f"  RMSE test: {np.sqrt(mean_squared_error(y_test_final, y_pred_final)):.4f}")

In [None]:
# Visualisation des r√©sidus
residus = y_test_final - y_pred_final

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Histogramme des r√©sidus
axes[0].hist(residus, bins=50, edgecolor='black', alpha=0.7)
axes[0].axvline(0, color='red', linestyle='--', label='Z√©ro')
axes[0].set_xlabel('R√©sidu')
axes[0].set_ylabel('Fr√©quence')
axes[0].set_title('Distribution des r√©sidus')
axes[0].legend()

# Pr√©dictions vs r√©el
axes[1].scatter(y_test_final, y_pred_final, alpha=0.3, s=5)
axes[1].plot([y_test_final.min(), y_test_final.max()], 
             [y_test_final.min(), y_test_final.max()], 'r--', label='Parfait')
axes[1].set_xlabel('√ânergie r√©elle (kWh)')
axes[1].set_ylabel('√ânergie pr√©dite (kWh)')
axes[1].set_title('Pr√©dictions vs R√©el')
axes[1].legend()

plt.tight_layout()

In [None]:
print("="*60)
print("ANALYSE APPROFONDIE DES ERREURS")
print("="*60)
print("asdassadsad")

# 1. Erreurs par segment
residus_final = y_test_final - y_pred_final
test_analysis = test_eng.copy()
test_analysis['residus'] = residus_final
test_analysis['erreur_abs'] = np.abs(residus_final)
test_analysis['erreur_pct'] = 100 * np.abs(residus_final) / (y_test_final + 1)

fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Erreurs par heure
erreurs_heure = test_analysis.groupby('heure')['erreur_abs'].mean()
axes[0, 0].bar(erreurs_heure.index, erreurs_heure.values, color='coral')
axes[0, 0].set_xlabel('Heure')
axes[0, 0].set_ylabel('Erreur absolue moyenne (kWh)')
axes[0, 0].set_title('Erreurs par Heure de la Journ√©e')
axes[0, 0].grid(True, alpha=0.3, axis='y')

# Erreurs par temp√©rature
temp_bins = pd.cut(test_analysis['temperature_ext'], bins=8)
erreurs_temp = test_analysis.groupby(temp_bins)['erreur_abs'].mean()
axes[0, 1].bar(range(len(erreurs_temp)), erreurs_temp.values, color='steelblue')
axes[0, 1].set_xlabel('Plage de temp√©rature')
axes[0, 1].set_ylabel('Erreur absolue moyenne (kWh)')
axes[0, 1].set_title('Erreurs par Temp√©rature')
axes[0, 1].set_xticklabels([f"{int(i.left)}-{int(i.right)}" 
                             for i in erreurs_temp.index], rotation=45)
axes[0, 1].grid(True, alpha=0.3, axis='y')

# Erreurs par type (pointe vs normal)
erreurs_type = test_analysis.groupby('evenement_pointe')['erreur_abs'].mean()
axes[0, 2].bar(['Normal', 'Pointe'], erreurs_type.values, color=['green', 'red'])
axes[0, 2].set_ylabel('Erreur absolue moyenne (kWh)')
axes[0, 2].set_title('Erreurs: Normal vs Pointe')
axes[0, 2].grid(True, alpha=0.3, axis='y')

print(f"\nErreurs moyennes:")
print(f"  Normal: {erreurs_type[0]:.2f} kWh")
print(f"  Pointe: {erreurs_type[1]:.2f} kWh")
print(f"  Ratio pointe/normal: {erreurs_type[1]/erreurs_type[0]:.2f}x")

# R√©sidus vs features importantes
axes[1, 0].scatter(test_analysis['temperature_ext'], residus_final, 
                   alpha=0.4, s=10, c=test_analysis['heure'], cmap='viridis')
axes[1, 0].axhline(0, color='red', linestyle='--')
axes[1, 0].set_xlabel('Temp√©rature (¬∞C)')
axes[1, 0].set_ylabel('R√©sidu (kWh)')
axes[1, 0].set_title('R√©sidus vs Temp√©rature (color√© par heure)')
plt.colorbar(axes[1, 0].collections[0], ax=axes[1, 0], label='Heure')

axes[1, 1].scatter(test_analysis['clients_connectes'], residus_final,
                   alpha=0.4, s=10, color='purple')
axes[1, 1].axhline(0, color='red', linestyle='--')
axes[1, 1].set_xlabel('Clients connect√©s')
axes[1, 1].set_ylabel('R√©sidu (kWh)')
axes[1, 1].set_title('R√©sidus vs Nombre de Clients')

# Top 10 pires pr√©dictions
axes[1, 2].axis('off')
pires = test_analysis.nlargest(10, 'erreur_abs')[
    ['heure', 'temperature_ext', 'evenement_pointe', 'erreur_abs']
]
table_text = "TOP 10 PIRES PR√âDICTIONS\n\n"
table_text += pires.to_string(index=False, float_format='%.1f')
axes[1, 2].text(0.1, 0.5, table_text, fontsize=9, 
                family='monospace', verticalalignment='center')

plt.tight_layout()
plt.show()

# 2. M√©triques par quantile de consommation
print(f"\n{'='*60}")
print("PERFORMANCE PAR NIVEAU DE CONSOMMATION")
print("="*60)

test_analysis['consommation_level'] = pd.qcut(y_test_final, q=4,
                                               labels=['Faible', 'Moyen', '√âlev√©', 'Tr√®s √©lev√©'])

perf_by_level = test_analysis.groupby('consommation_level').agg({
    'erreur_abs': 'mean',
    'erreur_pct': 'mean',
    'residus': lambda x: r2_score(y_test_final[x.index], 
                                   y_pred_final[x.index])
}).round(2)
perf_by_level.columns = ['MAE (kWh)', 'MAPE (%)', 'R¬≤']

print(perf_by_level)

print("\nConstat: Le mod√®le est-il meilleur sur certaines plages?")

---

## Partie 7: Extension (10%) - Choisir UNE option

### Option A: Donn√©es m√©t√©orologiques externes
Utilisez la biblioth√®que `meteostat` pour ajouter des donn√©es m√©t√©o suppl√©mentaires (ex: pression atmosph√©rique, point de ros√©e).

### Option B: Classification multiclasse
Au lieu de binaire (pointe/normal), cr√©ez 3+ classes de consommation (faible/moyenne/√©lev√©e) et utilisez softmax.

### Option C: Analyse d'erreur approfondie
Identifiez quand le mod√®le fait le plus d'erreurs et proposez des am√©liorations.

In [None]:
# VOTRE EXTENSION ICI
# Indiquez quelle option vous avez choisie et pourquoi.

# Option choisie: ___
# Justification: ___

---

## Kaggle Score Simulation & Diagnostics

The two cells below serve two purposes:
1. **Kaggle Simulation**: Replicate exactly what happens when Kaggle evaluates your submission (RMSE on `energy_test.csv` targets). This uses the clean model only (no energy lags).
2. **Diagnostic Output**: Print a structured text block with all the information needed to diagnose performance issues. Copy-paste the output for analysis.

In [44]:
# ================================================================
# CELL A v3: KAGGLE SCORE SIMULATION ‚Äî Per-Poste Ridge Models
# ================================================================
# Root cause of bad RMSE: Postes A/B/C have very different consumption
#   Poste A: ~50 kWh (test)   ‚Äî small
#   Poste B: ~72 kWh (test)   ‚Äî medium (64% of test!)
#   Poste C: ~269 kWh (test)  ‚Äî large  (74% of train!)
# A single model learns an intercept ~216 (train mean) and can't
# shift enough per poste. Solution: one Ridge model per poste.
# Also: weather lags are now computed per-poste (shift across postes
# was giving wrong values).
# Requires: train, test, clf_pointe (from earlier cells)
# ================================================================

import numpy as np
import pandas as pd
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# -------------------------------------------------------------------
# 1. FEATURE ENGINEERING (applied per-poste, so lags are correct)
# -------------------------------------------------------------------
def creer_caracteristiques_v3(df):
    """
    Feature engineering for a SINGLE poste's data.
    Data is sorted by time so shift/rolling operate correctly.
    Original DataFrame index is preserved for alignment.
    """
    df = df.sort_values('horodatage_local').copy()

    # Weather x time interactions
    df['temp_heure_cos'] = df['temperature_ext'] * df['heure_cos']
    df['temp_heure_sin'] = df['temperature_ext'] * df['heure_sin']
    df['temp_weekend'] = df['temperature_ext'] * df['est_weekend']
    df['temp_mois_sin'] = df['temperature_ext'] * df['mois_sin']
    df['temp_mois_cos'] = df['temperature_ext'] * df['mois_cos']

    # Degree-days
    df['degres_jours_chauffage'] = np.maximum(18 - df['temperature_ext'], 0)
    df['degres_jours_clim'] = np.maximum(df['temperature_ext'] - 22, 0)

    # Weather transforms
    df['temp_squared'] = df['temperature_ext'] ** 2
    df['temp_ressentie'] = df['temperature_ext'] - 0.5 * df['vitesse_vent']
    df['humidite_temp'] = df['humidite'] * np.abs(df['temperature_ext']) / 100

    # Time indicators
    df['est_pointe_matin'] = ((df['heure'] >= 7) & (df['heure'] <= 9)).astype(int)
    df['est_pointe_soir'] = ((df['heure'] >= 17) & (df['heure'] <= 20)).astype(int)
    df['est_nuit'] = ((df['heure'] >= 0) & (df['heure'] <= 6)).astype(int)

    # Weather lags (correct: single-poste, sorted by time)
    df['temp_lag1'] = df['temperature_ext'].shift(1).fillna(df['temperature_ext'].iloc[0])
    df['temp_lag24'] = df['temperature_ext'].shift(24).fillna(df['temperature_ext'].iloc[0])
    df['temp_diff'] = df['temperature_ext'].diff().fillna(0)
    df['temp_amplitude_24h'] = (
        df['temperature_ext'].rolling(window=24, min_periods=1).max() -
        df['temperature_ext'].rolling(window=24, min_periods=1).min()
    )

    # Infrastructure interactions
    if 'clients_connectes' in df.columns:
        df['clients_temp'] = df['clients_connectes'] * df['temperature_ext']
        df['clients_weekend'] = df['clients_connectes'] * df['est_weekend']
        df['clients_heure_cos'] = df['clients_connectes'] * df['heure_cos']

    if 'tstats_intelligents_connectes' in df.columns:
        df['tstats_temp'] = df['tstats_intelligents_connectes'] * df['temperature_ext']
        df['ratio_tstats_clients'] = (
            df['tstats_intelligents_connectes'] / (df['clients_connectes'] + 1))

    if 'irradiance_solaire' in df.columns:
        df['irradiance_temp'] = df['irradiance_solaire'] * df['temperature_ext']

    return df

# -------------------------------------------------------------------
# 2. BUILD PER-POSTE DATA
# -------------------------------------------------------------------
postes = sorted(train['poste'].unique())
train_parts = {}
test_parts = {}

for p in postes:
    train_parts[p] = creer_caracteristiques_v3(train[train['poste'] == p])
    test_parts[p] = creer_caracteristiques_v3(test[test['poste'] == p])

# Feature list (same for all postes ‚Äî same function applied)
features_clean = [f for f in train_parts[postes[0]].select_dtypes(include=[np.number]).columns
                  if f not in ['energie_kwh']]

# Add P_pointe if classifier available
try:
    _feats_pointe = ['temperature_ext', 'humidite', 'vitesse_vent',
                     'heure_sin', 'heure_cos', 'est_weekend', 'clients_connectes']
    for p in postes:
        train_parts[p]['P_pointe'] = clf_pointe.predict_proba(
            train_parts[p][_feats_pointe].values)[:, 1]
        test_parts[p]['P_pointe'] = clf_pointe.predict_proba(
            test_parts[p][_feats_pointe].values)[:, 1]
    if 'P_pointe' not in features_clean:
        features_clean.append('P_pointe')
    print("P_pointe added.")
except NameError:
    print("clf_pointe not found, skipping P_pointe.")

print(f"Features per model: {len(features_clean)}")
print(f"Postes: {postes}")

# Reassemble full DataFrames (for Cell B compatibility)
train_clean = pd.concat([train_parts[p] for p in postes])
test_clean = pd.concat([test_parts[p] for p in postes])

# -------------------------------------------------------------------
# 3. TRAIN PER-POSTE RIDGE MODELS
# -------------------------------------------------------------------
alphas_grid = [0.01, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000, 5000]
models = {}
scalers_per_poste = {}

print("\nPer-poste model training:")
for p in postes:
    X_tr = train_parts[p][features_clean].values
    y_tr = train_parts[p]['energie_kwh'].values

    scaler_p = StandardScaler()
    X_tr_s = scaler_p.fit_transform(X_tr)

    n_splits = min(5, max(2, len(X_tr) // 50))
    tscv_p = TimeSeriesSplit(n_splits=n_splits)

    ridge_p = RidgeCV(alphas=alphas_grid, cv=tscv_p, scoring='neg_mean_squared_error')
    ridge_p.fit(X_tr_s, y_tr)

    # Evaluate
    X_te = test_parts[p][features_clean].values
    y_te = test_parts[p]['energie_kwh'].values
    X_te_s = scaler_p.transform(X_te)

    y_pred_tr = ridge_p.predict(X_tr_s)
    y_pred_te = ridge_p.predict(X_te_s)

    rmse_tr = np.sqrt(mean_squared_error(y_tr, y_pred_tr))
    rmse_te = np.sqrt(mean_squared_error(y_te, y_pred_te))
    r2_tr = r2_score(y_tr, y_pred_tr)
    r2_te = r2_score(y_te, y_pred_te) if len(y_te) > 1 else float('nan')

    models[p] = ridge_p
    scalers_per_poste[p] = scaler_p

    print(f"  Poste {p}: alpha={ridge_p.alpha_}, n_tr={len(y_tr)}, n_te={len(y_te)}, "
          f"RMSE_tr={rmse_tr:.2f}, RMSE_te={rmse_te:.2f}, "
          f"R2_tr={r2_tr:.4f}, R2_te={r2_te:.4f}")

# -------------------------------------------------------------------
# 4. PREDICT ON TEST (preserving original row order)
# -------------------------------------------------------------------
y_pred_test_c = pd.Series(index=test.index, dtype=float)

for p in postes:
    te_p = test_parts[p]
    X_te = te_p[features_clean].values
    X_te_s = scalers_per_poste[p].transform(X_te)
    preds = np.maximum(models[p].predict(X_te_s), 0)
    y_pred_test_c.loc[te_p.index] = preds

y_pred_test_c = y_pred_test_c.values
y_test_clean = test['energie_kwh'].values

# -------------------------------------------------------------------
# 5. KAGGLE SCORE = TEST SCORE
# -------------------------------------------------------------------
# test IS energy_test_avec_cible.csv, so test RMSE = Kaggle RMSE
rmse_sim = np.sqrt(mean_squared_error(y_test_clean, y_pred_test_c))
r2_sim = r2_score(y_test_clean, y_pred_test_c)
mae_sim = mean_absolute_error(y_test_clean, y_pred_test_c)

# Build merged DataFrame (for Cell B)
merged = pd.DataFrame({
    'horodatage_local': test['horodatage_local'].values,
    'poste': test['poste'].values,
    'y_true': y_test_clean,
    'y_pred': y_pred_test_c
})

# y_sim_pred for submission (same order as test)
y_sim_pred = y_pred_test_c.copy()

try:
    sample_sub = pd.read_csv('ift-3395-6390-prediction-energetique/sample_submission.csv')
    expected_rows = len(sample_sub)
except FileNotFoundError:
    expected_rows = len(test)

# -------------------------------------------------------------------
# 6. REPORT
# -------------------------------------------------------------------
print("\n" + "=" * 70)
print("KAGGLE SCORE SIMULATION (v3: Per-Poste Ridge Models)")
print("=" * 70)
print(f"  Models:      Ridge per poste ({', '.join(postes)})")
print(f"  Features:    {len(features_clean)} per model (no poste dummies needed)")
print(f"  Rows:        {len(merged)} (expected {expected_rows})")
print()
print(f"  RMSE:  {rmse_sim:.4f} kWh   <-- Kaggle score")
print(f"  MAE:   {mae_sim:.4f} kWh")
print(f"  R2:    {r2_sim:.4f}")
print()
print(f"  Pred stats:  min={merged['y_pred'].min():.2f}, "
      f"mean={merged['y_pred'].mean():.2f}, max={merged['y_pred'].max():.2f}")
print(f"  Truth stats: min={merged['y_true'].min():.2f}, "
      f"mean={merged['y_true'].mean():.2f}, max={merged['y_true'].max():.2f}")
print()
print(f"  Expected rows by Kaggle: {expected_rows}")
print(f"  Your submission rows:    {len(merged)}")

print("\n  Per-poste RMSE:")
for p in postes:
    mask = merged['poste'] == p
    sub = merged[mask]
    rmse_p = np.sqrt(mean_squared_error(sub['y_true'], sub['y_pred']))
    r2_p = r2_score(sub['y_true'], sub['y_pred']) if len(sub) > 1 else float('nan')
    bias_p = (sub['y_true'] - sub['y_pred']).mean()
    print(f"    Poste {p}: RMSE={rmse_p:.2f}, R2={r2_p:.4f}, bias={bias_p:+.2f}, n={mask.sum()}")

print("=" * 70)

P_pointe added.
Features per model: 44
Postes: ['A', 'B', 'C']

Per-poste model training:
  Poste A: alpha=500.0, n_tr=1751, n_te=474, RMSE_tr=25.55, RMSE_te=16.74, R2_tr=0.8339, R2_te=0.2834
  Poste B: alpha=10.0, n_tr=366, n_te=1126, RMSE_tr=17.85, RMSE_te=44.83, R2_tr=0.7666, R2_te=-0.5415
  Poste C: alpha=10.0, n_tr=6129, n_te=154, RMSE_tr=171.32, RMSE_te=186.61, R2_tr=0.4401, R2_te=-3.6472

KAGGLE SCORE SIMULATION (v3: Per-Poste Ridge Models)
  Models:      Ridge per poste (A, B, C)
  Features:    44 per model (no poste dummies needed)
  Rows:        1754 (expected 1754)

  RMSE:  66.3865 kWh   <-- Kaggle score
  MAE:   42.6369 kWh
  R2:    0.1196

  Pred stats:  min=0.00, mean=116.87, max=702.43
  Truth stats: min=17.76, mean=83.74, max=538.48

  Expected rows by Kaggle: 1754
  Your submission rows:    1754

  Per-poste RMSE:
    Poste A: RMSE=16.74, R2=0.2834, bias=+2.84, n=474
    Poste B: RMSE=44.55, R2=-0.5218, bias=-29.51, n=1126
    Poste C: RMSE=186.61, R2=-3.6472, bias=-1

In [None]:
# ================================================================
# CELL B v3: DIAGNOSTIC OUTPUT (copy-paste the output to me)
# ================================================================
# Run AFTER Cell A. Uses: train, test, postes, train_parts, test_parts,
# features_clean, models, scalers_per_poste, merged, rmse_sim, mae_sim
# ================================================================

print("=" * 80)
print("DIAGNOSTIC DUMP v3 -- COPY EVERYTHING BELOW THIS LINE")
print("=" * 80)

# ---- SECTION 1: DATASET OVERVIEW ----
print("\n[1] DATASET OVERVIEW")
print(f"  train shape: {train.shape}")
print(f"  test shape:  {test.shape}")
print(f"  train period: {train['horodatage_local'].min()} -> {train['horodatage_local'].max()}")
print(f"  test period:  {test['horodatage_local'].min()} -> {test['horodatage_local'].max()}")

print(f"\n  train energie_kwh: mean={train['energie_kwh'].mean():.2f}, "
      f"std={train['energie_kwh'].std():.2f}, "
      f"min={train['energie_kwh'].min():.2f}, max={train['energie_kwh'].max():.2f}")
print(f"  test energie_kwh:  mean={test['energie_kwh'].mean():.2f}, "
      f"std={test['energie_kwh'].std():.2f}, "
      f"min={test['energie_kwh'].min():.2f}, max={test['energie_kwh'].max():.2f}")

for col in ['temperature_ext', 'humidite', 'vitesse_vent', 'irradiance_solaire', 'clients_connectes']:
    if col in train.columns and col in test.columns:
        print(f"  {col}: train_mean={train[col].mean():.2f}, test_mean={test[col].mean():.2f}, "
              f"shift={100*(test[col].mean()-train[col].mean())/(train[col].mean()+1e-8):+.1f}%")

# ---- SECTION 1B: PER-POSTE OVERVIEW ----
print("\n[1B] PER-POSTE OVERVIEW")
for p in postes:
    tr_mask = train['poste'] == p
    te_mask = test['poste'] == p
    print(f"  Poste {p}: train n={tr_mask.sum()}, mean_kwh={train.loc[tr_mask, 'energie_kwh'].mean():.2f} | "
          f"test n={te_mask.sum()}, mean_kwh={test.loc[te_mask, 'energie_kwh'].mean():.2f}")

# ---- SECTION 2: PER-POSTE MODEL CONFIGS ----
print("\n[2] PER-POSTE MODEL CONFIGURATIONS")
print(f"  Number of features: {len(features_clean)}")
for p in postes:
    print(f"  Poste {p}: alpha={models[p].alpha_}, intercept={models[p].intercept_:.4f}")

# ---- SECTION 3: TOP COEFFICIENTS PER POSTE ----
print("\n[3] TOP 15 COEFFICIENTS PER POSTE")
for p in postes:
    print(f"\n  --- Poste {p} ---")
    coef_pairs = sorted(zip(features_clean, models[p].coef_),
                        key=lambda x: abs(x[1]), reverse=True)
    for i, (feat, coef) in enumerate(coef_pairs[:15], 1):
        print(f"    {i:2d}. {feat:40s} {coef:+10.4f}")

# ---- SECTION 4: KAGGLE RESULTS ----
print("\n[4] KAGGLE SIMULATION RESULTS")
print(f"  RMSE:  {rmse_sim:.4f}")
print(f"  MAE:   {mae_sim:.4f}")
print(f"  R2:    {r2_sim:.4f}")
print(f"  Rows:  {len(merged)}")

# ---- SECTION 5: RESIDUAL ANALYSIS ----
print("\n[5] RESIDUAL ANALYSIS")
residuals = merged['y_true'].values - merged['y_pred'].values
abs_residuals = np.abs(residuals)
merged_diag = merged.copy()
merged_diag['residual'] = residuals
merged_diag['abs_err'] = abs_residuals

print(f"  Residual mean:   {residuals.mean():.4f} (bias)")
print(f"  Residual std:    {residuals.std():.4f}")
for p_val in [50, 75, 90, 95, 99]:
    print(f"  |residual| P{p_val}: {np.percentile(abs_residuals, p_val):.2f}")

# ---- SECTION 5B: PER-POSTE RESIDUAL ----
print("\n[5B] PER-POSTE RESIDUAL ANALYSIS")
for p in postes:
    mask = merged_diag['poste'] == p
    sub = merged_diag[mask]
    rmse_p = np.sqrt((sub['residual']**2).mean())
    r2_p = r2_score(sub['y_true'], sub['y_pred']) if len(sub) > 1 else float('nan')
    print(f"  Poste {p}: RMSE={rmse_p:.2f}, bias={sub['residual'].mean():+.2f}, "
          f"MAE={sub['abs_err'].mean():.2f}, R2={r2_p:.4f}, n={mask.sum()}")

# ---- SECTIONS 6-11: ERRORS BY CATEGORY ----
try:
    _info_cols = ['horodatage_local', 'poste', 'heure', 'mois', 'temperature_ext',
                  'clients_connectes', 'evenement_pointe', 'est_weekend']
    _info_avail = [c for c in _info_cols if c in test.columns]
    analysis = merged_diag.merge(test[_info_avail], on=['horodatage_local', 'poste'], how='left')

    print("\n[6] MAE BY HOUR")
    for h in range(24):
        mask = analysis['heure'] == h
        if mask.sum() > 0:
            print(f"  Hour {h:2d}: MAE={analysis.loc[mask, 'abs_err'].mean():6.2f}, "
                  f"mean_conso={analysis.loc[mask, 'y_true'].mean():6.2f}, n={mask.sum()}")

    print("\n[7] MAE BY MONTH")
    for m in sorted(analysis['mois'].unique()):
        mask = analysis['mois'] == m
        if mask.sum() > 0:
            print(f"  Month {m:2d}: MAE={analysis.loc[mask, 'abs_err'].mean():6.2f}, n={mask.sum()}")

    print("\n[8] MAE BY TEMPERATURE BIN")
    _tbins = pd.cut(analysis['temperature_ext'], bins=[-30, -10, 0, 10, 20, 40])
    for tb, grp in analysis.groupby(_tbins):
        if len(grp) > 0:
            print(f"  {str(tb):15s}: MAE={grp['abs_err'].mean():6.2f}, "
                  f"bias={grp['residual'].mean():+6.2f}, n={len(grp)}")

    print("\n[9] ERRORS: POINTE vs NORMAL")
    for ev, label in [(0, 'Normal'), (1, 'Pointe')]:
        mask = analysis['evenement_pointe'] == ev
        if mask.sum() > 0:
            print(f"  {label:8s}: MAE={analysis.loc[mask, 'abs_err'].mean():.2f}, "
                  f"bias={analysis.loc[mask, 'residual'].mean():+.2f}, n={mask.sum()}")

    print("\n[10] ERRORS: WEEKEND vs WEEKDAY")
    for we, label in [(0, 'Weekday'), (1, 'Weekend')]:
        mask = analysis['est_weekend'] == we
        if mask.sum() > 0:
            print(f"  {label:8s}: MAE={analysis.loc[mask, 'abs_err'].mean():.2f}, "
                  f"bias={analysis.loc[mask, 'residual'].mean():+.2f}, n={mask.sum()}")

    print("\n[11] TOP 15 WORST PREDICTIONS")
    worst = analysis.nlargest(15, 'abs_err')
    cols_show = ['horodatage_local', 'poste', 'heure', 'mois', 'temperature_ext',
                 'y_true', 'y_pred', 'abs_err']
    cols_avail = [c for c in cols_show if c in worst.columns]
    print(worst[cols_avail].to_string(index=False))

except Exception as e:
    print(f"  Error in sections 6-11: {e}")

# ---- SECTION 12: FEATURE CORRELATIONS PER POSTE ----
print("\n[12] TOP FEATURE CORRELATIONS PER POSTE")
for p in postes:
    try:
        _feats_avail = [f for f in features_clean if f in train_parts[p].columns]
        _corrs = train_parts[p][_feats_avail + ['energie_kwh']].corr()['energie_kwh'].drop('energie_kwh')
        _corrs_sorted = _corrs.abs().sort_values(ascending=False)
        print(f"\n  --- Poste {p} (top 10) ---")
        for i, feat in enumerate(_corrs_sorted.head(10).index, 1):
            print(f"    {i:2d}. {feat:40s} r={_corrs[feat]:+.4f}")
    except Exception as e:
        print(f"  Error for poste {p}: {e}")

# ---- SECTION 13: PIPELINE HEALTH ----
print("\n[13] DATA PIPELINE HEALTH")
print(f"  train_clean shape: {train_clean.shape}")
print(f"  test_clean shape:  {test_clean.shape}")
print(f"  merged shape:      {merged.shape}")
any_nan = False
for p in postes:
    nans = train_parts[p][features_clean].isnull().sum()
    if nans.sum() > 0:
        print(f"  WARNING NaN in train poste {p}: {nans[nans>0].to_dict()}")
        any_nan = True
    nans_te = test_parts[p][features_clean].isnull().sum()
    if nans_te.sum() > 0:
        print(f"  WARNING NaN in test poste {p}: {nans_te[nans_te>0].to_dict()}")
        any_nan = True
if not any_nan:
    print(f"  No NaN in features (good)")

# ---- SECTION 15: SAMPLE PREDICTIONS ----
print("\n[15] SAMPLE PREDICTIONS (first 20 rows)")
print(merged[['horodatage_local', 'poste', 'y_true', 'y_pred']].head(20).to_string(index=False))

print("\n" + "=" * 80)
print("END OF DIAGNOSTIC DUMP v3 -- COPY EVERYTHING ABOVE THIS LINE")
print("=" * 80)

---

## Soumission Kaggle

G√©n√©rez votre fichier de soumission pour la comp√©tition.

In [None]:
# G√©n√©rer les pr√©dictions pour Kaggle
# Uses y_sim_pred from Cell A (predictions in same row order as test)

submission = pd.DataFrame({
    'id': range(len(y_sim_pred)),
    'energie_kwh': y_sim_pred
})

submission.to_csv('submission.csv', index=False)
print(f"Fichier de soumission cr√©√©: submission.csv ({len(submission)} lignes)")
print(f"Pr√©dictions: min={y_sim_pred.min():.2f}, mean={y_sim_pred.mean():.2f}, max={y_sim_pred.max():.2f}")
submission.head()

Fichier de soumission cr√©√©: submission.csv (1754 lignes)


Unnamed: 0,id,energie_kwh
0,0,398.873974
1,1,349.478473
2,2,338.598378
3,3,326.020949
4,4,27.960704


---

## Questions de pr√©paration pour l'entrevue orale

Pr√©parez-vous √† r√©pondre √† ces questions:

### Fondamentaux
1. D√©rivez la solution OLS sur le tableau.
2. Pourquoi avez-vous utilis√© une division temporelle et non al√©atoire?
3. Que voyez-vous dans vos r√©sidus?

### R√©gularisation
4. Pourquoi Ridge aide-t-il avec des caract√©ristiques corr√©l√©es?
5. Comment avez-vous choisi Œª?
6. Quel coefficient a √©t√© le plus r√©duit? Pourquoi?

### Classification
7. Quelle cible binaire avez-vous choisie? Justifiez.
8. Votre classifieur donne P=0.7. Qu'est-ce que cela signifie?
9. Pourquoi utiliser P(pointe) plut√¥t qu'un indicateur 0/1?

### Th√©orie probabiliste
10. Expliquez Ridge comme estimation MAP.
11. Pourquoi la r√©gression logistique minimise-t-elle l'entropie crois√©e?

### Synth√®se
12. Parcourez votre mod√®le complet √©tape par √©tape.
13. Quelle am√©lioration de R¬≤ √©tait la plus importante?
14. Modifiez ce seuil en direct - que pr√©disez-vous?