## Conclusion de l'EDA

Les variables suivantes expliquent une part importante de la variabilité :

* La température (et donc les saisons).

* L'heure de la journée.

* Le type de jour (semaine/week-end).

* Les vacances et les jours fériés.

* Les jours TEMPO.

In [504]:
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder
from sklearn.linear_model import Ridge

In [505]:
df = pd.read_csv("final_df.csv", parse_dates = ["Date"], low_memory = False )
df['Date'] = pd.to_datetime(df['Date'])

## Preprocessing

### Cropping and cleaning

In [506]:
## Remove useless variables ##

cols_to_drop_pre = [
    'Périmètre','Nature',
    'Fioul','Charbon','Gaz','Nucléaire','Eolien','Solaire','Hydraulique','Pompage','Bioénergies','Ech. physiques',
    'Fioul - TAC','Fioul - Cogén.','Fioul - Autres',
    'Gaz - TAC','Gaz - Cogén.','Gaz - CCG','Gaz - Autres',
    'Hydraulique - Fil de l?eau + éclusée','Hydraulique - Lacs','Hydraulique - STEP turbinage',
    'Bioénergies - Déchets','Bioénergies - Biomasse','Bioénergies - Biogaz',
    ' Stockage batterie','Déstockage batterie',
    'Eolien terrestre','Eolien offshore',
    'Ech. comm. Angleterre','Ech. comm. Espagne','Ech. comm. Italie',
    'Ech. comm. Suisse','Ech. comm. Allemagne-Belgique',
    'Taux de Co2'
]

cols_to_drop_pre = [c for c in cols_to_drop_pre if c in df.columns]
df = df.drop(columns=cols_to_drop_pre)

In [507]:
## Remove the lines before the first consumption day ##

df = df.set_index('Date')

valid = df['Consommation'].dropna()
min_date = valid.index.min()
max_date = valid.index.max()

df = df.loc[df.index >= min_date]

df = df.reset_index()

In [508]:
## Standardize to a 30-minute consumption regime ##

df['Date'] = pd.to_datetime(df['Date'])
df['Heures'] = df['Heures'].astype(str)

df = df.dropna(subset=['Heures'])

df['DateTime'] = pd.to_datetime(
    df['Date'].dt.date.astype(str) + ' ' + df['Heures'],
    format = '%Y-%m-%d %H:%M',
    errors = 'coerce'    # transforms invalid formats into NaT
)

df = df.dropna(subset = ['DateTime'])

df = df.set_index('DateTime').sort_index()

df = df[df.index.minute.isin([0, 30])].copy()

# 6) Revenir à un RangeIndex et supprimer la colonne temporaire
df = df.reset_index(drop = False).drop(columns = ['DateTime'])

In [491]:
## Handle outliers ?

### Data spliting

In [510]:
df = df.set_index('Date')
df = df.sort_index()

train_end = '2022-12-31'
valid_start = '2023-01-01'
valid_end   = '2023-12-31'
test_start  = '2024-01-01'

train = df.loc[:train_end]
valid = df.loc[valid_start:valid_end]
test  = df.loc[test_start:]

df = df.reset_index()

### Features engineering

**Features à ajouter**

* *HDD et CDD* -> "Heating Degree Days" et "Cooling Degree Days" sont des indicateurs très utilisés en énergétique des batîments qui traduisent en chiffres la « rigueur » d’un climat vis-à-vis des besoins de chauffage et de climatisation.
* *hour_sin et hour_cos* -> On a vu avec l'EDA que la consomation électrique varie cycliquement au cours de la journée. Ces deux features permettent de représenter cela.
* *day_of_week* -> On a vu avec l'EDA qu'en moyenne, la consommation le week-end est ~11 % plus basse qu’en semaine.
* *lag_24h* -> Consommation exactement 24 heures avant, à la même heure qu’aujourd’hui. Utile parceque la demande électrique à une heure donnée ressemble souvent beaucoup à celle de la veille à la même heure.
* *roll_mean_7d* -> Moyenne des consommations des 7 jours précédents, toujours à la même heure. Cela donne la tendance hebdomadaire, cad si cette heure-ci a été globalement haute ou basse en terme de conso sur la dernière semaine.
* *resid_j-1* -> Erreur de la prévision “naïve” faite la veille (J-1) pour ce même créneau. Permet d'informer le modèle de la performance du benchmark : s’il sous- ou sur-prévoit, le modèle peut ajuster.

In [511]:
## Feature engigneering class ## 

class FeatureEngineering(BaseEstimator, TransformerMixin):
    
    def __init__(self, T_ref = 18.0, freq_minutes = 30):
        # 18°C is the average temperature above which an “average” building needs neither heating (sufficient internal input) nor air conditioning.
        self.T_ref = T_ref
        # We checked in various_check that each day of the dataset was divided into quarter hours.
        self.steps_per_day = int(24 * 60 / freq_minutes)

    def fit(self, X, y = None):
        return self

    def transform(self, X):
        
        df_fe = X.copy()

        # 1) Hour cyclic
        df_fe = df_fe.reset_index()
        df_fe.index = pd.to_datetime(df_fe['Date'].dt.date.astype(str) + ' ' + df_fe['Heures'].astype(str), format = '%Y-%m-%d %H:%M')
        df_fe = df_fe.sort_index()
        hours = df_fe.index.hour + df_fe.index.minute / 60.0
        df_fe['hour_sin'] = np.sin(2 * np.pi * hours / 24)
        df_fe['hour_cos'] = np.cos(2 * np.pi * hours / 24)
        df_fe = df_fe.reset_index(drop = True)

        # 2) day_of_week
        df_fe = df_fe.set_index('Date')
        df_fe['day_of_week'] = df_fe.index.dayofweek

        # 3) HDD & CDD
        df_fe['HDD'] = (self.T_ref - df_fe['Avg_temp_mean']).clip(lower=0)
        df_fe['CDD'] = (df_fe['Avg_temp_mean'] - self.T_ref).clip(lower=0)

        # 4) lag_24h (24h = steps_per_day lignes)
        df_fe['lag_24h'] = df_fe['Consommation'].shift(periods = self.steps_per_day)

        # 5) roll_mean_7d (rolling window of 7 days on lag_24h)
        df_fe['roll_mean_7d'] = (
            df_fe
              .groupby(df_fe.index.time)['lag_24h']
              .transform(lambda s: s.rolling(window=7).mean())
        )

        # 6) resid_j-1
        df_fe['resid_j-1'] = df_fe['Consommation'] - df_fe['Prévision J-1']

        return df_fe

In [512]:
## Apply features engineering to the subsets ## 

fe    = FeatureEngineering()

train_fe        = fe.fit_transform(train)
train_fe_clean  = train_fe.dropna(subset = ['lag_24h','roll_mean_7d'])
y_train_resid   = train_fe_clean['Consommation'] - train_fe_clean['Prévision J']
X_train_resid   = train_fe_clean.drop(columns = [
    'Heures','Consommation','Prévision J','Prévision J-1',
    'Avg_temp_min','Avg_temp_max','Avg_temp_mean'
])

valid_fe       = fe.transform(valid)                                    
valid_fe_clean = valid_fe.dropna(subset = ['lag_24h','roll_mean_7d'])      
y_valid_resid  = valid_fe_clean['Consommation'] - valid_fe_clean['Prévision J']
X_valid_resid  = valid_fe_clean.drop( columns = [
    'Heures','Consommation','Prévision J','Prévision J-1',
    'Avg_temp_min','Avg_temp_max','Avg_temp_mean'
]) 

test_fe        = fe.transform(test)
test_fe_clean  = test_fe.dropna(subset = ['lag_24h','roll_mean_7d'])
y_test_resid   = test_fe_clean['Consommation'] - test_fe_clean['Prévision J']
X_test_resid   = test_fe_clean.drop(columns = [
    'Heures','Consommation','Prévision J','Prévision J-1',
    'Avg_temp_min','Avg_temp_max','Avg_temp_mean'
])

In [513]:
## Feature cleaning / scaling transformer ##

# Impute missing temperatures + missing flag
temperature_imputer = SimpleImputer(strategy = 'mean', add_indicator = True) # Add a feature to take into account missing values

# Impute TEMPO with BLUE, which is the more common day
tempo_imputer = Pipeline([
    ('imputer', SimpleImputer(strategy = 'constant', fill_value = 'BLEU')),
    ('ohe', OneHotEncoder(drop = 'first')) # Drop the first modality to avoid linear collinearity (dummy-trap).
])

numerical_feats = ['resid_j-1','hour_sin','hour_cos','lag_24h','roll_mean_7d']
categorial_feats = ['day_of_week']
binary_feats = ['Bank holidays','School holidays']

# Cleaning and scaling trasnformer
ImputingAndScaling = ColumnTransformer(
    [
        # Impute temperature
        ('temp', temperature_imputer, ['HDD','CDD']),
        # Impute TEMPO
        ('tempo', tempo_imputer, ['Type de jour TEMPO']),
        # Scale continuous numerical variables between 0 and 1
        ('num', StandardScaler(), numerical_feats),
        # One-hot encoding of categorical variables (Day of week)
        ('cat', OneHotEncoder(drop = 'first'), categorial_feats),
        # Binary indicators (Bank holidays, School holidays) are kept as they are.
    ],
    remainder = 'passthrough',
    force_int_remainder_cols = False # Passthrough columns referenced by name (to avoid the warning)
)

In [514]:
## Apply features cleaning / scaling to the subsets ##

pipe  = Pipeline([
    ('prep',  ImputingAndScaling),
    ('model', Ridge())
])

pipe.fit(X_train_resid, y_train_resid)