# PROJET SEATTLE ENERGY BENCHMARKING
## Notebook – Pipeline de modélisation et expérimentation multi-modèles



---

### Identité
* **Population cible** : Bâtiments non-résidentiels
* **Date création** : 12 Janvier 2026
* **Objectif** : Catégoriser les variables pour 3 modèles distincts

### Stratégie 3 Modèles
1. **Modèle 1 (Prédictif Pur)** : Variables autorisées uniquement
2. **Modèle 2 (Data Leakage Partiel)** : + ENERGY STAR Score

### Livrables
1. Dataset filtré au premier degré (non-résidentiel)
2. Catégorisation des 46 variables
3. Statistiques descriptives de la cible
4. Analyse des valeurs manquantes
5. Corrélations variables autorisées

---

# Table des matières du notebook

- [Section 0 : Importation des packages](#section-0)  
- [Section 1 : Chargement et Filtrage ](#section-1)  
- [Section 2 : Catégorisation Variables (3 Modèles)](#section-2) 
- [Section 3 : Split train-test](#section-3) 
- [Section 3 : Entraînement des modèles](#section-3)  
- [Section 4 : Évaluation des performances](#section-4)  
- [Section 5 : Intégration MLflow](#section-5)  
- [Section 6 : Documentation et synthèse](#section-6)  


<a id="section-0"></a>
# Section 0 : Importation des packages

In [2]:
import logging
import pandas as pd
from pathlib import Path
import logging
import numpy as np


# Import des fonctions utilitaires 
import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
SRC_PATH = PROJECT_ROOT / "src"
if str(SRC_PATH) not in sys.path:
    sys.path.insert(0, str(SRC_PATH))


import seaborn as sns
import matplotlib.pyplot as plt


# pipeline modele
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

In [3]:
# Configuration
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')

# Ajouter src au path
project_root = Path.cwd().parent
sys.path.append(str(project_root))

# Import module feature_engineering
from src.feature_engineering import (
    create_ratio_features,
    create_interaction_features,
    create_temporal_features,
    create_polynomial_features,
    fit_aggregated_features,
    transform_aggregated_features,
    print_feature_summary
)

# Configuration
REFERENCE_YEAR = 2016  # Année du dataset

# Chemins
processed = Path('../data/processed_data')
interim = Path('../data/interim_data')
interim.mkdir(parents=True, exist_ok=True)

print(f"Année de référence: {REFERENCE_YEAR}")

Année de référence: 2016


<a id="section-1"></a>
# Section 1 : Chargement et Filtrage

In [8]:
# Charger les données du notebook 02
train_df = pd.read_csv(interim / 'train_with_features.csv')
test_df = pd.read_csv(interim / 'test_with_features.csv')

print(f"Train: {train_df.shape}")
print(f"Test: {test_df.shape}")
print(f"\nColonnes: {train_df.shape[1]}")

Train: (1332, 31)
Test: (334, 31)

Colonnes: 31


In [10]:
# Aperçu des données 
train_df.head()

Unnamed: 0,OSEBuildingID,BuildingType,PrimaryPropertyType,Address,Neighborhood,Latitude,Longitude,ListOfAllPropertyUseTypes,LargestPropertyUseType,LargestPropertyUseTypeGFA,SecondLargestPropertyUseType,SecondLargestPropertyUseTypeGFA,ENERGYSTARScore,SteamUse(kBtu),Electricity(kWh),NaturalGas(therms),TotalGHGEmissions,TotalGHGEmissions_log,GFA_per_floor,Parking_ratio,Building_age_squared,Is_old_building,Size_floors,Age_size,Age_floors,GFA_sqrt,Floors_squared,Neighborhood_mean,Neighborhood_std,PrimaryPropertyType_mean,PrimaryPropertyType_std
0,23701.0,NonResidential,Warehouse,1136 S. Albro Place,GREATER DUWAMISH,47.5465,-122.31704,"Non-Refrigerated Warehouse, Office",Non-Refrigerated Warehouse,28000.0,Office,4000.0,73.0,0.0,575815.1,0.0,13.7,2.687847,31899.9681,0.0,2809,1,31900.0,1690700.0,53.0,178.605711,1.0,3.502194,1.267156,3.105404,1.199811
1,401.0,NonResidential,Hotel,1325 6th Ave,DOWNTOWN,47.60968,-122.33379,Hotel,Hotel,310000.0,Parking,11745.0,58.0,845964.44375,4272584.0,28667.04883,564.285,6.337328,24542.006061,0.0,7396,1,3804072.6,26277128.0,1070.7,552.763964,155.0025,4.339052,1.371857,5.200315,1.025148
2,238.0,Nonresidential COS,Small- and Mid-Sized Office,1300 N 97th ST,NORTHWEST,47.70044,-122.34136,"Data Center, Distribution Center, Office, Othe...",Office,57968.0,Distribution Center,32881.0,73.0,0.0,2079128.0,23783.73047,175.77,5.174849,45564.977218,0.0,3481,1,182260.0,5376670.0,118.0,301.877459,4.0,3.974065,1.308375,3.29727,0.907349
3,600.0,NonResidential,Warehouse,4100 4th Avenue South,GREATER DUWAMISH,47.56558,-122.32889,Non-Refrigerated Warehouse,Non-Refrigerated Warehouse,98480.0,Parking,11745.0,65.0,0.0,704380.7,4282.180176,39.5,3.701302,98479.90152,0.0,3600,1,98480.0,5908800.0,60.0,313.815232,1.0,3.502194,1.267156,3.105404,1.199811
4,21336.0,NonResidential,Other,1004 Boren Ave,EAST,47.6097,-122.325,Social/Meeting Hall,Social/Meeting Hall,20411.0,Parking,11745.0,73.0,0.0,227122.7,12395.33008,71.23,4.279855,7012.730996,0.0,10816,1,63114.6,2187972.8,312.0,145.04551,9.0,4.224068,1.359667,4.282176,1.305329


In [None]:
# Toutes les variables du train_df
list(train_df.columns)

['OSEBuildingID',
 'BuildingType',
 'PrimaryPropertyType',
 'Address',
 'Neighborhood',
 'Latitude',
 'Longitude',
 'ListOfAllPropertyUseTypes',
 'LargestPropertyUseType',
 'LargestPropertyUseTypeGFA',
 'SecondLargestPropertyUseType',
 'SecondLargestPropertyUseTypeGFA',
 'ENERGYSTARScore',
 'SteamUse(kBtu)',
 'Electricity(kWh)',
 'NaturalGas(therms)',
 'TotalGHGEmissions',
 'TotalGHGEmissions_log',
 'GFA_per_floor',
 'Parking_ratio',
 'Building_age_squared',
 'Is_old_building',
 'Size_floors',
 'Age_size',
 'Age_floors',
 'GFA_sqrt',
 'Floors_squared',
 'Neighborhood_mean',
 'Neighborhood_std',
 'PrimaryPropertyType_mean',
 'PrimaryPropertyType_std']

<a id="section-2"></a>
# Section 2 : Catégorisation des variables (3 modèles)

In [None]:
# MODÈLE 1 : Variables autorisées (disponibles au permis)
variables_autorisees = [
    # Identification & localisation
    'BuildingType', 'PrimaryPropertyType', 'City', 'State', 'ZipCode',
    'CouncilDistrictCode', 'Neighborhood', 'Latitude', 'Longitude',
    
    # Caractéristiques structurelles
    'YearBuilt', 'NumberofBuildings', 'NumberofFloors',
    'PropertyGFATotal', 'PropertyGFAParking', 'PropertyGFABuilding(s)',
    
    # Typologie d'usage
    'ListOfAllPropertyUseTypes', 'LargestPropertyUseType',
    'LargestPropertyUseTypeGFA', 'SecondLargestPropertyUseType',
    'SecondLargestPropertyUseTypeGFA', 'ThirdLargestPropertyUseType',
    'ThirdLargestPropertyUseTypeGFA'
]

# MODÈLE 2 : Variables autorisées + ENERGY STAR (DATA LEAKAGE PARTIEL)
variable_energystar = ['ENERGYSTARScore']

# Variables à exclure (identifiants)
variables_id = [
    'OSEBuildingID', 'DataYear', 'PropertyName', 'Address',
    'TaxParcelIdentificationNumber', 'Comments', 'Outlier',
    'DefaultData', 'ComplianceStatus'
]

# Variable cible
target = 'TotalGHGEmissions_log'
# variables explicatives totale
variables_exp_tot = variables_autorisees + variable_energystar


print("CATÉGORISATION DES 46 VARIABLES")

print(f"\n MODÈLE 1 - Variables autorisées : {len(variables_autorisees)}")
print(f" MODÈLE 2 - ENERGY STAR : {len(variable_energystar)}")
print(f" Variables ID (exclues) : {len(variables_id)}")
print(f" Variable cible : {target}")


CATÉGORISATION DES 46 VARIABLES

 MODÈLE 1 - Variables autorisées : 22
 MODÈLE 2 - ENERGY STAR : 1
 Variables ID (exclues) : 9
 Variable cible : TotalGHGEmissions


<a id="section-3"></a>
# Section 3 : 