# PROJET SEATTLE ENERGY BENCHMARKING
## Notebook – Pipeline de modélisation et expérimentation multi-modèles



---

### Identité
* **Population cible** : Bâtiments non-résidentiels
* **Date création** : 12 Janvier 2026
* **Objectif** : Catégoriser les variables pour 3 modèles distincts

### Stratégie 3 Modèles
1. **Modèle 1 (Prédictif Pur)** : Variables autorisées uniquement
2. **Modèle 2 (Data Leakage Partiel)** : + ENERGY STAR Score

### Livrables
1. Dataset filtré au premier degré (non-résidentiel)
2. Catégorisation des 46 variables
3. Statistiques descriptives de la cible
4. Analyse des valeurs manquantes
5. Corrélations variables autorisées

---

# Table des matières du notebook

- [Section 0 : Importation des packages](#section-0)  
- [Section 1 : Chargement et Filtrage ](#section-1)  
- [Section 2 : Catégorisation Variables (3 Modèles)](#section-2) 
- [Section 3 : Split train-test](#section-3) 
- [Section 3 : Entraînement des modèles](#section-3)  
- [Section 4 : Évaluation des performances](#section-4)  
- [Section 5 : Intégration MLflow](#section-5)  
- [Section 6 : Documentation et synthèse](#section-6)  


<a id="section-0"></a>
# Section 0 : Importation des packages

In [1]:
import logging
import pandas as pd
from pathlib import Path
import logging
import numpy as np


# Import des fonctions utilitaires 
import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
SRC_PATH = PROJECT_ROOT / "src"
if str(SRC_PATH) not in sys.path:
    sys.path.insert(0, str(SRC_PATH))


# from data.load_data import load_data_raw
# from utils.config_loader import load_config, create_directories
# from utils.eda_logger import setup_eda_logger

import seaborn as sns
import matplotlib.pyplot as plt

#pipeline data
# from utils.config_loader import load_config
# from data.load_data import load_data_raw
# from data.clean_data import run_cleaning_pipeline
# from feature_engineering.build_features import run_feature_engineering_pipeline
# import copy

# pipeline modele
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

<a id="section-1"></a>
# Section 1 : Chargement et Filtrage

In [2]:
# Chargement
url = "https://raw.githubusercontent.com/MouslyDiaw/handson-machine-learning/master/data/2016_Building_Energy_Benchmarking.csv"
df = pd.read_csv(url)

print(f"Dataset initial : {df.shape[0]} lignes, {df.shape[1]} colonnes")
df.head()

Dataset initial : 3376 lignes, 46 colonnes


Unnamed: 0,OSEBuildingID,DataYear,BuildingType,PrimaryPropertyType,PropertyName,Address,City,State,ZipCode,TaxParcelIdentificationNumber,...,Electricity(kWh),Electricity(kBtu),NaturalGas(therms),NaturalGas(kBtu),DefaultData,Comments,ComplianceStatus,Outlier,TotalGHGEmissions,GHGEmissionsIntensity
0,1,2016,NonResidential,Hotel,Mayflower park hotel,405 Olive way,Seattle,WA,98101.0,659000030,...,1156514.0,3946027.0,12764.5293,1276453.0,False,,Compliant,,249.98,2.83
1,2,2016,NonResidential,Hotel,Paramount Hotel,724 Pine street,Seattle,WA,98101.0,659000220,...,950425.2,3242851.0,51450.81641,5145082.0,False,,Compliant,,295.86,2.86
2,3,2016,NonResidential,Hotel,5673-The Westin Seattle,1900 5th Avenue,Seattle,WA,98101.0,659000475,...,14515440.0,49526664.0,14938.0,1493800.0,False,,Compliant,,2089.28,2.19
3,5,2016,NonResidential,Hotel,HOTEL MAX,620 STEWART ST,Seattle,WA,98101.0,659000640,...,811525.3,2768924.0,18112.13086,1811213.0,False,,Compliant,,286.43,4.67
4,8,2016,NonResidential,Hotel,WARWICK SEATTLE HOTEL (ID8),401 LENORA ST,Seattle,WA,98121.0,659000970,...,1573449.0,5368607.0,88039.98438,8803998.0,False,,Compliant,,505.01,2.88


<a id="section-2"></a>
# Section 2 : Catégorisation des variables (3 modèles)

In [4]:
# MODÈLE 1 : Variables autorisées (disponibles au permis)
variables_autorisees = [
    # Identification & localisation
    'BuildingType', 'PrimaryPropertyType', 'City', 'State', 'ZipCode',
    'CouncilDistrictCode', 'Neighborhood', 'Latitude', 'Longitude',
    
    # Caractéristiques structurelles
    'YearBuilt', 'NumberofBuildings', 'NumberofFloors',
    'PropertyGFATotal', 'PropertyGFAParking', 'PropertyGFABuilding(s)',
    
    # Typologie d'usage
    'ListOfAllPropertyUseTypes', 'LargestPropertyUseType',
    'LargestPropertyUseTypeGFA', 'SecondLargestPropertyUseType',
    'SecondLargestPropertyUseTypeGFA', 'ThirdLargestPropertyUseType',
    'ThirdLargestPropertyUseTypeGFA'
]

# MODÈLE 2 : ENERGY STAR (DATA LEAKAGE PARTIEL)
variable_energystar = ['ENERGYSTARScore']

# Variables à exclure (identifiants)
variables_id = [
    'OSEBuildingID', 'DataYear', 'PropertyName', 'Address',
    'TaxParcelIdentificationNumber', 'Comments', 'Outlier',
    'DefaultData', 'ComplianceStatus'
]

# Variable cible
target = 'TotalGHGEmissions'
# variables explicatives totale
variables_exp_tot = variables_autorisees + variable_energystar


print("CATÉGORISATION DES 46 VARIABLES")

print(f"\n MODÈLE 1 - Variables autorisées : {len(variables_autorisees)}")
print(f" MODÈLE 2 - ENERGY STAR : {len(variable_energystar)}")
print(f" Variables ID (exclues) : {len(variables_id)}")
print(f" Variable cible : {target}")


CATÉGORISATION DES 46 VARIABLES

 MODÈLE 1 - Variables autorisées : 22
 MODÈLE 2 - ENERGY STAR : 1
 Variables ID (exclues) : 9
 Variable cible : TotalGHGEmissions


<a id="section-3"></a>
# Section 3 : Split train-test