# PROJET SEATTLE ENERGY BENCHMARKING
## Notebook 06 : pipeline de mod√©lisation et premi√®res exp√©rimentations
---

### Identit√© du document
* **Statut :** Phase 1 (exploration & prototypage)
* **Derni√®re mise √† jour :** 10/01/2026
* **D√©pendances notebooks**: Notebooks 0 √† 7,feature engineering

### Description
Ce notebook constitue la premi√®re √©tape de la construction des mod√®les pr√©dictifs. Il s‚Äôappuie sur les donn√©es nettoy√©es et enrichies afin de tester plusieurs approches de mod√©lisation, comparer leurs performances et √©tablir une base de r√©f√©rence. L‚Äôobjectif est de documenter un pipeline reproductible et d‚Äôidentifier les mod√®les les plus prometteurs.


### Objectifs principaux
1. Charger le dataset pr√©par√© depuis `processed/`.  
2. Mettre en place le split train/test.  
3. Entra√Æner les mod√®les baseline :  
   - R√©gression lin√©aire.  
   - R√©gression Ridge/Lasso.  
   - Random Forest.  
4. √âvaluer les performances avec RMSE, MAE, R¬≤.  
5. Int√©grer MLflow pour tracer les runs (param√®tres, m√©triques, artefacts).  
6. Documenter les r√©sultats et g√©n√©rer un rapport synth√©tique.

---
### D√©pendances critiques
* `src.feature_engineering` : pipeline de features.  
* `src.utils` : fonctions de split et m√©triques.  
* `sklearn` : librairie de mod√©lisation.  
* `mlflow` : suivi des exp√©riences.

### LIVRABLES
1. Mod√®les baseline entra√Æn√©s et sauvegard√©s (`models/`).  
2. R√©sultats des m√©triques dans MLflow UI.  
3. Tableaux comparatifs des performances (`reports/model_baseline.md`).  
4. Visualisations des r√©sidus et des distributions d‚Äôerreurs.  
5. Notebook document√© et reproductible via des scripts

---

# üìö Table des mati√®res

- [Section 0 : Importation des packages](#section-0)  
- [Section 1 : Chargement des donn√©es feature engineering](#section-1)  
- [Section 2 : Pr√©processing et split train/test](#section-2)  
- [Section 3 : Entra√Ænement des mod√®les basiques](#section-3)  
- [Section 4 : √âvaluation des performances](#section-4)  
- [Section 5 : Int√©gration MLflow](#section-5)  
- [Section 6 : Documentation et synth√®se](#section-6)  

> Note : la table des mati√®res est indicative. Utilisez la navigation int√©gr√©e de votre √©diteur (ex. outline VSCode) pour acc√©der rapidement aux sections.
---

<a id="section-0"></a>
# Section 0 : Importation des packages

In [1]:
import logging
import pandas as pd
from pathlib import Path


# Import des fonctions utilitaires 

import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
SRC_PATH = PROJECT_ROOT / "src"

if str(SRC_PATH) not in sys.path:
    sys.path.insert(0, str(SRC_PATH))


from data.load_data import load_data_raw
from utils.config_loader import load_config, create_directories
from utils.eda_logger import setup_eda_logger

import seaborn as sns
import matplotlib.pyplot as plt

#pipeline
from utils.config_loader import load_config
from data.load_data import load_data_raw
from data.clean_data import run_cleaning_pipeline
from feature_engineering.build_features import run_feature_engineering_pipeline

In [2]:
# Configuration du logger pour voir les infos dans le notebook
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
logger = logging.getLogger("notebook")

<a id="section-1"></a>
# Section 1:Chargement des data

In [3]:
cfg = load_config()
create_directories(cfg) # cr√©er les dossiers si absent

# A. Chargement Raw
logger.info("--- 1. LOADING ---")
df_raw = load_data_raw(cfg)

# B. Nettoyage 
logger.info("--- 2. CLEANING ---")
df_cleaned = run_cleaning_pipeline(df_raw, cfg)


2026-01-11 01:54:07,854 - Configuration 'config' charg√©e (project_root=C:\Users\HP\Desktop\temp\TODO\SEMESTRE_1\ML1\ML-prediction-CO2)
2026-01-11 01:54:07,857 - R√©pertoire pr√™t : C:\Users\HP\Desktop\temp\TODO\SEMESTRE_1\ML1\ML-prediction-CO2\data\raw
2026-01-11 01:54:07,859 - R√©pertoire pr√™t : C:\Users\HP\Desktop\temp\TODO\SEMESTRE_1\ML1\ML-prediction-CO2\data\interim
2026-01-11 01:54:07,860 - R√©pertoire pr√™t : C:\Users\HP\Desktop\temp\TODO\SEMESTRE_1\ML1\ML-prediction-CO2\data\processed
2026-01-11 01:54:07,862 - R√©pertoire pr√™t : C:\Users\HP\Desktop\temp\TODO\SEMESTRE_1\ML1\ML-prediction-CO2\figures
2026-01-11 01:54:07,864 - R√©pertoire pr√™t : C:\Users\HP\Desktop\temp\TODO\SEMESTRE_1\ML1\ML-prediction-CO2\reports
2026-01-11 01:54:07,864 - --- 1. LOADING ---
2026-01-11 01:54:07,908 - DataFrame charg√© : 3376 lignes, 46 colonnes
2026-01-11 01:54:07,944 - ‚úîÔ∏è 2016_Building_Energy_Benchmarking.csv : Identique √† la version pr√©c√©dente.
2026-01-11 01:54:07,944 - --- 2. CLEANI

   [Audit] section_0 : -1752 lignes export√©es vers section_0_removed.csv
   [Audit] section_2 : -36 lignes export√©es vers section_2_removed.csv


2026-01-11 01:54:08,667 - --- Ex√©cution : section_1 ---
2026-01-11 01:54:08,816 - ‚úì Donn√©es sauvegard√©es dans : C:\Users\HP\Desktop\temp\TODO\SEMESTRE_1\ML1\ML-prediction-CO2\data\interim\data_cleaned.csv


   [Audit] section_3 : -59 lignes export√©es vers section_3_removed.csv
   [Audit] section_1 : -108 lignes export√©es vers section_1_removed.csv


In [4]:

# C. Feature Engineering 
logger.info("--- 3. FEATURE ENGINEERING ---")
df_final = run_feature_engineering_pipeline(df_cleaned, cfg)

2026-01-11 01:54:31,114 - --- 3. FEATURE ENGINEERING ---
2026-01-11 01:54:31,116 - --- D√©marrage : Feature Engineering ---
2026-01-11 01:54:31,297 - ‚úì Feature Engineering termin√©. Shape: (1421, 68)


‚úì Feature engineering sauvegarde dans : C:\Users\HP\Desktop\temp\TODO\SEMESTRE_1\ML1\ML-prediction-CO2\data\processed\model_input.csv


In [None]:
print(f"Lignes apr√®s cleaning: {len(df_cleaned)}")
print(f"Lignes apr√®s FE: {len(df_final)}")

Lignes apr√®s cleaning: 1421
Lignes apr√®s FE: 1421


<a id="section-2"></a>
# Section 2 : Pr√©processing et split train/test


In [6]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1421 entries, 0 to 3375
Data columns (total 68 columns):
 #   Column                               Non-Null Count  Dtype   
---  ------                               --------------  -----   
 0   BuildingType                         1421 non-null   object  
 1   PrimaryPropertyType                  1421 non-null   object  
 2   ZipCode                              1421 non-null   float64 
 3   CouncilDistrictCode                  1421 non-null   int64   
 4   Neighborhood                         1421 non-null   object  
 5   Latitude                             1421 non-null   float64 
 6   Longitude                            1421 non-null   float64 
 7   YearBuilt                            1421 non-null   int64   
 8   NumberofBuildings                    1421 non-null   float64 
 9   NumberofFloors                       1421 non-null   int64   
 10  PropertyGFATotal                     1421 non-null   int64   
 11  PropertyGFAParking    