# √âtape 2 : Feature Engineering

## Objectif
Cr√©er de nouvelles features √† partir des donn√©es structurelles existantes pour am√©liorer la pr√©diction de :
- `SiteEnergyUse(kBtu)` : Consommation √©nerg√©tique totale
- `TotalGHGEmissions` : √âmissions totales de CO2

## Strat√©gie
Nous allons cr√©er des features couvrant **6 cat√©gories** :
1. **Sources d'√©nergie** : Diversit√© et types d'√©nergie utilis√©s
2. **Temporalit√©** : √Çge et √©poque de construction
3. **Structure** : Ratios de surface, hauteur, complexit√©
4. **Usages** : Diversit√© et types d'usages
5. **Localisation** : Position g√©ographique
6. **Regroupement** : R√©duction de cardinalit√©

**Note importante** : Les features de sources d'√©nergie doivent √™tre cr√©√©es AVANT de supprimer les colonnes de leakage.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List

# Configuration
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Constantes
TARGET_CO2 = 'TotalGHGEmissions'
TARGET_ENERGY = 'SiteEnergyUse(kBtu)'
CIBLES = [TARGET_CO2, TARGET_ENERGY]

# Chargement des donn√©es nettoy√©es de l'√©tape 1
df_origin = pd.read_csv('data/data_EDA.csv')

df = df_origin.copy()
print(f"üìä Dataset charg√©")
print(f"   Shape : {df.shape}")
print(f"   Lignes : {len(df):,} b√¢timents")
print(f"   Colonnes : {df.shape[1]}")

üìä Dataset charg√©
   Shape : (1630, 38)
   Lignes : 1,630 b√¢timents
   Colonnes : 38


## 1. Features li√©es aux SOURCES D'√âNERGIE

### Objectif
Cr√©er des features indiquant **quels types de sources d'√©nergie** sont utilis√©s par chaque b√¢timent.

### Pourquoi c'est important ?
- Le **type d'√©nergie** influence directement les **√©missions de CO2**
- Gaz naturel √©met ~5x plus de CO2 que l'√©lectricit√© (hydro√©lectrique √† Seattle)
- La **diversit√© √©nerg√©tique** indique la complexit√© du b√¢timent
- Un b√¢timent multi-sources a g√©n√©ralement une consommation plus √©lev√©e

### ‚ö†Ô∏è ATTENTION AU DATA LEAKAGE
- ‚ùå On NE PEUT PAS utiliser les **valeurs** de consommation (`SteamUse(kBtu)`, `Electricity(kBtu)`, `NaturalGas(kBtu)`)
- ‚úÖ On PEUT utiliser l'**existence** de ces sources (pr√©sence/absence)
- Ces informations structurelles sont **ind√©pendantes** de l'intensit√© de consommation

In [2]:
print("\n" + "="*80)
print("üîã CR√âATION DES FEATURES DE SOURCES D'√âNERGIE")
print("="*80)

# Feature 1 : Pr√©sence de chauffage vapeur
df['has_steam'] = (df['SteamUse(kBtu)'] > 0).astype(int)
print(f"‚úÖ has_steam : {df['has_steam'].sum()} b√¢timents avec vapeur ({df['has_steam'].mean()*100:.1f}%)")

# Feature 2 : Pr√©sence d'√©lectricit√©
df['has_electricity'] = (df['Electricity(kBtu)'] > 0).astype(int)
print(f"‚úÖ has_electricity : {df['has_electricity'].sum()} b√¢timents avec √©lectricit√© ({df['has_electricity'].mean()*100:.1f}%)")

# Feature 3 : Pr√©sence de gaz naturel
df['has_natural_gas'] = (df['NaturalGas(kBtu)'] > 0).astype(int)
print(f"‚úÖ has_natural_gas : {df['has_natural_gas'].sum()} b√¢timents avec gaz ({df['has_natural_gas'].mean()*100:.1f}%)")

# Feature 4 : Diversit√© des sources d'√©nergie (0 √† 3)
df['energy_source_diversity'] = (
    df['has_steam'] + 
    df['has_electricity'] + 
    df['has_natural_gas']
)
print(f"\n‚úÖ energy_source_diversity :")
print(df['energy_source_diversity'].value_counts().sort_index())

# Feature 5 : Indicateur multi-√©nergie (2+ sources)
df['is_multi_energy'] = (df['energy_source_diversity'] >= 2).astype(int)
print(f"\n‚úÖ is_multi_energy : {df['is_multi_energy'].sum()} b√¢timents multi-√©nergies ({df['is_multi_energy'].mean()*100:.1f}%)")

print(f"\nüìä R√©sum√© : 5 features cr√©√©es pour les sources d'√©nergie")


üîã CR√âATION DES FEATURES DE SOURCES D'√âNERGIE
‚úÖ has_steam : 114 b√¢timents avec vapeur (7.0%)
‚úÖ has_electricity : 1630 b√¢timents avec √©lectricit√© (100.0%)
‚úÖ has_natural_gas : 1177 b√¢timents avec gaz (72.2%)

‚úÖ energy_source_diversity :
energy_source_diversity
1     402
2    1165
3      63
Name: count, dtype: int64

‚úÖ is_multi_energy : 1228 b√¢timents multi-√©nergies (75.3%)

üìä R√©sum√© : 5 features cr√©√©es pour les sources d'√©nergie


## 2. Features li√©es √† la TEMPORALIT√â

### Objectif
Capturer l'impact de l'**√¢ge du b√¢timent** sur sa consommation √©nerg√©tique.

### Pourquoi c'est important ?
- Chaque **d√©cennie** a ses propres standards de construction
- L'√¢ge est plus parlant qu'une ann√©e brute pour un mod√®le ML

In [3]:
print("\n" + "="*80)
print("üìÖ CR√âATION DES FEATURES DE TEMPORALIT√â")
print("="*80)

# Ann√©e de r√©f√©rence : ann√©e des relev√©s (2016)
REFERENCE_YEAR = 2016

# Feature 6 : √Çge du b√¢timent (en ann√©es depuis 2016)
df['building_age'] = REFERENCE_YEAR - df['YearBuilt']
print(f"‚úÖ building_age : min={df['building_age'].min()}, max={df['building_age'].max()}, moyenne={df['building_age'].mean():.1f} ans")

# Feature : Classes larges d'√¢ge
df["building_age_bucket"] = pd.cut(
    df["building_age"],
    bins=[-np.inf, 20, 50, np.inf],
    labels=["recent", "intermediate", "old"]
)

# Feature 7 : D√©cennie de construction
df['decade_built'] = ((df['YearBuilt'] // 10) * 10).astype(str)
print(f"\n‚úÖ decade_built : {df['decade_built'].nunique()} d√©cennies repr√©sent√©es")
print(df['decade_built'].value_counts().sort_index().tail(10))

# Feature 8 : B√¢timent ancien (> 50 ans)
df['is_old_building'] = (df['building_age'] > 50).astype(int)
print(f"\n‚úÖ is_old_building : {df['is_old_building'].sum()} b√¢timents anciens ({df['is_old_building'].mean()*100:.1f}%)")

# Feature 10 : B√¢timent r√©cent (< 20 ans, post-1996)
df['is_recent_building'] = (df['building_age'] < 20).astype(int)
print(f"‚úÖ is_recent_building : {df['is_recent_building'].sum()} b√¢timents r√©cents ({df['is_recent_building'].mean()*100:.1f}%)")

print(f"\nüìä R√©sum√© : 5 features cr√©√©es pour la temporalit√©")


üìÖ CR√âATION DES FEATURES DE TEMPORALIT√â
‚úÖ building_age : min=1, max=116, moyenne=54.0 ans

‚úÖ decade_built : 12 d√©cennies repr√©sent√©es
decade_built
1920    168
1930     54
1940     63
1950    161
1960    225
1970    164
1980    168
1990    147
2000    197
2010     56
Name: count, dtype: int64

‚úÖ is_old_building : 814 b√¢timents anciens (49.9%)
‚úÖ is_recent_building : 314 b√¢timents r√©cents (19.3%)

üìä R√©sum√© : 5 features cr√©√©es pour la temporalit√©


## 3. Features li√©es √† la STRUCTURE

### Objectif
Cr√©er des **ratios et indicateurs** sur la composition physique des b√¢timents.

### Pourquoi c'est important ?
- **Parking** : Zones souvent non chauff√©es ‚Üí ratio parking √©lev√© = consommation r√©duite par m¬≤
- **Hauteur** : Tours (>10 √©tages) ‚Üí ascenseurs, pompes, ventilation verticale ‚Üí surconsommation
- **Surface/√©tage** : B√¢timents √©tal√©s vs compacts ‚Üí efficacit√© thermique diff√©rente
- **Campus** : Multi-b√¢timents ‚Üí r√©seaux de distribution ‚Üí pertes √©nerg√©tiques

In [4]:
print("\n" + "="*80)
print("üèóÔ∏è CR√âATION DES FEATURES DE STRUCTURE")
print("="*80)

# Feature 11 : Ratio de parking (surface parking / surface totale)
df['parking_ratio'] = df['PropertyGFAParking'] / df['PropertyGFATotal']
df['parking_ratio'] = df['parking_ratio'].fillna(0)  # Si division par 0
print(f"‚úÖ parking_ratio : moyenne={df['parking_ratio'].mean():.3f}, max={df['parking_ratio'].max():.3f}")

# Feature 12 : Pr√©sence de parking
df['has_parking'] = (df['PropertyGFAParking'] > 0).astype(int)
print(f"‚úÖ has_parking : {df['has_parking'].sum()} b√¢timents avec parking ({df['has_parking'].mean()*100:.1f}%)")

# Feature 13 : Ratio b√¢timent (surface b√¢timent / surface totale)
df['building_ratio'] = df['PropertyGFABuilding(s)'] / df['PropertyGFATotal']
df['building_ratio'] = df['building_ratio'].fillna(1)  # Si pas de parking, ratio = 1
print(f"‚úÖ building_ratio : moyenne={df['building_ratio'].mean():.3f}")

# Feature 14 : Surface par √©tage (GFA b√¢timent / nombre √©tages)
df['gfa_per_floor'] = df['PropertyGFABuilding(s)'] / df['NumberofFloors']
df['gfa_per_floor'] = df['gfa_per_floor'].replace([np.inf, -np.inf], np.nan).fillna(df['PropertyGFABuilding(s)'])
print(f"‚úÖ gfa_per_floor : moyenne={df['gfa_per_floor'].mean():.0f} sqft/√©tage")

# Feature 15 : √âtages par b√¢timent
df['floor_per_building'] = df['NumberofFloors'] / df['NumberofBuildings']
df['floor_per_building'] = df['floor_per_building'].replace([np.inf, -np.inf], np.nan).fillna(df['NumberofFloors'])
print(f"‚úÖ floor_per_building : moyenne={df['floor_per_building'].mean():.1f} √©tages/b√¢timent")

# Feature 16 : Gratte-ciel (> 10 √©tages)
df['is_tall_building'] = (df['NumberofFloors'] > 10).astype(int)
print(f"‚úÖ is_tall_building : {df['is_tall_building'].sum()} gratte-ciels ({df['is_tall_building'].mean()*100:.1f}%)")

# Feature 17 : Grand b√¢timent (> 100,000 sqft)
df['is_large_building'] = (df['PropertyGFATotal'] > 100000).astype(int)
print(f"‚úÖ is_large_building : {df['is_large_building'].sum()} grands b√¢timents ({df['is_large_building'].mean()*100:.1f}%)")

# Feature 18 : Campus (plusieurs b√¢timents)
df['is_campus'] = (df['NumberofBuildings'] > 1).astype(int)
print(f"‚úÖ is_campus : {df['is_campus'].sum()} campus ({df['is_campus'].mean()*100:.1f}%)")

print(f"\nüìä R√©sum√© : 8 features cr√©√©es pour la structure")


üèóÔ∏è CR√âATION DES FEATURES DE STRUCTURE
‚úÖ parking_ratio : moyenne=0.062, max=0.895
‚úÖ has_parking : 331 b√¢timents avec parking (20.3%)
‚úÖ building_ratio : moyenne=0.938
‚úÖ gfa_per_floor : moyenne=35367 sqft/√©tage
‚úÖ floor_per_building : moyenne=4.1 √©tages/b√¢timent
‚úÖ is_tall_building : 111 gratte-ciels (6.8%)
‚úÖ is_large_building : 433 grands b√¢timents (26.6%)
‚úÖ is_campus : 52 campus (3.2%)

üìä R√©sum√© : 8 features cr√©√©es pour la structure


## 4. Features li√©es aux USAGES

### Objectif
Capturer la **diversit√© et les types d'usages** des b√¢timents.

### Pourquoi c'est important ?
- **Restaurant** : Cuisines + r√©frig√©ration ‚Üí 3-5x plus √©nergivore qu'un bureau
- **Data center** : Serveurs 24/7 + climatisation intensive ‚Üí 10-20x plus √©nergivore
- **Usages multiples** : Horaires √©tendus, besoins vari√©s ‚Üí consommation complexe
- **Dominance d'usage** : B√¢timent sp√©cialis√© (90% bureaux) vs mixte (40% bureaux + 30% restaurant)

In [5]:
print("\n" + "="*80)
print("üè¢ CR√âATION DES FEATURES D'USAGES")
print("="*80)

# Feature 19 : Usage multiple (pr√©sence de virgule dans la liste)
df['has_multiple_uses'] = df['ListOfAllPropertyUseTypes'].str.contains(',', na=False).astype(int)
print(f"‚úÖ has_multiple_uses : {df['has_multiple_uses'].sum()} b√¢timents multi-usages ({df['has_multiple_uses'].mean()*100:.1f}%)")

# Feature 20 : Nombre d'usages diff√©rents
df['use_count'] = df['ListOfAllPropertyUseTypes'].str.count(',') + 1
print(f"\n‚úÖ use_count :")
print(df['use_count'].value_counts().sort_index())

# Feature 21 : Pr√©sence d'un restaurant
df['has_restaurant'] = df['ListOfAllPropertyUseTypes'].str.contains('Restaurant|Food', case=False, na=False).astype(int)
print(f"\n‚úÖ has_restaurant : {df['has_restaurant'].sum()} b√¢timents avec restaurant ({df['has_restaurant'].mean()*100:.1f}%)")

# Feature 22 : Pr√©sence de commerce/retail
df['has_retail'] = df['ListOfAllPropertyUseTypes'].str.contains('Retail|Store', case=False, na=False).astype(int)
print(f"‚úÖ has_retail : {df['has_retail'].sum()} b√¢timents avec commerce ({df['has_retail'].mean()*100:.1f}%)")

# Feature 23 : Pr√©sence de data center
df['has_data_center'] = df['ListOfAllPropertyUseTypes'].str.contains('Data Center', case=False, na=False).astype(int)
print(f"‚úÖ has_data_center : {df['has_data_center'].sum()} b√¢timents avec data center ({df['has_data_center'].mean()*100:.1f}%)")

# Feature 24 : Ratio de l'usage principal (dominance)
df['largest_use_ratio'] = df['LargestPropertyUseTypeGFA'] / df['PropertyGFATotal']
df['largest_use_ratio'] = df['largest_use_ratio'].fillna(1)
print(f"\n‚úÖ largest_use_ratio : moyenne={df['largest_use_ratio'].mean():.3f} (1 = mono-usage)")

print(f"\nüìä R√©sum√© : 6 features cr√©√©es pour les usages")


üè¢ CR√âATION DES FEATURES D'USAGES
‚úÖ has_multiple_uses : 848 b√¢timents multi-usages (52.0%)

‚úÖ use_count :
use_count
1     782
2     494
3     203
4      81
5      41
6      19
7       4
8       1
9       3
11      1
13      1
Name: count, dtype: int64

‚úÖ has_restaurant : 133 b√¢timents avec restaurant (8.2%)
‚úÖ has_retail : 300 b√¢timents avec commerce (18.4%)
‚úÖ has_data_center : 43 b√¢timents avec data center (2.6%)

‚úÖ largest_use_ratio : moyenne=0.869 (1 = mono-usage)

üìä R√©sum√© : 6 features cr√©√©es pour les usages


## 5. Features li√©es √† la LOCALISATION

### Objectif
Capturer l'impact de la **position g√©ographique** sur la consommation.

### Pourquoi c'est important ?
- **Centre-ville** : B√¢timents denses, anciens, gratte-ciels ‚Üí forte consommation
- **P√©riph√©rie** : B√¢timents r√©cents, √©tal√©s, moins denses ‚Üí consommation mod√©r√©e
- **Zones industrielles** : Usages sp√©cifiques (entrep√¥ts, production)
- **Distance au centre** : Proxy de l'√¢ge et du type urbain

In [6]:
print("\n" + "="*80)
print("üìç CR√âATION DES FEATURES DE LOCALISATION")
print("="*80)

# Centre approximatif de Seattle (Downtown)
CENTER_LAT = 47.6062
CENTER_LON = -122.3321

# Feature 25 : Distance au centre (en degr√©s, approximation)
df['distance_to_center'] = np.sqrt(
    (df['Latitude'] - CENTER_LAT)**2 + 
    (df['Longitude'] - CENTER_LON)**2
)
print(f"‚úÖ distance_to_center : moyenne={df['distance_to_center'].mean():.4f}¬∞, max={df['distance_to_center'].max():.4f}¬∞")

# Feature 26 : B√¢timent du centre-ville
df['is_downtown'] = (df['Neighborhood'] == 'DOWNTOWN').astype(int)
print(f"‚úÖ is_downtown : {df['is_downtown'].sum()} b√¢timents downtown ({df['is_downtown'].mean()*100:.1f}%)")

# Feature 27 : Zone industrielle
industrial_zones = ['GREATER DUWAMISH', 'BALLARD', 'INTERBAY']
df['is_industrial_area'] = df['Neighborhood'].isin(industrial_zones).astype(int)
print(f"‚úÖ is_industrial_area : {df['is_industrial_area'].sum()} b√¢timents en zone industrielle ({df['is_industrial_area'].mean()*100:.1f}%)")

print(f"\nüìä R√©sum√© : 3 features cr√©√©es pour la localisation")


üìç CR√âATION DES FEATURES DE LOCALISATION
‚úÖ distance_to_center : moyenne=0.0439¬∞, max=0.1298¬∞
‚úÖ is_downtown : 353 b√¢timents downtown (21.7%)
‚úÖ is_industrial_area : 410 b√¢timents en zone industrielle (25.2%)

üìä R√©sum√© : 3 features cr√©√©es pour la localisation


## 7. Bilan du feature engineering

### R√©sum√© des features cr√©√©es

In [7]:
print("\n" + "="*80)
print("üìä BILAN COMPLET DU FEATURE ENGINEERING")
print("="*80)

# Liste de toutes les nouvelles features cr√©√©es
new_features = [
    # Cat√©gorie 1 : Sources d'√©nergie (5)
    'has_steam', 'has_natural_gas', 'energy_source_diversity', 'is_multi_energy',
    
    # Cat√©gorie 2 : Temporalit√© (5)
    'building_age', 'decade_built', 'is_old_building', 'is_recent_building',
    
    # Cat√©gorie 3 : Structure (8)
    'parking_ratio', 'has_parking', 'building_ratio', 'gfa_per_floor', 'floor_per_building',
    'is_tall_building', 'is_large_building', 'is_campus',
    
    # Cat√©gorie 4 : Usages (6)
    'has_multiple_uses', 'use_count', 'has_restaurant', 'has_retail', 'has_data_center', 'largest_use_ratio',
    
    # Cat√©gorie 5 : Localisation (3)
    'distance_to_center', 'is_downtown', 'is_industrial_area'
]

print(f"\n‚úÖ TOTAL : {len(new_features)} nouvelles features cr√©√©es\n")

categories = {
    'üîã Sources d\'√©nergie': 5,
    'üìÖ Temporalit√©': 5,
    'üèóÔ∏è Structure': 8,
    'üè¢ Usages': 6,
    'üìç Localisation': 3,
    'üóÇÔ∏è Regroupement': 2,
    'üîó Interaction': 4
}

for category, count in categories.items():
    print(f"{category:30s} : {count:2d} features")

print(f"\n{'='*80}")
print(f"Shape finale du dataset : {df.shape}")
print(f"  ‚Ä¢ Lignes : {len(df):,} b√¢timents")
print(f"  ‚Ä¢ Colonnes : {df.shape[1]} (dont {len(new_features)} nouvelles features)")
print(f"{'='*80}")


üìä BILAN COMPLET DU FEATURE ENGINEERING

‚úÖ TOTAL : 25 nouvelles features cr√©√©es

üîã Sources d'√©nergie            :  5 features
üìÖ Temporalit√©                  :  5 features
üèóÔ∏è Structure                   :  8 features
üè¢ Usages                       :  6 features
üìç Localisation                 :  3 features
üóÇÔ∏è Regroupement                :  2 features
üîó Interaction                  :  4 features

Shape finale du dataset : (1630, 65)
  ‚Ä¢ Lignes : 1,630 b√¢timents
  ‚Ä¢ Colonnes : 65 (dont 25 nouvelles features)


## 7. V√©rification de la qualit√© des features

V√©rifions qu'il n'y a pas de valeurs aberrantes ou de probl√®mes dans nos nouvelles features.

In [8]:
print("\n" + "="*80)
print("üîç V√âRIFICATION DE LA QUALIT√â DES FEATURES")
print("="*80)

# V√©rifier les valeurs manquantes dans les nouvelles features
print("\n1Ô∏è‚É£ Valeurs manquantes dans les nouvelles features :")
missing_in_new_features = df[new_features].isnull().sum()
if missing_in_new_features.sum() == 0:
    print("   ‚úÖ Aucune valeur manquante !")
else:
    print(missing_in_new_features[missing_in_new_features > 0])

# V√©rifier les valeurs infinies
print("\n2Ô∏è‚É£ Valeurs infinies dans les nouvelles features :")
numeric_new_features = df[new_features].select_dtypes(include=[np.number]).columns
inf_count = np.isinf(df[numeric_new_features]).sum()
if inf_count.sum() == 0:
    print("   ‚úÖ Aucune valeur infinie !")
else:
    print(inf_count[inf_count > 0])

# Statistiques descriptives des features num√©riques
print("\n3Ô∏è‚É£ Statistiques des features num√©riques cr√©√©es :")
display(df[numeric_new_features].describe().round(2))


üîç V√âRIFICATION DE LA QUALIT√â DES FEATURES

1Ô∏è‚É£ Valeurs manquantes dans les nouvelles features :
   ‚úÖ Aucune valeur manquante !

2Ô∏è‚É£ Valeurs infinies dans les nouvelles features :
   ‚úÖ Aucune valeur infinie !

3Ô∏è‚É£ Statistiques des features num√©riques cr√©√©es :


Unnamed: 0,has_steam,has_natural_gas,energy_source_diversity,is_multi_energy,building_age,is_old_building,is_recent_building,parking_ratio,has_parking,building_ratio,gfa_per_floor,floor_per_building,is_tall_building,is_large_building,is_campus,has_multiple_uses,use_count,has_restaurant,has_retail,has_data_center,largest_use_ratio,distance_to_center,is_downtown,is_industrial_area
count,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0,1630.0
mean,0.07,0.72,1.79,0.75,53.96,0.5,0.19,0.06,0.2,0.94,35366.97,4.06,0.07,0.27,0.03,0.52,1.91,0.08,0.18,0.03,0.87,0.04,0.22,0.25
std,0.26,0.45,0.49,0.43,32.66,0.5,0.39,0.14,0.4,0.14,124930.02,6.2,0.25,0.44,0.18,0.5,1.23,0.27,0.39,0.16,0.31,0.03,0.41,0.43
min,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.1,1818.0,0.02,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.19,0.0,0.0,0.0
25%,0.0,0.0,2.0,1.0,27.0,0.0,0.0,0.0,0.0,1.0,12302.69,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.69,0.02,0.0,0.0
50%,0.0,1.0,2.0,1.0,50.0,0.0,0.0,0.0,0.0,1.0,21005.5,2.0,0.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,0.96,0.04,0.0,0.0
75%,0.0,1.0,2.0,1.0,86.0,1.0,0.0,0.0,0.0,1.0,33700.88,4.0,0.0,1.0,0.0,1.0,2.0,0.0,0.0,0.0,1.0,0.07,0.0,1.0
max,1.0,1.0,3.0,1.0,116.0,1.0,1.0,0.9,1.0,1.0,4660078.0,76.0,1.0,1.0,1.0,1.0,13.0,1.0,1.0,1.0,6.43,0.13,1.0,1.0


## 8. Sauvegarde du dataset enrichi

Nous sauvegardons le dataset avec toutes les nouvelles features cr√©√©es.

‚ö†Ô∏è **Important** : Nous conservons encore les colonnes de leakage pour le moment (elles seront supprim√©es √† l'√©tape 3).

In [9]:
print("\n" + "="*80)
print("üíæ SAUVEGARDE DU DATASET ENRICHI")
print("="*80)

# Sauvegarder le dataset
output_path = 'data/data_feature_engineering.csv'
df.to_csv(output_path, index=False)

print(f"\n‚úÖ Dataset sauvegard√© : {output_path}")
print(f"   Shape : {df.shape}")
print(f"   Taille : {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print("\n" + "="*80)
print("üéâ √âTAPE 2 - FEATURE ENGINEERING TERMIN√âE !")
print("="*80)
print(f"\nüì¶ {len(new_features)} nouvelles features cr√©√©es")
print(f"üìä {len(df)} b√¢timents enrichis")
print(f"‚úÖ Dataset pr√™t pour l'√©tape 3 (Pr√©paration pour mod√©lisation)")
print("\nüí° Prochaine √©tape :")
print("   - Supprimer les colonnes de leakage")
print("   - G√©rer les outliers")
print("   - Encoder les variables cat√©gorielles")
print("   - Normaliser les features num√©riques")


üíæ SAUVEGARDE DU DATASET ENRICHI

‚úÖ Dataset sauvegard√© : data/data_feature_engineering.csv
   Shape : (1630, 65)
   Taille : 1.78 MB

üéâ √âTAPE 2 - FEATURE ENGINEERING TERMIN√âE !

üì¶ 25 nouvelles features cr√©√©es
üìä 1630 b√¢timents enrichis
‚úÖ Dataset pr√™t pour l'√©tape 3 (Pr√©paration pour mod√©lisation)

üí° Prochaine √©tape :
   - Supprimer les colonnes de leakage
   - G√©rer les outliers
   - Encoder les variables cat√©gorielles
   - Normaliser les features num√©riques
