**Projet 3 — Prédiction de prix de l'immobilier avec modèles classiques**

Objectif : Prédire les prix des maisons selon caractéristiques + zone géographique

Techniques : Feature Engineering, Random Forest, Gradient Boosting

Dataset : House Prices - Advanced Regression

Tâches :

•	Encodage, imputation

•	Pipeline de transformation

•	Comparaison de modèles

•	Feature importance




Un **pipeline** est une chaîne d’étapes de traitement des données et de modélisation, regroupées dans un seul objet.
Cela permet de préparer automatiquement les données et d’entraîner le modèle sans faire chaque étape à la main .

par exemple : Pipeline(

  steps=[
    
   ('imputer', SimpleImputer(strategy='median')),   # remplir les valeurs manquantes

  ('scaler', StandardScaler()),                    # normaliser les données

  ('model', RandomForestRegressor())              # entraîner le modèle

])

In [1]:
import warnings
warnings.filterwarnings('ignore')


import os
import numpy as np
import pandas as pd


from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.metrics import mean_squared_error

In [3]:
def rmse(y_true, y_pred):
 return np.sqrt(mean_squared_error(y_true, y_pred))


def log_transform(y):   #Les prix des maisons ont souvent une distribution très asymétrique ,Faire un log réduit l’influence des très grandes valeurs
 return np.log1p(y)


def inv_log_transform(y): #Quand on prédit avec un modèle entraîné sur les valeurs log-transformées, il faut revenir aux valeurs originales.
 return np.expm1(y)

In [5]:
train_path = '/content/drive/MyDrive/ai/dane/train.csv'
test_path = '/content/drive/MyDrive/ai/dane/test.csv'


if not os.path.exists(train_path) or not os.path.exists(test_path):
 raise FileNotFoundError("train.csv and test.csv must be present in the working directory.")


train = pd.read_csv(train_path)
test = pd.read_csv(test_path)


test_ids = test['Id']

In [7]:
def feature_engineering(df):
  df = df.copy()


  # Age of the house
  df['HouseAge'] = df['YrSold'] - df['YearBuilt']
  df['RemodAge'] = df['YrSold'] - df['YearRemodAdd']


  # Total area
  df['TotalSF'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']
  df['TotalPorchSF'] = df['OpenPorchSF'] + df['EnclosedPorch'] + df['3SsnPorch'] + df['ScreenPorch']


  # Simplify some quality/condition scores (convert to numeric)
  qual_dict = {
  'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1
  }
  for col in ['ExterQual', 'ExterCond', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual']:
    if col in df.columns:
      df[col] = df[col].map(qual_dict)


  # Bathrooms total
  df['TotalBath'] = df['FullBath'] + 0.5 * df['HalfBath'] + df['BsmtFullBath'] + 0.5 * df['BsmtHalfBath']


  # Simplify overall quality*condition
  if 'OverallQual' in df.columns and 'OverallCond' in df.columns:
    df['OverallScore'] = df['OverallQual'] * df['OverallCond']


  # Neighborhood average prices (for train only, safe encoding)
  if 'SalePrice' in df.columns:
    df['Neighborhood_MeanPrice'] = df.groupby('Neighborhood')['SalePrice'].transform('mean')
  else:
    df['Neighborhood_MeanPrice'] = np.nan # will be imputed later


  return df


# Apply feature engineering
train = feature_engineering(train)
test = feature_engineering(test)

In [8]:
train = train.drop(columns=['Id'])
y = train['SalePrice']
X = train.drop(columns=['SalePrice'])


numeric_feats = X.select_dtypes(include=[np.number]).columns.tolist()
cat_feats = X.select_dtypes(include=['object']).columns.tolist()


print(f'Numeric features: {len(numeric_feats)}, Categorical features: {len(cat_feats)}')

Numeric features: 49, Categorical features: 37


In [20]:

numeric_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])


categorical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])


preprocessor = ColumnTransformer([
('num', numeric_pipeline, numeric_feats),
('cat', categorical_pipeline, cat_feats),
])


y_trans = log_transform(y)

In [14]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y_trans, test_size=0.2, random_state=42)

In [17]:
rf = RandomForestRegressor(random_state=42, n_jobs=-1)
rf_pipeline = Pipeline([
('pre', preprocessor),
('rf', rf)
])
gbr = GradientBoostingRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=3,
    random_state=42
)

gbr_pipeline = Pipeline([
('pre', preprocessor),
('gbr', gbr)
])


kf = KFold(n_splits=5, shuffle=True, random_state=42)


print('\nBaseline cross-val RMSE (log target)')
for name, pipe in [('RandomForest', rf_pipeline), ('GradientBoosting', gbr_pipeline)]:
  scores = cross_val_score(pipe, X, y_trans, scoring='neg_mean_squared_error', cv=kf, n_jobs=-1)
  rmses = np.sqrt(-scores)
  print(f'{name}: mean RMSE = {rmses.mean():.4f}, std = {rmses.std():.4f}')


rf_param_grid = {
'rf__n_estimators': [200, 400],
'rf__max_depth': [10, 20, None],
'rf__min_samples_split': [2, 5]
}


gbr_param_grid = {
'gbr__n_estimators': [200, 400],
'gbr__learning_rate': [0.05, 0.1],
'gbr__max_depth': [3, 5]
}


print('\nRunning GridSearch on RandomForest')
rf_search = GridSearchCV(rf_pipeline, rf_param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)
rf_search.fit(X_train, y_train)
print('RF best params:', rf_search.best_params_)


print('\nRunning GridSearch on GradientBoosting')
gbr_search = GridSearchCV(gbr_pipeline, gbr_param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)
gbr_search.fit(X_train, y_train)
print('GBR best params:', gbr_search.best_params_)


rf_best = rf_search.best_estimator_
gbr_best = gbr_search.best_estimator_


rf_pred_log = rf_best.predict(X_valid)
gbr_pred_log = gbr_best.predict(X_valid)


print('\nValidation RMSE (log target):')
print('RF:', rmse(y_valid, rf_pred_log))
print('GBR:', rmse(y_valid, gbr_pred_log))


Baseline cross-val RMSE (log target)
RandomForest: mean RMSE = 0.1393, std = 0.0184
GradientBoosting: mean RMSE = 0.1299, std = 0.0152

Running GridSearch on RandomForest
Fitting 5 folds for each of 12 candidates, totalling 60 fits
RF best params: {'rf__max_depth': 20, 'rf__min_samples_split': 2, 'rf__n_estimators': 400}

Running GridSearch on GradientBoosting
Fitting 5 folds for each of 8 candidates, totalling 40 fits
GBR best params: {'gbr__learning_rate': 0.05, 'gbr__max_depth': 3, 'gbr__n_estimators': 400}

Validation RMSE (log target):
RF: 0.14147320108207873
GBR: 0.1321363389319283


In [24]:
preprocessor.fit(X_train)
num_names = numeric_feats
cat_ohe = preprocessor.named_transformers_['cat'].named_steps['onehot']
cat_names = cat_ohe.get_feature_names_out(cat_feats).tolist()
all_feature_names = num_names + cat_names


rf_model = rf_best.named_steps['rf']
rf_importances = rf_model.feature_importances_
feat_imp = pd.Series(rf_importances, index=all_feature_names).sort_values(ascending=False)
print('\nTop 30 feature importances (Random Forest)')
print(feat_imp.head(30))


Top 30 feature importances (Random Forest)
OverallQual               0.366555
TotalSF                   0.357776
Neighborhood_MeanPrice    0.061588
OverallScore              0.025318
GrLivArea                 0.012905
GarageArea                0.011790
LotArea                   0.009444
TotalBath                 0.009373
BsmtFinSF1                0.007998
BsmtUnfSF                 0.006631
HouseAge                  0.006340
RemodAge                  0.005830
GarageCars                0.005682
YearBuilt                 0.005249
1stFlrSF                  0.005187
TotalBsmtSF               0.004774
YearRemodAdd              0.004299
CentralAir_Y              0.004107
KitchenQual               0.003825
2ndFlrSF                  0.003806
LotFrontage               0.003769
CentralAir_N              0.003415
TotalPorchSF              0.003391
GarageYrBlt               0.003101
OverallCond               0.003010
OpenPorchSF               0.002907
MoSold                    0.002875
GarageType_