# Preprocessing Validation — Issue 5

This notebook validates the preprocessing pipeline defined in `src/preprocessing.py`.

Goals:
- Ensure preprocessing runs end-to-end on raw data
- Verify drop, imputation and encoding logic
- Confirm leakage-safe behavior using cross-validation

This notebook is **not** intended for model optimization.

In [None]:
import sys
from pathlib import Path
import pandas as pd

# Trouve la racine projet en remontant jusqu'à trouver requirements.txt (ou autre marqueur)
PROJECT_ROOT = Path.cwd().resolve()
while not (PROJECT_ROOT / "requirements.txt").exists():
    if PROJECT_ROOT.parent == PROJECT_ROOT:
        raise RuntimeError("Project root not found (requirements.txt missing).")
    PROJECT_ROOT = PROJECT_ROOT.parent

sys.path.insert(0, str(PROJECT_ROOT))  # insert(0) > append() : priorité

from src.preprocessing import TARGET_COL, fit_transform_preview, validate_with_cv


In [11]:
DATA_PATH = PROJECT_ROOT / "data" / "raw" / "train.csv"
df = pd.read_csv(DATA_PATH)

print("Loaded:", DATA_PATH)
print("Shape:", df.shape)
df.head()


Loaded: C:\Users\SlaDe\Documents\real-estate-price-prediction\data\raw\train.csv
Shape: (2197, 82)


Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,534,531363010,20,RL,80.0,9605,Pave,,Reg,Lvl,...,0,,,,0,4,2009,WD,Normal,159000
1,803,906203120,20,RL,90.0,14684,Pave,,IR1,Lvl,...,0,,,,0,6,2009,WD,Normal,271900
2,956,916176030,20,RL,,14375,Pave,,IR1,Lvl,...,0,,,,0,1,2009,COD,Abnorml,137500
3,460,528180130,120,RL,48.0,6472,Pave,,Reg,Lvl,...,0,,,,0,4,2009,WD,Normal,248500
4,487,528290030,80,RL,61.0,9734,Pave,,IR1,Lvl,...,0,,,,0,5,2009,WD,Normal,167000


In [5]:
fit_transform_preview(df, n_rows=3)


Raw X shape: (2197, 81)
y shape: (2197,)
Transformed X shape: (2197, 202)
n_features_out: 202
first 30 feature names: ['Lot Frontage', 'Lot Area', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces', 'Garage Cars', 'Garage Area', 'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Pool Area', 'Misc Val', 'MS Zoning_A (agr)', 'MS Zoning_C (all)']
Preview (first rows):
[[8.0000e+01 9.6050e+03 0.0000e+00 0.0000e+00 0.0000e+00 1.2180e+03
  1.2180e+03 1.2180e+03 0.0000e+00 0.0000e+00 1.2180e+03 0.0000e+00
  0.0000e+00 1.0000e+00 1.0000e+00 3.0000e+00 1.0000e+00 6.0000e+00
  0.0000e+00 2.0000e+00 5.7600e+02 0.0000e+00 1.7800e+02 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 1.0000e+0

## Cross-Validation Check

We run a 5-fold cross-validation using a simple Ridge baseline.
The objective is to verify that:
- preprocessing is fit only on training folds
- no data leakage occurs
- the pipeline is ready for downstream model comparison

In [6]:
results = validate_with_cv(
    df,
    cv=5,
    scoring="neg_root_mean_squared_error"
)

results


{'cv': 5,
 'scoring': 'neg_root_mean_squared_error',
 'mean_score': -32094.347320662328,
 'std_score': 4623.63736439117,
 'all_scores': [-36962.89715009125,
  -37339.54728640161,
  -26531.648479773565,
  -32499.825524618333,
  -27137.818162426873]}

In [7]:
import numpy as np

scores = np.array(results["all_scores"], dtype=float)

# scores = neg RMSE -> RMSE = -score
rmse_mean = (-scores).mean()
rmse_std  = (-scores).std()

print(f"CV folds: {results['cv']}")
print(f"Scoring: {results['scoring']}")
print(f"RMSE mean: {rmse_mean:.4f}")
print(f"RMSE std:  {rmse_std:.4f}")
print("Fold scores (neg RMSE):", scores)


CV folds: 5
Scoring: neg_root_mean_squared_error
RMSE mean: 32094.3473
RMSE std:  4623.6374
Fold scores (neg RMSE): [-36962.89715009 -37339.5472864  -26531.64847977 -32499.82552462
 -27137.81816243]


## Conclusion

- Preprocessing pipeline executes without error
- Feature schema is stable
- Cross-validation confirms leakage-safe behavior