<a href="https://colab.research.google.com/github/NassimZahri/Data_Mining/blob/main/03_transformation_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 03 — Transformation & Feature Engineering
Encodage catégoriel, scaling, binning, variables temporelles, gestion des outliers, transformations log, etc.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

DATA_DIR = Path('data')
ventes = pd.read_csv(DATA_DIR / 'ventes.csv', parse_dates=['date'])
produits = pd.read_csv(DATA_DIR / 'produits.csv')
df = ventes.merge(produits[['product_id','category']], on='product_id', how='left').copy()
df['total'] = df['price'] * df['quantity']
df['month'] = df['date'].dt.month
df['dow'] = df['date'].dt.dayofweek
df.head()


## 1. Binning & Winsorization

In [None]:
# Binning des prix
df['price_bin'] = pd.cut(df['price'], bins=[0,20,50,100,200, np.inf], include_lowest=True)

# Winsorization simple (cap à P1/P99)
p1, p99 = df['total'].quantile([0.01, 0.99])
df['total_cap'] = df['total'].clip(lower=p1, upper=p99)
df[['total', 'total_cap']].head()


## 2. Encodage + Scaling via Pipeline

In [None]:
num_cols = ['price','quantity','month','dow']
cat_cols = ['category']

preprocess = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])

pipe = Pipeline([('prep', preprocess)])
X = df[num_cols + cat_cols]
Xt = pipe.fit_transform(X)
Xt.shape


## 3. EXERCICE
- Créez une variable `is_weekend` (1 si samedi/dimanche; 0 sinon) et mesurez son impact sur `total` (moyenne par groupe).
- Créez `price_per_unit = total / quantity` (attention aux divisions par 0).
- Ajoutez un encodage One-Hot pour la ville et refaites le `ColumnTransformer`.