<a href="https://colab.research.google.com/github/NassimZahri/Data_Mining/blob/main/07_mini_projet_end_to_end.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 07 — Mini-Projet End-to-End
Objectif : réaliser un flux complet depuis le chargement jusqu'à un petit modèle prédictif.

**Tâche :** prédire `quantity` à partir de features simples (prix, promo, mois, jour_semaine, catégorie).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

DATA_DIR = Path('data')
ventes = pd.read_csv(DATA_DIR / 'ventes.csv', parse_dates=['date'])
produits = pd.read_csv(DATA_DIR / 'produits.csv')
df = ventes.merge(produits[['product_id','category']], on='product_id', how='left')

# Features
df['month'] = df['date'].dt.month
df['dow'] = df['date'].dt.dayofweek

# Nettoyage minimal
df['price'] = df['price'].fillna(df['price'].median())
df['quantity'] = df['quantity'].fillna(0).clip(lower=0)

X = df[['price','promo','month','dow','category']]
y = df['quantity']

num_cols = ['price','promo','month','dow']
cat_cols = ['category']

preprocess = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])

model = Pipeline([
    ('prep', preprocess),
    ('linreg', LinearRegression())
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model.fit(X_train, y_train)
pred = model.predict(X_test)

mae = mean_absolute_error(y_test, pred)
r2 = r2_score(y_test, pred)
print('MAE =', round(mae,3), '| R2 =', round(r2,3))


## 1. Importance (approche simple via coefficients)

In [None]:
# Récupération des coefficients de la régression linéaire
lin = model.named_steps['linreg']
prep = model.named_steps['prep']

# Noms des features après One-Hot
num_names = num_cols
cat_names = list(prep.named_transformers_['cat'].get_feature_names_out(cat_cols))
feat_names = num_names + cat_names

coefs = pd.Series(lin.coef_, index=feat_names).sort_values(key=abs, ascending=False)
coefs.head(10)


In [None]:
plt.figure()
coefs.head(15).plot(kind='barh')
plt.title('Top 15 coefficients (importance approximative)')
plt.gca().invert_yaxis()
plt.show()


## 2. EXERCICE
- Remplacez `LinearRegression` par `RandomForestRegressor` et comparez MAE/R2.
- Ajoutez une feature `price_per_unit = total / max(quantity,1)` et observez l'impact.
- Créez un graphique des erreurs (`y_test - pred`).