# 03 ‚Äî Analyse du Data Drift (Train vs Test)

Ce notebook compare la distribution des variables entre :
- **Reference** : `application_train.csv` (train)
- **Current** : `application_test.csv` (test)

Il g√©n√®re un rapport HTML interactif avec **Evidently**.


### Sommaire
- [1. Pr√©paration](#1-preparation)
- [2. Rapport Evidently](#2-evidently)
- [3. Simulation monitoring](#3-monitoring)


## <a id="1-preparation"></a>1. Pr√©paration


### 1.1 Imports & configuration


In [1]:
import os
import warnings
import pandas as pd
from evidently import Report
from evidently.presets import DataDriftPreset
warnings.filterwarnings("ignore")
from pathlib import Path
from datetime import datetime, timezone
import joblib

# Dossier de sortie pour les rapports
report_dir = "../artifacts/reports"
os.makedirs(report_dir, exist_ok=True)



### 1.2 Chargement des donn√©es


In [2]:
# Chargement des donn√©es
df_train = pd.read_csv("../data/raw/application_train.csv")
df_test = pd.read_csv("../data/raw/application_test.csv")

print(f"Train shape: {df_train.shape}")
print(f"Test shape:  {df_test.shape}")
print(f"\nColonnes uniquement dans train: {set(df_train.columns) - set(df_test.columns)}")
print(f"Colonnes uniquement dans test:  {set(df_test.columns) - set(df_train.columns)}")


Train shape: (307511, 122)
Test shape:  (48744, 121)

Colonnes uniquement dans train: {'TARGET'}
Colonnes uniquement dans test:  set()


## <a id="2-evidently"></a>2. Rapport Evidently


[Evidently](https://www.evidentlyai.com/) est une biblioth√®que open-source pour le monitoring des mod√®les ML. Elle g√©n√®re des rapports HTML interactifs pour analyser le *data drift*.

Dans ce notebook, le rapport est sauvegard√© dans `../artifacts/reports/`.


### 2.1 Pr√©paration des donn√©es (reference vs current)

Evidently compare deux jeux de donn√©es :
- **Reference** : les donn√©es d'entra√Ænement (train)
- **Current** : les nouvelles donn√©es (test)

On conserve uniquement les **features communes** (en excluant `TARGET` et l'ID).


In [3]:
TARGET_COL = "TARGET"
ID_COL = "SK_ID_CURR"

# Colonnes communes, en excluant la target et l'ID
feature_cols = [
    col
    for col in df_train.columns
    if col not in [TARGET_COL, ID_COL] and col in df_test.columns
]

print(f"\nNombre de features utilis√©es : {len(feature_cols)}")

reference_data = df_train[feature_cols].copy()   # Train = r√©f√©rence
current_data   = df_test[feature_cols].copy()    # Test  = current

print(f"Reference (Train): {reference_data.shape}")
print(f"Current (Test):   {current_data.shape}")



Nombre de features utilis√©es : 120
Reference (Train): (307511, 120)
Current (Test):   (48744, 120)


### 2.2 G√©n√©ration du rapport de drift (HTML)

Le rapport analyse le drift pour toutes les features avec des tests statistiques et des visualisations interactives.


In [4]:
data_drift_report = Report([
    DataDriftPreset()   
])

drift_eval = data_drift_report.run(
    current_data=current_data,
    reference_data=reference_data,
)

# Sauvegarde HTML
drift_html_path = os.path.join(report_dir, "evidently_data_drift_report.html")
drift_eval.save_html(drift_html_path)

print(f"\n‚úÖ Rapport Data Drift HTML sauvegard√© : {drift_html_path}")
print(f" file://{os.path.abspath(drift_html_path)}")



‚úÖ Rapport Data Drift HTML sauvegard√© : ../artifacts/reports/evidently_data_drift_report.html
 file:///Users/ely/Developer/home_credit_project01/home_credit_project/artifacts/reports/evidently_data_drift_report.html


### 2.3 Lecture rapide

Remarque : m√™me si le drift global peut para√Ætre faible (ex. ~7.5% sur une ex√©cution), il peut toucher des variables cl√©s du mod√®le.


## <a id="3-monitoring"></a>3. Simulation monitoring


### 3.1 Journalisation des pr√©dictions

On simule la journalisation des pr√©dictions (comme si elles provenaient d'une API), puis on les stocke dans un fichier Parquet pour monitorer l'√©volution dans le temps : `../artifacts/predictions/predictions_log.parquet`.


In [5]:
# --- Paths
PRED_DIR = Path("../artifacts/predictions")
PRED_DIR.mkdir(parents=True, exist_ok=True)
PRED_LOG_PATH = PRED_DIR / "predictions_log.parquet"

MODEL_PATH = Path("../artifacts/models/champion_pipeline.joblib")
MODEL_VERSION = "champion_v1" 

# --- Charge ton mod√®le
model = joblib.load(MODEL_PATH)

# --- On simule des "√©v√©nements API" √† partir du test
proba = model.predict_proba(current_data)[:, 1]

events = pd.DataFrame({
    "timestamp": datetime.now(timezone.utc).isoformat(),
    "sk_id_curr": df_test["SK_ID_CURR"].values if "SK_ID_CURR" in df_test.columns else range(len(df_test)),
    "model_version": MODEL_VERSION,
    "proba_default": proba,
})

THRESHOLD = 0.54
events["decision"] = (events["proba_default"] >= THRESHOLD).map({True: "REFUSED", False: "ACCEPTED"})
events["threshold_used"] = THRESHOLD

# --- append parquet
if PRED_LOG_PATH.exists():
    old = pd.read_parquet(PRED_LOG_PATH)
    events = pd.concat([old, events], ignore_index=True)

events.to_parquet(PRED_LOG_PATH, index=False)
print(f"‚úÖ Events sauvegard√©s: {PRED_LOG_PATH} ({len(events)} lignes)")



‚úÖ Events sauvegard√©s: ../artifacts/predictions/predictions_log.parquet (97488 lignes)


### 3.2 Exemple de r√®gle d'alerte (√† adapter)

Le bloc ci-dessous est un **squelette** (√† adapter) pour d√©clencher une alerte en fonction du drift global et/ou de features critiques.


In [None]:
# ALERT_PATH = Path("../artifacts/reports/alerts.log")

# # r√®gle simple
# share_drifted = drift_df["drift_detected"].mean() if "drift_detected" in drift_df.columns else None

# # si ta colonne ne s'appelle pas drift_detected, affiche drift_df.columns et adapte
# print("Colonnes du drift_df:", list(drift_df.columns))

# CRITICAL_FEATURES = [
#     "AMT_CREDIT", "AMT_GOODS_PRICE", "AMT_ANNUITY", "EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"
# ]

# critical_drift = drift_df[
#     (drift_df["column_name"].isin(CRITICAL_FEATURES)) &
#     (drift_df.get("drift_detected", False) == True)
# ]

# ALERT = False

# # Exemple de conditions
# if share_drifted is not None and share_drifted > 0.10:
#     ALERT = True

# if not critical_drift.empty:
#     ALERT = True

# if ALERT:
#     msg = f"[{datetime.utcnow().isoformat()}] ALERT: drift detected. share_drifted={share_drifted}, critical={critical_drift['column_name'].tolist()}"
#     print("üö®", msg)
#     ALERT_PATH.parent.mkdir(parents=True, exist_ok=True)
#     with open(ALERT_PATH, "a") as f:
#         f.write(msg + "\n")
# else:
#     print(f"‚úÖ OK: pas d'alerte. share_drifted={share_drifted}")

## En cas de production

D√©finir un jeu de r√©f√©rence (ex. un mois stable de production), logger les requ√™tes de pr√©diction, puis lancer un job p√©riodique qui compare les nouvelles donn√©es de production au jeu de r√©f√©rence et applique des r√®gles d'alerte.
