
# Smoking & Drinking Risk — Teacher–Student (XGBoost) Pipeline
**Generated:** 2025-10-16 13:07  

Questo notebook implementa un flusso di **training con informazioni privilegiate** (Teacher–Student / Knowledge Distillation):  
- **Teacher model:** allena un regressore utilizzando **tutte** le feature (incluse cliniche/laboratorio) per predire un **indice di rischio** costruito dai marker clinici.  
- **Student model:** impara a **imitare** le predizioni del Teacher usando **solo feature semplici** (anagrafiche, antropometriche, pressione, vista/udito, fumo, alcol).  
- **Produzione:** in filiale si usa **solo lo Student**, senza esami del sangue.

Infine, si mappa l'`indice di rischio` ad un **premio assicurativo**.


## 1) Configurazione

In [None]:

# === Config ===
DATA_PATH = "train.csv"   # <-- Modifica qui il nome del tuo file CSV se diverso
SAVE_MODEL_DIR = "models"
BASE_PREMIUM = 500.0      # premio base (€/anno)
MAX_INCREASE = 0.8        # sovrapprezzo max = +80%

import os
os.makedirs(SAVE_MODEL_DIR, exist_ok=True)
print("Config ok.")


## 2) Import librerie

In [None]:

import numpy as np
import pandas as pd

# Matplotlib only (no seaborn), 1 plot per figure as required
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# XGBoost (fallback to GradientBoosting if unavailable)
try:
    import xgboost as xgb
    HAS_XGB = True
except Exception as e:
    HAS_XGB = False
    from sklearn.ensemble import GradientBoostingRegressor

print("XGBoost available:", HAS_XGB)


## 3) Caricamento dataset

In [None]:

df = pd.read_csv(DATA_PATH)
print(df.shape)
df.head()


## 4) Pulizia e mappature

In [None]:

# Copia per sicurezza
df = df.copy()

# Mappa categorie note
if "DRK_YN" in df.columns:
    df["DRK_YN"] = df["DRK_YN"].map({"Y":1, "N":0}).astype("Int64")

# Sex -> 0/1 se possibile
if "sex" in df.columns:
    # tentativo robusto
    df["sex"] = df["sex"].astype(str).str.strip().str.lower()
    map_sex = {"male":1, "m":1, "1":1, "female":0, "f":0, "0":0}
    df["sex"] = df["sex"].map(lambda x: map_sex[x] if x in map_sex else np.nan)
    df["sex"] = df["sex"].astype("Float64")

# SMK_stat_type_cd già 1/2/3
if "SMK_stat_type_cd" in df.columns:
    df["SMK_stat_type_cd"] = pd.to_numeric(df["SMK_stat_type_cd"], errors="coerce").astype("Float64")

# Crea BMI se height/weight
if set(["height","weight"]).issubset(df.columns):
    h = pd.to_numeric(df["height"], errors="coerce")
    w = pd.to_numeric(df["weight"], errors="coerce")
    df["BMI"] = w / ( (h/100.0)**2 )
    df["BMI"] = df["BMI"].astype("Float64")

# Converte tutte le colonne numeriche possibili
for c in df.columns:
    if df[c].dtype == "object":
        try:
            df[c] = pd.to_numeric(df[c], errors="ignore")
        except Exception:
            pass

print("Valori nulli per colonna (prime 30):")
print(df.isnull().sum().sort_values(ascending=False).head(30))


## 5) Definizione gruppi di feature

In [None]:

# Feature "semplici" (disponibili in filiale)
simple_features = [
    col for col in [
        "sex","age","height","weight","waistline",
        "sight_left","sight_right","hear_left","hear_right",
        "SBP","DBP","SMK_stat_type_cd","DRK_YN","BMI"
    ] if col in df.columns
]

# Feature cliniche (laboratorio/urine/enzimi) — usate SOLO in training
candidate_clinical = [
    "HDL_chole","LDL_chole","triglyceride","hemoglobin",
    "urine_protein","serum_creatinine","SGOT_AST","SGOT_ALT","gamma_GTP"
]
clinical_features = [c for c in candidate_clinical if c in df.columns]

print("Simple features:", simple_features)
print("Clinical features:", clinical_features)

# Rimuove righe con NaN nelle feature usate dal teacher (per semplicità)
used_for_teacher = list(set(simple_features + clinical_features))
df_clean = df.dropna(subset=used_for_teacher).reset_index(drop=True)
print("Shape after dropna for teacher features:", df_clean.shape)


## 6) Costruzione target 'health_risk' via PCA sulle feature cliniche

In [None]:

# Standardizza le cliniche e applica PCA (1 componente)
scaler = StandardScaler()
X_clin = scaler.fit_transform(df_clean[clinical_features])

pca = PCA(n_components=1, random_state=42)
risk_raw = pca.fit_transform(X_clin).ravel()

# Normalizza su [0,1]
def minmax(x):
    x = np.asarray(x, dtype=float)
    return (x - np.nanmin(x)) / (np.nanmax(x) - np.nanmin(x) + 1e-12)

df_clean["health_risk"] = minmax(risk_raw)

print("Varianza spiegata dalla prima componente PCA:", float(pca.explained_variance_ratio_[0]))
df_clean[["health_risk"]].describe()


## 7) Split train/val/test

In [None]:

# Train/Val/Test split sul teacher
X_teacher = df_clean[simple_features + clinical_features].values
y_teacher = df_clean["health_risk"].values

X_tr, X_te, y_tr, y_te = train_test_split(X_teacher, y_teacher, test_size=0.2, random_state=42)
X_tr, X_va, y_tr, y_va = train_test_split(X_tr, y_tr, test_size=0.2, random_state=42)

len_tr, len_va, len_te = len(y_tr), len(y_va), len(y_te)
print("Train/Val/Test sizes:", len_tr, len_va, len_te)


## 8) Training Teacher model (tutte le feature)

In [None]:

if HAS_XGB:
    teacher = xgb.XGBRegressor(
        n_estimators=400,
        learning_rate=0.05,
        max_depth=5,
        subsample=0.9,
        colsample_bytree=0.9,
        reg_alpha=0.0,
        reg_lambda=1.0,
        random_state=42,
        n_jobs=-1,
        tree_method="hist"
    )
else:
    teacher = GradientBoostingRegressor(
        n_estimators=400,
        learning_rate=0.05,
        max_depth=3,
        random_state=42
    )

teacher.fit(X_tr, y_tr)

pred_va = teacher.predict(X_va)
pred_te = teacher.predict(X_te)

def reg_report(y_true, y_pred):
    return {
        "R2": r2_score(y_true, y_pred),
        "MAE": mean_absolute_error(y_true, y_pred),
        "RMSE": mean_squared_error(y_true, y_pred, squared=False)
    }

print("Teacher Val:", reg_report(y_va, pred_va))
print("Teacher Test:", reg_report(y_te, pred_te))


## 9) Training Student model (solo feature semplici) — distillazione

In [None]:

# Label distillate dal teacher su TUTTO df_clean
teacher_full_pred = teacher.predict(df_clean[simple_features + clinical_features].values)
df_clean["teacher_pred"] = teacher_full_pred

# Split per lo Student usando SOLO simple features
X_student_all = df_clean[simple_features].values
y_student_all = df_clean["teacher_pred"].values

Xs_tr, Xs_te, ys_tr, ys_te = train_test_split(X_student_all, y_student_all, test_size=0.2, random_state=42)
Xs_tr, Xs_va, ys_tr, ys_va = train_test_split(Xs_tr, ys_tr, test_size=0.2, random_state=42)

if HAS_XGB:
    student = xgb.XGBRegressor(
        n_estimators=600,
        learning_rate=0.05,
        max_depth=5,
        subsample=0.9,
        colsample_bytree=0.9,
        random_state=42,
        n_jobs=-1,
        tree_method="hist"
    )
else:
    student = GradientBoostingRegressor(
        n_estimators=600,
        learning_rate=0.05,
        max_depth=3,
        random_state=42
    )

student.fit(Xs_tr, ys_tr)

stu_va = student.predict(Xs_va)
stu_te = student.predict(Xs_te)

print("Student Val:", reg_report(ys_va, stu_va))
print("Student Test:", reg_report(ys_te, stu_te))


## 10) Importanza feature (Student)

In [None]:

# Ottieni importanze (XGBoost o sklearn)
if HAS_XGB:
    importances = student.feature_importances_
else:
    importances = getattr(student, "feature_importances_", np.zeros(len(simple_features)))

# Ordina e plottalo con matplotlib
idx_sorted = np.argsort(importances)[::-1]
features_sorted = [simple_features[i] for i in idx_sorted]
importances_sorted = importances[idx_sorted]

plt.figure(figsize=(8, 5))
plt.bar(range(len(importances_sorted)), importances_sorted)
plt.xticks(range(len(importances_sorted)), features_sorted, rotation=45, ha="right")
plt.title("Student Feature Importances")
plt.tight_layout()
plt.show()


## 11) Funzione di pricing

In [None]:

def compute_premium(risk_index, base=BASE_PREMIUM, max_increase=MAX_INCREASE):
    # risk_index può essere fuori [0,1] -> clamp
    r = float(risk_index)
    r = max(0.0, min(1.0, r))
    return base * (1.0 + max_increase * r)

# Esempio su test set dello Student
risk_pred_test = student.predict(Xs_te)
# normalizzazione opzionale su [0,1]
risk_min, risk_max = np.min(risk_pred_test), np.max(risk_pred_test) + 1e-12
risk_norm = (risk_pred_test - risk_min) / (risk_max - risk_min)

premiums = [compute_premium(r) for r in risk_norm]
print("Esempio premi (prime 10):", premiums[:10])


## 12) Helper: predizione per un singolo cliente (solo feature semplici)

In [None]:

# Attenzione: lo Student è addestrato su df_clean (dopo dropna). 
# Per un cliente nuovo, bisogna passare esattamente le simple_features in questo ordine.

def predict_client_premium(client_dict):
    # Crea vettore nello stesso ordine delle simple_features
    x = []
    for col in simple_features:
        if col not in client_dict:
            raise ValueError(f"Manca la feature richiesta: {col}")
        x.append(client_dict[col])
    x = np.array(x).reshape(1, -1)
    # Predizione rischio dallo Student
    risk_pred = student.predict(x)[0]
    # Normalizzazione rispetto a train dello Student (usa le stesse statistiche del batch test come esempio)
    r = (risk_pred - risk_min) / (risk_max - risk_min)
    premium = compute_premium(r)
    return float(r), float(premium)

# Esempio di input (adatta ai tuoi dati reali)
example_client = {k: df_clean.iloc[0][k] for k in simple_features}
risk_idx, prem = predict_client_premium(example_client)
print("Risk index (normalized):", round(risk_idx, 3), " -> Premium (€):", round(prem, 2))


## 13) Salvataggio modelli

In [None]:

import pickle

with open(os.path.join(SAVE_MODEL_DIR, "teacher_model.pkl"), "wb") as f:
    pickle.dump(teacher, f)
with open(os.path.join(SAVE_MODEL_DIR, "student_model.pkl"), "wb") as f:
    pickle.dump(student, f)

meta = {
    "simple_features": simple_features,
    "clinical_features": clinical_features,
    "pca_variance_explained": float(pca.explained_variance_ratio_[0]),
    "base_premium": BASE_PREMIUM,
    "max_increase": MAX_INCREASE,
    "has_xgboost": bool(HAS_XGB)
}
with open(os.path.join(SAVE_MODEL_DIR, "metadata.pkl"), "wb") as f:
    pickle.dump(meta, f)

print("Modelli salvati in:", SAVE_MODEL_DIR)



## Note e miglioramenti possibili
- **Calibrazione premio:** valuta su dati storici sinistri/premi per tarare `BASE_PREMIUM` e `MAX_INCREASE`.
- **Explainability:** si può aggiungere SHAP (se disponibile) per spiegare le decisioni dello Student.
- **Validazione:** aggiungere K-Fold CV e metriche stabili per ridurre overfitting.
- **Robustezza:** gestire outlier, winsorization, e trasformazioni robuste su variabili con code pesanti.
- **Privacy:** i marker clinici sono usati solo in training (Teacher) e mai richiesti in filiale (Student).
