# 02 – Binary Modelltraining (Diabetes-Risiko: Ja/Nein)

**Ziel:** Wir trainieren ein binäres Modell, das nur zwischen **kein Diabetes** (0) und **Diabetes-Risiko** (1) unterscheidet.
**Warum?** Die Klasse „Prädiabetes“ ist im Datensatz extrem selten und schwer abzugrenzen, daher ist ein binäres Setting oft stabiler.

## 1) Datenbasis

Wir laden den bereits feature-engineerten **Binary-Datensatz** (`diabetes_fe_binary.csv`), damit Training und App später exakt dieselben Features verwenden.

In [33]:
import pandas as pd
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, f1_score, roc_auc_score
import joblib

PROJECT_ROOT = Path("..")
DATA_PATH = PROJECT_ROOT / "data" / "processed" / "diabetes_fe_binary.csv"
MODEL_OUT = PROJECT_ROOT / "models" / "diabetes_binary_model.joblib"

df = pd.read_csv(DATA_PATH)
df.head()

Unnamed: 0,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,...,Education,Income,inactive,cardio_risk_sum,low_fruits,low_veggies,lifestyle_risk_sum,poor_health,mental_physical_burden,Diabetes_binary
0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,1.0,...,4.0,3.0,1,2.0,1,0,3.0,1,33.0,0
1,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,0.0,...,6.0,1.0,0,0.0,1,1,3.0,0,0.0,0
2,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,0.0,...,4.0,8.0,1,2.0,0,1,2.0,1,60.0,0
3,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,1.0,...,3.0,6.0,0,1.0,0,0,0.0,0,0.0,0
4,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,...,5.0,4.0,0,2.0,0,0,0.0,0,3.0,0


## 2) Train/Test Split

Wir splitten die Daten **stratifiziert**, damit das Verhältnis von 0/1 im Train- und Testset gleich bleibt und die Evaluation fair ist.

In [34]:
target = "Diabetes_binary"
X = df.drop(columns=[target])
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)
print("Train label distribution:\n", y_train.value_counts(normalize=True).round(3))

Train shape: (202944, 28)
Test shape: (50736, 28)
Train label distribution:
 Diabetes_binary
0    0.842
1    0.158
Name: proportion, dtype: float64


## 3) Modell

Wir nutzen **Logistic Regression** als robuste, interpretierbare Baseline und setzen `class_weight="balanced"`, um das (immer noch vorhandene) Ungleichgewicht zu berücksichtigen.

In [35]:

from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import make_scorer, f1_score, classification_report, confusion_matrix, roc_auc_score
import pandas as pd
import numpy as np

# Pipeline (Skalierung + SMOTE inside pipeline to avoid leakage)
pipe_logreg = ImbPipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(random_state=42, class_weight="balanced", max_iter=1000, solver="lbfgs",l1_ratio=0.5) )
])
iters = [4000 + i * (6000 - 4000) // (5 - 1) for i in range(5)]
# Parameterraum (stabile, kompatible Kombinationen für LogisticRegression)
param_dist = {
    "clf__C": np.logspace(-4, 2, 50),                 # inverse Regularization
    "clf__solver": ["saga"],
    "clf__tol": [1e-4, 1e-3, 1e-2],
    "clf__max_iter": iters
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scorer = make_scorer(f1_score, pos_label=1)

rs_log = RandomizedSearchCV(
    estimator=pipe_logreg,
    param_distributions=param_dist,
    n_iter=80,
    scoring=scorer,
    cv=cv,
    n_jobs=-1,
    random_state=42,
    return_train_score=True,
    verbose=1

)

# Fit (verwendet vorhandene X_train / y_train im Notebook)
rs_log.fit(X_train, y_train)

print("Best CV F1:", rs_log.best_score_)
print("Best params:", rs_log.best_params_)


Fitting 5 folds for each of 80 candidates, totalling 400 fits
Best CV F1: 0.4702790759399398
Best params: {'clf__tol': 0.001, 'clf__solver': 'saga', 'clf__max_iter': 4500, 'clf__C': np.float64(0.0007196856730011522)}


## 4) Evaluation

Wir betrachten neben dem Report auch die **Confusion Matrix**, weil sie zeigt, ob das Modell Risiko-Fälle übersieht (False Negatives).
Zusätzlich nutzen wir **F1 (für Klasse 1)** und **ROC-AUC**, um die Qualität bei ungleichen Klassen besser zu bewerten als nur Accuracy.

In [36]:
# Evaluation auf Testdaten
y_pred = rs_log.predict(X_test)
# einige solver liefern nicht immer predict_proba; handle defensiv
y_proba = None
if hasattr(rs_log.best_estimator_.named_steps["clf"], "predict_proba"):
    y_proba = rs_log.predict_proba(X_test)[:, 1]

print("=== Classification Report (LogReg best) ===")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print(f"F1 (positive class): {f1_score(y_test, y_pred):.4f}")
if y_proba is not None:
    print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")

# Optional: Top CV-Konfigurationen anzeigen
res = pd.DataFrame(rs_log.cv_results_)
top = res.sort_values("mean_test_score", ascending=False).head(10)[[
    "params", "mean_test_score", "std_test_score"
]]
print("Top 10 CV-Konfigurationen:")
print(top.to_string(index=False))

=== Classification Report (LogReg best) ===
              precision    recall  f1-score   support

           0       0.94      0.73      0.82     42741
           1       0.34      0.76      0.47      7995

    accuracy                           0.73     50736
   macro avg       0.64      0.74      0.65     50736
weighted avg       0.85      0.73      0.76     50736

Confusion Matrix:
[[31012 11729]
 [ 1917  6078]]
F1 (positive class): 0.4711
ROC-AUC: 0.8173
Top 10 CV-Konfigurationen:
                                                                                              params  mean_test_score  std_test_score
  {'clf__tol': 0.001, 'clf__solver': 'saga', 'clf__max_iter': 4500, 'clf__C': 0.0007196856730011522}         0.470279        0.002928
  {'clf__tol': 0.001, 'clf__solver': 'saga', 'clf__max_iter': 5000, 'clf__C': 0.0005428675439323859}         0.470094        0.002758
 {'clf__tol': 0.0001, 'clf__solver': 'saga', 'clf__max_iter': 4000, 'clf__C': 0.0009540954763499944}       

## 4.1) Random Forest als Vergleich

In [37]:
# python
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import make_scorer, f1_score
import pandas as pd
import joblib

# Pipeline mit SMOTE (Vermeidet Data-Leakage durch Einbettung in die Pipeline)
pipe_rf_tune = ImbPipeline([
    ("scaler", StandardScaler()),
    ("smote", SMOTE(sampling_strategy=0.5, random_state=42)),
    ("clf", RandomForestClassifier(random_state=42, n_jobs=-1, class_weight="balanced"))
])

# Suchraum (RandomizedSearchCV)
param_dist = {
    "clf__n_estimators": [100, 200, 500, 800],
    "clf__max_depth": [None, 6, 8, 12, 16],
    "clf__max_features": ["sqrt", "log2", 0.2, 0.5],
    "clf__min_samples_split": [2, 5, 10],
    "clf__min_samples_leaf": [1, 2, 4],
    "clf__bootstrap": [True, False]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scorer = make_scorer(f1_score, pos_label=1)

rs_rf = RandomizedSearchCV(
    estimator=pipe_rf_tune,
    param_distributions=param_dist,
    n_iter=40,
    scoring=scorer,
    cv=cv,
    n_jobs=-1,
    random_state=42,
    return_train_score=True,
    verbose=1
)

# Fit (führt die Suche durch)
rs_rf.fit(X_train, y_train)

# Ergebnisse
print("Best CV F1:", rs_rf.best_score_)
print("Best params:", rs_rf.best_params_)

# Speicherung des besten Pipelines
out_path = MODEL_OUT.with_suffix(".rf_best.joblib")
joblib.dump(rs_rf.best_estimator_, out_path)
print("Bestes RF-Modell gespeichert:", out_path)

# Kurze Evaluation auf Testdaten
y_pred_rf = rs_rf.predict(X_test)
y_proba_rf = rs_rf.predict_proba(X_test)[:, 1]
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
print("=== Classification Report (RF best) ===")
print(classification_report(y_test, y_pred_rf))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))
print(f"F1 (positive class): {f1_score(y_test, y_pred_rf):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_rf):.4f}")

# Optional: einfache Zusammenfassung der Top-Konfigurationen
res = pd.DataFrame(rs_rf.cv_results_)
top = res.sort_values("mean_test_score", ascending=False).head(10)[[
    "params", "mean_test_score", "std_test_score"
]]
print("Top 10 CV-Konfigurationen:")
print(top.to_string(index=False))



Fitting 5 folds for each of 40 candidates, totalling 200 fits
Best CV F1: 0.4847691344481865
Best params: {'clf__n_estimators': 100, 'clf__min_samples_split': 10, 'clf__min_samples_leaf': 1, 'clf__max_features': 0.5, 'clf__max_depth': 12, 'clf__bootstrap': True}
Bestes RF-Modell gespeichert: ..\models\diabetes_binary_model.rf_best.joblib
=== Classification Report (RF best) ===
              precision    recall  f1-score   support

           0       0.92      0.81      0.86     42741
           1       0.39      0.64      0.48      7995

    accuracy                           0.79     50736
   macro avg       0.66      0.73      0.67     50736
weighted avg       0.84      0.79      0.80     50736

Confusion Matrix:
[[34715  8026]
 [ 2878  5117]]
F1 (positive class): 0.4842
ROC-AUC: 0.8165
Top 10 CV-Konfigurationen:
                                                                                                                                                          params  mean_test_s

## 4.2) MLP als Vergleich (mit Hyperparameter-Tuning)

In [38]:

from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import make_scorer, f1_score
import joblib

# Pipeline-Basis (verwende dieselben X_train/y_train wie im Notebook)
pipe_mlp = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", MLPClassifier(
        max_iter=2000,
        early_stopping=True,       # nötig, damit validation_scores_ verfügbar ist
        n_iter_no_change=25,
        tol=1e-4,
        random_state=42,
        verbose=False
    ))
])

# Suchraum (zufällige Suche)
param_dist = {
    "clf__hidden_layer_sizes": [(32,16), (64,32), (128,64), (64,32,16), (128,64,32)],
    "clf__activation": ["relu", "tanh"],
    "clf__alpha": [1e-5, 1e-4, 1e-3, 1e-2],
    "clf__learning_rate_init": [1e-4, 5e-4, 1e-3, 5e-3],
    "clf__learning_rate": ["constant", "adaptive"],
    # solver kept as 'adam' (robust für tiefe Netze)
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scorer = make_scorer(f1_score, pos_label=1)

rs = RandomizedSearchCV(
    pipe_mlp,
    param_distributions=param_dist,
    n_iter=25,
    scoring=scorer,
    cv=cv,
    n_jobs=-1,
    random_state=42,
    return_train_score=True,
    verbose=1
)

# Fit (führt die Suche durch)
rs.fit(X_train, y_train)

# Ergebnisse kurz anzeigen
print("Best F1 (cv):", rs.best_score_)
print("Best params:", rs.best_params_)


Fitting 5 folds for each of 25 candidates, totalling 125 fits
Best F1 (cv): 0.305753115193338
Best params: {'clf__learning_rate_init': 0.001, 'clf__learning_rate': 'constant', 'clf__hidden_layer_sizes': (32, 16), 'clf__alpha': 0.0001, 'clf__activation': 'tanh'}


## 5) Speichern

Wir speichern das trainierte Pipeline-Modell als `.joblib`, damit die Web-App es direkt laden und reproduzierbar Vorhersagen machen kann.

In [39]:
joblib.dump(pipe_logreg, MODEL_OUT)
print("✅ Modell gespeichert:", MODEL_OUT)

✅ Modell gespeichert: ..\models\diabetes_binary_model.joblib
