# Stroke Prediction (Imbalanced) — Threshold tuning + Calibration

## Abstract
Este proyecto entrena modelos para predecir stroke en un dataset tabular desbalanceado (~5% positivos).
Se prioriza F1 y se reporta AP/PR-AUC por ser más informativo en desbalance.
El pipeline final combina BalancedRandomForest + búsqueda de hiperparámetros + calibración de probabilidades + selección de umbral con predicciones out-of-fold (OOF) para evitar fuga de información.


## Dataset
Fuente: Stroke Prediction Dataset (Kaggle).  
Target: `stroke` (0/1).  
El dataset está altamente desbalanceado, por lo que accuracy y ROC-AUC pueden ser engañosos; se usan métricas basadas en Precision-Recall.


In [1]:
from google.colab import files
uploaded = files.upload()  # eliges el .xlsx o .csv


Saving healthcare-dataset-stroke-data (1).csv to healthcare-dataset-stroke-data (1).csv


In [8]:
import pandas as pd
import io

df = pd.read_csv(io.BytesIO(uploaded["healthcare-dataset-stroke-data (1).csv"]))
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [4]:
!pip -q install -U imbalanced-learn

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

from sklearn.metrics import (
    precision_recall_curve,
    f1_score, precision_score, recall_score,
    average_precision_score,
    confusion_matrix, classification_report
)

from imblearn.ensemble import BalancedRandomForestClassifier

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


## Experimental protocol
1) Split estratificado en train/val/test (test intocable).  
2) Selección de modelo con CV estratificada sobre trainval.  
3) Selección de umbral que maximiza F1 usando probabilidades OOF (no se usa el test para escoger umbral).  
4) Evaluación final una sola vez en test.


In [10]:
df = pd.read_csv("stroke.csv")

y = df["stroke"].astype(int)
X = df.drop(columns=["stroke"])

print(df.shape)
print(y.value_counts(), "\n")
print(y.value_counts(normalize=True))


(5110, 12)
stroke
0    4861
1     249
Name: count, dtype: int64 

stroke
0    0.951272
1    0.048728
Name: proportion, dtype: float64


In [11]:
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=RANDOM_STATE,
    stratify=y
)

X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval,
    test_size=0.20,
    random_state=RANDOM_STATE,
    stratify=y_trainval
)

print("Train:", y_train.value_counts().to_dict())
print("Val:  ", y_val.value_counts().to_dict())
print("Test: ", y_test.value_counts().to_dict())


Train: {0: 3111, 1: 159}
Val:   {0: 778, 1: 40}
Test:  {0: 972, 1: 50}


In [12]:
num_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
cat_cols = X.select_dtypes(include=["object"]).columns.tolist()

numeric_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
])

categorical_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, num_cols),
        ("cat", categorical_pipe, cat_cols),
    ],
    remainder="drop"
)

preprocess


## Models
- Baseline: Logistic Regression + threshold tuning.
- BalancedRandomForest: diseñado para desbalance al balancear el muestreo al construir árboles.
- Calibración: CalibratedClassifierCV (sigmoid) para mejorar confiabilidad de probabilidades y luego re-optimizar el umbral.


In [13]:
brf = BalancedRandomForestClassifier(
    n_estimators=500,
    random_state=RANDOM_STATE,
    n_jobs=-1
)

model_brf = Pipeline(steps=[
    ("preprocess", preprocess),
    ("clf", brf)
])

model_brf.fit(X_train, y_train)

p_val = model_brf.predict_proba(X_val)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_val, p_val)

precision_t, recall_t = precision[:-1], recall[:-1]
f1 = 2 * precision_t * recall_t / (precision_t + recall_t + 1e-12)

best_thr_val = thresholds[np.argmax(f1)]
print("Best threshold (VAL):", best_thr_val)
print("Best F1 (VAL):", f1.max())

# Evaluación en VAL con ese umbral
yhat_val = (p_val >= best_thr_val).astype(int)
print("VAL Precision:", precision_score(y_val, yhat_val))
print("VAL Recall:   ", recall_score(y_val, yhat_val))
print("VAL F1:       ", f1_score(y_val, yhat_val))
print("VAL AP:       ", average_precision_score(y_val, p_val))
print("VAL Confusion:\n", confusion_matrix(y_val, yhat_val))


Best threshold (VAL): 0.474
Best F1 (VAL): 0.2527881040889662
VAL Precision: 0.14847161572052403
VAL Recall:    0.85
VAL F1:        0.2527881040892193
VAL AP:        0.15484722416286908
VAL Confusion:
 [[583 195]
 [  6  34]]


## Metrics
- F1: métrica objetivo (balance precisión/recall).
- Average Precision (AP / PR-AUC): calidad de ranking en escenarios desbalanceados.
Además se reporta matriz de confusión para interpretar FP/FN.


In [14]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

oof_proba = np.zeros(len(X_trainval), dtype=float)

for tr_idx, va_idx in cv.split(X_trainval, y_trainval):
    X_tr, y_tr = X_trainval.iloc[tr_idx], y_trainval.iloc[tr_idx]
    X_va, y_va = X_trainval.iloc[va_idx], y_trainval.iloc[va_idx]

    fold_model = Pipeline(steps=[
        ("preprocess", preprocess),
        ("clf", BalancedRandomForestClassifier(
            n_estimators=500,
            random_state=RANDOM_STATE,
            n_jobs=-1
        ))
    ])
    fold_model.fit(X_tr, y_tr)
    oof_proba[va_idx] = fold_model.predict_proba(X_va)[:, 1]

# Umbral que maximiza F1 sobre OOF
precision, recall, thresholds = precision_recall_curve(y_trainval, oof_proba)
precision_t, recall_t = precision[:-1], recall[:-1]
f1 = 2 * precision_t * recall_t / (precision_t + recall_t + 1e-12)

best_thr_oof = thresholds[np.argmax(f1)]
print("Best threshold (OOF-CV):", best_thr_oof)
print("Best F1 (OOF-CV):", f1.max())

# Entrenar modelo final con todo trainval
final_model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("clf", BalancedRandomForestClassifier(
        n_estimators=500,
        random_state=RANDOM_STATE,
        n_jobs=-1
    ))
])
final_model.fit(X_trainval, y_trainval)

# Evaluar en TEST
p_test = final_model.predict_proba(X_test)[:, 1]
yhat_test = (p_test >= best_thr_oof).astype(int)

print("\nTEST metrics (threshold = OOF-CV best)")
print("TEST Precision:", precision_score(y_test, yhat_test))
print("TEST Recall:   ", recall_score(y_test, yhat_test))
print("TEST F1:       ", f1_score(y_test, yhat_test))
print("TEST AP:       ", average_precision_score(y_test, p_test))
print("TEST Confusion:\n", confusion_matrix(y_test, yhat_test))

print("\nClassification report (TEST):")
print(classification_report(y_test, yhat_test, digits=4))

# Comparación con threshold=0.5 (para referencia)
yhat_test_05 = (p_test >= 0.5).astype(int)
print("\nTEST metrics (threshold=0.5)")
print("Precision:", precision_score(y_test, yhat_test_05))
print("Recall:   ", recall_score(y_test, yhat_test_05))
print("F1:       ", f1_score(y_test, yhat_test_05))


Best threshold (OOF-CV): 0.56
Best F1 (OOF-CV): 0.2535496957400429

TEST metrics (threshold = OOF-CV best)
TEST Precision: 0.16842105263157894
TEST Recall:    0.64
TEST F1:        0.26666666666666666
TEST AP:        0.19968340377198687
TEST Confusion:
 [[814 158]
 [ 18  32]]

Classification report (TEST):
              precision    recall  f1-score   support

           0     0.9784    0.8374    0.9024       972
           1     0.1684    0.6400    0.2667        50

    accuracy                         0.8278      1022
   macro avg     0.5734    0.7387    0.5846      1022
weighted avg     0.9387    0.8278    0.8713      1022


TEST metrics (threshold=0.5)
Precision: 0.1568627450980392
Recall:    0.8
F1:        0.26229508196721313


In [15]:
# Búsqueda: optimizamos Average Precision (AP) en CV, y luego umbral para F1.
# AP suele ser más estable que F1 porque no depende del umbral.

pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("clf", BalancedRandomForestClassifier(random_state=RANDOM_STATE, n_jobs=-1))
])

param_dist = {
    "clf__n_estimators": [300, 500, 800, 1200],
    "clf__max_depth": [None, 5, 10, 15],
    "clf__min_samples_leaf": [1, 2, 5, 10],
    "clf__max_features": ["sqrt", 0.5, 0.8],
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

search = RandomizedSearchCV(
    estimator=pipe,
    param_distributions=param_dist,
    n_iter=25,
    scoring="average_precision",
    cv=cv,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    verbose=1
)

search.fit(X_trainval, y_trainval)
print("Best params:", search.best_params_)
print("Best CV AP:", search.best_score_)

best_model = search.best_estimator_

# OOF con el mejor modelo para umbral F1
oof_proba = np.zeros(len(X_trainval), dtype=float)
for tr_idx, va_idx in cv.split(X_trainval, y_trainval):
    X_tr, y_tr = X_trainval.iloc[tr_idx], y_trainval.iloc[tr_idx]
    X_va, y_va = X_trainval.iloc[va_idx], y_trainval.iloc[va_idx]

    m = Pipeline(steps=[
        ("preprocess", preprocess),
        ("clf", BalancedRandomForestClassifier(random_state=RANDOM_STATE, n_jobs=-1))
    ])
    m.set_params(**search.best_params_)
    m.fit(X_tr, y_tr)
    oof_proba[va_idx] = m.predict_proba(X_va)[:, 1]

precision, recall, thresholds = precision_recall_curve(y_trainval, oof_proba)
precision_t, recall_t = precision[:-1], recall[:-1]
f1 = 2 * precision_t * recall_t / (precision_t + recall_t + 1e-12)
best_thr_oof = thresholds[np.argmax(f1)]

print("Best threshold (OOF-CV, tuned model):", best_thr_oof)
print("Best F1 (OOF-CV, tuned model):", f1.max())

# Entrenar el mejor modelo con TODO trainval
best_model.fit(X_trainval, y_trainval)

# TEST final
p_test = best_model.predict_proba(X_test)[:, 1]
yhat_test = (p_test >= best_thr_oof).astype(int)

print("\nTEST metrics (tuned model + OOF threshold)")
print("TEST Precision:", precision_score(y_test, yhat_test))
print("TEST Recall:   ", recall_score(y_test, yhat_test))
print("TEST F1:       ", f1_score(y_test, yhat_test))
print("TEST AP:       ", average_precision_score(y_test, p_test))
print("TEST Confusion:\n", confusion_matrix(y_test, yhat_test))


Fitting 5 folds for each of 25 candidates, totalling 125 fits
Best params: {'clf__n_estimators': 300, 'clf__min_samples_leaf': 1, 'clf__max_features': 0.5, 'clf__max_depth': None}
Best CV AP: 0.2064281591209065
Best threshold (OOF-CV, tuned model): 0.64
Best F1 (OOF-CV, tuned model): 0.2735723771576457

TEST metrics (tuned model + OOF threshold)
TEST Precision: 0.21551724137931033
TEST Recall:    0.5
TEST F1:        0.30120481927710846
TEST AP:        0.22179750616852667
TEST Confusion:
 [[881  91]
 [ 25  25]]


In [16]:
import numpy as np

from sklearn.base import clone
from sklearn.model_selection import StratifiedKFold
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import (
    precision_recall_curve,
    f1_score, precision_score, recall_score,
    average_precision_score,
    confusion_matrix, classification_report
)

RANDOM_STATE = 42
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

#  1) OOF calibrated probabilities (sin fuga)
oof_proba_cal = np.zeros(len(X_trainval), dtype=float)

for tr_idx, va_idx in cv.split(X_trainval, y_trainval):
    X_tr, y_tr = X_trainval.iloc[tr_idx], y_trainval.iloc[tr_idx]
    X_va, y_va = X_trainval.iloc[va_idx], y_trainval.iloc[va_idx]

    # 1) Entrena el modelo base en el fold de entrenamiento
    base = clone(best_model)
    base.fit(X_tr, y_tr)

    # 2) Calibra sobre el fold de validación (prefit)
    cal = CalibratedClassifierCV(estimator=base, method="sigmoid", cv="prefit")
    cal.fit(X_va, y_va)

    # 3) Guardamos las probabilidades calibradas OOF para ese fold
    oof_proba_cal[va_idx] = cal.predict_proba(X_va)[:, 1]

#  2) Umbral OOF que maximiza F1
precision, recall, thresholds = precision_recall_curve(y_trainval, oof_proba_cal)  # PR curve [web:345]
precision_t, recall_t = precision[:-1], recall[:-1]
f1_vals = 2 * precision_t * recall_t / (precision_t + recall_t + 1e-12)

best_idx = np.argmax(f1_vals)
best_thr_oof = thresholds[best_idx]

print("Best threshold (OOF-CV, calibrated):", best_thr_oof)
print("Best F1 (OOF-CV, calibrated):", f1_vals[best_idx])

#  3) Entrena calibración final en todo trainval (sin tocar test)
# CalibratedClassifierCV con cv!=prefit vuelve a entrenar y calibrar internamente usando CV.
final_calibrated = CalibratedClassifierCV(
    estimator=clone(best_model),
    method="sigmoid",
    cv=cv
)
final_calibrated.fit(X_trainval, y_trainval)

#  4) Evalúa una sola vez en test
p_test = final_calibrated.predict_proba(X_test)[:, 1]
yhat_test = (p_test >= best_thr_oof).astype(int)

print("\nTEST metrics (calibrated + OOF threshold)")
print("TEST Precision:", precision_score(y_test, yhat_test))
print("TEST Recall:   ", recall_score(y_test, yhat_test))
print("TEST F1:       ", f1_score(y_test, yhat_test))
print("TEST AP:       ", average_precision_score(y_test, p_test))
print("TEST Confusion:\n", confusion_matrix(y_test, yhat_test))

print("\nClassification report (TEST):")
print(classification_report(y_test, yhat_test, digits=4))




Best threshold (OOF-CV, calibrated): 0.13214715051972686
Best F1 (OOF-CV, calibrated): 0.2713286713282696

TEST metrics (calibrated + OOF threshold)
TEST Precision: 0.23275862068965517
TEST Recall:    0.54
TEST F1:        0.3253012048192771
TEST AP:        0.23194779704022073
TEST Confusion:
 [[883  89]
 [ 23  27]]

Classification report (TEST):
              precision    recall  f1-score   support

           0     0.9746    0.9084    0.9404       972
           1     0.2328    0.5400    0.3253        50

    accuracy                         0.8904      1022
   macro avg     0.6037    0.7242    0.6328      1022
weighted avg     0.9383    0.8904    0.9103      1022



In [17]:
import pandas as pd

results = [
  {"Experiment":"Baseline (LogReg + CV threshold)", "Precision":0.2191780821917808, "Recall":0.64, "F1":0.32653061224489793, "AP":0.2599195805215839},
  {"Experiment":"BRF tuned (OOF threshold)", "Precision":0.21551724137931033, "Recall":0.50, "F1":0.30120481927710846, "AP":0.22179750616852667},
  {"Experiment":"BRF tuned + calibrated (OOF threshold)", "Precision":0.23275862068965517, "Recall":0.54, "F1":0.3253012048192771, "AP":0.23194777970402207},
]

df_results = pd.DataFrame(results).sort_values(["F1","AP"], ascending=False)
df_results.to_csv("results.csv", index=False)

print(df_results.to_markdown(index=False))


| Experiment                             |   Precision |   Recall |       F1 |       AP |
|:---------------------------------------|------------:|---------:|---------:|---------:|
| Baseline (LogReg + CV threshold)       |    0.219178 |     0.64 | 0.326531 | 0.25992  |
| BRF tuned + calibrated (OOF threshold) |    0.232759 |     0.54 | 0.325301 | 0.231948 |
| BRF tuned (OOF threshold)              |    0.215517 |     0.5  | 0.301205 | 0.221798 |


Un pipeline reproducible para clasificación desbalanceada que combina BalancedRandomForest + tuning + calibración de probabilidades y selección de umbral con probabilidades OOF para maximizar F1 sin fuga.