# Heart Disease UCI #

URL: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

- Variable objetivo binaria: HeartDisease (1 = tiene enfermedad, 0 = no tiene).

## Paso 1: Cargar dataset ##
Se descarga el dataset desde KaggleHub y se carga en un DataFrame de pandas para su exploración inicial.

In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("fedesoriano/heart-failure-prediction")

print("Path to dataset files:", path)

  from .autonotebook import tqdm as notebook_tqdm


Downloading from https://www.kaggle.com/api/v1/datasets/download/fedesoriano/heart-failure-prediction?dataset_version_number=1...


100%|██████████| 8.56k/8.56k [00:00<00:00, 3.94MB/s]

Extracting files...
Path to dataset files: C:\Users\Alejo\.cache\kagglehub\datasets\fedesoriano\heart-failure-prediction\versions\1





## Visualización del Data Frame ##

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv(path + "/heart.csv")
df.head()


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


## Paso 2: División en conjunto de entrenamiento y prueba ##
Se separa la variable objetivo (HeartDisease) del resto y se divide el conjunto de datos en entrenamiento y prueba con estratificación.

In [7]:
X = df.drop("HeartDisease", axis=1)
y = df["HeartDisease"]

# Codificación de variables categóricas
X = pd.get_dummies(X, drop_first=True)

# División
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

## Paso 3: Preprocesamiento (escalado) ##
Se aplica codificación one-hot a variables categóricas y escalado estándar (StandardScaler) a los datos.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ('scaler', StandardScaler())
])

X_train_scaled = pipeline.fit_transform(X_train)
X_test_scaled = pipeline.transform(X_test)

## Paso 4: Modelo baseline (Random Forest por defecto) ##
Se entrena un modelo base de RandomForestClassifier con hiperparámetros por defecto. Se evalúa usando validación cruzada y AUC-ROC.

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rf_baseline = RandomForestClassifier(random_state=42)
baseline_cv_auc = cross_val_score(rf_baseline, X_train_scaled, y_train, cv=5, scoring='roc_auc')
print(f"AUC baseline (CV): {baseline_cv_auc.mean():.4f}")


AUC baseline (CV): 0.9228


## Paso 5: Definición del espacio de hiperparámetros ##
Se define un rango de valores para los hiperparámetros que serán ajustados en los modelos posteriores.



In [None]:
param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

## Paso 6: Ajuste con búsqueda aleatoria ##
Se ajustan los hiperparámetros usando búsqueda aleatoria (RandomizedSearchCV) y optimización bayesiana (Optuna). Se evalúa el desempeño con AUC-ROC.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

random_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    cv=5,
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train_scaled, y_train)
print(f"AUC random search (CV): {random_search.best_score_:.4f}")

AUC random search (CV): 0.9286


## Ajuste con Optuna ##

In [None]:
import optuna

def objective(trial):
    model = RandomForestClassifier(
        n_estimators=trial.suggest_int('n_estimators', 50, 200),
        max_depth=trial.suggest_int('max_depth', 3, 20),
        min_samples_split=trial.suggest_int('min_samples_split', 2, 10),
        min_samples_leaf=trial.suggest_int('min_samples_leaf', 1, 4),
        random_state=42
    )
    auc = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='roc_auc').mean()
    return auc

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30)
print(f"Mejor AUC Optuna (CV): {study.best_value:.4f}")

[I 2025-05-25 20:23:46,509] A new study created in memory with name: no-name-271d72ad-938d-4a61-9973-a42fb69022f9
[I 2025-05-25 20:23:48,057] Trial 0 finished with value: 0.9252637836919678 and parameters: {'n_estimators': 158, 'max_depth': 9, 'min_samples_split': 8, 'min_samples_leaf': 2}. Best is trial 0 with value: 0.9252637836919678.
[I 2025-05-25 20:23:49,120] Trial 1 finished with value: 0.9256374319246948 and parameters: {'n_estimators': 115, 'max_depth': 5, 'min_samples_split': 10, 'min_samples_leaf': 4}. Best is trial 1 with value: 0.9256374319246948.
[I 2025-05-25 20:23:49,919] Trial 2 finished with value: 0.9264024442886232 and parameters: {'n_estimators': 80, 'max_depth': 20, 'min_samples_split': 4, 'min_samples_leaf': 3}. Best is trial 2 with value: 0.9264024442886232.
[I 2025-05-25 20:23:50,918] Trial 3 finished with value: 0.9266656980478117 and parameters: {'n_estimators': 100, 'max_depth': 16, 'min_samples_split': 6, 'min_samples_leaf': 3}. Best is trial 3 with value: 

Mejor AUC Optuna (CV): 0.9289


## Paso 7: Comparación con validación cruzada ##
Se comparan los tres modelos (baseline, random search, Optuna) en el conjunto de entrenamiento mediante validación cruzada.

In [15]:
best_rf_optuna = RandomForestClassifier(**study.best_params, random_state=42)
optuna_cv_auc = cross_val_score(best_rf_optuna, X_train_scaled, y_train, cv=5, scoring='roc_auc')

print(f"""
Modelo baseline AUC: {baseline_cv_auc.mean():.4f}
Modelo random search AUC: {random_search.best_score_:.4f}
Modelo Optuna AUC: {optuna_cv_auc.mean():.4f}
""")


Modelo baseline AUC: 0.9228
Modelo random search AUC: 0.9286
Modelo Optuna AUC: 0.9289



## Paso 8: Evaluación sobre conjunto de prueba ##
Se evalúan los modelos ya entrenados sobre datos no vistos y se calcula el AUC-ROC para cada uno.

In [16]:
from sklearn.metrics import roc_auc_score

rf_baseline.fit(X_train_scaled, y_train)
random_search.best_estimator_.fit(X_train_scaled, y_train)
best_rf_optuna.fit(X_train_scaled, y_train)

df_probs = pd.DataFrame({
    'baseline': rf_baseline.predict_proba(X_test_scaled)[:,1],
    'random': random_search.best_estimator_.predict_proba(X_test_scaled)[:,1],
    'optuna': best_rf_optuna.predict_proba(X_test_scaled)[:,1],
})

auc_baseline_test = roc_auc_score(y_test, df_probs['baseline'])
auc_random_test = roc_auc_score(y_test, df_probs['random'])
auc_optuna_test = roc_auc_score(y_test, df_probs['optuna'])

print(f"""
AUC test baseline: {auc_baseline_test:.4f}
AUC test random search: {auc_random_test:.4f}
AUC test Optuna: {auc_optuna_test:.4f}
""")



AUC test baseline: 0.9314
AUC test random search: 0.9298
AUC test Optuna: 0.9330

