## Proyecto Boosting: *Predicción de diagnóstico de diabetes*



- *Partimos del mismo dataset que utilizamos para hacer el modelo anterior de Random Forest, por lo que ya ha sido tratado con un EDA completo.*

- *Consta de 8 variables predictoras (características del paciente), 1 variable target binaria (padecer diabetes o no) y 22524 pacientes.*

- *Vamos a entrenar 3 modelos: un primer XGBoost sin optimizar, un segundo y un tercero intentando mejorar los hiperparámetros y, por último, un cuarto XGBoost con los mejores hiperparámetros encontrados + Early Stopping.*

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import *
from imblearn.metrics import specificity_score
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

In [2]:
ds = pd.read_csv('DT_diabetes.csv')
ds

Unnamed: 0,gender,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,diabetes,smoking_never
0,0,44,0,0,19,6,200,1,1
1,0,53,0,0,27,6,85,0,1
2,0,67,0,0,25,5,200,0,1
3,1,37,0,0,25,3,159,0,0
4,1,67,0,1,27,6,200,1,0
...,...,...,...,...,...,...,...,...,...
22519,0,61,0,0,30,6,240,1,0
22520,1,22,0,0,29,6,80,0,0
22521,1,66,0,0,27,5,155,0,0
22522,0,24,0,0,35,4,100,0,1


In [3]:
X_train, X_test, y_train, y_test = train_test_split(ds.drop(['diabetes'], axis=1), ds['diabetes'], test_size=0.2, random_state=63)

### 1. Entrenamiento XGBoost default

In [4]:
xgboost_model_1 = XGBClassifier(random_state = 63)
xgboost_model_1.fit(X_train, y_train)

In [5]:
test_pred = xgboost_model_1.predict(X_test)
train_pred = xgboost_model_1.predict(X_train)

In [6]:
def get_metrics(y_train, y_test, y_pred_train, y_pred_test):
    # Calcular métricas para el conjunto de entrenamiento
    train_accuracy = accuracy_score(y_train, y_pred_train)
    train_f1 = f1_score(y_train, y_pred_train)
    train_auc = roc_auc_score(y_train, y_pred_train)
    train_precision = precision_score(y_train, y_pred_train)
    train_recall = recall_score(y_train, y_pred_train)
    train_specificity = specificity_score(y_train, y_pred_train)

    # Calcular métricas para el conjunto de prueba
    test_accuracy = accuracy_score(y_test, y_pred_test)
    test_f1 = f1_score(y_test, y_pred_test)
    test_auc = roc_auc_score(y_test, y_pred_test)
    test_precision = precision_score(y_test, y_pred_test)
    test_recall = recall_score(y_test, y_pred_test)
    test_specificity = specificity_score(y_test, y_pred_test)

    # Calcular la diferencia entre métricas de entrenamiento y prueba
    diff_accuracy = train_accuracy - test_accuracy
    diff_f1 = train_f1 - test_f1
    diff_auc = train_auc - test_auc
    diff_precision = train_precision - test_precision
    diff_recall = train_recall - test_recall
    diff_specificity = train_specificity - test_specificity

    # Crear un DataFrame con los resultados
    metrics_df = pd.DataFrame([[train_accuracy, train_f1, train_auc, train_precision, train_recall, train_specificity],[test_accuracy, test_f1, test_auc, test_precision, test_recall, test_specificity],[diff_accuracy, diff_f1, diff_auc, diff_precision, diff_recall, diff_specificity]],
                              columns = ['Accuracy', 'F1', 'AUC', 'Precision', 'Recall', 'Specificity'],
                              index = ['Train','Test', 'Diferencia'])

    return metrics_df

In [7]:
get_metrics(y_train, y_test, train_pred, test_pred)

Unnamed: 0,Accuracy,F1,AUC,Precision,Recall,Specificity
Train,0.931517,0.905455,0.9211,0.932166,0.880232,0.961967
Test,0.89323,0.859562,0.882429,0.888889,0.832109,0.932749
Diferencia,0.038287,0.045893,0.038671,0.043277,0.048124,0.029219


- *El modelo parece funcionar bastante bien aún con todos los parámetros predeterminados. Aún así, existe algo de overfitting, por lo que vamos a intentar realizar un GridSearch para buscar los mejores hiperparámetros.*


### 2. Entrenamiento XGBoost con primera búsqueda de hiperparámetros

- *Las opciones de valores de hiperparámetros en el param_grid las he escogido en base a una búsqueda en la documentación sobre cómo reducir el sobreajuste en XGBoost.*

In [12]:
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

# Inicializar el modelo
xgboost_model_2 = xgb.XGBClassifier()

# Definir los parámetros a ajustar
param_grid = {
    'eta': [0.05, 0.1],
    'max_depth': [4, 5],
    'min_child_weight': [1, 3],
    'subsample': [0.8, 1],
    'gamma': [0, 0.1]
}

# Configurar la búsqueda en grid
grid_search = GridSearchCV(estimator=xgboost_model_2, param_grid=param_grid, 
                           scoring='accuracy', cv=3, verbose=1, n_jobs=-1)

# Ajustar la búsqueda en grid al conjunto de datos
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 32 candidates, totalling 96 fits


In [13]:
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_


y_pred_train = best_model.predict(X_train)
y_pred_test = best_model.predict(X_test)

In [14]:
print("Mejores hiperparámetros:", best_params)
get_metrics(y_train, y_test, y_pred_train, y_pred_test)

Mejores hiperparámetros: {'eta': 0.1, 'gamma': 0, 'max_depth': 4, 'min_child_weight': 1, 'subsample': 0.8}


Unnamed: 0,Accuracy,F1,AUC,Precision,Recall,Specificity
Train,0.906432,0.870507,0.893788,0.898525,0.844183,0.943393
Test,0.900333,0.867902,0.888576,0.904908,0.833804,0.943348
Diferencia,0.006099,0.002605,0.005212,-0.006383,0.010379,4.5e-05


*Observamos una clara mejoría en las métricas, sobre todo en la reducción de la diferencia entre el conjunto de train y de test, por lo que hemos reducido el sobreajuste del modelo.*

### 3. Entrenamiento XGBoost con segunda búsqueda de hiperparámetros

*Intentamos mejorar aún más las métricas.*

In [33]:
xgboost_model_3 = xgb.XGBClassifier()

param_grid = {
    'eta': [0.05, 0.1],
    'max_depth': [3, 4],
    'min_child_weight': [1, 2],
    'subsample': [0.8, 0.7],
    'gamma': [0, 0.1]
}


grid_search = GridSearchCV(estimator=xgboost_model_3, param_grid=param_grid, 
                           scoring='accuracy', cv=3, verbose=1, n_jobs=-1)


grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 32 candidates, totalling 96 fits


In [34]:
best_params = grid_search.best_params_


best_model = grid_search.best_estimator_

# Predicciones en los conjuntos de entrenamiento y prueba + métricas
y_pred_train = best_model.predict(X_train)
y_pred_test = best_model.predict(X_test)

In [35]:
print("Mejores hiperparámetros:", best_params)
get_metrics(y_train, y_test, y_pred_train, y_pred_test)

Mejores hiperparámetros: {'eta': 0.1, 'gamma': 0, 'max_depth': 3, 'min_child_weight': 2, 'subsample': 0.8}


Unnamed: 0,Accuracy,F1,AUC,Precision,Recall,Specificity
Train,0.903019,0.865724,0.889758,0.896019,0.837411,0.942105
Test,0.896419,0.86078,0.884034,0.894215,0.829755,0.938313
Diferencia,0.0066,0.004944,0.005724,0.001804,0.007656,0.003792


*Los nuevos hiperparámetros ({'eta': 0.1, 'gamma': 0.1, 'max_depth': 4, 'min_child_weight': 1, 'subsample': 0.7}) ofrecen una ligera mejora en la mayoría de las métricas en comparación con los anteriores ({'eta': 0.1, 'gamma': 0, 'max_depth': 4, 'min_child_weight': 1, 'subsample': 0.8}).*

### 4. Entrenamiento XGBoost con los hiperparámetros más óptimos encontrados + Early Stopping

*Las métricas ya son bastante satisfactorias, pero vamos a probar a entrenar el modelo con Early Stopping (detiene el entrenamiento si el rendimiento en el conjunto de test no mejora después de un número fijo de rondas. Esto puede prevenir el sobreajuste.)

In [None]:
# Dividir los datos en entrenamiento, validación y prueba
X_train, X_intermediate, y_train, y_intermediate = train_test_split(ds.drop(['diabetes'], axis=1), ds['diabetes'], test_size=0.3, random_state=63)
X_val, X_test, y_val, y_test = train_test_split(X_intermediate, y_intermediate, test_size=0.5, random_state=63)

# Inicializar el modelo con los hiperparámetros
xgboost_model_final = xgb.XGBClassifier(
    eta=0.1,
    max_depth=4,
    min_child_weight=1,
    subsample=0.7,
    colsample_bytree=0.8,
    gamma=0.1,
    objective='binary:logistic',
    n_estimators=1000,  
    random_state=63
)

# Entrenar el modelo con early stopping
xgboost_model_final.fit(
    X_train,
    y_train,
    eval_set=[(X_val, y_val)], 
    eval_metric="logloss",  
    early_stopping_rounds=10,  
    verbose=True  
)

# Predecir en el conjunto de prueba y entrenamiento
test_pred = xgboost_model_final.predict(X_test)
train_pred = xgboost_model_final.predict(X_train)

In [31]:
train_pred = xgboost_model_final.predict(X_train)
test_pred = xgboost_model_final.predict(X_test)


get_metrics(y_train, y_test, train_pred, test_pred)

Unnamed: 0,Accuracy,F1,AUC,Precision,Recall,Specificity
Train,0.910567,0.876359,0.898115,0.905582,0.848964,0.947267
Test,0.893164,0.857594,0.882096,0.883022,0.833589,0.930602
Diferencia,0.017403,0.018765,0.01602,0.02256,0.015375,0.016665


*No parece que el modelo haya mejorado con early stopping, más bien ha empeorado el sobreajuste. ¿Por qué?*

In [36]:
from joblib import dump
dump(xgboost_model_3, 'modelo_entrenado_xgboost.joblib')

['modelo_entrenado_xgboost.joblib']