## Proyecto Random Forest: *Diagnóstico de Diabetes*

- *Partimos del mismo dataset que utilizamos para hacer el modelo de Decision Tree, por lo que ya ha sido tratado con un EDA completo.*
- *Consta de 8 variables predictoras (características del paciente), 1 variable target binaria (padecer diabetes o no) y 22524 pacientes.*

- *Vamos a entrenar 2 modelos: un Random Forest sin optimizar y otro con optimización de hiperparámetros. Para concluir compararemos las métricas de ambos modelos.*

In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import *
from imblearn.metrics import specificity_score
from sklearn.model_selection import GridSearchCV

In [11]:
ds = pd.read_csv('DT_diabetes.csv')
ds

Unnamed: 0,gender,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,diabetes,smoking_never
0,0,44,0,0,19,6,200,1,1
1,0,53,0,0,27,6,85,0,1
2,0,67,0,0,25,5,200,0,1
3,1,37,0,0,25,3,159,0,0
4,1,67,0,1,27,6,200,1,0
...,...,...,...,...,...,...,...,...,...
22519,0,61,0,0,30,6,240,1,0
22520,1,22,0,0,29,6,80,0,0
22521,1,66,0,0,27,5,155,0,0
22522,0,24,0,0,35,4,100,0,1


### 1. Modelo Random Forest sin optimizar

In [12]:
X_train, X_test, y_train, y_test = train_test_split(ds.drop(['diabetes'], axis=1), ds['diabetes'], test_size=0.2, random_state=63)

In [14]:
model = RandomForestClassifier(random_state = 63)
model.fit(X_train, y_train)

In [20]:
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)

In [18]:
def get_metrics(y_train, y_test, y_pred_train, y_pred_test):
    # Calcular métricas para el conjunto de entrenamiento
    train_accuracy = accuracy_score(y_train, y_pred_train)
    train_f1 = f1_score(y_train, y_pred_train)
    train_auc = roc_auc_score(y_train, y_pred_train)
    train_precision = precision_score(y_train, y_pred_train)
    train_recall = recall_score(y_train, y_pred_train)
    train_specificity = specificity_score(y_train, y_pred_train)

    # Calcular métricas para el conjunto de prueba
    test_accuracy = accuracy_score(y_test, y_pred_test)
    test_f1 = f1_score(y_test, y_pred_test)
    test_auc = roc_auc_score(y_test, y_pred_test)
    test_precision = precision_score(y_test, y_pred_test)
    test_recall = recall_score(y_test, y_pred_test)
    test_specificity = specificity_score(y_test, y_pred_test)

    # Calcular la diferencia entre métricas de entrenamiento y prueba
    diff_accuracy = train_accuracy - test_accuracy
    diff_f1 = train_f1 - test_f1
    diff_auc = train_auc - test_auc
    diff_precision = train_precision - test_precision
    diff_recall = train_recall - test_recall
    diff_specificity = train_specificity - test_specificity

    # Crear un DataFrame con los resultados
    metrics_df = pd.DataFrame([[train_accuracy, train_f1, train_auc, train_precision, train_recall, train_specificity],[test_accuracy, test_f1, test_auc, test_precision, test_recall, test_specificity],[diff_accuracy, diff_f1, diff_auc, diff_precision, diff_recall, diff_specificity]],
                              columns = ['Accuracy', 'F1', 'AUC', 'Precision', 'Recall', 'Specificity'],
                              index = ['Train','Test', 'Diferencia'])

    return metrics_df

In [21]:
get_metrics(y_train, y_test, train_pred, test_pred)

Unnamed: 0,Accuracy,F1,AUC,Precision,Recall,Specificity
Train,0.99384,0.991716,0.993003,0.993718,0.989721,0.996285
Test,0.881687,0.843649,0.869529,0.876829,0.812889,0.92617
Diferencia,0.112153,0.148067,0.123474,0.116889,0.176833,0.070116


### 2. Random Forest optimizado con GridSearch

In [33]:

forest = RandomForestClassifier(random_state=63)

# Definir el espacio de hiperparámetros a explorar
param_grid = {
    'n_estimators': [100, 200],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'bootstrap': [True, False]
}

# Configurar GridSearchCV
grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=5, n_jobs=-1, verbose=0, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_


best_model = grid_search.best_estimator_

# Predicciones en los conjuntos de entrenamiento y prueba + métricas
y_pred_train = best_model.predict(X_train)
y_pred_test = best_model.predict(X_test)

160 fits failed out of a total of 480.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
58 fits failed with the following error:
Traceback (most recent call last):
  File "/home/vscode/.local/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/vscode/.local/lib/python3.10/site-packages/sklearn/base.py", line 1466, in wrapper
    estimator._validate_params()
  File "/home/vscode/.local/lib/python3.10/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/home/vscode/.local/lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints


In [34]:
print("Mejores hiperparámetros:", best_params)
get_metrics(y_train, y_test, y_pred_train, y_pred_test)

Mejores hiperparámetros: {'bootstrap': True, 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}


Unnamed: 0,Accuracy,F1,AUC,Precision,Recall,Specificity
Train,0.90782,0.867639,0.888146,0.932831,0.810964,0.965328
Test,0.894784,0.855664,0.877015,0.927393,0.794234,0.959795
Diferencia,0.013036,0.011975,0.011131,0.005438,0.01673,0.005533


In [30]:
# Realizamos una segunda búsqueda de hiperparámetros teniendo en cuenta los resultados anteriores para intentar mejorarlos, aunque el resultado es muy similar.

forest_2 = RandomForestClassifier(random_state=63)

# Definir el espacio de hiperparámetros a explorar
param_grid_2 = {
    'n_estimators': [90, 100],
    'max_features': ['sqrt'],
    'max_depth': [8, 10],
    'min_samples_split': [3, 5],
    'min_samples_leaf': [2],
    'bootstrap': [True]
}

# Configurar GridSearchCV
grid_search_2 = GridSearchCV(estimator=forest_2, param_grid=param_grid_2, cv=5, n_jobs=-1, verbose=0, scoring='accuracy')
grid_search_2.fit(X_train, y_train)

best_params = grid_search_2.best_params_
print("Mejores hiperparámetros:", best_params)

best_model_2 = grid_search_2.best_estimator_

# Predicciones en los conjuntos de entrenamiento y prueba + métricas
y_pred_train = best_model_2.predict(X_train)
y_pred_test = best_model_2.predict(X_test)

get_metrics(y_train, y_test, y_pred_train, y_pred_test)

Mejores hiperparámetros: {'bootstrap': True, 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}


Unnamed: 0,Accuracy,F1,AUC,Precision,Recall,Specificity
Train,0.907709,0.867353,0.887846,0.933551,0.809921,0.96577
Test,0.894118,0.854883,0.876466,0.92556,0.794234,0.958699
Diferencia,0.013591,0.01247,0.011379,0.007991,0.015687,0.007072


*Conclusiones finales:*

- *Podemos observar como la diferencia en los valores de las métricas entre los conjuntos de train y test disminuye después de la optimización de hiperparámetros, lo que significa que hemos conseguido disminuir el overfitting.*
- *La mayoría de las métricas para test se han aumentado o se han mantenido con el modelo optimizado, mientras que las del conjunto de train han disminuido, lo que indica un mayor poder de generalización y mejor predicción en datos desconocidos.*

*DUDAS*:
- *Me gustaría saber qué significan los mensajes de "warning" que me aparecen después de hacer la búsqueda de hiperparámetros con GridSearch. Veo que funciona igualmente y me da un resultado, pero quiero saber si sale porque no estoy haciendo algo bien o es algo que sale siempre al utilizar GridSearch.*

- *Feedback sobre cosas que podría mejorar :) ¡Gracias!*

In [35]:
# Guardado del modelo
from joblib import dump
dump(best_model, 'modelo_entrenado_randomforest.joblib')

['modelo_entrenado_randomforest.joblib']