### Mejorar el accuracy en el proyecto de detección de la diabetes
En este ejercicio seguiremos analizando el modelo de detección de la diabetes utilizando XGBoost y comparándolo con los modelos anteriores. Estos fueron los resultados:
| Métrica       | Decision Tree | Random Forest Optimizado |
| ------------- | ------------- | ------------------------ |
| **Accuracy**  | **0.7792**    | 0.7532                   |
| **Precision** | 0.6667        | 0.6441                   |
| **Recall**    | **0.7407**    | 0.6909                   |
| **F1 Score**  | **0.7018**    | 0.6667                   |


## 1. Importamos librerías y el modelo anterior

In [48]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report

from sklearn.metrics import mean_squared_error, r2_score
import pickle
from pickle import dump

import warnings
warnings.filterwarnings('ignore')

In [49]:
archivo = "https://breathecode.herokuapp.com/asset/internal-link?id=930&path=diabetes.csv"

df = pd.read_csv(archivo, sep=",") # Defino X e y igual que en el proyecto original 
X = df.drop("Outcome", axis=1) 
y = df["Outcome"] # Split igual que el original 
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )

In [50]:
# Carga del modelo anterior 
with open("modelo_randomforest_opt_diabetes.pkl", "rb") as f: modelo_cargado = pickle.load(f)

In [51]:
X_train.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
60,2,84,0,0,0,0.0,0.304,21
618,9,112,82,24,0,28.2,1.282,50
346,1,139,46,19,83,28.7,0.654,22
294,0,161,50,0,0,21.9,0.254,65
231,6,134,80,37,370,46.2,0.238,46


In [52]:
# modelos
ada = AdaBoostClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
xgb = XGBClassifier(n_estimators=200, learning_rate=0.05, max_depth=3, subsample=0.8, colsample_bytree=0.8, random_state=42, eval_metric="logloss")
lgb = LGBMClassifier(n_estimators=100, learning_rate=0.1, random_state = 42)

# entrenamiento
ada.fit(X_train, y_train)
gb.fit(X_train, y_train)
xgb.fit(X_train, y_train)
lgb.fit(X_train, y_train)

[LightGBM] [Info] Number of positive: 213, number of negative: 401
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000067 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 664
[LightGBM] [Info] Number of data points in the train set: 614, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.346906 -> initscore=-0.632669
[LightGBM] [Info] Start training from score -0.632669


0,1,2
,boosting_type,'gbdt'
,num_leaves,31
,max_depth,-1
,learning_rate,0.1
,n_estimators,100
,subsample_for_bin,200000
,objective,
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


In [53]:
# predicción
ada_y_pred_test = ada.predict(X_test)
gb_y_pred_test = gb.predict(X_test)
xgb_y_pred_test = xgb.predict(X_test)
lgb_y_pred_test = lgb.predict(X_test)

ada_y_pred_train = ada.predict(X_train)
gb_y_pred_train = gb.predict(X_train)
xgb_y_pred_train = xgb.predict(X_train)
lgb_y_pred_train = lgb.predict(X_train)

In [54]:
# metricas
ada_accuracy_test = accuracy_score(y_test, ada_y_pred_test)
ada_accuracy_train = accuracy_score(y_train, ada_y_pred_train)

gb_accuracy_test = accuracy_score(y_test, gb_y_pred_test)
gb_accuracy_train = accuracy_score(y_train, gb_y_pred_train)

xgb_accuracy_test = accuracy_score(y_test, xgb_y_pred_test)
xgb_accuracy_train = accuracy_score(y_train, xgb_y_pred_train)

lgb_accuracy_test = accuracy_score(y_test, lgb_y_pred_test)
lgb_accuracy_train = accuracy_score(y_train, lgb_y_pred_train)

print('AdaBoost')
print("Accuracy Test: ", ada_accuracy_test)
print("Accuracy Train: ", ada_accuracy_train)

print('Gradient Boosting')
print("Accuracy Test: ", gb_accuracy_test)
print("Accuracy Train: ", gb_accuracy_train)

print('XGBoost')
print("Accuracy Test: ", xgb_accuracy_test)
print("Accuracy Train: ", xgb_accuracy_train)

print('LightGBM')
print("Accuracy Test: ", lgb_accuracy_test)
print("Accuracy Train: ", lgb_accuracy_train)


AdaBoost
Accuracy Test:  0.7792207792207793
Accuracy Train:  0.7785016286644951
Gradient Boosting
Accuracy Test:  0.7402597402597403
Accuracy Train:  0.9381107491856677
XGBoost
Accuracy Test:  0.7532467532467533
Accuracy Train:  0.9153094462540716
LightGBM
Accuracy Test:  0.7207792207792207
Accuracy Train:  1.0


Tras probar los distintos algoritmos, los resultados muestran que tanto Decisiton Tree como AdaBoost son los modelos más estables. El resto presenta distintos niveles de sobreajuste.


| Modelo                         | Accuracy Test | Accuracy Train | Mis comentarios                                       |
| ------------------------------ | ------------- | -------------- | ------------------------------------------------------ |
| **Decision Tree**              | **0.7792**    | 0.8127         | Trabaja razonablemente bien con este set de datos        |
| **Random Forest Optimizado**   | 0.7532        | 0.91         | Presenta overfitting |
| **AdaBoost (100 estimadores)** | **0.7792**    | 0.7785         | Mejor equilibrio. Sin sobreajuste                    |
| **AdaBoost (200 estimadores)** | 0.7597        | 0.7833         | A partir de 100 árboles, mete ruido y no mejora el resultado                       |
| **Gradient Boosting**          | 0.7403        | 0.9381         | Overfitting severo                                     |
| **XGBoost**                    | 0.7532        | 0.9153         | Overfitting moderado; necesita tuning                  |
| **LightGBM**                   | 0.7208        | **1.000**      | No adecuado para este dataset. Overfitting                       |
