# Ejercicios ensembling
En este ejercicio vas a realizar prediciones sobre un dataset de ciudadanos indios diabéticos. Se trata de un problema de clasificación en el que intentaremos predecir 1 (diabético) 0 (no diabético). Todas las variables son numércias.

### 1. Carga las librerias que consideres comunes al notebook

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 2. Lee los datos de [esta direccion](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv)
Los nombres de columnas son:
```Python
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
```

In [5]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

df = pd.read_csv(url, names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'], sep=',')
df

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


### 3. Bagging
Para este apartado tendrás que crear un ensemble utilizando la técnica de bagging ([BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)), mediante la cual combinarás 100 [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Recuerda utilizar también [cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) con 10 kfolds.

**Para este apartado y siguientes, no hace falta que dividas en train/test**, por hacerlo más sencillo. Simplemente divide tus datos en features y target.

Establece una semilla

In [6]:
x = df.drop(columns=["class"])
y = df["class"]

In [7]:
print("Train features shape:", x.shape)
print("Test target shape:", y.shape)

Train features shape: (768, 8)
Test target shape: (768,)


In [8]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

In [26]:
seed = 2023

tree_class = DecisionTreeClassifier()
bagging = BaggingClassifier(base_estimator=tree_class, n_estimators=100,random_state=seed)

In [27]:
resultado_bag = cross_val_score(bagging,x, y,cv=10)
print(resultado_bag.mean())

0.7643369788106631


### 4. Random Forest
En este caso entrena un [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) con 100 árboles y un `max_features` de 3. También con validación cruzada

In [1]:
from sklearn.ensemble import RandomForestClassifier

In [11]:
arbolito = RandomForestClassifier(n_estimators=100,max_features=3, random_state=seed)

In [12]:
resultado_arbolito = cross_val_score(arbolito,x,y,cv=10)
print(resultado_arbolito.mean())

0.7681989063568011


### 5. AdaBoost
Implementa un [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) con 30 árboles.

In [14]:
from sklearn.ensemble import AdaBoostClassifier

In [34]:
dtc = DecisionTreeClassifier()
ada_b = AdaBoostClassifier(base_estimator=dtc, n_estimators=30,random_state=seed, learning_rate=0.3)

In [35]:
result_ada = cross_val_score(ada_b,x,y,cv=10)
print(result_ada.mean())

0.7004272043745728


### 6. GradientBoosting
Implementa un [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) con 100 estimadores

In [17]:
from sklearn.ensemble import GradientBoostingClassifier

In [20]:
grad_boo = GradientBoostingClassifier(n_estimators=100,random_state=seed)

In [21]:
result_grad = cross_val_score(grad_boo,x,y,cv=10)

In [22]:
print(result_grad.mean())

0.7604066985645933


### 7. XGBoost
Para este apartado utiliza un [XGBoostClassifier](https://docs.getml.com/latest/api/getml.predictors.XGBoostClassifier.html) con 100 estimadores. XGBoost no forma parte de la suite de modelos de sklearn, por lo que tendrás que instalarlo con pip install

In [23]:
# !pip install XGBoost

Collecting XGBoost
  Downloading xgboost-1.6.2-py3-none-win_amd64.whl (125.4 MB)
     ------------------------------------- 125.4/125.4 MB 17.7 MB/s eta 0:00:00
Installing collected packages: XGBoost
Successfully installed XGBoost-1.6.2


In [29]:
from xgboost import XGBClassifier

In [31]:
xg = XGBClassifier(n_estimators = 100, random_state = seed)

In [32]:
result_xg = cross_val_score(xg,x,y,cv=10)

In [33]:
print(result_xg.mean())

0.7357142857142857


### 8. Primeros resultados
Crea un dataframe con los resultados y sus algoritmos, ordenándolos de mayor a menor

In [46]:
df = pd.DataFrame([resultado_bag.mean(),resultado_arbolito.mean(),result_ada.mean(),result_grad.mean(),result_xg.mean()], columns=["Resultado"], index=["BAG", "DTC", "ADA", "GRAD", "XG"])

In [49]:
df.head().sort_values(by="Resultado",ascending=False)

Unnamed: 0,Resultado
DTC,0.768199
BAG,0.764337
GRAD,0.760407
XG,0.735714
ADA,0.700427


### 9. Hiperparametrización
Vuelve a entrenar los modelos de nuevo, pero esta vez dividiendo el conjunto de datos en train/test y utilizando un gridsearch para encontrar los mejores hiperparámetros.

In [50]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [51]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(614, 8)
(154, 8)
(614,)
(154,)


In [61]:
# Bagging
grid_bag = BaggingClassifier(base_estimator=tree_class,random_state=seed)
tree_class = DecisionTreeClassifier()
parametros = {
    "n_estimators": [10,20,40,60,80,100],
    "max_samples" : [20,40,60,80,100], 
    "max_features":[2,3,4,5,6,7,8], 
    "bootstrap": [True, False]}
grid_search = GridSearchCV(estimator=grid_bag,param_grid=parametros,cv=10)
grid_search.fit(x_train,y_train)


GridSearchCV(cv=10,
             estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                         random_state=2023),
             param_grid={'bootstrap': [True, False],
                         'max_features': [2, 3, 4, 5, 6, 7, 8],
                         'max_samples': [20, 40, 60, 80, 100],
                         'n_estimators': [10, 20, 40, 60, 80, 100]})

In [62]:
print("Mejores parámetros:", grid_search.best_params_,)
print("Score:", grid_search.best_score_)

Mejores parámetros: {'bootstrap': False, 'max_features': 8, 'max_samples': 40, 'n_estimators': 100}
Score: 0.778476996298255


In [66]:
# Random Forest
grid_arbolito = RandomForestClassifier(random_state=seed)
param_arbolito = {
    "n_estimators": [10,20,40,60,80,100],
    "max_depth": [2,4,6,8,10],
    "max_features": [2,3,4,5,6,7,8],
    "min_samples_leaf": [2,4,6,8,10],
    "min_samples_split": [2,4,6,8,10]
}
grid_arbol = GridSearchCV(estimator=grid_arbolito,param_grid=param_arbolito, cv=10)
grid_arbol.fit(x_train, y_train)


GridSearchCV(cv=10, estimator=RandomForestClassifier(random_state=2023),
             param_grid={'max_depth': [2, 4, 6, 8, 10],
                         'max_features': [2, 3, 4, 5, 6, 7, 8],
                         'min_samples_leaf': [2, 4, 6, 8, 10],
                         'min_samples_split': [2, 4, 6, 8, 10],
                         'n_estimators': [10, 20, 40, 60, 80, 100]})

In [67]:
print("Mejores parámetros:", grid_arbol.best_params_,)
print("Score:", grid_arbol.best_score_)

Mejores parámetros: {'max_depth': 8, 'max_features': 3, 'min_samples_leaf': 2, 'min_samples_split': 8, 'n_estimators': 80}
Score: 0.7865150713907985


In [68]:
#Ada Boost
ada = AdaBoostClassifier(random_state=seed)
param_ada = {
    "n_estimators": [10,20,40,60,80,100,200],
    "learning_rate": [0.1, 0.25, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1],
    "algorithm": ["SAMME", "SAMME.R"]
}
grid_ada = GridSearchCV(estimator=ada, param_grid=param_ada, cv=10)
grid_ada.fit(x_train, y_train)

GridSearchCV(cv=10, estimator=AdaBoostClassifier(random_state=2023),
             param_grid={'algorithm': ['SAMME', 'SAMME.R'],
                         'learning_rate': [0.1, 0.25, 0.4, 0.5, 0.6, 0.7, 0.8,
                                           0.9, 1],
                         'n_estimators': [10, 20, 40, 60, 80, 100, 200]})

In [69]:
print("Mejores parámetros:", grid_ada.best_params_,)
print("Score:", grid_ada.best_score_)

Mejores parámetros: {'algorithm': 'SAMME', 'learning_rate': 0.5, 'n_estimators': 100}
Score: 0.7751983077736648


In [70]:
#Gradient Boosting
gb = GradientBoostingClassifier(random_state=seed)
param_gb = {
    "n_estimators": [10,20,40,60,80,100,200],
    "learning_rate": [0.1, 0.25, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1],
    "max_depth": [2,4,6,8,10,15]
}
grid_gb = GridSearchCV(estimator=gb, param_grid=param_gb, cv=10)
grid_gb.fit(x_train, y_train)

GridSearchCV(cv=10, estimator=GradientBoostingClassifier(random_state=2023),
             param_grid={'learning_rate': [0.1, 0.25, 0.4, 0.5, 0.6, 0.7, 0.8,
                                           0.9, 1],
                         'max_depth': [2, 4, 6, 8, 10, 15],
                         'n_estimators': [10, 20, 40, 60, 80, 100, 200]})

In [71]:
print("Mejores parámetros:", grid_gb.best_params_,)
print("Score:", grid_gb.best_score_)

Mejores parámetros: {'learning_rate': 0.4, 'max_depth': 4, 'n_estimators': 60}
Score: 0.7833685880486515


In [72]:
# XG Boost
xgb = XGBClassifier(random_state= seed)
param_xgb = {
    "n_estimators": [10,20,40,60,80,100,200],
    "learning_rate": [0.1, 0.25, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1],
    "max_depth": [2,4,6,8,10,15],
    "booster":["gbtree", "gblinear"]
}
grid_xgb = GridSearchCV(estimator=xgb, param_grid=param_xgb, cv=10)
grid_xgb.fit(x_train,y_train)

Parameters: { "max_depth" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth" } might not be used.

  Th

GridSearchCV(cv=10,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False, eval_metric=None,
                                     gamma=None, gpu_id=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_bin=None,
                                     max_c...
                                     max_leaves=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=100, n_jobs=None,
                        

In [73]:
print("Mejores parámetros:", grid_xgb.best_params_,)
print("Score:", grid_xgb.best_score_)

Mejores parámetros: {'booster': 'gbtree', 'learning_rate': 0.6, 'max_depth': 2, 'n_estimators': 20}
Score: 0.7865415124272871


### 10. Conclusiones finales

In [74]:
print("Score Bagging:", grid_search.best_score_)
print("Score Random Forest:", grid_arbol.best_score_)
print("Score Ada Booster:", grid_ada.best_score_)
print("Score GB:", grid_gb.best_score_)
print("Score XGB:", grid_xgb.best_score_)

Score Bagging: 0.778476996298255
Score Random Forest: 0.7865150713907985
Score Ada Booster: 0.7751983077736648
Score GB: 0.7833685880486515
Score XGB: 0.7865415124272871


Los mejores resultados se han obtenido con Random Forest y XGBooster