# Ejercicios ensembling
En este ejercicio vas a realizar prediciones sobre un dataset de ciudadanos indios diabéticos. Se trata de un problema de clasificación en el que intentaremos predecir 1 (diabético) 0 (no diabético).

### 1. Carga las librerias que consideres comunes al notebook

In [34]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score,precision_score,recall_score

### 2. Lee los datos de [esta direccion](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv)
Los nombres de columnas son:
```Python
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
```

In [35]:
df = pd.read_csv("dataensembles.csv",header = None)

In [36]:
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df.columns = names
df

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


### 3. Bagging
Para este apartado tendrás que crear un ensemble utilizando la técnica de bagging ([BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)), mediante la cual combinarás 100 [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Recuerda utilizar también [cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) con 10 kfolds.

**Para este apartado y siguientes, no hace falta que dividas en train/test**, por hacerlo más sencillo. Simplemente divide tus datos en features y target.

Establece una semilla

In [37]:
x = df[['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']]
y = df['class']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=42)

In [38]:
prueba = DecisionTreeClassifier(max_depth=5)

In [39]:
parametros = {
    "max_depth": np.arange(3,9),
    "min_samples_split" : [10,20,50],
    "min_samples_leaf": [30,40,50]
}

gs = GridSearchCV(estimator=prueba, param_grid=parametros, cv=10, scoring='accuracy', verbose=2)
gs.fit(x_train,y_train)
print(gs.best_estimator_)
print(gs.best_score_)
print(gs.best_params_)

Fitting 10 folds for each of 54 candidates, totalling 540 fits
[CV] END max_depth=3, min_samples_leaf=30, min_samples_split=10; total time=   0.0s
[CV] END max_depth=3, min_samples_leaf=30, min_samples_split=10; total time=   0.0s
[CV] END max_depth=3, min_samples_leaf=30, min_samples_split=10; total time=   0.0s
[CV] END max_depth=3, min_samples_leaf=30, min_samples_split=10; total time=   0.0s
[CV] END max_depth=3, min_samples_leaf=30, min_samples_split=10; total time=   0.0s
[CV] END max_depth=3, min_samples_leaf=30, min_samples_split=10; total time=   0.0s
[CV] END max_depth=3, min_samples_leaf=30, min_samples_split=10; total time=   0.0s
[CV] END max_depth=3, min_samples_leaf=30, min_samples_split=10; total time=   0.0s
[CV] END max_depth=3, min_samples_leaf=30, min_samples_split=10; total time=   0.0s
[CV] END max_depth=3, min_samples_leaf=30, min_samples_split=10; total time=   0.0s
[CV] END max_depth=3, min_samples_leaf=30, min_samples_split=20; total time=   0.0s
[CV] END max_

In [40]:
estimator = gs.best_estimator_
bagg = BaggingClassifier(estimator = estimator,n_estimators=100,max_samples=100, bootstrap=True, random_state=42)

In [41]:
parametros = {
    "bootstrap": [True,False],
    "n_estimators" : [100,200],
    "max_samples":[50,100,200]
}

gs = GridSearchCV(estimator=bagg, param_grid=parametros, cv=10, scoring='accuracy', verbose=2)
gs.fit(x_train,y_train)
print(gs.best_estimator_)
print(gs.best_score_)
print(gs.best_params_)

Fitting 10 folds for each of 12 candidates, totalling 120 fits
[CV] END ...bootstrap=True, max_samples=50, n_estimators=100; total time=   0.1s
[CV] END ...bootstrap=True, max_samples=50, n_estimators=100; total time=   0.1s
[CV] END ...bootstrap=True, max_samples=50, n_estimators=100; total time=   0.1s
[CV] END ...bootstrap=True, max_samples=50, n_estimators=100; total time=   0.1s
[CV] END ...bootstrap=True, max_samples=50, n_estimators=100; total time=   0.1s
[CV] END ...bootstrap=True, max_samples=50, n_estimators=100; total time=   0.1s
[CV] END ...bootstrap=True, max_samples=50, n_estimators=100; total time=   0.1s
[CV] END ...bootstrap=True, max_samples=50, n_estimators=100; total time=   0.1s
[CV] END ...bootstrap=True, max_samples=50, n_estimators=100; total time=   0.1s
[CV] END ...bootstrap=True, max_samples=50, n_estimators=100; total time=   0.1s
[CV] END ...bootstrap=True, max_samples=50, n_estimators=200; total time=   0.3s
[CV] END ...bootstrap=True, max_samples=50, n_

In [43]:
bagg2 = gs.best_estimator_
pred = bagg2.predict(x_test)

In [44]:
accuracy_score(y_test,pred)

0.7359307359307359

### 4. Random Forest
En este caso entrena un [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) con 100 árboles y un `max_features` de 3. También con validación cruzada

### 5. AdaBoost
Implementa un [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) con 30 árboles.

### 6. GradientBoosting
Implementa un [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) con 100 estimadores

### 7. XGBoost
Para este apartado utiliza un [XGBoostClassifier](https://docs.getml.com/latest/api/getml.predictors.XGBoostClassifier.html) con 100 estimadores. XGBoost no forma parte de la suite de modelos de sklearn, por lo que tendrás que instalarlo con pip install

### 8. Primeros resultados
Crea un dataframe con los resultados y sus algoritmos, ordenándolos de mayor a menor

### 9. Hiperparametrización
Vuelve a entrenar los modelos de nuevo, pero esta vez dividiendo el conjunto de datos en train/test y utilizando un gridsearch para encontrar los mejores hiperparámetros.

### 10. Conclusiones finales

Escogemos el modelo que mejor generalice, es decir, el que tenga mejor métrica ante test (RF hiperparametrizado)