# Ejercicios ensembling
En este ejercicio vas a realizar prediciones sobre un dataset de ciudadanos indios diabéticos. Se trata de un problema de clasificación en el que intentaremos predecir 1 (diabético) 0 (no diabético). Todas las variables son numércias.

## 1. Carga las librerias que consideres comunes al notebook

In [20]:
import pandas as pd
import numpy as np

## 2. Lee los datos de [esta direccion](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv)
Los nombres de columnas son:
```Python
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
```

preg = Number of times pregnant

plas = Plasma glucose concentration a 2 hours in an oral glucose tolerance test

pres = Diastolic blood pressure (mm Hg)

skin = Triceps skin fold thickness (mm)

test = 2-Hour serum insulin (mu U/ml)

mass = Body mass index (weight in kg/(height in m)^2)

pedi = Diabetes pedigree function

age = Age (years)

class = Class variable (1:tested positive for diabetes, 0: tested negative for diabetes)

In [21]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(url, names=names)
df

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


## 3. Bagging
Para este apartado tendrás que crear un ensemble utilizando la técnica de bagging ([BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)), mediante la cual combinarás 100 [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Recuerda utilizar también [cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) con 10 kfolds.

**Para este apartado y siguientes, no hace falta que dividas en train/test**, por hacerlo más sencillo. Simplemente divide tus datos en features y target.

Establece una semilla

In [22]:
array = df.values
X = array[:,0:8]
Y = array[:,8]
seed = 7

In [23]:
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

kfold = model_selection.KFold(n_splits=10)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed, 
                            oob_score=True, bootstrap=True)
results_bagg = model_selection.cross_val_score(model, X, Y, cv=kfold).mean()
results_bagg

0.7720437457279563

In [26]:
model.fit(X,Y)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100,
                  oob_score=True, random_state=7)

In [27]:
model.oob_score_

0.76953125

## 4. Random Forest
En este caso entrena un [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) con 100 árboles y un `max_features` de 3. También con validación cruzada

In [32]:
import pandas
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier

num_trees = 100
max_features = 3
kfold = model_selection.KFold(n_splits=10)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features, random_state=seed, oob_score=True)
results_rf = model_selection.cross_val_score(model, X, Y, cv=kfold).mean()
results_rf

0.7733766233766234

In [33]:
model.fit(X, Y)

RandomForestClassifier(max_features=3, oob_score=True, random_state=7)

In [34]:
model.oob_score_

0.7682291666666666

## 5. AdaBoost
Implementa un [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) con 30 árboles.

In [53]:
# AdaBoost Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier

num_trees = 30
kfold = model_selection.KFold(n_splits=10)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results_ada = model_selection.cross_val_score(model, X, Y, cv=kfold)
results_mean = results_ada.mean()
results_std = results_ada.std()

print('Resultado: {:.2f}+-{:.2f}'.format(results_mean, results_std))

Resultado: 0.76+-0.05


## 6. GradientBoosting
Implementa un [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) con 100 estimadores

In [61]:
import pandas
from sklearn import model_selection
from sklearn.ensemble import GradientBoostingClassifier

num_trees = 1000
kfold = model_selection.KFold(n_splits=10)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results_gb = model_selection.cross_val_score(model, X, Y, cv=kfold)#.mean()
results_mean = results_gb.mean()
results_std = results_gb.std()

print('Resultado: {:.2f}+-{:.2f}'.format(results_mean, results_std))

Resultado: 0.75+-0.05


In [56]:
results_gb

array([0.74025974, 0.81818182, 0.72727273, 0.64935065, 0.79220779,
       0.77922078, 0.79220779, 0.83116883, 0.68421053, 0.76315789])

In [58]:
results_gb # 10 est

array([0.66233766, 0.79220779, 0.71428571, 0.66233766, 0.77922078,
       0.80519481, 0.84415584, 0.84415584, 0.72368421, 0.71052632])

In [60]:
results_gb # 100 est

array([0.74025974, 0.81818182, 0.74025974, 0.63636364, 0.80519481,
       0.79220779, 0.80519481, 0.83116883, 0.72368421, 0.78947368])

In [62]:
results_gb # 1000 est

array([0.7012987 , 0.81818182, 0.68831169, 0.64935065, 0.79220779,
       0.76623377, 0.75324675, 0.83116883, 0.73684211, 0.75      ])

## 7. XGBoost
Para este apartado utiliza un [XGBoostClassifier](https://docs.getml.com/latest/api/getml.predictors.XGBoostClassifier.html) con 100 estimadores. XGBoost no forma parte de la suite de modelos de sklearn, por lo que tendrás que instalarlo con pip install

In [11]:
#!pip install xgboost

In [42]:
from xgboost import XGBClassifier

kfold = model_selection.KFold(n_splits=10)
model = XGBClassifier(n_estimators=100)
results_xgb = model_selection.cross_val_score(model, X, Y, cv=kfold)#.mean()

results_mean = results_xgb.mean()
results_std = results_xgb.std()

print('Resultado: {:.2f}+-{:.2f}'.format(results_mean, results_std))

Resultado: 0.74+-0.05


## 8. Resultados
Crea un series con los resultados y sus algoritmos, ordenándolos de mayor a menor

In [48]:
resul = [results_bagg, results_rf, results_ada.mean(), results_gb.mean(), results_xgb.mean()]
algori = ["Bagging DT", "Random Forest", "Ada Boost", "GradientBoosting", "XGBoost"]

resultados = pd.Series(resul, algori).sort_values(ascending=False)
resultados

Random Forest       0.773377
Bagging DT          0.772044
GradientBoosting    0.768199
Ada Boost           0.760458
XGBoost             0.739559
dtype: float64