# Ejercicios ensembling
En este ejercicio vas a realizar prediciones sobre un dataset de ciudadanos indios diabéticos. Se trata de un problema de clasificación en el que intentaremos predecir 1 (diabético) 0 (no diabético). Todas las variables son numércias.

### 1. Carga las librerias que consideres comunes al notebook

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import seaborn as sns

### 2. Lee los datos de [esta direccion](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv)
Los nombres de columnas son:
```Python
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
```

In [13]:
url="https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

df=pd.read_csv(url,sep=",")
df.columns=['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 767 entries, 0 to 766
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   preg    767 non-null    int64  
 1   plas    767 non-null    int64  
 2   pres    767 non-null    int64  
 3   skin    767 non-null    int64  
 4   test    767 non-null    int64  
 5   mass    767 non-null    float64
 6   pedi    767 non-null    float64
 7   age     767 non-null    int64  
 8   class   767 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


### 3. Bagging
Para este apartado tendrás que crear un ensemble utilizando la técnica de bagging ([BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)), mediante la cual combinarás 100 [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Recuerda utilizar también [cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) con 10 kfolds.

**Para este apartado y siguientes, no hace falta que dividas en train/test**, por hacerlo más sencillo. Simplemente divide tus datos en features y target.

Establece una semilla

In [18]:
X = df[["preg", "plas", "pres", "skin", "test", "mass", "pedi", "age"]]
y = df["class"]

In [19]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

estimator = DecisionTreeClassifier(random_state=0)
bc = BaggingClassifier(estimator, n_estimators=100, 
                        bootstrap=True, oob_score=True)

In [20]:
bc.fit(X, y)

BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=0),
                  n_estimators=100, oob_score=True)

In [26]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

kfold = KFold(n_splits = 10)       

cv_score = cross_val_score(bc,
                           X, y, 
                           cv=kfold, 
                           scoring='neg_root_mean_squared_error')
print(cv_score.mean(), cv_score.std())

-0.47620552942163297 0.07258527289486412


In [27]:
bc.oob_score_

0.7679269882659713

### 4. Random Forest
En este caso entrena un [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) con 100 árboles y un `max_features` de 3. También con validación cruzada

In [28]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, 
                            bootstrap=True,
                            oob_score=True,
                            random_state=0,
                            max_features=3
                            )


In [29]:
rf.fit(X, y)

RandomForestClassifier(max_features=3, oob_score=True, random_state=0)

In [30]:
kfold = KFold(n_splits = 10)       

cv_score = cross_val_score(rf,
                           X, y, 
                           cv=kfold, 
                           scoring='neg_root_mean_squared_error')
print(cv_score.mean(), cv_score.std())

-0.4701053032928863 0.06718369642547754


In [31]:
rf.oob_score_

0.7679269882659713

### 5. AdaBoost
Implementa un [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) con 30 árboles.

In [32]:
from sklearn.ensemble import AdaBoostClassifier
estimator = DecisionTreeClassifier(random_state=0)


ada_clf = AdaBoostClassifier(estimator, n_estimators=30,  
                                    random_state=0)  

 

In [33]:
ada_clf.fit(X, y)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(random_state=0),
                   n_estimators=30, random_state=0)

In [36]:
y_pred = ada_clf.predict(X)

from sklearn.metrics import accuracy_score
acc = accuracy_score(y, y_pred)
acc

1.0

### 6. GradientBoosting
Implementa un [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) con 100 estimadores

In [37]:
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(random_state=0, n_estimators=100)

clf.fit(X, y
       )
y_hat = clf.predict(X)

accuracy_score(y, y_hat)

0.894393741851369

### 7. XGBoost
Para este apartado utiliza un [XGBoostClassifier](https://docs.getml.com/latest/api/getml.predictors.XGBoostClassifier.html) con 100 estimadores. XGBoost no forma parte de la suite de modelos de sklearn, por lo que tendrás que instalarlo con pip install

In [None]:
#!pip install xgboost

In [38]:
from xgboost import XGBClassifier

xgb_clas =  XGBClassifier(n_estimators=100, random_state=0)

xgb_clas.fit(X, y)

y_pred = xgb_clas.predict(X)

accuracy_score(y, y_pred)

1.0

### 8. Resultados
Crea un series con los resultados y sus algoritmos, ordenándolos de mayor a menor