# Ejercicios ensembling
En este ejercicio vas a realizar prediciones sobre un dataset de ciudadanos indios diabéticos. Se trata de un problema de clasificación en el que intentaremos predecir 1 (diabético) 0 (no diabético). Todas las variables son numércias.

## 1. Carga las librerias que consideres comunes al notebook

In [27]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import KFold

%matplotlib inline


## 2. Lee los datos de [esta direccion](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv)
Los nombres de columnas son:
```Python
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
```

In [2]:
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv', names= names)

In [4]:
df

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   preg    768 non-null    int64  
 1   plas    768 non-null    int64  
 2   pres    768 non-null    int64  
 3   skin    768 non-null    int64  
 4   test    768 non-null    int64  
 5   mass    768 non-null    float64
 6   pedi    768 non-null    float64
 7   age     768 non-null    int64  
 8   class   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


## 3. Bagging
Para este apartado tendrás que crear un ensemble utilizando la técnica de bagging ([BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)), mediante la cual combinarás 100 [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Recuerda utilizar también [cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) con 10 kfolds.

**Para este apartado y siguientes, no hace falta que dividas en train/test**, por hacerlo más sencillo. Simplemente divide tus datos en features y target.

Establece una semilla

In [6]:
#Set X and y_target
X = df.drop(columns='class')
y = df['class']

print ("X_shape", X.shape)
print ("y_shape", y.shape)

X_shape (768, 8)
y_shape (768,)


In [7]:
seed= 42

In [33]:
kfold = KFold(n_splits=10)
tree_cls = DecisionTreeClassifier()

bagging_cls = BaggingClassifier(base_estimator=tree_cls,
                                n_estimators=100,
                                random_state= seed)

result_bagg = cross_val_score(bagging_cls, X, y, cv= kfold ).mean()
#result_bagg = cross_val_score(bagging_cls, X, y, cv= 10 )

In [34]:
result_bagg

0.775974025974026

In [10]:
result_bagg = cross_val_score(bagging_cls, X, y, cv= 10 )
result_bagg

array([0.7012987 , 0.77922078, 0.79220779, 0.66233766, 0.76623377,
       0.79220779, 0.83116883, 0.85714286, 0.72368421, 0.80263158])

## 4. Random Forest
En este caso entrena un [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) con 100 árboles y un `max_features` de 3. También con validación cruzada

In [11]:
kfold = KFold(n_splits=10)
rand_forest_cls = RandomForestClassifier(n_estimators=100,
                                         max_features=3,
                                         random_state= seed)

result_rand_forest_cls = cross_val_score(rand_forest_cls, X, y, cv= kfold ).mean()
result_rand_forest_cls

0.7642857142857143

## 5. AdaBoost
Implementa un [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) con 30 árboles.

In [12]:
kfold = KFold(n_splits=10)
tree_cls = DecisionTreeClassifier()
adaBoost_cls = AdaBoostClassifier(tree_cls,
                                  n_estimators=30,
                                  random_state= seed)

result_adaB_cls = cross_val_score(adaBoost_cls, X, y, cv= kfold ).mean()
result_adaB_cls

0.695164046479836

## 6. GradientBoosting
Implementa un [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) con 100 estimadores

In [35]:
grad_boost_cls = GradientBoostingClassifier(n_estimators=100,
                                            random_state= seed)
resuld_gradBoost_cls = cross_val_score(grad_boost_cls, 
                                        X,
                                        y,
                                        cv= kfold).mean()

In [36]:
resuld_gradBoost_cls

0.7642857142857143

In [37]:
cross_val_score(grad_boost_cls, X, y, cv= kfold)

array([0.71428571, 0.81818182, 0.74025974, 0.63636364, 0.80519481,
       0.79220779, 0.80519481, 0.83116883, 0.71052632, 0.78947368])

## 7. XGBoost
Para este apartado utiliza un [XGBoostClassifier](https://docs.getml.com/latest/api/getml.predictors.XGBoostClassifier.html) con 100 estimadores. XGBoost no forma parte de la suite de modelos de sklearn, por lo que tendrás que instalarlo con pip install

In [17]:
#!pip install xgboost



In [30]:
model = XGBClassifier(n_estimators= 100)

resuld_xgdBoost_cls = cross_val_score(model, 
                                      X,
                                      y,
                                      cv= 10).mean()
                                   



In [31]:
resuld_xgdBoost_cls

0.7357142857142857

## 8. Resultados
Crea un series con los resultados y sus algoritmos, ordenándolos de mayor a menor

In [49]:
dict_results_models = {'models': ['Bagging_cls','Random_forest_cls','AdaBoost_cls', 'GradientBoosting_cls', 'XGBoost_cls' ],
                        'results':[result_bagg, result_rand_forest_cls ,result_adaB_cls, resuld_gradBoost_cls,resuld_xgdBoost_cls]}

df_results = pd.DataFrame(dict_results_models)

In [50]:
df_results

Unnamed: 0,models,results
0,Bagging_cls,0.775974
1,Random_forest_cls,0.764286
2,AdaBoost_cls,0.695164
3,GradientBoosting_cls,0.764286
4,XGBoost_cls,0.735714


In [51]:
df_results = df_results.sort_values('results', ascending=False)

In [52]:
df_results

Unnamed: 0,models,results
0,Bagging_cls,0.775974
1,Random_forest_cls,0.764286
3,GradientBoosting_cls,0.764286
4,XGBoost_cls,0.735714
2,AdaBoost_cls,0.695164
