![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Construcción e implementación de modelos basados en Boosting
En este notebook aprenderá a construir e implementar modelos de Adaboost, Gradient Boosting y XGBoost. El primer modelo lo desarrollará de forma manual y usando la librería especializada sklearn, mientras que los otros dos los desarrollará solamente usando la librería.

## Instrucciones Generales:

Los modelos que construirá por medio de este notebook deberán predecir si un usuario deja o no de usar los servicios de una compañía (churn) teniendo en cuenta diferentes variables. Para conocer más detalles de la base de 'churn' puede ingresar al siguiente vínculo: http://srepho.github.io/Churn/Churn

Para realizar la actividad, solo siga las indicaciones asociadas a cada celda del notebook. 

In [9]:
import warnings
warnings.filterwarnings('ignore')

In [10]:
# Importar base de datos y librerías
import pandas as pd
import numpy as np

# Carga de datos de archivos .csv
data = pd.read_csv('https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/datasets/churn.csv')

In [11]:
# Selección de variables numéricas (X)
X = data.iloc[:, [1,2,6,7,8,9,10]].astype(float)
# Tranformación de variables booleanas a floats
X = X.join((data.iloc[:, [4,5]] == 'no').astype(float))

# Definición variable de interés binaria (y)
y = (data.iloc[:, -1] == 'True.').astype(int)

In [12]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=40)
n_samples = X_train.shape[0]

## Adaboost manual

In [13]:
# Definición de la cantidad de árboles de decisión del modelo
n_estimators = 10

# Definición de DataFrame para guardar pesos de las observaciones en cada árbol de decisión
weights = pd.DataFrame(index=X_train.index, columns=list(range(n_estimators)))

# Asignación los mismos pesos para todas las observaciones en el árbol 0
t = 0
weights[t] = 1 / n_samples

# Visualización de DataFrame de pesos
weights.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
2953,0.000448,,,,,,,,,
617,0.000448,,,,,,,,,
26,0.000448,,,,,,,,,
853,0.000448,,,,,,,,,
2510,0.000448,,,,,,,,,


In [14]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

# Definición y entrenamiento (fit) del primer árbol de decisión (DecisionTreeClassifier)
trees = []
trees.append(DecisionTreeClassifier(max_depth=1))
trees[t].fit(X_train, y_train, sample_weight=weights[t].values)

In [17]:
# Estimación del error del primer árbol de decisión
y_pred_ = trees[t].predict(X_train)
error = []
error.append(1 - metrics.balanced_accuracy_score(y_train, y_pred_, sample_weight = weights[t].values))
error[t]

0.37950927163251025

In [18]:
# Cálculo del factor alpha del primer árbol de decisión
alpha = []
alpha.append(np.log((1 - error[t]) / error[t])/2)
alpha[t]

0.24581581731037072

In [19]:
# Actualización de los pesos a considerar en el segundo árbol (t+1)
weights[t + 1] = weights[t]
filter_ = y_pred_ != y_train

weights.loc[filter_, t + 1] = weights.loc[filter_, t] * np.exp(alpha[t])

In [20]:
# Normalización de los pesos
weights[t + 1] = weights[t + 1] / weights[t + 1].sum()

In [21]:
# Visualización de DataFrame de pesos
weights.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
2953,0.000448,0.000432,,,,,,,,
617,0.000448,0.000432,,,,,,,,
26,0.000448,0.000432,,,,,,,,
853,0.000448,0.000432,,,,,,,,
2510,0.000448,0.000432,,,,,,,,


In [23]:
# Definición de loop que itera sobre todos los árboles definidos realizando los mismos cáculos presentados anteriormente
for t in range(1, n_estimators):
    # Definición y entrenamiento (fit) del árbol t
    trees.append(DecisionTreeClassifier(max_depth=1))
    trees[t].fit(X_train, y_train, sample_weight=weights[t].values)
    y_pred_ = trees[t].predict(X_train)
    # Estimación del error del árbol t
    error.append(1 - metrics.balanced_accuracy_score(y_pred_, y_train, sample_weight = weights[t].values))
    # Cálculo del factor alpha para el árbol t
    alpha.append(np.log((1 - error[t]) / error[t]) / 2)
    # Actualización de pesos para el árbol t+2
    weights[t + 1] = weights[t]
    filter_ = y_pred_ != y_train
    weights.loc[filter_, t + 1] = weights.loc[filter_, t] * np.exp(alpha[t])
    weights[t + 1] = weights[t + 1] / weights[t + 1].sum()

In [24]:
# Visualización de los errores de cada árbol
error

[0.37950927163251025,
 0.1761671507959327,
 0.29887226314008886,
 0.2888088327626901,
 0.3847613492923222,
 0.3886197394111994,
 0.37902179410944004,
 0.4372322506461914,
 0.42778833896841784,
 0.4440647604728645]

In [25]:
# Visualización de DataFrame de pesos
weights.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
2953,0.000448,0.000432,0.000358,0.000314,0.000268,0.000243,0.00022,0.000194,0.000208,0.000226,0.00024
617,0.000448,0.000432,0.000358,0.000314,0.000268,0.000243,0.00022,0.000194,0.000208,0.000226,0.00024
26,0.000448,0.000432,0.000358,0.000314,0.000268,0.000243,0.00022,0.000194,0.000184,0.000172,0.000164
853,0.000448,0.000432,0.000358,0.000314,0.000268,0.000243,0.00022,0.000194,0.000208,0.000226,0.00024
2510,0.000448,0.000432,0.000358,0.000314,0.000268,0.000243,0.00022,0.000194,0.000208,0.000226,0.000215


In [26]:
# Solo se usan modelos con error menor a 0.5
new_n_estimators = np.sum([x<0.5 for x in error])

In [27]:
# Prección sobre la muestra de test
y_pred_all = np.zeros((X_test.shape[0], new_n_estimators))
for t in range(new_n_estimators):
    y_pred_all[:, t] = trees[t].predict(X_test)

In [29]:
# Obtención de la predicción final al poderar las predicciones por el factor aplha
y_pred = (np.sum(y_pred_all * alpha[:new_n_estimators], axis=1) >= 1).astype(int)

In [30]:
# Impresión del desempeño modelo
metrics.f1_score(y_pred, y_test.values), metrics.accuracy_score(y_pred, y_test.values)

(0.37344398340248963, 0.8627272727272727)

In [31]:
# Definición de un sólo árbol de decisón para compara el desempeño
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
metrics.f1_score(y_pred, y_test.values), metrics.accuracy_score(y_pred, y_test.values)

(0.33986928104575165, 0.8163636363636364)

## Adaboost usando sklearn

In [32]:
# Importación y definición de modelo AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier()
clf

In [33]:
# Entrenamiento (fit) y desempeño del modelo AdaBoostClassifier
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
metrics.f1_score(y_pred, y_test.values), metrics.accuracy_score(y_pred, y_test.values)

(0.36771300448430494, 0.8718181818181818)

## Gradient Boosting usando sklearn

In [34]:
# Importación y definición de modelo GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier()
clf

In [35]:
# Entrenamiento (fit) y desempeño del modelo GradientBoostingClassifier
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
metrics.f1_score(y_pred, y_test.values), metrics.accuracy_score(y_pred, y_test.values)

(0.4666666666666666, 0.8981818181818182)

## XGBoost usando sklearn
Instalar la librería xgboost usando el comando: pip install xgboost

In [36]:
# Importación y definición de modelo XGBClassifier

from xgboost import XGBClassifier
clf = XGBClassifier()
clf

In [37]:
# Entrenamiento (fit) y desempeño del modelo XGBClassifier
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
metrics.f1_score(y_pred, y_test.values), metrics.accuracy_score(y_pred, y_test.values)

(0.4298245614035088, 0.8818181818181818)