# XGBoost: Extreme Gradiente Boosting

Este ejemplo es basado en https://www.datacamp.com/community/tutorials/xgboost-in-python. El análisis presentado es interesante, vamos a utilizar el set de datos de los precios de las casas de California y el set de datos sobre diabetes.

In [None]:
#!pip install xgboost

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from xgboost import XGBRegressor, XGBRFRegressor, XGBClassifier, XGBRFClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import load_diabetes, fetch_california_housing, load_wine

from sklearn.metrics import mean_squared_error, ConfusionMatrixDisplay, accuracy_score
from sklearn.model_selection import train_test_split

Vamos a utilizar una base de datos para predecir el precio de una casa y otra base de datos para predecir la progresión de diabetes.

In [None]:
diabetes = load_diabetes()
california = fetch_california_housing()

In [None]:
data_diabetes = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
data_california = pd.DataFrame(california.data, columns=california.feature_names)

Miremos los datos..

In [None]:
data_diabetes.describe()

Los nombres de las columnas no ayudan demasiado, por lo que hay que mirar la descripción del dataset:

In [None]:
print(diabetes.DESCR)
print(california.DESCR)

Vamos a separar la variable objetivo de ambos dataset:

In [None]:
data_diabetes['DIAB'] = diabetes.target
data_california['PRICE'] = california.target

In [None]:
Xb, yb = data_diabetes.iloc[:,:-1],data_diabetes.iloc[:,-1]
Xc, yc = data_california.iloc[:,:-1],data_california.iloc[:,-1]

Vamos a separar en dos subconjuntos (train y test) a ambos dataset:

In [None]:
Xb_train, Xb_test, yb_train, yb_test = train_test_split(Xb, yb, test_size=0.2, random_state=42)
Xc_train, Xc_test, yc_train, yc_test = train_test_split(Xc, yc, test_size=0.2, random_state=50)

In [None]:
Xb_train.shape, Xc_train.shape

Ahora vamos a crear y entrenar el modelo XGBoost. La APIs es similar al de sklearn :D

In [None]:
XGBRFRegressor?

In [None]:
argumentos = dict(objective ='reg:squarederror',
                         colsample_bytree = 0.3,
                         learning_rate = 0.2,
                         max_depth = 10, 
                         alpha = 50, 
                         n_estimators = 10) #jugar con esto

xg_regb = [XGBRegressor(**argumentos), 
        XGBRFRegressor()]

xg_regc = [XGBRegressor(**argumentos),
        XGBRFRegressor()]

diabetes_set = (xg_regb, Xb_train, yb_train, Xb_test, yb_test)
california_set = (xg_regc, Xc_train, yc_train, Xc_test, yc_test)

for (dataset_models, X_train, y_train, X_test, y_test) in (diabetes_set, california_set):
    for model in dataset_models:
        print(model)
        model.fit(X_train,y_train)
        preds = model.predict(X_test)
        rmse = np.sqrt(mean_squared_error(y_test, preds))
        print("RMSE: %f" % (rmse))

        plt.figure(figsize=(15,7))
        for i, var in enumerate(X_test):
            plt.subplot(2, int(X_test.shape[1]/2) + 1, i+1)
            plt.scatter(X_test.loc[:,var],y_test, label='trueval')
            plt.scatter(X_test.loc[:,var],preds, label='predicted')
            plt.title(var)
        plt.legend()
        plt.show()


In [None]:
xg_regb[0]

### Podemos plotear el árbol

In [None]:
from xgboost import to_graphviz
to_graphviz(xg_regb[0])

# Y para problemas de clasificación?

In [None]:
from sklearn.model_selection import KFold

In [None]:
iris = load_wine()

X = iris["data"]
y = iris["target"]

FOLDS=4
cv = KFold(n_splits=FOLDS, shuffle=True, random_state=4)

El parámetro `objective` es la función objetivo a minimizar. Para problemas de clasificación multiclase usualmente usamos `multi:softmax` debido a que da una "probabilidad" para cada clase. 

Tiene la forma:

$\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K{e^{z_j}}}$ para $i=1, \dots, K$ y $\mathbf{z} = (z_1, \dots, z_K) \in \mathbb{R}^K $

Se puede ver la [documentacion](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters) para otras funciones objetivo.

In [None]:
clfs = [XGBClassifier(objective = "multi:softmax", colsample_bytree = 0.3, learning_rate = 0.1,
                           max_depth = 5, alpha = 10, n_estimators = 10),
        XGBRFClassifier(objective = "multi:softmax", colsample_bytree = 0.3, learning_rate = 0.1,
                           max_depth = 5, alpha = 10, n_estimators = 10),
        RandomForestClassifier()]

clfs_names = ['XGBC', 'XGBRFC', 'RF']


In [None]:
for clf, name in zip(clfs, clfs_names):
    avg_accuracy = 0
    print(name)
    for fold, (train_idx, val_idx) in enumerate(cv.split(X,y)):
        X_train, y_train = X[train_idx], y[train_idx]
        X_test, y_test = X[val_idx], y[val_idx]
        clf.fit(X_train,y_train)
        preds = clf.predict(X_test)
        accuracy = accuracy_score(y_test, preds)
        avg_accuracy +=accuracy
        print(f"Acc. fold {fold+1}: {accuracy * 100.0 :.2f}" % ())
        if name == 'XGBC':
            ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test)
    avg_accuracy /= FOLDS
    print(f'Avg. accuracy = {avg_accuracy * 100}')

## Referencias:

  * [Wikipedia sobre XGBoost](https://es.wikipedia.org/wiki/XGBoost)
  * [Documentación de XGBoost](https://xgboost.readthedocs.io/en/stable/)
  * [Gradient Boosting in SciKit Learn](https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting)