# Extreme Gradient Boosting (XGBoost)

Algoritmo de aprendizaje supervisado basado en la técnica de gradient boosting, diseñado para ser altamente eficiente, flexible y escalable.

Consiste en la construcción secuencial de árboles de decisión que corrigen los errores de los modelos anteriores, optimizando una función de pérdida mediante el uso del gradiente.

XGBoost incluye mejoras como regularización L1 y L2, manejo eficiente de valores faltantes, y técnicas avanzadas de poda y paralelización, lo que lo convierte en una de las herramientas más potentes y utilizadas en competencias de ciencia de datos y en aplicaciones del mundo real.

Se destaca por su alto rendimiento predictivo.

Carguemos las librerías necesarias para implementar el modelo.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

In [None]:
import pandas as pd

Carguemos el dataset a utilizar.

In [None]:
X,y = load_breast_cancer(return_X_y=True, as_frame=True)
X = pd.DataFrame(X)
y = pd.DataFrame(y)
X_tr, X_te, y_tr, y_te = train_test_split(X,y,test_size=0.2,random_state=42)

Visualicemos el dataset.

In [None]:
X

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


Como podemos ver todas las variables son continuas, pero tienen dimensiones muy distintas por lo que toca realizar un RobustScaler.

In [None]:
X.dtypes

Unnamed: 0,0
mean radius,float64
mean texture,float64
mean perimeter,float64
mean area,float64
mean smoothness,float64
mean compactness,float64
mean concavity,float64
mean concave points,float64
mean symmetry,float64
mean fractal dimension,float64


In [None]:
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.pipeline import Pipeline

Preprocesamiento.

In [None]:
pre = Pipeline([
    ('scaler', RobustScaler())
])

Partamos viendo el metodo de Gradient Boosting.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

Construyamos el pipeline del modelo.

In [None]:
pipe_gb = Pipeline([
    ('pre', pre),
    ('gb', GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=1.0,
        max_depth=3,
        random_state=42
    ))
])

Entremos, realicemos la predicción y evaluemosla a través de las métricas.

In [None]:
pipe_gb.fit(X_tr,y_tr)
y_pred = pipe_gb.predict(X_te)
print(confusion_matrix(y_te,y_pred))
print(classification_report(y_te,y_pred))

  y = column_or_1d(y, warn=True)


[[41  2]
 [ 2 69]]
              precision    recall  f1-score   support

           0       0.95      0.95      0.95        43
           1       0.97      0.97      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.96      0.96       114
weighted avg       0.96      0.96      0.96       114



Ahora realizaremos el mismo proceso pero con XGBoost.

In [None]:
!pip install xgboost



In [None]:
from xgboost import XGBClassifier

XGBoost es una versión optimizada y extendida de Gradient Boosting. Si necesitas rendimiento alto, grandes datasets o control fino, XGBoost es mejor.

Obs: Boosting se refiere a que siempre va mejorando el error de la etapa anterior.

Creemos el pipeline del modelo.

In [None]:
pipe_xgb = Pipeline([
    ('pre', pre),
    ('xgb', XGBClassifier(
        n_estimators=100,
        learning_rate=1.0,
        max_depth=3,
        random_state=42
    ))
])

Entrenamiento, predicción y evaluación del modelo a través de las métricas.

In [None]:
pipe_xgb.fit(X_tr,y_tr)
y_pred = pipe_xgb.predict(X_te)
print(confusion_matrix(y_te,y_pred))
print(classification_report(y_te,y_pred))

[[40  3]
 [ 3 68]]
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



Cuando el dataset comprende un numero muy alto de datos, es recomendable pasar a utilizar el gpu para una mayor rapidez a la hora de obtener los resultados.


Construyamos el modelo pero cambiando el device a gpu.

In [None]:
pipe_xgb_gpu = Pipeline([
    ('pre', pre),
    ('xgb', XGBClassifier(
        n_estimators=100,
        learning_rate=1.0,
        max_depth=3,
        random_state=42,
        device = 'gpu'
    ))
])

Entrenamiento, predicción y evaluación de los resultados a través de las métricas.

In [None]:
pipe_xgb_gpu.fit(X_tr,y_tr)
y_pred = pipe_xgb_gpu.predict(X_te)
print(confusion_matrix(y_te,y_pred))
print(classification_report(y_te,y_pred))

[[40  3]
 [ 2 69]]
              precision    recall  f1-score   support

           0       0.95      0.93      0.94        43
           1       0.96      0.97      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.




Ahora realicemos los metodos pero para el dataset de titanic, donde tenemos una variable objetivo del tipo categorica (Survived).

In [None]:
import kagglehub
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("yasserh/titanic-dataset")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/titanic-dataset


In [None]:
data = pd.read_csv(path + '/Titanic-Dataset.csv')

Recordemos que este dataset contenia variables nulas, por lo que procedemos a su llenado.

In [None]:
data['Age'] = data['Age'].fillna(data['Age'].mean())
data['Embarked']= data['Embarked'].fillna(data['Embarked'].mode()[0])

In [None]:
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.000000,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.000000,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.000000,0,0,111369,30.0000,C148,C


Ahora lo que haremos es usar LabelEncoder para transformar las variables categoricas en 0 y 1, para poder manejarlas de mejor manera. Es clave utilizar LabelEncoder cuando vayamos a utilizar XGBoost, debido a que en nuevas versiones entrega error el no hacerlo.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
data['Sex'] = LabelEncoder().fit_transform(data['Sex'])
data['Embarked'] = LabelEncoder().fit_transform(data['Embarked'])

In [None]:
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.000000,1,0,A/5 21171,7.2500,,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.000000,1,0,PC 17599,71.2833,C85,0
2,3,1,3,"Heikkinen, Miss. Laina",0,26.000000,0,0,STON/O2. 3101282,7.9250,,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.000000,1,0,113803,53.1000,C123,2
4,5,0,3,"Allen, Mr. William Henry",1,35.000000,0,0,373450,8.0500,,2
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",1,27.000000,0,0,211536,13.0000,,2
887,888,1,1,"Graham, Miss. Margaret Edith",0,19.000000,0,0,112053,30.0000,B42,2
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",0,29.699118,1,2,W./C. 6607,23.4500,,2
889,890,1,1,"Behr, Mr. Karl Howell",1,26.000000,0,0,111369,30.0000,C148,0


Separemos las variables entre numéricas y categóricas.

In [None]:
y = data['Survived']
X = data.drop(columns=['Survived','Name','Ticket','Cabin','PassengerId'])

In [None]:
num = ['Age','SibSp','Parch','Fare']
cat = ['Sex','Embarked','Pclass']

In [None]:
X = X[num+cat]

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, RobustScaler

Preprocesamiento.

In [None]:
prep = ColumnTransformer([
    ('num', RobustScaler(), num),
    ('cat', OneHotEncoder(drop='first'), cat)
])

División del dataset entre entrenamiento y validación.

In [None]:
X_tr, X_te, y_tr, y_te = train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)

Construcción del modelo con una mayor configuración de sus hiperparámetros.

In [None]:
pipe_xgb_gpu_2 = Pipeline([
    ('prep', prep),
    ('xgb', XGBClassifier(
        n_estimators=500,
        learning_rate=0.05,
        max_depth=3,
        enable_categorical=True,
        random_state=42,
        device = 'gpu',
        objective = 'binary:logistic',
        colsample_bytree = 0.5,
        eval_metric = 'logloss'
    ))
])

Entrenamiento, predicción y evaluación de los resultados a través de las métricas.

In [None]:
pipe_xgb_gpu_2.fit(X_tr,y_tr)
y_pred = pipe_xgb_gpu_2.predict(X_te)
print(confusion_matrix(y_te,y_pred))
print(classification_report(y_te,y_pred))

[[97 13]
 [23 46]]
              precision    recall  f1-score   support

           0       0.81      0.88      0.84       110
           1       0.78      0.67      0.72        69

    accuracy                           0.80       179
   macro avg       0.79      0.77      0.78       179
weighted avg       0.80      0.80      0.80       179

