# XGBoost

While it might be more comfortable to use the Sklearn API at first, later, you’ll realize that the native API of XGBoost contains some excellent features that the former doesn’t support.

Diamonds dataset throughout the tutorial. It is built into the Seaborn library. It has a nice combination of numeric and categorical features and over 50k observations that we can comfortably showcase all the advantages of XGBoost.

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import xgboost as xgb

diamonds = sns.load_dataset("diamonds")
diamonds.shape
diamonds.head()
diamonds.describe()
diamonds.describe(exclude=np.number)

Unnamed: 0,cut,color,clarity
count,53940,53940,53940
unique,5,7,8
top,Ideal,G,SI1
freq,21551,11292,13065


# Predict diamond prices from physical measurements
The dataset has three categorical columns. Normally, you would encode them with ordinal or one-hot encoding, but XGBoost has the ability to internally deal with categoricals.

The way to enable this feature is to cast the categorical columns into Pandas `category` data type (by default, they are treated as text columns)

In [None]:
# Extract feature and target arrays
X, y = diamonds.drop('price', axis=1), diamonds[['price']]

# Extract text features
cats = X.select_dtypes(exclude=np.number).columns.tolist()

# Convert to Pandas category
for col in cats:
   X[col] = X[col].astype('category')

# we have three `category` features
X.dtypes

Let’s split the data into train, and test sets (0.25 test size):

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# XGBoost DMatrix

XGBoost comes with its own class for storing datasets called DMatrix. It is a highly optimized class for memory and speed. That's why converting datasets into this format is a requirement for the native XGBoost API.

The class accepts both the training features and the labels. To enable automatic encoding of Pandas category columns, we also set enable_categorical to True.

In [None]:
# Create regression matrices
dtrain_reg = xgb.DMatrix(X_train, y_train, enable_categorical=True)
dtest_reg = xgb.DMatrix(X_test, y_test, enable_categorical=True)

# XGBoost Regression

choose a value for the objective parameter. It tells XGBoost the machine learning problem you are trying to solve and what metrics or loss functions to use to solve that problem.

to predict diamond prices, which is a regression problem, you can use the common reg:squarederror objective. Usually, the name of the objective also contains the name of the loss function for the problem. For regression, it is common to use Root Mean Squared Error, which minimizes the square root of the squared sum of the differences between actual and predicted values. Here is how the metric would look like when implemented in NumPy:

In [None]:
import numpy as np

mse = np.mean((actual - predicted) ** 2)
rmse = np.sqrt(mse)

A note on the difference between a loss function and a performance metric: A loss function is used by machine learning models to minimize the differences between the actual (ground truth) values and model predictions. On the other hand, a metric (or metrics) is chosen by the machine learning engineer to measure the similarity between ground truth and model predictions.

In short, a loss function should be minimized while a metric should be maximized. A loss function is used during training to guide the model on where to improve. A metric is used during evaluation to measure overall performance.

# Training

The chosen objective function and any other hyperparameters of XGBoost should be specified in a dictionary, which by convention should be called params.

Inside this initial params, we are also setting tree_method to gpu_hist, which enables GPU acceleration. If you don't have a GPU, you can omit the parameter or set it to hist.

We set another parameter called num_boost_round, which stands for number of boosting rounds. Internally, XGBoost minimizes the loss function RMSE in small incremental rounds (more on this later). This parameter specifies the amount of those rounds.

The ideal number of rounds is found through hyperparameter tuning. For now, we will just set it to 100:

In [1]:
# Define hyperparameters
params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}

n = 1000

Next, we create a list of two tuples that each contain two elements. The first element is the array for the model to evaluate, and the second is the array’s name.

In [2]:
evals = [(dtrain_reg, "train"), (dtest_reg, "validation")]

SyntaxError: unexpected EOF while parsing (853556469.py, line 1)

## XGBoost with Cross-Validation

The test set waits in the corner, we split the training into 3, 5, 7, or k splits or folds. Then, we train the model k times. Each time, we use k-1 parts for training and the final kth part for validation. This process is called k-fold cross-validation:

In [None]:
# model = xgb.train(
#    params=params,
#    dtrain=dtrain_reg,
#    num_boost_round=n,
#    evals=evals,
#    verbose_eval=10, # Every ten rounds
#    early_stopping_rounds=50, # Activate early stopping
#    nfold=5,
# )
results = xgb.cv(
   params=params,
   dtrain=dtrain_reg,
   num_boost_round=n,
   nfold=5,
   early_stopping_rounds=20
)

In [None]:
best_rmse = results['test-rmse-mean'].min()

best_rmse
# 550.8959336674216

# Evaluation

During the boosting rounds, the model object has learned all the patterns of the training set it possibly can. Now, we must measure its performance by testing it on unseen data. That's where our `dtest_reg` DMatrix comes into play:

This step of the process is called model evaluation (or inference). Once you generate predictions with predict, you pass them inside mean_squared_error function of Sklearn to compare against y_test:

In [None]:
from sklearn.metrics import mean_squared_error

preds = model.predict(dtest_reg)

rmse = mean_squared_error(y_test, preds, squared=False)

print(f"RMSE of the base model: {rmse:.3f}")
# RMSE of the base model: 543.203