# XGBoost

One of the most promising and scalable end-to-end tree boosting systems is Extreme Gradient Boosting (XGBoost) introduced by Chen and Guestrin (2016), which employs a regularization technique termed ``gradient-based sampling'' to address overfitting and parallel processing to accelerate training. The scalability of XGBoost is attributed by its author to several vital systems and algorithmic optimizations, such as a new tree learning algorithm that handles sparse data and a theoretically justified weighted quantile sketch procedure that facilitates instance weight handling during approximate tree learning.

Given a training set $\{ x_i, y_i \}^N_1$, Chen and Guestrin (2016)  define a tree ensemble model that uses $K$ additive functions as

$$
\hat{y}_i = \phi(x_i) = \sum_{k=1}^K f_k(x_i),\ f_k \in \mathcal{F}
$$

where $\mathcal{F}$ is the space of regression trees (CART). Subsequently, the decision rules incorporated within the trees $q$ are utilized to classify the data into the corresponding leaves, and the final prediction is computed by summing the scores assigned to the relevant leaves $w$. To obtain the set of functions employed in the model, we aim to minimize the ensuing regularized objective function.

$$
\mathcal{L}(\phi) = \sum_i l(\hat{y}_i, y_i) + \sum_k \Omega (f_k)
$$

where $\Omega(f) = \gamma T + \frac{1}{2} \lambda ||w||^2$, $l$ is a differentiable convex loss function and $\Omega$ penalizes the complexity of the regression tree functions. It is noteworthy that when the regularization parameter is adjusted to zero, the objective function reverts to the traditional gradient tree boosting method. Besides this regularized objective, XGBoost incorporates two additional techniques to further prevent over-fitting: Shrinkage introduced by Friedman (2002) and feature sub-sampling.

In [1]:
import sys

sys.path.insert(0, '..')

In [2]:
from data.dataset import Dataset
from sklearn.metrics import mean_squared_error, r2_score
from model.xgboost.XGBoost import XGBoost

In [41]:
TRN = Dataset.load_csv("ds/UMAP_100d_TRN")
TST = Dataset.load_csv("ds/UMAP_100d_TST")

In [21]:
TRN = Dataset.load_csv("ds/08SC/TRN_All")
TST = Dataset.load_csv("ds/TST_All")

In [None]:
model = XGBoost(
    task_type="regression",
    learning_rate=0.01,
    n_estimators=800,
    subsample=0.2,
    reg_alpha=0.1,
    reg_lambda=0.1,
    gamma=0.1,
)
model.fit(TRN)
# model.fit(TRN, verbose=True, eval_set=[(TST.X, TST.y)])
print(r2_score(TST.y, model.predict(TST)))