# XGBoost

One of the most promising and scalable end-to-end tree boosting systems is Extreme Gradient Boosting (XGBoost) introduced by Chen and Guestrin (2016), which employs a regularization technique termed ``gradient-based sampling'' to address overfitting and parallel processing to accelerate training. The scalability of XGBoost is attributed by its author to several vital systems and algorithmic optimizations, such as a new tree learning algorithm that handles sparse data and a theoretically justified weighted quantile sketch procedure that facilitates instance weight handling during approximate tree learning.

Given a training set $\{ x_i, y_i \}^N_1$, Chen and Guestrin (2016)  define a tree ensemble model that uses $K$ additive functions as

$$
\hat{y}_i = \phi(x_i) = \sum_{k=1}^K f_k(x_i),\ f_k \in \mathcal{F}
$$

where $\mathcal{F}$ is the space of regression trees (CART). Subsequently, the decision rules incorporated within the trees $q$ are utilized to classify the data into the corresponding leaves, and the final prediction is computed by summing the scores assigned to the relevant leaves $w$. To obtain the set of functions employed in the model, we aim to minimize the ensuing regularized objective function.

$$
\mathcal{L}(\phi) = \sum_i l(\hat{y}_i, y_i) + \sum_k \Omega (f_k)
$$

where $\Omega(f) = \gamma T + \frac{1}{2} \lambda ||w||^2$, $l$ is a differentiable convex loss function and $\Omega$ penalizes the complexity of the regression tree functions. It is noteworthy that when the regularization parameter is adjusted to zero, the objective function reverts to the traditional gradient tree boosting method. Besides this regularized objective, XGBoost incorporates two additional techniques to further prevent over-fitting: Shrinkage introduced by Friedman (2002) and feature sub-sampling.

In [None]:
import sys

sys.path.insert(0, '..')

In [None]:
from data.dataset import Dataset
from sklearn.metrics import mean_squared_error, r2_score
from model.xgboost.XGBoost import XGBoost
from model.abstractmodel import AbstractModel
from ray import tune

In [None]:
TRN = Dataset.load_csv("ds/TRN_All")
TST1 = Dataset.load_csv("ds/TST_1_All")
TST2 = Dataset.load_csv("ds/TST_2_All")

In [None]:
SPACE = {
    "n_estimators": tune.choice([100, 150, 200, 250, 300, 400, 500, 700, 900]),
    "learning_rate": tune.choice([0.1, 0.01, 0.001]),
    "gamma": tune.uniform(0, 1),
    "subsample": tune.uniform(0, 1),
    "reg_alpha": tune.uniform(0, 1),
    "reg_lambda": tune.uniform(0, 1),
}

In [None]:
def trainable_func(config: dict, dataset: Dataset):
    trn, val = dataset.split()
    model = XGBoost(
        task_type="regression",
        n_estimators=config["n_estimators"],
        learning_rate=config["learning_rate"],
        subsample=config["subsample"],
        reg_alpha=config["reg_alpha"],
        reg_lambda=config["reg_lambda"],
        verbose=False
        # eval_metric=mean_squared_error,
        # early_stopping_rounds=10
    )
    # model.fit(trn, verbose=True, eval_set=[(val.X, val.y)])
    model.fit(trn)
    pred = model.predict(val)
    tune.report(rmse=mean_squared_error(val.y, pred, squared=False))

In [None]:
from ray.tune.search import BasicVariantGenerator

tuner = AbstractModel.tuner(
    trainable_func,
    SPACE,
    num_samples=100,
    search_alg=BasicVariantGenerator(max_concurrent=1),
    dataset=TRN,
    metric_columns=["rmse"]
)
tune_result = tuner.fit()

In [None]:
best_result = tune_result.get_best_result(metric="rmse", mode="min")
best_result.config

In [None]:
TRN = Dataset.load_csv("ds/all/TRN_All")
TST1 = Dataset.load_csv("ds/all/TST_1_All")
TST2 = Dataset.load_csv("ds/all/TST_2_All")

In [None]:
TRN = Dataset.load_csv("ds/2019trn/TRN_All")
TST1 = Dataset.load_csv("ds/all/TST_1_All")
TST2 = Dataset.load_csv("ds/all/TST_2_All")

In [None]:
trn_sets, val_sets = TRN.k_fold_split(5)

In [None]:
k_pred_tst1 = []
k_pred_tst2 = []

for trn, val in zip(trn_sets, val_sets):
    model = XGBoost(
        task_type="regression",
        learning_rate=0.1,
        n_estimators=175,
        subsample=0.4,
        # reg_alpha=0.,
        # reg_lambda=0.4,
        # gamma=0.3,
        # grow_policy="lossguide",
        # random_state=1234
    )
    model.fit(trn)
    k_pred_tst1.append(model.predict(TST1))
    k_pred_tst2.append(model.predict(TST2))

In [None]:
import pandas as pd

pred_tst1 = pd.concat([pd.DataFrame(k_pred) for k_pred in k_pred_tst1], axis=1).mean(axis=1)
pred_tst2 = pd.concat([pd.DataFrame(k_pred) for k_pred in k_pred_tst2], axis=1).mean(axis=1)

In [None]:
print(f"TST1 : RMSE {mean_squared_error(TST1.y, pred_tst1, squared=False)}")
print(f"TST2 : RMSE {mean_squared_error(TST2.y, pred_tst2, squared=False)}")

In [None]:
print(f"TST1 : R^2 {r2_score(TST1.y, pred_tst1)}")
print(f"TST2 : R^2 {r2_score(TST2.y, pred_tst2)}")

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

model_name = "XGBoost"
min_ax1, max_ax1 = -7, -1
min_ax2, max_ax2 = -11, -1

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))

sns.lineplot(x=[min_ax1, max_ax1], y=[min_ax1, max_ax1], ax=ax1, color="black")
sns.lineplot(x=[min_ax2, max_ax2], y=[min_ax2, max_ax2], ax=ax2, color="black")

sns.regplot(
    data=pd.DataFrame({
        "True logS (mol/L)": TST1.y.values[:, 0],
        "Predicted logS (mol/L)": pred_tst1
    }),
    x="True logS (mol/L)",
    y="Predicted logS (mol/L)",
    ax=ax1
)
sns.regplot(
    data=pd.DataFrame({
        "True logS (mol/L)": TST2.y.values[:, 0],
        "Predicted logS (mol/L)": pred_tst2
    }),
    x="True logS (mol/L)",
    y="Predicted logS (mol/L)",
    ax=ax2
)

ax1.set_xlim(min_ax1, max_ax1)
ax1.set_ylim(min_ax1, max_ax1)
ax2.set_xlim(min_ax2, max_ax2)
ax2.set_ylim(min_ax2, max_ax2)

ax1.set_title(f"2019 Solubility Challenge Test Set 1 ({model_name})\n"
              f"RMSE: {mean_squared_error(TST1.y, pred_tst1, squared=False):.3f}, $R^2$: {r2_score(TST1.y, pred_tst1):.3f}")
ax2.set_title(f"2019 Solubility Challenge Test Set 2 ({model_name})\n"
              f"RMSE: {mean_squared_error(TST2.y, pred_tst2, squared=False):.3f}, $R^2$: {r2_score(TST2.y, pred_tst2):.3f}")

# plt.axis("equal")

In [None]:
TRN = Dataset.load_csv("ds/all/UMAP_100d_TRN")
TST1 = Dataset.load_csv("ds/all/UMAP_100d_TST1")
TST2 = Dataset.load_csv("ds/all/UMAP_100d_TST2")

In [None]:
model = XGBoost(
    task_type="regression",
    learning_rate=0.01,
    n_estimators=600,
    subsample=0.3,
    reg_alpha=0.2,
    reg_lambda=0.1,
    gamma=0.1,
)
model.fit(TRN, verbose=True, eval_set=[(TST1.X, TST1.y)])
print(r2_score(TST1.y, model.predict(TST1)))

In [None]:
model.fit(TRN, verbose=True, eval_set=[(TST1.X, TST1.y)])
print(r2_score(TST1.y, model.predict(TST1)))
model.fit(TRN, verbose=True, eval_set=[(TST2.X, TST2.y)])
print(r2_score(TST2.y, model.predict(TST2)))