## XGBoost

The full name of XGBoost is eXtreme Gradient Boosting. And it is the same as GBDT that it belongs to ensemble learning algorithm but it has better performance in most scenario. 

Since XGBoost is similar with GBDT, XGBoost also is an additive model composed of multiple base models, so XGBoost can be expressed as:
$$
\hat{y_{i}} = \sum ^{K}_{k=1}f_{k}(x_{i})
$$

Assuming that the tree model needs to be trained in $t$-th iteration is $f_{t}(x)$, then:
$$
\hat{y_{i}}^{(t)} = \sum ^{t}_{k=1}\hat{y_{i}}^{(t-1)} + f_{t}(x_{i})
$$

The original form of object function is:
$$
Obj = \sum^{n}_{i=1}l(y_{i}, \hat{y_{i}}) + \sum^{t}_{i=1} \Omega (f_{i})
$$

$\sum^{t}_{i=1} \Omega (f_{i})$ is the regularization term for the loss function, which represents the sum of the complexities of all t trees and aims to prevent the model from overfitting.

Forward stagewise algorithm is also used in XGBoost, using the model of $t$-th step as example, the prediction for the $i$-th sample $x_i$ is:
$$
\hat{y_{i}}^{(t)} = \hat{y_{i}}^{(t-1)} + f_{t}(x_{i})
$$

Because $\hat{y_{i}}^{(t-1)}$ is the prediction given on $t-1$-th step, it can be regarded as a known constant on $t$-th step, and $f_{t}(x_{i})$ is the tree model of $t$-th step. Meanwhile, the regularization term can also be split. Since the structure of the first $t-1$ trees has been determined, the sum of the complexities of the first $t-1$ trees can also be expressed as a constant. Then, the object function can be rewritten as:
$$
\begin{aligned}
O b j^{(t)} &=\sum_{i=1}^{n} l\left(y_{i}, \hat{y}_{i}^{(t)}\right)+\sum_{i=1}^{t} \Omega\left(f_{i}\right) \\
&=\sum_{i=1}^{n} l\left(y_{i}, \hat{y}_{i}^{(t-1)}+f_{t}\left(x_{i}\right)\right)+\Omega (f_{t})+\sum_{i=1}^{t-1} \Omega\left(f_{i}\right) \\
&=\sum_{i=1}^{n} l\left(y_{i}, \hat{y}_{i}^{(t-1)}+f_{t}\left(x_{i}\right)\right)+\Omega\left(f_{t}\right)+\text { constant }
\end{aligned}
$$

Then using the second-order Taylor formula, the loss function can be rewritten as:
$$
l\left(y_{i}, \hat{y}_{i}^{(t-1)}+f_{t}\left(x_{i}\right)\right)=l\left(y_{i}, \hat{y}_{i}^{(t-1)}\right)+g_{i} f_{t}\left(x_{i}\right)+\frac{1}{2} h_{i} f_{t}^{2}\left(x_{i}\right)
$$

$g_{i}$ is the first derivative of the loss function, $h_{i}$ is the second derivative of the loss function. XGBoost uses the second derivative information, so if custom loss function is used, its second derivation has to be feasible.

Using square loss function as an example:
$$
\begin{aligned}
&l(y_{i}, \hat{y}_{i}^{(t-1)}) = (y_{i}-\hat{y}_{i}^{(t-1)})^{2} \\
&g_{i}=\frac{\partial l\left(y_{i}, \hat{y}_{i}^{(t-1)}\right)}{\partial \hat{y}_{i}^{(t-1)}}=-2\left(y_{i}-\hat{y}_{i}^{(t-1)}\right)\\
&h_{i}=\frac{\partial^{2} l\left(y_{i}, \hat{y}_{i}^{(t-1)}\right)}{\partial\left(\hat{y}_{i}^{(t-1)}\right)^{2}}=2
\end{aligned}
$$

Bringing this second-order Taylor expansion into the XGBoost object function derived above, the approximate expression of the object function can be obtained:
$$
O b j^{(t)} \simeq \sum_{i=1}^{n}\left[l\left(y_{i}, \hat{y}_{i}^{t-1}\right)+g_{i} f_{t}\left(x_{i}\right)+\frac{1}{2} h_{i} f_{t}^{2}\left(x_{i}\right)\right]+\Omega\left(f_{t}\right)+\text { constant }
$$

Removing the relevant constant term from the above formula, the simplified object function is:
$$
O b j^{(t)} \simeq \sum_{i=1}^{n}\left[g_{i} f_{t}\left(x_{i}\right)+\frac{1}{2} h_{i} f_{t}^{2}\left(x_{i}\right)\right]+\Omega\left(f_{t}\right)
$$

Therefore, it is only necessary to solve the first-order derivative and the second-order derivative of the loss function of each step, and then optimize the objective function to obtain the $f(x)$ of each step, then a boosting model can be obtained according to the addition model.

There are two significant components in a decision tree, weight vector of leaf nodes $w$, and the mapping relationship between instances and leaf nodes $q$. So the mathematical expression of a tree is:
$$
f_{t}(X) = w_{q(x)}
$$

As for the regularization term for model complexity, the model complexity $\Omega$ can be determined by the number of leaf nodes $T$ and the weight of leaf $w$. Specifically, the complexity of the loss function is determined by the number of leaf nodes and leaf weights of all trees. The mathematical expression is as follows:
$$
\Omega(f_{t}) = \gamma T + \frac{1}{2}\lambda \sum^{T}_{i=1}w^{2}_{j}
$$

Then regrouping all leaf nodes, which is allocating all samples $x_i$ belonging to the $j$-th leaf node into the sample set of one leaf node, that is: $I_{j}=\{i|q(x_{i})=j \}$. And the object function can be written as:
$$
\begin{aligned}
O b j^{(t)} & \simeq \sum_{i=1}^{n}\left[g_{i} f_{t}\left(x_{i}\right)+\frac{1}{2} h_{i} f_{t}^{2}\left(x_{i}\right)\right]+\Omega\left(f_{t}\right) \\
&=\sum_{i=1}^{n}\left[g_{i} w_{q\left(x_{i}\right)}+\frac{1}{2} h_{i} w_{q\left(x_{i}\right)}^{2}\right]+\gamma T+\frac{1}{2} \lambda \sum_{j=1}^{T} w_{j}^{2} \\
&=\sum_{j=1}^{T}\left[\left(\sum_{i \in I_{j}} g_{i}\right) w_{j}+\frac{1}{2}\left(\sum_{i \in I_{j}} h_{i}+\lambda\right) w_{j}^{2}\right]+\gamma T
\end{aligned}
$$

Define $G_{j}=\sum _{i \in I_{j}}g_{i}$ and $H_{j}=\sum _{i \in I_{j}}h_{i}$: 

$ \enspace \text{•} \enspace G_{j}$: The accumulated sum of the first-order partial derivatives of the samples contained in the leaf node j, it is a constant. 

$ \enspace \text{•} \enspace H_{j}$: The accumulated sum of the second-order partial derivatives of the samples contained in the leaf node j, it is a constant.

Put $G_{j}$ and $H_{j}$ into the object function above, the final version of object function for XGBoost is as follow:
$$
Obj^{(t)}=\sum_{j=1}^{T}\left[G_{j} w_{j}+\frac{1}{2}(H_{j}+\lambda) w_{j}^{2}\right]+\gamma T
$$



According to the solution formula of one-dimensional quadratic equation, the following are obtained:
$$
\begin{gathered}
x^{*}=-\frac{b}{2 a}=\frac{G}{H} \\
y^{*}=\frac{4 a c-b^{2}}{4 a}=-\frac{G^{2}}{2 H}
\end{gathered}
$$

Disassemble each leaf node $j$ from the objective function, there is:
$$
G_{j} w_{j}+\frac{1}{2}(H_{j}+\lambda) w_{j}^{2}
$$

It can be seen from the above derivation that $G_{j}$ and $H_{j}$ in the $t$-th tree can be calculated. Therefore, this formula is a one-variable quadratic function that contains only one variable, leaf node weight $w$, and its maximum point can be calculated according to the maximum value formula. When the leaf nodes of each independent tree reach the optimal value, the entire loss function also reaches the optimal status accordingly.

When the structure of a tree is fixed and let the formula above be equal to 0, the optimal point and optimal value are:
$$
\begin{gathered}
w_{j}^{*}=-\frac{G_{j}}{H_{j}+\lambda} \\
O b j=-\frac{1}{2} \sum_{j=1}^{T} \frac{G_{j}^{2}}{H_{j}+\lambda}+\gamma T
\end{gathered}
$$

Compare with GBDT, XGBoost mainly has difference on information divergence calculation, leaf computation, and the use of second-order derivative of the loss function.

According to the second-order derivative information, the loss function of XGBoost is optimized to a state that is very close to the real loss. Its node splitting method is not essentially different from the node splitting method of the CART tree, but the calculation of the information divergence is different.

Suppose the model processes feature splitting on one node, the object function before splitting is:
$$
Obj_{1} = - \frac{1}{2}[\frac{(G_{L}+G_{R})^{2}}{H_{L}+H_{R}+ \lambda}]+ \gamma
$$

The object function after splitting is:
$$
Obj_{1} = - \frac{1}{2}[\frac{G_{L}^{2}}{H_{L}+ \lambda}+\frac{G_{R}^{2}}{H_{R}+ \lambda}]+ 2 \gamma
$$

The information divergence after splitting is:
$$
Gain = \frac{1}{2}[\frac{G_{L}^{2}}{H_{L}+ \lambda}+\frac{G_{R}^{2}}{H_{R}+ \lambda}-\frac{(G_{L}+G_{R})^{2}}{H_{L}+H_{R}+ \lambda}]- \gamma
$$

In [1]:
import numpy as np
from CART import TreeNode, BinaryDecisionTree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from utils import cat_label_convert

In [2]:
class XGBoostTree(BinaryDecisionTree):
    def _split(self, y):
        col = int(np.shape(y)[1]/2)
        y, y_pred = y[:, :col], y[:, col:]
        return y, y_pred

    # calculate information divergence
    def _gain(self, y, y_pred):
        Gradient = np.power((y * self.loss.gradient(y, y_pred)).sum(), 2) 
        Hessian = self.loss.hess(y, y_pred).sum()
        return 0.5 * (Gradient / Hessian)

    # calculate divergence in tree splitting
    def _gain_by_taylor(self, y, y1, y2):
        # node split
        y, y_pred = self._split(y)
        y1, y1_pred = self._split(y1)
        y2, y2_pred = self._split(y2)

        true_gain = self._gain(y1, y1_pred)
        false_gain = self._gain(y2, y2_pred)
        gain = self._gain(y, y_pred)
        return true_gain + false_gain - gain

    # find the optimized weight for leaf node
    def _approximate_update(self, y):
        y, y_pred = self._split(y)
        # Newton's method
        gradient = np.sum(y * self.loss.gradient(y, y_pred), axis=0)
        hessian = np.sum(self.loss.hess(y, y_pred), axis=0) 
        update_approximation = gradient / hessian
        return update_approximation

    def fit(self, X, y):
        self._impurity_calculation = self._gain_by_taylor
        self._leaf_value_calculation = self._approximate_update
        super(XGBoostTree, self).fit(X, y)

In [3]:
# loss function for classification
class Sigmoid:
    def __call__(self, x):
        return 1 / (1 + np.exp(-x))

    def gradient(self, x):
        return self.__call__(x) * (1 - self.__call__(x))

class LogLoss:
    def __init__(self):
        sigmoid = Sigmoid()
        self._func = sigmoid
        self._grad = sigmoid.gradient
    
    # define loss function
    def loss(self, y, y_pred):
        y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
        p = self._func(y_pred)
        return y * np.log(p) + (1 - y) * np.log(1 - p)

    # first-order derivative
    def gradient(self, y, y_pred):
        p = self._func(y_pred)
        return -(y - p)

    # second-order derivative
    def hess(self, y, y_pred):
        p = self._func(y_pred)
        return p * (1 - p)

In [4]:
# define XGBoost model based on forward step algorithm
class XGBoost:
    def __init__(self, n_estimators=200, learning_rate=0.001, min_samples_split=2, min_gini_impurity=999, max_depth=2):
        # number of tree
        self.n_estimators = n_estimators
        # step size for weight update
        self.learning_rate = learning_rate
        self.min_samples_split = min_samples_split
        self.min_gini_impurity = min_gini_impurity
        self.max_depth = max_depth

        # square loss for regression
        # self.loss = SquaresLoss()
        # logarithmic loss for classification
        self.loss = LogLoss()
        # initialize the list for classification tree
        self.trees = []
        # build the decision tree in iteration
        for _ in range(n_estimators):
            tree = XGBoostTree(
                    min_samples_split=self.min_samples_split,
                    min_gini_impurity=self.min_gini_impurity,
                    max_depth=self.max_depth,
                    loss=self.loss)
            self.trees.append(tree)

    def fit(self, X, y):
        y = cat_label_convert(y)
        y_pred = np.zeros(np.shape(y))
        # accumulate results after fitting each tree
        for i in range(self.n_estimators):
            tree = self.trees[i]
            y_true_pred = np.concatenate((y, y_pred), axis=1)
            tree.fit(X, y_true_pred)
            iter_pred = tree.predict(X)
            y_pred -= np.multiply(self.learning_rate, iter_pred)

    def predict(self, X):
        y_pred = None
        # prediction in iteration
        for tree in self.trees:
            iter_pred = tree.predict(X)
            if y_pred is None:
                y_pred = np.zeros_like(iter_pred)
            y_pred -= np.multiply(self.learning_rate, iter_pred)
        y_pred = np.exp(y_pred) / np.sum(np.exp(y_pred), axis=1, keepdims=True)
        # transform the prediction into label
        y_pred = np.argmax(y_pred, axis=1)
        return y_pred

In [5]:
from sklearn import datasets
data = datasets.load_iris()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)  
clf = XGBoost()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print ("Accuracy: ", accuracy)

  return np.array([X_left, X_right])


ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [None]:
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt

params = {
    'booster': 'gbtree',
    'objective': 'multi:softmax',   
    'num_class': 3,     
    'gamma': 0.1,
    'max_depth': 2,
    'lambda': 2,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'min_child_weight': 3,
    'eta': 0.001,
    'seed': 1000,
    'nthread': 4,
}

dtrain = xgb.DMatrix(X_train, y_train)
num_rounds = 200
model = xgb.train(params, dtrain, num_rounds)

dtest = xgb.DMatrix(X_test)
y_pred = model.predict(dtest)

accuracy = accuracy_score(y_test, y_pred)
print ("Accuracy:", accuracy)
plot_importance(model)
plt.show()