## GBDT

GBDT(Gradient Boosting Decision Tree) is the core model of ensemble learning, which is also one kind of decision tree model algorithm. GBDT consists of decision tree, boosting model, and gradient descent.

The basic principle of decision tree is the process of continuously selecting features to build a tree model according to the information divergence (or other criterion). Boosting is an ensemble learning mode, which is a process of linearly combining multiple single decision trees (weak learners) to form a strong learner. Boosting uses a single model as a weak classifier, and CART is the weak classifier in GBDT. And after integrating gradient descent to optimize the boosting tree model, there is a gradient boosting tree model.

A boosting tree model can be described as follow:
$$
f_{M}(x)=\sum_{m=1}^{M} T\left(x ; \Theta_{m}\right)
$$

Under the given model, the $m$-th step is:
$$
f_{m}(x)=f_{m-1}(x) T\left(x ; \Theta_{m}\right)
$$

Then we optimize the parameters of the next tree by the following objective function:
$$
\hat{\Theta}_{m}=\underset{\Theta_{m}}{\arg \min } \sum_{i=1}^{N} L\left(y_{i}, f_{m-1}\left(x_{i}\right)+T\left(x_{i} ; \Theta_{m}\right)\right)
$$

Using the boosting tree of regression problem as example, a regression tree can be expressed as follow:
$$
T\left(x ; \Theta \right) = \sum ^{J}_{j=1}c_{j}I(x \in R_{j})
$$

The $0$-th, $m$-th step and final model are:
$$
\begin{aligned}
&f_{0}(x)=0 \\
&f_{m}(x)=f_{m-1}(x)+T\left(x ; \Theta_{m}\right), m=1,2, \cdots, M \\
&f_{M}(x)=\sum_{m=1}^{M} T\left(x ; \Theta_{m}\right)
\end{aligned}
$$

With the given $m-1$-th step model to solve:
$$
\hat{\Theta}_{m}=\underset{\Theta_{m}}{\arg \min } \sum_{i=1}^{N} L\left(y_{i}, f_{m-1}\left(x_{i}\right)+T\left(x_{i} ; \Theta_{m}\right)\right)
$$

When the loss function is squared loss:
$$
L(y, f(x))=(y-f(x))^{2}
$$

The corresponding loss can be expressed as:
$$
\begin{aligned}
&L(y, f_{m-1}\left(x_{i}\right)+T\left(x ; \Theta_{m}\right)) \\
=&[y-f_{m-1}\left(x_{i}\right)-T\left(x ; \Theta_{m}\right)]^{2} \\
=&[r-T(x: \Theta_{m})]^{2}
\end{aligned}
$$

Then $\enspace r=y-f_{m-1}(x) \enspace$ is obtained, which indicates each iteration of boosting tree model is fitting a residual function.

However, in practice, not every loss function is as easy to optimize as squared loss, so an approximate gradient descent method is proposed to use the value of the negative gradient of the loss function in the current model as an approximation of the residual in the regression boosting tree. which is:
$$
r_{im}=-\left[\frac{\partial L\left(y_{i}, f\left(x_{i}\right)\right)}{\partial f\left(x_{i}\right)}\right]_{f(x)=f_{m-1}(x)}
$$

Therefore, combining boosting tree and gradient boosting, the general process of the GBDT model algorithm can be summarized as:

(1) Initialize the weak learner:
$$
f_{0}(x) = \arg \min_{c} \sum^{N}_{i=1}L(y_{i}, c)
$$

(2) For $m = 1,2, \cdots , M$:

Calculate negative gradient (residual) for every sample $i= 1,2, \cdots , N$:
$$
r_{im}=-\left[\frac{\partial L\left(y_{i}, f\left(x_{i}\right)\right)}{\partial f\left(x_{i}\right)}\right]_{f(x)=f_{m-1}(x)}
$$

The residual obtained in the previous step is used as the new real value of the sample, and the data $(x_{i}, r_{im}), i= 1,2, \cdots , N$ are used as the training data of the next tree to obtain a new regression tree $f_{m}(x)$ whose corresponding leaf node area are $R_{jm}, j= 1,2, \cdots , J$. And $J$ is the number of leaf node in the regression tree.

Calculate the best fitted value for leaf node area $j= 1,2, \cdots , J$
$$
r_{j m}=\underbrace{\arg \min }_{r} \sum_{x_{i} \in R_{j m}} L\left(y_{i}, f_{m-1}\left(x_{i}\right) r \right)
$$

Update the string learner:
$$
f_{m}(x) = f_{m-1}(x) \sum^{J}_{j=1} r_{jm}I(x \in R_{jm})
$$

(3) Obtain the final learner:
$$
f(x)=f_{M}(x) = f_{0}(x) \sum^{M}_{m=1} \sum^{J}_{j=1} r_{jm}I(x \in R_{jm})
$$

In [9]:
import numpy as np
import matplotlib.pyplot as plt
from CART import TreeNode, BinaryDecisionTree, ClassificationTree, RegressionTree
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from CART import feature_split, calculate_gini, data_shuffle

In [10]:
class GBDT(object):
    def __init__(self, n_estimators, learning_rate, min_samples_split, min_gini_impurity, max_depth, regression):
        # number of tree
        self.n_estimators = n_estimators
        # step size for weight update
        self.learning_rate = learning_rate
        self.min_samples_split = min_samples_split
        self.min_gini_impurity = min_gini_impurity
        self.max_depth = max_depth
        # default is regression tree
        self.regression = regression
        # square loss as loss function
        self.loss = SquareLoss()
        # use other loss function for classification tree
        if not self.regression:
            self.loss = SotfMaxLoss()
        # combine multiple trees to form a strong learner
        self.estimators = []
        for i in range(self.n_estimators):
            self.estimators.append(RegressionTree(min_samples_split=self.min_samples_split,
                                                min_gini_impurity=self.min_gini_impurity,
                                                max_depth=self.max_depth))

    def fit(self, X, y):
        # forward step model initialize the first tree
        self.estimators[0].fit(X, y)
        # prediction for the first tree
        y_pred = self.estimators[0].predict(X)
        # forward step iterative training
        for i in range(1, self.n_estimators):
            gradient = self.loss.gradient(y, y_pred)
            self.estimators[i].fit(X, gradient)
            y_pred -= np.multiply(self.learning_rate, self.estimators[i].predict(X))

    def predict(self, X):
        # prediction for regression tree
        y_pred = self.estimators[0].predict(X)
        for i in range(1, self.n_estimators):
            y_pred -= np.multiply(self.learning_rate, self.estimators[i].predict(X))
        # prediction for classification tree
        if not self.regression:
            # transform the predicted value into probability
            y_pred = np.exp(y_pred) / np.expand_dims(np.sum(np.exp(y_pred), axis=1), axis=1)
            # transform into label
            y_pred = np.argmax(y_pred, axis=1)
        return y_pred

In [11]:
# regression tree
class GBDTRegressor(GBDT):
    def __init__(self, n_estimators=300, learning_rate=0.1, min_samples_split=2,
                min_var_reduction=1e-6, max_depth=3):
        super(GBDTRegressor, self).__init__(n_estimators=n_estimators,
                                            learning_rate=learning_rate,
                                            min_samples_split=min_samples_split,
                                            min_gini_impurity=min_var_reduction,
                                            max_depth=max_depth,
                                            regression=True)

In [12]:
# classification tree
class GBDTClassifier(GBDT):
    def __init__(self, n_estimators=200, learning_rate=.5, min_samples_split=2,
                min_info_gain=1e-6, max_depth=2):
            super(GBDTClassifier, self).__init__(n_estimators=n_estimators,
                                                learning_rate=learning_rate,
                                                min_samples_split=min_samples_split,
                                                min_gini_impurity=min_info_gain,
                                                max_depth=max_depth,
                                                regression=False)

    def fit(self, X, y):
        super(GBDTClassifier, self).fit(X, y)

In [13]:
class SquareLoss():
    def loss(self, y, y_pred):
        return 0.5 * np.power((y - y_pred), 2)

    def gradient(self, y, y_pred):
        return -(y - y_pred)

In [14]:
#from sklearn import datasets
from sklearn.datasets import fetch_california_housing
X, y = fetch_california_housing(return_X_y=True)
#boston = datasets.load_boston()
#X, y = data_shuffle(boston.data, boston.target, seed=13)
X = X.astype(np.float32)
offset = int(X.shape[0] * 0.9)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = GBDTRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Color map
cmap = plt.get_cmap('viridis')
mse = mean_squared_error(y_test, y_pred)
print ("Mean Squared Error of NumPy GBRT:", mse)

# Plot the results
m1 = plt.scatter(range(X_test.shape[0]), y_test, color=cmap(0.5), s=10)
m2 = plt.scatter(range(X_test.shape[0]), y_pred, color='black', s=10)
plt.suptitle("Regression Tree")
plt.title("MSE: %.2f" % mse, fontsize=10)
plt.xlabel('sample')
plt.ylabel('house price')
plt.legend((m1, m2), ("Test data", "Prediction"), loc='lower right')
plt.show()

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)>