# Gradient Boosting

When we want to use predictor vectors 
$x=(x_1,\ldots,x_k)$ to fit a model $Y=f(x)$ assume our aim is 
to minimize the expected loss

$$
E[ \ell(Y,f(x)) ]
$$

where $\ell$ is some loss function. Using training data 
$(x^{(i)},Y^{(i)}),i=1,\ldots,N,$ we try to minimize 
the *empirical risk*

$$
\frac{1}{N}\sum_{i=1}^N  \ell(Y^{(i)},f(x^{(i)}))
$$

**Examples of loss functions**

For example, using squared error loss we take 
$\ell(y,f(x)) = (y-f(x))^2.$

Another example is the loss function that is equivalent to the one we use in logistic regression.  Here, **assuming that the response is $\pm 1$ valued,** we can take

$$
\ell(y,f(x)) = -y\log\left( \frac{f(x)}{1-f(x)} \right)
$$

To understand the connection with maximum likelihood, viewing $f(x)$ as our estimate of $P[Y=1|x]$ when we maximize the log-likelihood 

$$
\sum_{i=1}^N Y^{(i)} \log f(x^{(i)}) - Y^{(i)} \log (1-f(x^{(i)}))
$$

this is equivalent to minimizing the expression

$$
-\frac{1}{N}\sum_{i=1}^N 
Y^{(i)} 
\log \left\{ \frac{f(x^{(i)})}{1-f(x^{(i)})}\right\}
$$

Suppose we wish to predict a categorical variable $Y$ taking values in $\{ 1,...,K\}$ with a probability vector $f(x) = (f_1(x),\ldots,f_K(x))$ where we want $f_j$ close to 1 if $Y=j$ and $f_i$ close to zero otherwise. A commonly used loss function is the cross-entropy
$$
\ell(y,f(x)) = - \sum_{j=1}^k I_{y=j} \log(f_j(x))
$$
and when we have data $Y^{(i)}$ and predictions $(\hat{f}^{(i)}_1,\ldots,\hat{f}^{(i)}_K)$  we try to minimize
$$
- \frac{1}{N}\sum_{i=1}^N \sum_{j=1}^K I_{\{Y^{(i)}=j\}} \log(\hat{f}^{(i)}_j) 
$$




**Decision/regression stumps**

To illustrate how gradient boost works, we focus on the simple case in which we build a classifier/prediction function $f$ by using only the simplest possible trees, namely, trees of depth 1. Specifying one of these trees requires selection of

- a variable to split on
- a threshold (assume all predictor variables are continuous for simplicity)
- a final prediction to make at each leaf

Let ${\cal H}$ denote the family of all possible prediction functions.
We will start the process of learning $f$ by picking the one $f_0$ in ${\cal H}$ that gives a minimum empirical loss.

**Regularity**

It is important to note that in the case of least squares, assuming all of the $x^{(i)}$ are distinct, we can always find a function $f$ that makes our empirical loss equal to zero. We simply take $f(x^{(i)}) = Y^{(i)}$ but this would definitely lead to over-fitting. Our true aim is to minimize the expected loss which refers to a hypothetical not yet observed pair $(x,Y).$ So it makes sense to restrict our possible choices of $f$ to have some degree of *regularity*. 

Now that we've picked $f_0$ think of how we might modify by adding another $cf_1$ where $c \in \mathbb{R}$ and $f_1 \in {\cal H}$ to make our function 

$$
f = f_0 + cf_1
$$

have improved empirical risk.  Think of $c$ as being small so we are talking about making a small perturbation of $f_0.$
Then we can write an approximation to the empirical risk for this new function as

$$
\frac{1}{N}\sum_{i=1}^N  \ell(Y^{(i)},f_0(x^{(i)})+cf_1(x^{(i)})) 
$$

$$
\approx 
\frac{1}{N}\sum_{i=1}^N  \ell(Y^{(i)},f_0(x^{(i)}))+
c\sum_{i=1}^N  \ell_2(Y^{(i)},f_0(x^{(i)})) f_1(x^{(i)}).
$$

Here the term 

$$
\ell_2(Y^{(i)},f_0(x^{(i)})
$$ 

means we differentiate $\ell$ with respect to the second argument and evaluate at the current 
$(Y^{(i)},f_0(x^{(i)})).$

In gradient boosting, the next step is to choose the $f_1 \in {\cal H}$ that minimizes this approximation, and choose the constant $c$ (the step size). We can view the expression to be minimized as a dot product 
between a gradient 
$$
\left[\begin{array}{c}
\frac{\partial \ell}{\partial f}(Y^{(1)},f_0(x^{(1)}))\\
\vdots\\
\frac{\partial \ell}{\partial f}(Y^{(N)},f_0(x^{(N)}))\\
\end{array}
\right]
$$
and 
$$
\left[\begin{array}{c}
f_1(x^{(1)})\\
\vdots\\
f_1(x^{(N)}))\\
\end{array}
\right]
$$
and we pick our $f_1\in {\cal H}$ to be as close as possible to the negative of the gradient.


Typically we take $c\in (0,1),$ often a small value like .1 to regularize.

The gradient boosting scheme is an iterative one, at each step when we have our current $f$ which is an expression of the form

$$
f = \sum_{j=0}^m c_j f_j
$$ 

with each $f_j \in {\cal H}$ we compute the approximation to the empirical loss when we add a new term

$$
\frac{1}{N}\sum_{i=1}^N  \ell(Y^{(i)},f(x^{(i)})+cf_{m+1}(x^{(i)})) 
$$

$$
\approx 
\frac{1}{N}\sum_{i=1}^N  \ell(Y^{(i)},f(x^{(i)}))+
c\sum_{i=1}^N  \ell_2(Y^{(i)},f(x^{(i)})) f_{m+1}(x^{(i)}).
$$

again we choose $f_{m+1}\in {\cal H}$ to minimize this expression and a new step size $c_{m+1}$ and revise $f$ to give

$$
f = \sum_{j=0}^m c_j f_j.
$$ 

There are various schemes for choosing the sequence of step sizes e.g.
take $c_j = c_0$ for all $j,$ $c_j = 1/(j+1)$ etc..

**Least Squares**

In the case of least squares, the gradient becomes $-e$ where $e$ is the vector of residuals
$$
e= \left[\begin{array}{c}
Y^{(1)}-f(x^{(1)})\\
\vdots\\
Y^{(N)}-f(x^{(N)}))\\
\end{array}
\right].
$$
which is straightforward to calculate.

In the general loss function, the gradient expression is referred to as the generalized residual vector.

In the case of decision stumps, our update step amounts to adding on a stump predictor that closely resembles $-e.$


**More general classes of predictors/classifiers**

Our ${\cal H}$ can consist of trees of higher depth, but for successful use of *boosting trees* this depth is usual limited to be relatively small. The whole idea is to combine many *weak learners* to produce good performance.


**Gradient boosting using xgboost: Classification**

In [1]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-3.1.2-py3-none-macosx_12_0_arm64.whl.metadata (2.1 kB)
Downloading xgboost-3.1.2-py3-none-macosx_12_0_arm64.whl (2.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m16.3 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: xgboost
Successfully installed xgboost-3.1.2


In [1]:
import xgboost as xgb
import pandas as pd
import numpy as np

df=pd.read_csv("penguins.csv")
df["Y"]=df.species.map({"Adelie":0,"Gentoo":1,"Chinstrap":2})

In [2]:
N=df.shape[0]
p=np.random.permutation(range(N))
Itrain=p[0:int(2*N/3)]
Itest=p[int(2*N/3):N]
Xtrain=df.loc[Itrain,df.columns[2:5]].values
Xtest=df.loc[Itest,df.columns[2:5]].values
Ytrain=df.Y[Itrain]
Ytest=df.Y[Itest]

In [3]:
xgb_classifier = xgb.XGBClassifier()
xgb_classifier.fit(Xtrain,Ytrain)
Ypred=xgb_classifier.predict(Xtest)
pd.crosstab(Ytest,Ypred)

col_0,0,1,2
Y,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,51,0,2
1,0,38,0
2,2,0,21


**Prediction probabilities**

In [4]:
pprobs=xgb_classifier.predict_proba(Xtest)
print(pprobs.shape)
print(pprobs)

(114, 3)
[[2.85434187e-04 9.98185813e-01 1.52872584e-03]
 [9.96466041e-01 1.04290445e-03 2.49109627e-03]
 [2.85508781e-04 9.98446643e-01 1.26782293e-03]
 [1.76152075e-03 1.22830493e-03 9.97010112e-01]
 [9.97957110e-01 7.25972757e-04 1.31685042e-03]
 [2.32859189e-03 1.10589340e-03 9.96565521e-01]
 [2.85434187e-04 9.98185813e-01 1.52872584e-03]
 [9.96466041e-01 1.04290445e-03 2.49109627e-03]
 [4.34269130e-01 1.30234882e-01 4.35496032e-01]
 [3.06196464e-03 9.95557845e-01 1.38024357e-03]
 [2.85434187e-04 9.98185813e-01 1.52872584e-03]
 [2.85508781e-04 9.98446643e-01 1.26782293e-03]
 [4.57739551e-03 9.93469357e-01 1.95323327e-03]
 [9.94353771e-01 2.30302149e-03 3.34319193e-03]
 [5.05249575e-03 1.57654732e-02 9.79182005e-01]
 [1.50978491e-02 9.76226866e-01 8.67525861e-03]
 [9.98771727e-01 2.44380382e-04 9.83823207e-04]
 [5.94037818e-04 9.97444868e-01 1.96104939e-03]
 [9.98958945e-01 2.44426192e-04 7.96573469e-04]
 [9.98930156e-01 4.14722337e-04 6.55167445e-04]
 [9.96222615e-01 1.45938213e-03

## ADA Boost

There are various ways that a classification method can be iteratively updated to produce improved performance. On way is via *boosting*. **Ada boosting** was introduced in 1997 by Freund and Shapiro. It has been extended to the case of regression as well.

The basic idea is as follows. Assume, we have training data $(x^{(i)},Y^{(i)}),i=1,\ldots,N$ where 

- each $x^{(i)}$ is a $k$-vector,
- $Y^{(i)} =\pm 1$

And we assume we have some training algorithm that, given such a data set and importantly, weights on the observations $w_i,i=1,\ldots,N$ produces a classifier which we can denote by
$\hat{Y}_w(x).$

The algorithm starts by giving initial weight of $w_i = 1/N$ to each observation, and computes the classifier. Then, in successive iterations, after building the classifier, the weights are modified to give more influence on the difficult to classify observations.

Here is the algorithm. 

At step 0 start with $w_i^{(0)} = 1/N$ for $i=1,\ldots,N$

For $j=0,\ldots,J-1$

- build a classifier $\hat{Y}_j$ on the training data using the current weights

- compute the weighted error rate 

$$
e_j =\frac{ \sum_{i=1}^N w_i^{(j)} I(\hat{Y}_j(x^{(i)}) \neq Y^{(i)})}{\sum_{i=1}^N w_i^{(j)}}
$$

- define $\alpha_j = \log\left(\frac{1-e_j}{e_j}\right)$

- Update the weights by taking

$$
w_i^{(j+1)} = w_i^{(j)} \exp \left( \alpha_j I(\hat{Y}_j(x^{(i)}) \neq Y^{(i)})\right)
$$

Finally, after $J$ iterations we take as our classifier 

$$
\hat{Y}(x) = \text{sign} \left\{ \sum_{j=0}^{J-1} \alpha_j \hat{Y}_j(x^{(i)}) \right\}
$$


**Note:** Typically we have an error rate $e_j \in (0,1)$ and $\alpha_j$ is well-defined. If we start off with better than random guessing, then $e_j < \frac{1}{2}$ and $\frac{1-e_j}{e_j}>1$ and $\alpha_j>0,$ so  the factor we multiply our weights for 
mistakes ($e^{\alpha_j}$) exceeds 1, while the weights for correct classifications 
are unchanged in a particular iteration.


