# Gradient Boosting



## Learning Objective
- Explain intuition behind gradient boosting
- Compute gradient for different loss functions and use it to train a weak learner.
- List out different loss functions for regression and their pros or cons.
- Understand gradient boosting algorithm for binary and multi-class classification.
- Recall logistic loss function and sigmoid function and expand them to derive expression for cross-entropy loss function and softmax function.


## Introduction

In the last few chapters, we studied boosting and AdaBoost. Recall that the idea of boosting is to __iteratively add__ a weak learner to compensate the __errors or shortcomings__ of an existing model. This iterative addition can be expressed as:
$$F(\mathbf{x}) = \sum_t\alpha_t h_t(\mathbf{x})$$
Where, $\alpha_t$ is a shrinkage or learning rate which is used to dampen or shrink the contribution of such successively added models $h_t(\mathbf{x})$. $F(\mathbf{x})$ is final ensemble model obtained after boosting of weak learner $h_t(\mathbf{x})$.

__Note:__ In gradient boosting, we use same value of $\alpha$ for all weak learners.

To perform boosting, we need some way to identify the errors or shortcomings of an existing model. There are two popular ways to identify the shortcoming or errors of an existing model: __weight of data points__ and __gradient__. In AdaBoost, the errors or shortcomings of an existing model is identified by high-weight data points. The weight of the data points corresponds to the difficulty in classifying them. The higher the weight of the data points is, the more is the difficulty in classifying them.  So the successive added model should pay much more attention to these points.


However, in gradient boosting, the __errors or shortcomings__ of the existing model is identified by the gradient. Gradient Boosting is composed of two main techniques: Gradient descent and Boosting. That is,

$$\text{Gradient Boosting} = \text{Gradient Descent} + \text{Boosting}$$

Gradient boosting performs boosting (iterative addition) based on the gradient of an existing model.

In the linear regression module, we have talked about gradient descent. Let's briefly recall this technique.
In the gradient descent algorithm, we take the steps proportional to the _gradient's magnitude in the negative direction _ _ to minimize the loss function. That means, we iteratively update our parameter$(\theta)$ until we reach the optimal point as shown in the figure below.

<figure>
    <div align = "center">
 <img src="https://i.postimg.cc/3NbjG0x3/image.png" width="30%">
    </div>
    <div align = "center">
<figcaption>Figure 1: Gradient Descent <figcaption>
    </div>
 </figure>


Gradient Boosting also follows a similar approach as gradient descent. It computes the gradient of an existing model, and iteratively updates its function/model, $F(\mathbf{x})$.

$$F(\mathbf{x}_i):=F(\mathbf{x}_i) - \alpha\frac{\partial J}{\partial F(\mathbf{x_i})}$$


 The only difference with gradient boosting is that it updates the function $F(\mathbf{x})$ instead of the parameter, $\theta$. We will see later how the model update in gradient boosting is similar to that in gradient descent.


### Numerical Intuition of Gradient Boosting

Let's understand gradient boosting with the help of following dataset.
Suppose we have the following dataset $\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_N, y_N)\}$ and we fit a model $F(\mathbf{x})$ to minimize a squared loss.

Suppose we have the following prediction $F(\mathbf{x}_i)$ made by model $F$.

|     |      $y_i$  |   Prediction $ F(\mathbf{x}_i$) |
|----------|:-----------:|:----------:|
|    1     |  0.8        | 0.6         |
|    2     | 0.6         | 0.8         |
|    3     | 1.3         | 1.4         |
|    4    | 0.5         | 0.3         |

We can see that the model $F(\mathbf{x})$ has some error; we want our prediction be much closer to the actual label.


In gradient boosting, we add another model $h(\mathbf{x})$ to compensate the error of the model $F(\mathbf{x})$ such that,

$$F({\mathbf{x}}) + h({\mathbf{x}}) = \mathbf{y}$$
 $$h({\mathbf{x}}) = \mathbf{y} - F({\mathbf{x}}) $$

The above equation means the model $h(\mathbf{x})$ is trained on data $\big\{(\mathbf{x}_1, y_1-F(\mathbf{x}_1)), (\mathbf{x}_2, y_2-F(\mathbf{x}_2)), \ldots, (\mathbf{x}_N, y_N-F(\mathbf{x}_N))\big\}$.


The term $\mathbf{y} - F({\mathbf{x}})$ is the residual or error of the existing model $F$.





Computing the residual on the above data we have,

|   |      $y_i$  |   Prediction $F(\mathbf{x}_i$)     | Residual |
|:----------|-----------:|:----------:|:------:|
|    1     |  0.8        | 0.6         | 0.2   |
|    2     | 0.6         | 0.8         |-0.2   |
|    3     | 1.3         | 1.4         |-0.1  |
|    4     | 0.5         | 0.2         |0.3  |

We can now train another learner $h(\textbf{x})$ on residual. The role of $h(\textbf{x})$ is to compensate the shortcomings of an existing model $F(\textbf{x})$. Suppose, make the following prediction for the model $h(\textbf{x})$.

| |      $y_i$  |   Prediction $F(\mathbf{x}_i$) | Residual  | Prediction $h(\mathbf{x}_i$) |
|----------|:-----------:|:----------:|:------:|:-----:|
|    1     |  0.8        | 0.6         | 0.2   |   0.1 |
|    2     | 0.6         | 0.8         |-0.2   |   -0.2   |
|    3     | 1.3         | 1.4         |-0.1  |   0.0        |
|    4     | 0.5         | 0.2         |0.3  |   0.2     |








We have two different models $F$ and $h$ working in union to correctly predict the actual label $y$. Their predictions can be combined as:

$\hat{y}_i = F({\mathbf{x}_i}) + h(\mathbf{x}_i) $


|   |      $y_i$  |  Prediction $F(\mathbf{x}_i$) | Residual |  Prediction $h(\mathbf{x}_i$) |  $\hat{y}_i$ |
|:----------|:-----------:|:----------: |------:|:-----:|---- |
|    1     |  0.8        | 0.6         | 0.2   |   0.1 |  0.7  |
|    2     | 0.6         | 0.8         |-0.2   |   -0.2|  0.6  |
|    3     | 1.3         | 1.4         |-0.1  |   0.0  |  1.4  |
|    4    | 0.5         | 0.2         |0.3  |   0.2   |  0.4  |

From the above table, we can see that the new prediction $\hat{y}_i$ made together by $F$, and $h$ is closer to the actual label $y$  compared to the prediction made alone by model $F$. If the prediction made by model $F+ h$ is not satisfactory, we can add another model $h_t(\textbf{x})$.

__Note :__ While performing boosting, we add only a fraction of prediction made by $h$. i.e, we perform model addition as $F:= F+\alpha h$. Here $\alpha$ is a learning rate, shrinkage, used to dampen or shrink the contribution of currently added model $h$. Learning rates is a hyperparameter and is usually set to a small value, say 0.1. This shrinkage avoids overfitting in boosting. Generally, $\alpha$ is set to a small value, and a large number of models are added, each model making a little contribution.

In the above example, we used residual values or simply residual to train subsequent models, $h_t(\textbf{x})$. The successive addition of the model relates to boosting. Meanwhile, the use of residual relates to gradient descent. But how the use of residual relates to gradient descent? Because residual is a special case of the gradient for squared loss. Let's derive the expression for gradient of squared loss and see if it is equal to the residual.




 <!-- If the model $h$ alone couldn't improve the prediction, we can add another model $h_i$ too. The general convention uses many weak learners $h$ such that each model's contribution is shrunk by imposing the learning rate $ \alpha $.  -->

### How residual is related with the gradient?
The term gradient is more general and popular than the term residual. For the squared loss function, the residual is equivalent to the gradient. Let's see how residual is related to the gradient.

As discussed above, the residual of a model is
$$\textbf{y}- F(\textbf{x})$$



Suppose we have a squared loss function:
$$ J = L(y, F(\mathbf{x})) = \frac{1}{2}(y - F(\mathbf{x}))^2 $$

And we want to minimize $J = \sum_i L(y_i, F(\mathbf{x}_i))$ by adjusting $F(\mathbf{x}_1)$, $F(\mathbf{x}_2)$, ...., $F(\mathbf{x}_N)$. As seen in the above table $F(\mathbf{x}_1)$, $F(\mathbf{x}_2)$, ...., $F(\mathbf{x}_N)$ are some real numbers $\mathbb{R}$. We can treat $F(\textbf{x}_i)$ as parameters and take derivative as:


$$
\frac{\partial J}{\partial F(\mathbf{x}_i)} = \frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)} = F(\mathbf{x}_i) - y_i = -\big(y_i - F(\mathbf{x}_i)\big)
$$

$$\underbrace{y_i-F(\mathbf{x}_i) }_{\text{residual}} = \underbrace{-\frac{\partial J}{\partial F(\mathbf{x_i})}}_{\text{negative gradient}}$$



So for squared loss, we can interpret residuals as negative gradients.
$$\text{residual} =  \text{negative gradient}$$

The above relation shows that for squared loss, the residual is equivalent to a negative gradient. The residual is a special case of a gradient for squared loss function. The residual is not equal to the gradient for other loss functions, and we prefer gradient over residual. We use the concept of gradient because it is more general and useful for any loss function.



### How gradient boosting is related with gradient descent?

In the above section, we saw how residual is related to the negative gradient.
In the section, let's see how gradient boosting and gradient descent are closely related.

In gradient boosting, we perform iterative addition,
$$F(\mathbf{x}_i):=F(\mathbf{x}_i)+ h(\mathbf{x}_i)$$
$$F(\mathbf{x}_i):=F(\mathbf{x}_i)+ y_i - F(\mathbf{x}_i) \quad[ \therefore h({\mathbf{x}_i}) = y_i - F({\mathbf{x}_i}) ]$$
$$F(\mathbf{x}_i):=F(\mathbf{x}_i) - 1\frac{\partial J}{\partial F(\mathbf{x_i})}$$

For gradient descent, we have,

$$\theta_i:=\theta_i- \alpha\frac{\partial{J}}{\partial{\theta_i}}$$

The above two equation shows how gradient boosting is related to gradient descent.


<!-- We can rewrite the equation $h({\mathbf{x}}) = \mathbf{y} - F({\mathbf{x}}) $ as:

$$h({\mathbf{x}}) = - \frac{\partial J}{\partial F(\mathbf{x_i})}$$

For any general loss function $J$ we can iteratively add model h trained on $(\mathbf{x}_i - \frac{\partial J}{\partial F(\mathbf{x_i})})_{i=1}^N$ -->

### Visualize the working of Gradient Boosting
Until now, we have discussed theoretically and with numerical the working of gradient boosting. We also saw how gradient boosting is related to the gradient descent.
In this section, we will visualize the steps involved in the gradient boosting algorithm with animation.

<figure>
    <div align = "center">

 <!-- <img src="https://doc.google.com/a/fusemachines.com/uc?export=download&id=13KNRfhoqOZuVzbLulMjZ5bZl1gzPehOS" width="80%"> -->
  <img src="https://i.postimg.cc/NGrRCFF2/image.png" width="80%">
    </div>
    <div align = "center">
<figcaption>Figure 2 : An animation showing smaller iterative steps taken towards the minima(Target).  <figcaption>
    </div>
 </figure>


As discussed above, the negative gradient represents the direction towards the target(the direction where our loss function is minimized). So a negative direction shows us the direction to follow so that our loss function is minimized. Now we use a base learner that leads our existing model towards the negative gradient. That means we train a base learner on training data and taking a negative gradient as a new label. Note that in boosting, the base learner should be weak with high bias and low variance.

The above animation show two iterations in detail. Each iteration represents one whole step of boosting. As shown, we start with a very simple model $F_0(\mathbf{x})$ that returns 0. The first iteration starts with the gradient of our loss function with respect to the existing model. The direction of the negative gradient points towards the target point $y$. Using this negative gradient we train a weak learner $h_1(\mathbf{x})$. The weak learner can be any machine learning algorithm like linear regressor, decision tree, etc. Such a weak learner, being weak, may not point to the exact direction of the negative gradient. We just take a smaller step towards the weak learner's direction because the weak learner just directs toward the target's tentative direction. Moreover, taking smaller steps avoids overfitting. Finally, we add this new model to our existing model. i.e., $F_1({\mathbf{x}}) = F_0(\mathbf{x}) + h_1({\mathbf{x}})$.

With this, the second iteration starts. The second iteration also repeats the same way the first iteration does. We compute the negative gradient, train a weak learner on it, take a smaller step towards the weak learner's direction, and update our function.

If we repeat it in the same way for several iterations, we will eventually reach very close toward the target point.

<!-- ### $\textbf{Gradient descent in function space}$
Let's understand how this term negative gradient comes into play while minimizing a function.
At any $t^{th}$ iteration we want to find the model $h$ such that

$$F_t(\mathbf{x})=  F_{t-1}(\mathbf{x})+\alpha h(\mathbf{x})$$

Where, $h(\mathbf{x})$ is choosen in such a way to minimize a loss function.
$$F_t(\mathbf{x})=  F_{t-1}(\mathbf{x})+ argmin_h\sum_{i=1}^{N}L(y_i, F_{t-1}(\mathbf{x}_i)+\alpha h(\mathbf{x}_i))$$

i.e., $h = argmin_h\sum_{i=1}^{N}L(y_i, F_{t-1}(\mathbf{x}_i)+\alpha h(\mathbf{x}_i))$

We can approximate $L(y, F+\alpha h)$ using [Taylor Aproxim2ation](https://en.wikipedia.org/wiki/Taylor_series). Since $y$ is constant we can rewrite it as $L(F+\alpha h)$ -->

<!-- Using Taylor Approximation we have,
$$L(F+\alpha h) \approx L(F) + \alpha  L^{'}(F)h$$

This approximation of $L$ as a linear function only holds for smaller region around L(F). To ensure this the value of $\alpha$ is set to a smaller value.

So we find an optimal h as:

$$argmin_h L(F+\alpha h) \approx argmin_h L^{'}(F)h = argmin_h \sum_{i=1}^N \frac{\partial L}{\partial F(\mathbf{x_i})} h(\mathbf{x_i})$$

Here, removing the constants $L(F)$ and $\alpha$ doesn't affect the function we are going to minimize.

Here terms $\frac{\partial L}{\partial F(\mathbf{x}_i)}$ and $h(\mathbf{x}_i)$ can be taken as vector. The dot product between two vector is minimum when the angle between two vector is zero. This implies that for minimum, the two vectors should be same. i.e., $$h(\mathbf{x}_i)=-\frac{\partial L}{\partial F(\mathbf{x}_i)}$$.

The above equation implies that $h(\mathbf{x})$ is a model trained on dataset $\big{(}\mathbf{x_i}, -\frac{\partial{L}}{\partial F(\mathbf{x_i})}\big{)}_{i=1}^N$ -->

## Gradient boosting for regression

We have already discussed gradient boosting for squared loss with a numerical example. The gradient boosting algorithm can be used with any arbitrary differentiable loss function $J$. Let's summarize the working of the gradient boosting algorithm for regression.

For the gradient boosting algorithm, the inputs are training data, the number of boosting iteration to perform, loss function, base-learning algorithm, and learning rate.
The number of boosting iteration refers to the number of times we add a new model. In the above numerical example, we had added only one model. For this, the number of iteration was only 1.

 Similarly, the learning rate or shrinkage is used to shink the contribution of successive models. If we didn't apply shrinkage, the model is unstable and is likely to overfit. So using shrinkage, we take small steps in the steepest gradient. Please refer to gradient descent in the linear regression module to see the effect of a higher learning rate.



```
Inputs:
```
> - input data $(\mathbf{x},y)_{i=1}^N$
- number of iterations $M$
- loss function $J$; for square loss $J = L(y, F(\mathbf{x}))= \frac{1}{2}(y - F(\mathbf{x}))^2$
- choice of base-learner model $h(\mathbf{x})$
- learning rate or shrinkage $\alpha$

```
Algorithm:
```
> 1. initialize  $F_0= \frac{1}{N}\sum_{i=1}^N y_i$
2. for $t=1$ to $M$ do  
3. &nbsp;&nbsp;&nbsp;calculate negative gradients $-g(\mathbf{x_i})$; where $-g(\mathbf{x_i})= -\frac{\partial L(y_i, F(\mathbf{x_i}))}{\partial F(\mathbf{x_i})}\bigg|_{F = F_{t-1}}$
4. > fit a base-learner model $h$ to negative gradients $-g(\mathbf{x_i})$
5. > update the function: $F_t=F_{t-1} + \alpha h(\mathbf{x})$; where $\alpha$ is a shrinkage
6. end for



The first step in gradient boosting is to create a model, $F_0$, which returns the average of the target variable. This model is simply an average model and is a good starting point. Now we iterate through each boosting iteration.
For each boosting iteration, we first compute the negative gradient of a loss function. Based on this gradient, we train a base model $h$. We can use any machine learning algorithm like linear regression, decision tree, etc., for the base model. Now we iteratively update our model $F$ as $F_t = F_{t-1}+\alpha h$. But before updating, we need to downscale the contribution of the base learner $h$ using shrinkage parameter $\alpha$. This downscaling makes boosting operation stable and avoids overfitting. Shrinkage, $\alpha$ is one of the hyperparameters in gradient boosting.

## Loss functions
We discussed the iterative approach for optimization in the Linear Regression module. In the iterative approach, we update the model's parameter to minimize the function, called loss or cost function. The loss or cost function maps an event onto a real number representing the error associated with the event.  The gradient of the loss function gives the steepest direction towards the minimum point. It is useful in deciding how to update an existing model to reduce the model's overall cost/error.


Gradient boosting algorithm allows us to use any differentiable convex loss function. Even with different loss functions the main flow of algorithm remains the same. We have already discussed squared loss in the above section. The problem with squared loss is that it is highly affected by outlier. Let's see this with the help of an example,

|  |      $y_i$  |  Prediction $ F(\mathbf{x}_i$)  | Squared loss|
|:----------|-----------:|:----------:|:-----:|
|    1     |  0.8        | 0.6         | 0.02  |
|    2     | 0.6         | 0.8         | 0.02 |
|    3     | 1.3         | 1.4         | 0.005 |
|    4     | 0.5         | 0.3         | 0.02 |
|    5     | __5.0__        | 0.9        | __8.405__ |

Here __5.0__ is an outlier. With the squared loss function, the outlier is heavily punished. The consequence of this heavy punishment is that the new model, which we are going to add, will unnecessarily focus on these outliers, hence degrade overall performance.

To tackle this problem, we have other loss functions that are more robust to outliers.

### 1. Absolute loss

The absolute loss function is defined as:
$$ J = L(y,F(\mathbf{x})) = |y-F(\mathbf{x})|$$

Squared loss takes the square of the difference between actual and predicted value. Unlike this, absolute loss takes the absolute value of the difference between the actual and the predicted value. Absolute loss is more robust to the outlier.


### 2. Huber loss
Huber loss function is defined as:
\begin{equation}
  J = L(y, F(\mathbf{x})) =
    \begin{cases}
      \frac{1}{2}(y-F(\mathbf{x}))^2 & |y-F(\mathbf{x})|\le \delta \\
      \delta (|y-F(\mathbf{x})|-\frac{\delta}{2}) & |y-F(\mathbf{x})| \gt \delta
    \end{cases}       
\end{equation}

Huber loss combines square loss and absolute loss to create a new loss function, which is both differentiable and robust to the outliers. For the error less than $\delta$, the Huber loss acts like a squared loss, and for error larger than $\delta$, the Huber loss acts as an absolute loss. The parameter $\delta$ represents the transition point between the square loss and the absolute loss.

Let's see how absolute loss and Huber loss are robust to outliers with an example.

|     |      $y_i$  |   Prediction $F(\mathbf{x}_i$) | Squared loss| Absolute loss | Huber loss($\delta=0.2$) |
|----------|:-----------:|:----------:|:-----: |:-----:|:-----:|
|    1     |  0.8        | 0.6         | 0.02  |0.2 | 0.02  |
|    2     | 0.6         | 0.8         | 0.02 |0.2 |0.02     |
|    3     | 1.3         | 1.4         | 0.005 |0.1   |0.005  |
|    4     | 0.5         | 0.3         | 0.02 |0.2  | 0.02  |
|    __5__     | __5.0__        | __0.9__      | __8.405__ |__4.1__  |__0.8__   |

Here, $\delta = 0.2$ is choosen arbitrarily.

In the above table, $5^{th}$ data point is an outlier. The outlier is heavily punished by the squared loss function but not by other loss functions. This shows that the square loss function is highly sensitive to the presence of outliers. On the other hand, absolute loss and Huber loss are less sensitive and robust to the presence of outliers.

## Gradient Boosting for Classification

Until now, we talked about gradient boosting for regression. The gradient boosting algorithm is slightly different for the classification task, but most of the part of an algorithm remains similar to that for regression. In this section, we will discuss the gradient boosting algorithm for binary classification, and in the coming section, we will generalize it for multi-class classification.

### Gradient Boosting for binary classification


The general algorithm for classification also follows the same approach we used for regression. In classification, we compute the gradient of the loss function, train a new base learner on the computed negative gradient, and then add the base learner to the existing model.

We have already talked about the logistic loss function in the Logistic Regression module. Recall that the logistic loss function is used for binary classification. We can't use the squared loss for classification as we did for the regression task. Let's discuss the logistic loss function.

The logistic loss can be represented as:

$$J = L(y, \hat{p}) = -ylog(\hat{p})-(1-y)log(1-\hat{p})$$
Where, $\hat{p}$ is the probability of predicting $(y =1)$ by our model.

The above function shows that it requires the probability of positive and negative class to compute the loss.

The regression tree (CART), on the other hand, give logits as an output. We need some way of conversion from logits to probability and vice-versa. As already discussed in Logistic Regression, we can use a sigmoid function as:
$$\hat{p} = \frac{1}{1+e^{-\hat{y}}}$$

For conversion from probability back to logits, we have
$$\hat{y} = log\frac{\hat{p}}{1-\hat{p}}$$

The term $\hat{y}$ is a logit. It is also called log-odds, as it is a log of odds.

The term $(\frac{\hat{p}}{1-\hat{p}})$, called odds, is a ratio of probability of $(y=1)$ to probability of $(y=0)$.


<!-- However, the regression tree (CART) works with real value $\hat{y}$ which may lie in the range $(-\inf, \inf)$. We know how to convert the real value into probability as discussed in Logistic Regression chapter. i.e, We have,
$$\hat{p} = \frac{1}{1+e^{-\hat{y}}}$$

In logistic regression, $\hat{y}$ was just the output of the linear regressor.

The above equation can be expressed in term of probability as:

$$\hat{y} = log\frac{\hat{p}}{1-\hat{p}}$$
Here, the term $\hat{y}$ is simply called a logit or log odds.

The term $(\frac{\hat{p}}{1-\hat{p}})$, called odds is simply the ratio of probability of $(y=1)$ to probability of $(y=0)$. -->

While performing boosting(iterative addition), we can't perform addition on $\hat{p}$ because it's just a probability that ranges from 0 to 1. However, expressing our loss function in terms of $\hat{y}$ allows us to use boosting for classification problems. So we express our loss function in terms of logits.

<!-- So, we express our final prediction in term of probability$\hat{p}$ and intermediate steps during model construction is expressed in term of logits $\hat{y}$ -->

Now expressing the logistic loss in term of logit, we have,

$$J_i = -y_i log \frac{\hat{p}_i}{1-\hat{p}_i}-log(1-\hat{p}_i)$$


$$J_i = log(1+e^{\hat{y}_i})-y_i\hat{y}_i$$

Since, $\hat{y}_i= F({\mathbf{x}_i}) $. Expressing $\hat{y}_i$ in the above equation in term of $F({\mathbf{x}_i})$ we have,

$$J_i = \log\big(1+e^{F(\mathbf{x}_i)}\big)-y_i F(\mathbf{x}_i)$$

Now we compute the gradient of the loss function with respect to $F({\mathbf{x}_i)}$ as we did in gradient boosting for regression.

The gradient of loss function is:

$$
\frac{\partial J}{\partial F(\mathbf{x}_i)} = \frac{e^{F({\mathbf{x}}_i)}}{1+e^{F({\mathbf{x}}_i)}}-y_i = -\left(y_i-\frac{e^{F(\mathbf{x}_i)}}{1+e^{F(\mathbf{x}_i)}}\right)
$$

So we train successive regression trees on data $\left(\mathbf{x}_i,\, y_i-\frac{e^{F(\mathbf{x}_i)}}{1+e^{F(\mathbf{x}_i)}}\right)_{i=1}^N$

<!-- Once we perform all the boosting operation in term of log(odds), then we convert this into probability using $$\hat{p}_i = \frac{1}{1+e^{-{F({\mathbf{x}_i})}}}$$ -->
In this way we perform boosting operation in term of logits. But for final prediction we requires the probability of the class. We can convert logits into probability as:
$$\hat{p}_i = \frac{1}{1+e^{-{F({\mathbf{x}_i})}}}$$

Until now, we expressed logistic loss function in terms of logits. We also discussed how to compute the gradient of a logistic loss function and use it to train a successive trees.  Now let's summarize gradient boosting for binary classification.

```
Inputs:
```
> - input data $(\mathbf{x},y)_{i=1}^N$
- number of iterations $M$
- loss function $J$; logistic loss $J = L(y, \hat{p})= -ylog(\hat{p})-(1-y)log(1-\hat{p})$
- choice of base-learner model $h(\mathbf{x})$
- learning rate or shrinkage $\alpha$

```
Algorithm:
```
> 1. initialize  $F_0= log\frac{(y==1)}{(y==0)}$
2. for $t=1$ to $M$ do
3. &nbsp;&nbsp;&nbsp;calculate negative gradients $-g(\mathbf{x_i})$; where $-g(\mathbf{x_i})= -\frac{\partial L(y_i, F(\mathbf{x_i}))}{\partial F(\mathbf{x_i)}}\bigg|_{F = F_{t-1}}$  
4. &nbsp;&nbsp;&nbsp;fit a base-learner model $h$ to negative gradients $-g(\mathbf{x_i})$
5. > update the function: $F_t=F_{t-1} + \alpha h(\mathbf{x})$; where $\alpha$ is a shrinkage
6. end for
7. calculate the probability, $\hat{p}_i=\frac{e^{F_M(\mathbf{x}_i)}}{1+e^{F_M(\mathbf{x}_i)}}$

## Gradient Boosting for Multi-Class Classification
Gradient boosting for multi-class classification is a bit different compared to that for binary classification. First, we can't use the same logistic function and the same sigmoid function used in binary classification. We need to use the cross-entropy function as loss function and softmax function to map logits into probabilities. Second, the boosting approach itself is also a bit different compared to binary classification.

Before discussing the gradient boosting algorithm for multi-class classification, let's first discuss the cross-entropy loss function and softmax function.

### Cross-Entropy

We are already familiar with the logistic loss function. Recall that the logistic loss function is used for binary classification where we compute log loss for both positive and negative classes and add them. The expression for logistic loss is:

$$J = L(y, \hat{p}) = -\underbrace{ylog(\hat{p})}_{\text{for positive class}}-\underbrace{(1-y)log(1-\hat{p})}_{\text{for negative class}}$$
__Note :__ In a binary classification problem, the label $y$ of any data $\mathbf{x}$ is either one or zero. So only one expression on the right-hand side of the above equation is non-zero.

We can extend the concept of log loss for more than two classes. For multi-class classification with a number of classes, $C\gt2$, we calculate the separate loss for each class and then finally add them.



The cross-entropy loss function can be defined as:

$$J = H(Q,P) = -\sum_{c=1}^C Q_clog(P_c)$$
$$J = H(Q,P) = - \underbrace{ Q_1log(P_1)}_{\text{for first class}}    - \underbrace{ Q_2log(P_2)}_{\text{for second class}} \text{   .... } -\underbrace{ Q_Clog(P_C)}_{\text{for $C^{th}$ class}}  $$

Where, $C$ is the total number of classes, $Q_c$ is the true label of the data point, and $P_c$ is the probability that the data point belongs to class $c$.


Suppose we have an iris dataset containing three classes, $(C=3)$. We can represent three samples $\mathbf{x}_1,\mathbf{x}_2,\mathbf{x}_3$. Their labels can be represented as:

|Data | Setosa $(Q_1)$ | Virginica $(Q_2)$| Versicolor$(Q_3)$|
|------|--------|-----|-----|
|  $\mathbf{x}_1$    |   1     |   0   |   0   |
|  $\mathbf{x}_2$    |    0    |   1   |     0 |
|  $\mathbf{x}_2$    |   0     |   0   |  1    |

Here, sample $\mathbf{x_1}$ belongs to class Setosa. So it is labeled 1 in the Setosa column and 0 for other columns. Similarly, sample $\mathbf{x_2}$ belongs to class Virginica as represented by label 1 in column Virginica, and sample $\mathbf{x_3}$ belongs to class Versicolor as represented by label 1 in the Versicolor column in the above table. The label for sample $\mathbf{x}_1$ is $[Q_1, Q_2, Q_3]= [1, 0, 0]$ . Such representation of the label of the data points is called one-hot encoding.

This shows that only one expression in the right-hand side of the above equation is non-zero, and other expressions are zero for each sample $\mathbf{x}_i$ since only one value of $Q_c$ is non-zero.





The cross-entropy loss function is used to measure the difference between two probability distributions: actual probability distribution $Q$ and the predicted probability distribution $P$.



### Softmax function

In the Logistic Regression module, we used the sigmoid function to convert the logits into probability.

 Recall that the expression of sigmoid function is
$$\hat{p} = \frac{1}{1+e^{-\hat{y}}} = \frac{e^{\hat{y}}}{1+e^{\hat{y}}}$$
The above equation gives the probability of data point belonging to the positive class, i.e., $P(y==1)$. And the probability of negative class is computed as: $$P(y==0) = 1- P(y==1)$$

We can extend the concept of sigmoid function for more than two classes. This extension generates a new function called Softmax function.

Softmax function maps the logits of multiple classes into probability. It is defined as:

$$\hat{p}_i=  f(a_i) = \frac{e^{a_i}}{\sum_{j=1}^C e^{a_j}}$$
Where, $C$ is the number of classes, $a_i$ is a logit for class $i$.

For example, while solving any classification problem suppose the logits for each classes are (2.3, 0.4, 1.2). We can map the logits of different classes into their corresponding probabilities using softmax function. The probability of first class is:
$$\hat{p}_1 = f(2.3) = \frac{e^{2.3}}{e^{2.3}+e^{0.4}+e^{1.2}} = 0.67$$
Similarly we can calculate the probability of other classes too.

Until now, we have discussed loss function and logit to probability mapping function for multi-class classification. Let's now understand the gradient boosting algorithm for multi-class classification with the help of a simple example.

For the $C$ class classification problem, we define $C$ numbers of base predictor, each giving score for each class. For example, let's consider the iris dataset, which consists of three classes: Setosa, Virginica, and Versicolor. Now for solving this problem, we need to define three models $F_{Se}, F_{Vi}, F_{Ve}$, which gives a score for classes Setasa, Virginica, and Versicolor, respectively.

Let's see each of the steps involved:

1. Number of classes($C$) = 3; i.e., Classes= Setosa($Se$), Virginica($Vi$), Versicolor($Ve$)
2. Define three scoring model: $F_{Se}, F_{Vi}, F_{Ve}$
3. We can use any algorithm for the model. The score is just the output of the model.

  Where, $F_{Se}(\mathbf{x})$ gives the score for class Setosa
4. Convert these score into probability using softmax function as
$$P_{Se}(\mathbf{x}) = \frac{e^{F_{Se}(\mathbf{x})}}{e^{F_{Se}(\mathbf{x})}+e^{F_{Vi}(\mathbf{x})}+e^{F_{Ve}(\mathbf{x})}}$$
$$P_{Vi}(\mathbf{x}) = \frac{e^{F_{Vi}(\mathbf{x})}}{e^{F_{Se}(\mathbf{x})}+e^{F_{Vi}(\mathbf{x})}+e^{F_{Ve}(\mathbf{x})}}$$
$$P_{Ve}(\mathbf{x}) = \frac{e^{F_{Ve}(\mathbf{x})}}{e^{F_{Se}(\mathbf{x})}+e^{F_{Vi}(\mathbf{x})}+e^{F_{Ve}(\mathbf{x})}}$$
5. Now use cross-entropy function to assign these predicted probability distribution $P$ with the actual probability distribution $Q$. For this we need to compute gradient.
6. Compute gradient of loss function with respect to each scoring model. i.e,
$$-g_{Se}(\mathbf{x}) = \frac{\partial H(Q, P_{Se})}{\partial F_{Se}(\mathbf{x})}$$
$$-g_{Vi}(\mathbf{x}) = \frac{\partial H(Q, P_{Vi})}{\partial F_{Vi}(\mathbf{x})}$$
$$-g_{Ve}(\mathbf{x}) = \frac{\partial H(Q, P_{Ve})}{\partial F_{Ve}(\mathbf{x})}$$

7. Using these gradient train models: $h_{Se}, h_{Vi}, h_{Ve}$.
8. Finally perform iterative addition as:
$$F_{Se, t} = F_{Se, t-1} +\alpha h_{Se}$$
$$F_{Vi, t} = F_{Vi, t-1} +\alpha h_{Vi}$$
$$F_{Ve, t} = F_{Ve, t-1} +\alpha h_{Ve}$$

In the above section, we talked about the gradient boosting algorithm for a three-class classification problem. Let's summarize the gradient boosting algorithm for multi-class classification with any number of classes $C$.


```
Inputs:
```
> - input data $(\mathbf{x},y)_{i=1}^N$
- number of classes $C$; Classes = {$c1$, $c2$, ..., $cC$}
- number of iterations $M$
- loss function $J$; cross-entropy loss $J = H(Q,P) = -\sum_{c=1}^C Q_clog(P_c)$
- choice of base-learner model $h_{c1}(\mathbf{x}),h_{c2}(\mathbf{x}),...,h_{cC}(\mathbf{x})$
- learning rate or shrinkage $\alpha$

```
Algorithm:
```
> 1. initialize  $F_{c1, 0} = log(\frac{y==c1}{y!=c1}), F_{c2, 0}= log(\frac{y==c2}{y!=c2}), ...., F_{cC,0}= log(\frac{y==cC}{y!=cC})$
2. for $t=1$ to $M$ do  
3. &nbsp;&nbsp;&nbsp;calculate negative gradient for class $c1, c2, ..., cC$, i.e.,  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$$-g_{c1}(\mathbf{x_i})= -\frac{\partial J}{\partial F_{c1}(\mathbf{x_i)}}\bigg|_{F_{c1} = F_{c1, t-1}}$$  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$$-g_{c2}(\mathbf{x_i})= -\frac{\partial J}{\partial F_{c2}(\mathbf{x_i)}}\bigg|_{F_{c2} = F_{c2, t-1}}$$  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$$\cdots$$  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$$-g_{cC}(\mathbf{x_i})= -\frac{\partial J}{\partial F_{cC}(\mathbf{x_i)}}\bigg|_{F_{cC} = F_{cC, t-1}}$$
4. > fit base-learner model $h_{c1}$, $h_{c2}$,..., $h_{cC}$ to negative gradients $-g_{c1}(\mathbf{x_i})$, $-g_{c2}(\mathbf{x_i})$,...., $-g_{cC}(\mathbf{x_i})$ respectively.
5. > update the function using $\alpha$ as shrinkage factor as:
$$F_{c1, t}=F_{c1,t-1} + \alpha h_{c1}$$
$$F_{c2, t}=F_{c2,t-1} + \alpha h_{c2}$$
$$.....................$$
$$F_{cC, t}=F_{cC,t-1} + \alpha h_{cC}$$
6. end for
7. identify class as: Class $= argmax_{\{ci\}}F_{ci,M}(\mathbf{x})$; $i = 1, 2, ..., C$

In this way, we are at the end of this chapter. In gradient boosting, the shortcomings/errors of an existing model is identified by the gradient. The gradient is used to add models successively. We discussed the gradient boosting algorithm for regression with the help of a numerical example. We also introduced different loss functions for regression: Squared loss, Absolute loss, and Huber loss. Huber loss is more robust to the presence of outliers. We also talked about the gradient boosting algorithm for binary and multi-class classification.


### Key Takeaways
- In gradient boosting, the shortcoming/error of an existing model is identified by the gradient.
- Gradient Boosting provides a general framework for solving both classification and regression problems.
-The basic idea of gradient boosting is to compute the gradient of the loss function and use it to train a base learner.
- Popular loss functions for regression are Squared loss, Absolute loss, and Huber loss. Huber loss is more robust to outliers.
- For binary classification, the logistic loss function is expressed in terms of logits.
- Cross-entropy loss function is used for multi-class classification.
