In [1]:
import pandas as pd
import numpy as np

In [19]:
# toy data
df = pd.DataFrame({
    'height': [1.6, 1.6, 1.5, 1.8, 1.5, 1.4],
    'fav_color': ['blue', 'green', 'blue', 'red', 'green', 'blue'],
    'gender': ['male', 'female', 'female', 'male', 'male', 'female'],
    'weight': [88, 76, 56, 73, 77, 57]
})

# encode the categorical columns
df['gender'] = df['gender'].astype('category')
df['fav_color'] = df['fav_color'].astype('category')

df = pd.get_dummies(df, columns=['fav_color', 'gender'])
df.drop(columns=['gender_male'], inplace=True)

df.head()

Unnamed: 0,height,weight,fav_color_blue,fav_color_green,fav_color_red,gender_female
0,1.6,88,True,False,False,False
1,1.6,76,False,True,False,True
2,1.5,56,True,False,False,True
3,1.8,73,False,False,True,False
4,1.5,77,False,True,False,False


## Toy Data Example

In [25]:
target = 'weight'
features = df.columns[df.columns != target]

# 1. Take the average of the target variable (weight) as the baseline prediction 
avg_wt = df[target].mean()
print(f"Baseline prediction (average weight): {avg_wt:.2f}")

# 2. Fit a regression tree on the ERRORS of the baseline prediction - these errors are called Pseudo-Residuals
from sklearn.tree import DecisionTreeRegressor
pseudo_residuals = df[target] - avg_wt
tree = DecisionTreeRegressor(max_leaf_nodes=4, random_state=42)
tree.fit(df[features], pseudo_residuals)
p_residuals = tree.predict(df[features])

# 3. Add the baseline prediction to the Pseudo-Residuals weighted by a learning rate to update the weight predictions
learning_rate = 0.1
df['pred_weight'] = avg_wt + p_residuals * learning_rate

# 4. Compute new pseudo-residuals using the updated weight predictions
pseudo_residuals = df[target] - df['pred_weight']

# 5. Fit a new regression tree on the new pseudo-residuals
tree = DecisionTreeRegressor(max_leaf_nodes=4, random_state=42)
tree.fit(df[features], pseudo_residuals)
p_residuals = tree.predict(df[features])

# 6. Add the new pseudo-residuals to the previous predictions to update the weight predictions again
df['pred_weight'] += p_residuals * learning_rate

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(df[target], df['pred_weight'])
print(f"MSE: {mse:.2f}")

# Repeat steps 4-6 until convergence or a set number of iterations

i = 0
while (i < 10) and (mse > 10):
    # Compute new pseudo-residuals using the updated weight predictions
    pseudo_residuals = df[target] - df['pred_weight']

    # Fit a new regression tree on the new pseudo-residuals
    tree = DecisionTreeRegressor(max_leaf_nodes=4, random_state=42)
    tree.fit(df[features], pseudo_residuals)
    p_residuals = tree.predict(df[features])

    # Add the new pseudo-residuals to the previous predictions to update the weight predictions again
    df['pred_weight'] += p_residuals * learning_rate

    mse = mean_squared_error(df[target], df['pred_weight'])
    # print(f"Iteration {i+1}, MSE: {mse:.2f}")
    i += 1

print(f"Final MSE: {mse:.2f}")
df.head()

Baseline prediction (average weight): 71.17
MSE: 84.79
Final MSE: 10.44


Unnamed: 0,height,weight,fav_color_blue,fav_color_green,fav_color_red,gender_female,pred_weight
0,1.6,88,True,False,False,False,83.245769
1,1.6,76,False,True,False,True,74.821458
2,1.5,56,True,False,False,True,60.6423
3,1.8,73,False,False,True,False,72.555398
4,1.5,77,False,True,False,False,75.092775


## GBM Regression Algorithm

### Input: Data $\{x_i, y_i\}^n_{i=1}$ and a differentiable loss function $L(y_i, F(x))$

In the toy data example above, we didn't use a loss function explicitly, we simply computed the pseudo-residuals directly and then fit to them. A commonly used loss function for regression GBM is: $$\frac{1}{2} (y_i - F(x))^2$$

### Step 1: Initialize the model with a *constant value*

The constant value used for initialization is denoted $F_0(x)$ and is chosen to minimize the sum of the loss function for all observations:
$$F_0(x) = \argmin_\gamma \sum_{i = 1}^n L(y_i, \gamma)$$

For the loss function proposed above, this initial may be found conveniently by taking the sum of the first derivatives of the function at each observation:
$$\frac{d}{dF_0(x)} \bigg[ \frac{1}{2} (y_i - F_0(x))^2 \bigg] = -(y_i - F_0(x))$$
Setting the sum of these derivatives equal to zero and solving yields the optimal initialization:
$$-\sum_{i=1}^n y_i - nF_0(x) = 0 \ \implies \ F_0(x) = \frac{1}{n}\sum_{i=1}^n y_i$$
This is simply the average of the target value. However, for other loss functions the optimal initial predicted value may be different than the average.

### Step 2: Loop

#### **A**: Compute the gradient of the loss function w.r.t. the predicted target values 
$$r_{im} = -\bigg[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \bigg]_{F(x) = F_{m-1}(x)}, \ i=1,...,n$$

When the squared error loss is used, then the gradient is simply the residuals (aka. the pseudo-residuals) computed in the toy data example. So, in that example we were indirectly boosting using the squared error loss. However, it is not always that case that the gradient of the loss function will be the residuals. Using the gradient explicitly instead of the residual improves the generalizability of gradient boosting machines. Additionally, using the gradient loss function instead of the residuals *does not change* the error-minimizing properties of boosting machines since fitting to gradients still enables boosting machines to reduce errors w.r.t. the chosen loss-function.

#### **B**: Fit a regression tree to the $r_{im}$ values and create terminal regions $R_{jm}, j=1, ..., J_m$
Basically, just fit a regression tree to the gradient (psuedo-residuals) from the previous iteration of the loop. The "terminal regions" are simply the leaves of this tree

#### **C**: Compute the output values (predictions) for each terminal region
The predictions from the tree are not simply the average values of the target variable at each leaf. Rather, they are the value at each leaf that minimizes the sum of the loss functions for each obeservation within the leaf:
$$\gamma_{jm} = \argmin_\gamma \sum_{x_i\in R_{ij}} L(y_i, F_{m-1}(x) + \gamma), \ j=1,...,J_m$$
Like in Step 1, these values may often be found by simply taking the first derivatives of the summations. For more complex loss functions, algorithms like gradient descent may be used instead. It happens that the optimal prediction for the squared error loss is the average predicted target value in each terminal leaf node.

#### **D**: Update
Update the predicted target values using the predicted pseudo-residuals:
$$F_m(x) = F_{m-1}(x) + \nu \sum_{j=1}^{J_m} \gamma_{jm}I(x\in R_{jm})$$
Where $\nu$ is the learning rate and $I(\cdot)$ is an indicator function

### Step 3: Output $F_M(x)$
Simple as that.

## GBM Classification Algorithm

### Input: Data $\{x_i, y_i\}^n_{i=1}$ and a differentiable loss function $L(y_i, F(x))$
This is the same criteria as the regression context.\
A common loss function is cross-entropy aka. negative log-likelihood loss:
$$-y_i \ln(\frac{p}{1-p}) - \ln(1 - p)$$
Note that this is binary cross-entropy, so not a loss-function that is suitable for multi-class predictions.
This loss function may easily be expressed in terms of the log-odds rather than in terms of probability:
$$-y_i \ln(\text{odds}) + \ln(1 + \exp\ln(\text{odds}))$$

### **Step 1:** Initialize the model with a constant value
As with the regression context, the intial value is chosen to minimize the initial sum of loss functions:
$$F_0(x) = \argmin_\gamma \sum_{i=1}^n L(y_i, \gamma)$$
For the log-loss this may be solved analytically and is simply the unconditional log-odds of the target variable, i.e.:
$$F_0(x) = \ln\big(\frac{p}{1-p}\big)$$

### **Step 2**: Loop for $m=1$ to $M$

#### **A**: Compute the gradient of the loss-function (pseudo-residuals)
$$r_{im} = -\bigg[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \bigg]_{F(x) = F_{m-1}(x)}, \ i=1,...,n$$
For the log-loss, the gradient is:
$$r_{i} = y_i - \frac{e^{\ln(\frac{p}{1-p})}}{1 + e^{\ln(\frac{p}{1-p})}} = y_i - p$$
To be more precise, $p$ is now a conditional probability for observation $i$, so: $$y_i - P(y_i|x_i)$$
At $m=1$, the conditional probability is equal to the unconditional probability because the initial prediction is simply the unconditional probability given by the unconditional log-odds.

#### **B**: Fit a *regression* tree to the gradient (pseudo-residuals)
Fit a regression tree to the $r_{im}$ values and create terminal regions (leaves) $R_{jm}, j=1,...,J_m$

#### **C**: Compute output values for each terminal region
The output-values for each region are again found by minimizing the loss function within the region:
$$\gamma_{jm} = \argmin_\gamma \sum_{x_i \in R_{ij}} L(y_i, F_{m-1}(x_i) + \gamma), j=1,...,J_m$$
Sometimes these are analytically solvable, other times they are algorithmically approximated.\
For the log-loss, the output values are given by: $$y_{j} = \frac{\sum r_{ij}}{\sum p_{ij}}$$
Where $r_{ij}$ is the $i^\text{th}$ pseudo-residual for leaf $j$ and $p_{ij}$ is the $i^\text{th}$ predicted-probability for leaf $j$.\
At $m=1$, $p_{ij} = \frac{e^{F_0(x)}}{1 + e^{F_0(x)}} = p, \ \forall i, j$

#### **D**: Update predictions
Update the predicted probabilities:
$$F_m(x) = F_{m-1}(x) + \nu \sum_{j=1}^{J_m} \gamma_{jm}I(x\in R_{jm})$$
Where $\nu$ is the learning rate and $I(\cdot)$ is the indicator function.

#### **Step 3**: Output predictions $F_M(x)$
Simple as