1. Initialize the Model: 
Start with a simple model, typically just the mean of the target variable.

2. Calculate Residuals:
For each training example, calculate the difference between the actual target value and the predicted value (i.e., the residuals).

3. Train a Weak Learner:
A new weak learner (typically a shallow decision tree) is trained to predict the residuals from the previous step.

4. Update the Model:
The new model is added to the ensemble, adjusting the predictions to better fit the residuals (errors) from the previous round using gradient descent.

5. Repeat:
The process is repeated, with each new model trying to reduce the remaining error.

6. Final Model:
The final prediction is a sum of all the predictions from the weak learners.

Error : L(f)= ∑ L(yi,f(xi)), typically mse

We want to find the function   
f(x) that minimizes the loss function across the dataset:

𝑓^0 (𝑥) = argminf  ∑ 𝐿(𝑦𝑖 , 𝑓(𝑥𝑖) )

The initial model is typically a simple prediction, like the mean of the target values in regression tasks:

𝑓^0 (𝑥) =  (1/N) * ∑𝑦𝑖
​


### Boosting with trees 


To improve the prediction, the algorithm iteratively adds new models (often decision trees) that is trained on the errors made by the previous model.

The model at stage m is represented as:

F𝑚+1(𝑥𝑖)  = F𝑚(𝑥𝑖) + h𝑚(𝑥𝑖)


Here:

𝐹𝑚(𝑥𝑖) is the prediction from the previous stage.
ℎ𝑚(𝑥𝑖) is the new "correction" model added at stage m, which is trained to predict the residuals (the errors) from the previous model.

### Steepest Descent 

The key idea of gradient boosting is that it applies the gradient descent optimization method to minimize the loss function.

* At each stage m, the new model h𝑚 (𝑥𝑖) should try to reduce the residual errors from the previous stage.
* The residual error is the gradient of the loss function with respect to the predictions.   
The gradient for sample i at stage 𝑚 is:
    
    g𝑖𝑚 = − [ (∂𝐿(𝑦𝑖,𝑓(𝑥𝑖))) / ∂𝑓(𝑥𝑖) ]subscript(𝑓(𝑥𝑖) = 𝑓𝑚−1(𝑥𝑖)
​
 
This means that the gradient 
g𝑖𝑚 measures how much the loss would decrease if the prediction at 𝑥𝑖 were adjusted.

##### New Model h𝑚(𝑥)
We want the new model  ℎ𝑚(𝑥) to be proportional to the negative gradient (steepest descent):

ℎ𝑚(𝑥) = − 𝜌𝑚𝑔𝑚
 
where  𝜌𝑚ρm is the step size (or learning rate) that determines how much of the correction (negative gradient) should be applied.

### Minimize Loss with Each New Model

At each iteration, we aim to find the new model 
ℎ𝑚 (𝑥) that reduces the overall loss the most:

ℎ𝑚 = argmin ∑ 𝐿(𝑦𝑖, 𝐹𝑚−1(𝑥𝑖) + ℎ𝑚(𝑥𝑖))

This means we are finding the function 
ℎ𝑚(𝑥) that best fits the residuals (the negative gradients) from the previous model. In practice, ℎ𝑚(𝑥 is often a shallow decision tree trained on these residuals.

Step 6: Final Model
After 𝑀 stages, the final model is the sum of all the weak learners (trees) added at each stage:

𝑓𝑀 (𝑥) = 𝑓0 (𝑥) + ∑[𝑚=1 to 𝑀] 𝜌𝑚ℎ𝑚 (𝑥)


In each iteration, we are adding a new tree that tries to correct the mistakes made by the previous trees.

In [1]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_diabetes
 
# Setting SEED for reproducibility
SEED = 23
 
# Importing the dataset 
X, y = load_diabetes(return_X_y=True)
 
# Splitting dataset
train_X, test_X, train_y, test_y = train_test_split(X, y, 
                                                    test_size = 0.25, 
                                                    random_state = SEED)
 
# Instantiate Gradient Boosting Regressor
gbr = GradientBoostingRegressor(loss='absolute_error',
                                learning_rate=0.1,
                                n_estimators=300,
                                max_depth = 1, 
                                random_state = SEED,
                                max_features = 5)
 
# Fit to training set
gbr.fit(train_X, train_y)
 
# Predict on test set
pred_y = gbr.predict(test_X)
 
# test set RMSE
test_rmse = mean_squared_error(test_y, pred_y) ** (1 / 2)
 
# Print rmse
print('Root mean Square error: {:.2f}'.format(test_rmse))

Root mean Square error: 56.39


This sequential nature allows GBT to learn complex relationships in the data but makes it more prone to overfitting, especially if not properly regularized.