Gradient boosting is one of the variants of ensemble methods where you create multiple weak models and combine them to get better performance as a whole.
Boosting learns from the mistakes of individual trees. The general idea is to adjust new trees based on the errors of previous trees.

Gradient boosting uses a different approach than AdaBoost. While gradient boosting also adjusts based on incorrect predictions, it takes this idea one step further: gradient boosting fits each new tree entirely based on the errors of the previous tree's predictions.

Gradient boosting fits each new tree entirely based on the errors of the previous tree's predictions. That is, for each new tree, gradient boosting looks at the mistakes and then builds a new tree completely around these mistakes. The new tree doesn't care about the predictions that are already correct.

---

#### Gradient boosting computes the residuals of each tree's predictions and sums all the residuals to score the model.

#### Residuals - These are the difference between the errors and the predictions of a given model.

## Build a gradient boost from Scratch

In [1]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

In [2]:
df_bikes = pd.read_csv('bike_rentals_cleaned.csv')

df_bikes.head()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
0,1,1.0,0.0,1.0,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,985
1,2,1.0,0.0,1.0,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,801
2,3,1.0,0.0,1.0,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,1349
3,4,1.0,0.0,1.0,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,1562
4,5,1.0,0.0,1.0,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,1600


In [3]:
X_bikes = df_bikes.iloc[:,:-1]

y_bikes = df_bikes.iloc[:,-1]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_bikes, y_bikes, random_state=2)

## Steps for building the model from scratch

### 1. Fit the data into a decision tree 

You may use a decision tree stump, which has a max_depth value of 1, or a decision tree with a max_depth value of 2 or 3. The initial decision tree, called a base learner, should not be fine-tuned for accuracy. We want a model that focuses on learning from errors, not a model that relies heavily on the base learner.

In [4]:
from sklearn.tree import DecisionTreeRegressor

tree_1 = DecisionTreeRegressor(max_depth=2, random_state=2)

tree_1.fit(X_train, y_train)

## 2. Make predictions with the training set: 
Instead of making predictions with the test set, predictions in gradient boosting are initially made with the training set. Why? To compute the residuals, we need to compare the predictions while still in the training phase. The test phase of the model build comes at the end, after all the trees have been constructed.

In [6]:
y_train_pred=tree_1.predict(X_train)

## 3.Compute the residuals: 
The residuals are the differences between the predictions and the target column. The predictions of X_train, defined here as y_train_pred, are subtracted from y_train,

In [7]:
y2_train = y_train - y_train_pred

The residuals are defined as y2_train because they are the new target column for the next tree.

### 4.Fit the new tree on the residuals: 
Fitting a new tree on the residuals is different than fitting a model on the training set. The primary difference is in the predictions.

In [8]:
tree_2 = DecisionTreeRegressor(max_depth=2, random_state=2)

tree_2.fit(X_train, y2_train)

In [9]:
##Let's repeat the process for a third tree as follows:

y2_train_pred = tree_2.predict(X_train)

y3_train = y2_train - y2_train_pred

tree_3 = DecisionTreeRegressor(max_depth=2, random_state=2)

tree_3.fit(X_train, y3_train)

This process may continue for dozens, hundreds, or thousands of trees. Under normal circumstances, you would certainly keep going. It will take more than a few trees to transform a weak learner into a strong learner.

## 6.Sum the results: Summing the results requires making predictions for each tree with the test set as follows:

In [10]:
y1_pred = tree_1.predict(X_test)

y2_pred = tree_2.predict(X_test)

y3_pred = tree_3.predict(X_test)

In [11]:
y_pred = y1_pred + y2_pred + y3_pred

In [12]:
y_pred

array([4710.19173509, 4307.72376923, 3719.78786666, 3888.39768662,
       1865.38520272, 3888.39768662, 7047.24472853, 4307.72376923,
       3060.33145621, 3060.33145621, 3060.33145621, 1865.38520272,
       3060.33145621, 7047.24472853, 1865.38520272, 4307.72376923,
       4307.72376923, 3060.33145621, 7047.24472853, 6020.90829501,
       7047.24472853, 4307.72376923, 4914.73412014, 1865.38520272,
       1865.38520272, 7047.24472853, 3060.33145621, 7047.24472853,
       4914.73412014, 6440.23437762, 4710.19173509, 1865.38520272,
       4307.72376923, 3060.33145621, 4710.19173509, 3888.39768662,
       1865.38520272, 3060.33145621, 6440.23437762, 4914.73412014,
       4914.73412014, 7047.24472853, 1865.38520272, 6440.23437762,
       7047.24472853, 3060.33145621, 6020.90829501, 7047.24472853,
       3060.33145621, 3888.39768662, 4914.73412014, 2033.99502269,
       3060.33145621, 1865.38520272, 4914.73412014, 6440.23437762,
       1865.38520272, 7047.24472853, 7047.24472853, 7047.24472

In [18]:
from sklearn.metrics import mean_squared_error as MSE

MSE(y_test, y_pred)**0.5

911.0479538776444

## Building a gradient boosting model in scikit-learn

In [13]:
from sklearn.ensemble import GradientBoostingRegressor

When initializing GradientBoostingRegressor, there are several important hyperparameters. To obtain the same results, it's essential to match: 
- max_depth=2 and random_state=2. 
- Furthermore, since there are only three trees, we must have n_estimators=3. 
- Finally, we must set the learning_rate=1.0 hyperparameter.

In [14]:
gbr=GradientBoostingRegressor(max_depth=2,random_state=2,n_estimators=3,learning_rate=1.0)

Now that the model has been initialized, it can be fit on the training data and scored against the test data:

In [15]:
gbr.fit(X_train,y_train)

In [16]:
gbr.predict(X_test)

array([4710.19173509, 4307.72376923, 3719.78786666, 3888.39768662,
       1865.38520272, 3888.39768662, 7047.24472853, 4307.72376923,
       3060.33145621, 3060.33145621, 3060.33145621, 1865.38520272,
       3060.33145621, 7047.24472853, 1865.38520272, 4307.72376923,
       4307.72376923, 3060.33145621, 7047.24472853, 6020.90829501,
       7047.24472853, 4307.72376923, 4914.73412014, 1865.38520272,
       1865.38520272, 7047.24472853, 3060.33145621, 7047.24472853,
       4914.73412014, 6440.23437762, 4710.19173509, 1865.38520272,
       4307.72376923, 3060.33145621, 4710.19173509, 3888.39768662,
       1865.38520272, 3060.33145621, 6440.23437762, 4914.73412014,
       4914.73412014, 7047.24472853, 1865.38520272, 6440.23437762,
       7047.24472853, 3060.33145621, 6020.90829501, 7047.24472853,
       3060.33145621, 3888.39768662, 4914.73412014, 2033.99502269,
       3060.33145621, 1865.38520272, 4914.73412014, 6440.23437762,
       1865.38520272, 7047.24472853, 7047.24472853, 7047.24472

In [19]:
MSE(y_test, y_pred)**0.5

911.0479538776444

## Hyperparameter Tuning

## 1. learning_rate, also known as the shrinkage

A problem with gradient boosted decision trees is that they are quick to learn and overfit training data. One effective way to slow down learning in the gradient boosting model is to use a learning rate, also called shrinkage (or eta in XGBoost).

shrinks the contribution of individual trees so that no tree has too much influence when building the model. If an entire ensemble is built from the errors of one base learner, without careful adjustment of hyperparameters, early trees in the model can have too much influence on subsequent development. learning_rate limits the influence of individual trees. Generally speaking, as n_estimators, the number of trees, goes up, learning_rate should go down.



### learning_rate ranges from 0 to 1. A learning_rate value of 1 means that no adjustments are made. The default value of 0.1 means that the tree's influence is weighted at 10%.

Here is a reasonable range to start with:

learning_rate_values = [0.001, 0.01, 0.05, 0.1, 0.15, 0.2, 0.3, 0.5, 1.0]

Next, we will loop through the values by building and scoring a new GradientBoostingRegressor to see how the scores compare:

In [21]:
learning_rate_values = [0.001, 0.01, 0.05, 0.1, 0.15, 0.2, 0.3, 0.5, 1.0]

In [22]:
for value in learning_rate_values:

    gbr = GradientBoostingRegressor(max_depth=2,   n_estimators=300, random_state=2, learning_rate=value)

    gbr.fit(X_train, y_train)

    y_pred = gbr.predict(X_test)

    rmse = MSE(y_test, y_pred)**0.5

    print('Learning Rate:', value, ', Score:', rmse)

Learning Rate: 0.001 , Score: 1633.0261400367258
Learning Rate: 0.01 , Score: 831.5430182728547
Learning Rate: 0.05 , Score: 685.0192988749717
Learning Rate: 0.1 , Score: 653.7456840231495
Learning Rate: 0.15 , Score: 687.666134269379
Learning Rate: 0.2 , Score: 664.312804425697
Learning Rate: 0.3 , Score: 689.4190385930236
Learning Rate: 0.5 , Score: 693.8856905068778
Learning Rate: 1.0 , Score: 936.3617413678853


As you can see from the output, the default learning_rate value of 0.1 gives the best score for 300 trees.

## NB:Always tune learning_rate and n_estimators together.

## 2. Base_Learner

The initial decision tree in the gradient boosting regressor is called the base learner because it's at the base of the ensemble. It's the first learner in the process. The term learner here is indicative of a weak learner transforming into a strong learner.

Although base learners need not be fine-tuned for accuracy, it's certainly possible to tune base learners for gains in accuracy.

In [23]:
depths = [None, 1, 2, 3, 4]

for depth in depths:

    gbr = GradientBoostingRegressor(max_depth=depth, n_estimators=300, random_state=2)

    gbr.fit(X_train, y_train)

    y_pred = gbr.predict(X_test)

    rmse = MSE(y_test, y_pred)**0.5

    print('Max Depth:', depth, ', Score:', rmse)

Max Depth: None , Score: 869.2788645118395
Max Depth: 1 , Score: 707.8261886858736
Max Depth: 2 , Score: 653.7456840231495
Max Depth: 3 , Score: 646.4045923317708
Max Depth: 4 , Score: 663.048387855927


A max_depth value of 3 gives the best results.



## 3. Subsample

subsample is a subset of samples. Since samples are the rows, a subset of rows means that all rows may not be included when building each tree. By changing subsample from 1.0 to a smaller decimal, trees only select that percentage of samples during the build phase. For example, subsample=0.8 would select 80% of samples for each tree.

In [25]:
samples = [1, 0.9, 0.8, 0.7, 0.6, 0.5]

for sample in samples:

    gbr = GradientBoostingRegressor(max_depth=3, n_estimators=300, subsample=sample, random_state=2)

    gbr.fit(X_train, y_train)

    y_pred = gbr.predict(X_test)

    rmse = MSE(y_test, y_pred)**0.5

    print('Subsample:', sample, ', Score:', rmse)

Subsample: 1 , Score: 646.4045923317708
Subsample: 0.9 , Score: 620.1819001443569
Subsample: 0.8 , Score: 617.2355650565677
Subsample: 0.7 , Score: 612.9879156983139
Subsample: 0.6 , Score: 622.6385116402317
Subsample: 0.5 , Score: 626.9974073227554


When subsample is not equal to 1.0, the model is classified as stochastic gradient descent, where stochastic indicates that some randomness is inherent in the model.

### Randomized Search

In [39]:
## specify the parameters

params={
    'n_estimators':[300,500,1000],
    'subsample':[0.65, 0.7, 0.75] ,
    'learning_rate':[0.05, 0.075, 0.1]
    
    
}

In [40]:
from sklearn.model_selection import RandomizedSearchCV

gbr = GradientBoostingRegressor(max_depth=3, random_state=2)

In [41]:
rand_reg = RandomizedSearchCV(gbr, params, n_iter=10, scoring='neg_mean_squared_error', cv=5, n_jobs=-1, random_state=2)

In [42]:
rand_reg.fit(X_train, y_train)

best_model = rand_reg.best_estimator_

best_params = rand_reg.best_params_

print("Best params:", best_params)

best_score = np.sqrt(-rand_reg.best_score_)

print("Training score: {:.3f}".format(best_score))

y_pred = best_model.predict(X_test)

rmse_test = MSE(y_test, y_pred)**0.5

print('Test set score: {:.3f}'.format(rmse_test))

Best params: {'subsample': 0.65, 'n_estimators': 300, 'learning_rate': 0.05}
Training score: 636.200
Test set score: 625.985
