<a href="https://colab.research.google.com/github/Karalius/Calculator/blob/master/324.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 3: Machine Learning

## Sprint 2: Intermediate Machine Learning

## Subproject 4: Gradient boosted trees models

Previously we learned about random forest models, where the main idea is to fit many decision tree models, where each decision tree is built by using a random subset of the training samples and/or the features. While a single decision tree is likely to be a weak classifier that does not generalize well, the collective decision of all of them usually improves the generalization capability. One important aspect of random forest learning is that each tree learns independently - all of them try to be accurate on the whole training set, in other words, all of them try to minimize the same objective loss.

It is quite popular to use a different training strategy - have each tree learn using a different objective, one that essentially says "give more weight to the samples that are misclassified by the trees trained before". A model that uses such a strategy, gradient boosted trees, will be the topic of this notebook.

## Learning outcomes

- Gradient boosting
- Gradient boosted tree models in scikit-learn and xgboost packages

---

## Gradient boosted trees introduction

Imagine if there was a team-based competition that would test each team on their knowledge of several subjects - math, literature and computer science. One potential strategy would be for each team member to learn the subjects independently, maybe randomly picking the books from which to learn from, so that two different members would learn math from different books and their learning would be different. Given a competition question, each of the members would vote on the answer and the team's answer would be the most common answer among the team members. This is similar to random forest learning, where each member is a decision tree. This kind of independent parallel learning is also called bagging.

An intuitive way to improve upon that strategy would be for members to focus on the subject in which the other team members are currently weaker at, this might improve the overall team's performance. This is the essence of the boosting idea - each model puts more value on learning the samples where the models before it fails at. This kind of dependent sequantial learning is called boosting.

To continue learning about boosting algorithms, read the article below:

- https://www.analyticsvidhya.com/blog/2015/11/quick-introduction-boosting-algorithms-machine-learning

Again look up the documentation for this type of model in scikit-learn:

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

However, even though scikit-learn has this type of model, a very popular package for gradient boosting is called xgboost. Start by reading about it here:

- https://xgboost.readthedocs.io/en/latest/tutorials/model.html

The next lesson in the "Intermediate Machine Learning" course is about it too, so it will help to read and do the exercise:

- https://www.kaggle.com/alexisbcook/xgboost

## Gradient boosted trees in practice

In [1]:
!pip install scikit-learn --upgrade

Collecting scikit-learn
  Downloading scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
[K     |████████████████████████████████| 22.3 MB 80.6 MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.2.0-py3-none-any.whl (12 kB)
Installing collected packages: threadpoolctl, scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed scikit-learn-0.24.2 threadpoolctl-2.2.0


As usual, begin by importing required modules and setting the random state:

In [2]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor

RANDOM_STATE = 7

We will use California housing dataset again. This will also allow us to compare results with the previously trained models, as we are using the same random state value.

In [3]:
x, y = datasets.fetch_california_housing(
    return_X_y=True,
    as_frame=True
)

x_train, x_val, y_train, y_val = train_test_split(
    x,
    y,
    test_size=0.2,
    random_state=RANDOM_STATE
)
x.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


Let's build a baseline gradient boosted trees regressor model:

In [4]:
%%time
gbr = GradientBoostingRegressor(
    learning_rate=0.1,
    n_estimators=100,
    subsample=1.0,
    max_depth=3,
    max_features=1.0,
    verbose=1,
    random_state=RANDOM_STATE
)

gbr.fit(x_train, y_train)

      Iter       Train Loss   Remaining Time 
         1           1.1920            4.23s
         2           1.0770            4.13s
         3           0.9843            4.04s
         4           0.9079            3.99s
         5           0.8402            4.04s
         6           0.7856            3.96s
         7           0.7401            3.90s
         8           0.6973            3.90s
         9           0.6634            3.88s
        10           0.6349            3.85s
        20           0.4717            3.37s
        30           0.3812            2.98s
        40           0.3337            2.54s
        50           0.3079            2.11s
        60           0.2925            1.67s
        70           0.2823            1.25s
        80           0.2751            0.83s
        90           0.2678            0.41s
       100           0.2617            0.00s
CPU times: user 4.14 s, sys: 16.2 ms, total: 4.16 s
Wall time: 4.15 s


If any of the parameters used above are unclear, you can consult the documentation - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html.

Based on the output above, we can see how the training loss is reduced after each of the training iterations. Remaining time is also shown for our convenience. The training time is fast too, at least compared to the heavier neural network models we trained earlier.

We can check the model's performance:

In [5]:
%%time
y_pred_train = gbr.predict(x_train)
print('Training mean squared error is', mean_squared_error(y_train, y_pred_train))

y_pred = gbr.predict(x_val)
print('Validation mean squared error is', mean_squared_error(y_val, y_pred))

Training mean squared error is 0.2617311992553173
Validation mean squared error is 0.30095959244265574
CPU times: user 46.1 ms, sys: 1.93 ms, total: 48 ms
Wall time: 47.4 ms


The performance seems good, at least compared to the models we trained previously (the neural network with hyper-parameters found via optimization got around 0.31 validation error).

Let's build a model with a larger number of deeper trees. Additionally, let's only use a random subsample of our training data. Having held-out data will also help the training procedure to compute the so called OOB Improve score, which basically means the improvement on the validation samples. You can read more about it here: https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_oob.html (optional).

In [6]:
%%time
gbr = GradientBoostingRegressor(
    learning_rate=0.1,
    n_estimators=300,
    subsample=0.8,
    max_depth=5,
    max_features=1.0,
    verbose=1,
    random_state=RANDOM_STATE
)

gbr.fit(x_train, y_train)

      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.1642           0.1580           17.79s
         2           1.0266           0.1355           17.19s
         3           0.9223           0.1103           17.64s
         4           0.8308           0.0929           17.28s
         5           0.7576           0.0770           16.92s
         6           0.6847           0.0652           16.74s
         7           0.6294           0.0530           16.75s
         8           0.5774           0.0449           16.60s
         9           0.5409           0.0424           16.65s
        10           0.5112           0.0337           16.59s
        20           0.3427           0.0096           15.71s
        30           0.2686           0.0020           14.96s
        40           0.2343           0.0022           14.29s
        50           0.2186           0.0003           13.69s
        60           0.2013           0.0000           13.14s
       

The training was a bit longer now, but still fast. Let's see how it performs:

In [7]:
%%time
y_pred_train = gbr.predict(x_train)
print('Training mean squared error is', mean_squared_error(y_train, y_pred_train))

y_pred = gbr.predict(x_val)
print('Validation mean squared error is', mean_squared_error(y_val, y_pred))

Training mean squared error is 0.10213083736924554
Validation mean squared error is 0.2240998504181583
CPU times: user 159 ms, sys: 4.93 ms, total: 164 ms
Wall time: 164 ms


We might say that the model overfitted the training data a bit, as the validation error is much higher than the training error, however, the validation error still looks good.

We can also look at the feature importances of our model:

In [8]:
feature_importance_df = pd.DataFrame(
    {
        'feature': x_train.columns,
        'importance': gbr.feature_importances_
    }
)
feature_importance_df

Unnamed: 0,feature,importance
0,MedInc,0.534399
1,HouseAge,0.042794
2,AveRooms,0.035974
3,AveBedrms,0.017305
4,Population,0.016042
5,AveOccup,0.132981
6,Latitude,0.108937
7,Longitude,0.111569


We can see that the important features are similar to the important features found by using linear regression that we used in a previous notebook.

## Exercise

Go through the notebook for the titanic competition: https://www.kaggle.com/datacanary/xgboost-example-python

Do this task:
- Improve the public score by changing something in the training procedure

It might be useful to consult xgboost documentation (https://xgboost.readthedocs.io/en/latest/parameter.html) to see if there are parameters you could try to optimize.

---

## Summary

In this notebook we learned about a popular type of a machine learning model - gradient boosted trees. This model quite commonly wins Kaggle competitions when the input is of tabular data type (not images or free text), and while the underlying training process can seem complex, it is enough to understand the main idea to be able to use it well - while training a collection of learners, give more weight to those training samples that were misclassified by previous learners.

## Further research

- https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting
- https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
- https://arxiv.org/pdf/1603.02754.pdf