# 6. Boosting algortihms

This notebook will show applications of boosting algorithms on the dataset [Rain in Australia](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) available in Kaggle. It contains about 10 years of daily weather observations from many locations across Australia.

### Index:
1. [Packages required](#1.-Packages-required)
2. [Loading data](#2.-Loading-data)
3. [AdaBoost](#3.-AdaBoost)
4. [Gradient Boosting](#4.-Gradient-Boosting)
5. [XGBoost](#5.-XGBoost)
6. [LightGBM](#6.-LightGBM)
7. [CatBoost](#7.-CatBoost)

# 1. Packages required

In [None]:
!pip install xgboost
!pip install lightgbm
!pip install catboost

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score

# 2. Loading data

In [2]:
weather = pd.read_parquet('../data/04_model_input/master.parquet')
weather.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Temp9am,Temp3pm,RainToday,RainTomorrow,Date_month,Date_day,Location_encoded,WindGustDir_encoded,WindDir9am_encoded,WindDir3pm_encoded
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,16.9,21.8,0,0.0,12,1,2,12.0,12.0,13.0
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,17.2,24.3,0,0.0,12,2,2,13.0,15.0,11.0
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,21.0,23.2,0,0.0,12,3,2,11.0,12.0,11.0
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,18.1,26.5,0,0.0,12,4,2,2.0,6.0,4.0
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,17.8,29.7,0,0.0,12,5,2,12.0,3.0,14.0


In [3]:
#We fix the variables we are interested in and the date to separate data:
test_date = '2015-01-01'

model_columns = list(set(weather.select_dtypes(include='number').columns) - set(['RainTomorrow']))

In [4]:
#We separate in train/test data and solve Nan problems:
train = weather[weather.Date < test_date].fillna(-1)
test = weather[weather.Date >= test_date].fillna(-1)

# 3. AdaBoost

AdaBoost is a boosting algorithm that reduces the prediction error building (sequentially) trees with only two leave nodes. According to the error from the last estimator, the sample weights are changed and the trees are generated taking into account these different weights.

The Python implementation allows you to modify the base estimator, but we won't modify it to use the original one ("base_estimator = None" is a decision tree with max_depth = 1). Also, to avoid overfitting, we will use a learning rate $\nu = 0.1$

Then, we will generate different model with different number of boosting iterations to see the evolution.

In [9]:
#We generate our AdaBoost algorithms:
metrics = {}
for n_estimators in [1, 5, 10, 20, 50, 100, 200, 500, 1000]:
    model = AdaBoostClassifier(n_estimators = n_estimators, learning_rate = 0.1)
    model.fit(train[model_columns],train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred = model.predict_proba(test[model_columns])[:, 1]

    metrics['AdaB_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(test.RainTomorrow, test_pred)-1
    }

metrics_AdaB = pd.DataFrame.from_dict(metrics, orient='index',columns=['Train_Gini', 'Test_Gini'])
metrics_AdaB['delta%'] = 100*(metrics_AdaB.Test_Gini - metrics_AdaB.Train_Gini) / metrics_AdaB.Train_Gini
metrics_AdaB

Unnamed: 0,Train_Gini,Test_Gini,delta%
AdaB_1,0.35681,0.324403,-9.082368
AdaB_5,0.540856,0.518804,-4.077312
AdaB_10,0.569703,0.540461,-5.132848
AdaB_20,0.644189,0.607569,-5.68461
AdaB_50,0.69059,0.660897,-4.299631
AdaB_100,0.709561,0.690381,-2.703051
AdaB_200,0.725582,0.711206,-1.981235
AdaB_500,0.738811,0.723546,-2.066107
AdaB_1000,0.745362,0.727811,-2.354736


# 4. Gradient Boosting

Gradient Boosting is another boosting algorithm. While AdaBoost modifies the sample weights to build the trees, Gradient Boosting computes the residuals and try to classify them. By this way, the model starts with a big error but the more iterations you make the less error you will be comitting. Also, other difference between AdaBoost and Gradient Boosting is the fact that AdaBoost (originally) builds trees with only two leave nodes and Gradient Boost doesn't have a predetermined number of leaves.

About the parameters that we will choose to build our model:
* Loss function: loss = log_loss (default), that it's the same that we have been studying at the project.
* Learning rate: learning_rate = 0.1 (default), that it's the most common value to avoid overfitting.
* Error measure: criterion = mse, that it's the same that we have been studying.

Also, we will generate models with different numbers of iterations (n_estimators) to compare them and see clearly the evolution.

In [8]:
#We generate our Gradient Boosting algorithms:
metrics = {}
for n_estimators in [1, 5, 10, 20, 50, 100, 200, 500, 1000]:
    model = GradientBoostingClassifier(n_estimators = n_estimators, criterion = 'mse' )
    model.fit(train[model_columns],train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred = model.predict_proba(test[model_columns])[:, 1]

    metrics['GB_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(test.RainTomorrow, test_pred)-1
    }

metrics_GB = pd.DataFrame.from_dict(metrics, orient='index',columns=['Train_Gini', 'Test_Gini'])
metrics_GB['delta%'] = 100*(metrics_GB.Test_Gini - metrics_GB.Train_Gini) / metrics_GB.Train_Gini
metrics_GB

Unnamed: 0,Train_Gini,Test_Gini,delta%
GB_1,0.576891,0.564835,-2.089953
GB_5,0.652841,0.632492,-3.117064
GB_10,0.673436,0.653937,-2.895338
GB_20,0.702074,0.685878,-2.306969
GB_50,0.736219,0.718145,-2.45496
GB_100,0.756702,0.735576,-2.791743
GB_200,0.775952,0.749864,-3.362149
GB_500,0.803554,0.766653,-4.592306
GB_1000,0.825757,0.771226,-6.603767


# 5. XGBoost

XGBoost is the abbreviation of 'e**X**treme **G**radient **Boost**ing' and is a boosting method based on the last one. XGBoost models apply Gradient Boosting using their own type of trees, which are built taking into account the gradient and the hessian of the Loss Function and some regularization parameters. Also, XGBoost has many computational advantages to do the task faster.

XGBoost is a algorithm with a lot of parameters: regularization parameters, maximum depth of the trees, number of iterations, minimum number of subjects in the node to divide it ... That's very useful because it allows you to modify the algorithm as you want. However, to show a basic example, we will use many of the default values and:
* Learning rate: eta = 0.1, that it's a typical value for the learning rate and it's the same that we used in other models.
* $\gamma$: gamma = 0 (default), that it's one of the regularization parameters studied in the project.
* $\lambda$: lambda = 1 (default), that it's the other regularization parameter studied.
* Build method: tree_method = auto (default), that chooses the optimal method according to the length of the data. We have different methods based on the optimizations commented at the project.

Also, we will generate diffent models changing the number of iterations that the model makes.

In [11]:
#We generate our XGBoost algorithms:
metrics = {}
for n_estimators in [1, 5, 10, 20, 50, 100, 200, 500, 1000]:
    model = XGBClassifier(max_depth = 5, eta = 0.1, n_estimators = n_estimators )
    model.fit(train[model_columns],train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred = model.predict_proba(test[model_columns])[:, 1]

    metrics['XGB_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(test.RainTomorrow, test_pred)-1
    }

metrics_XGB = pd.DataFrame.from_dict(metrics, orient='index',columns=['Train_Gini', 'Test_Gini'])
metrics_XGB['delta%'] = 100*(metrics_XGB.Test_Gini - metrics_XGB.Train_Gini) / metrics_XGB.Train_Gini
metrics_XGB

Unnamed: 0,Train_Gini,Test_Gini,delta%
XGB_1,0.655905,0.62642,-4.495371
XGB_5,0.699824,0.672093,-3.962666
XGB_10,0.712149,0.682763,-4.126433
XGB_20,0.735552,0.709822,-3.498148
XGB_50,0.773221,0.740962,-4.17207
XGB_100,0.799161,0.757099,-5.263301
XGB_200,0.829289,0.768893,-7.282859
XGB_500,0.875273,0.772569,-11.733893
XGB_1000,0.91987,0.770986,-16.185363


# 6. LightGBM

LightGBM is the abbreviaton of '**Light** **G**radient **B**oosting **M**achine' and is a algorithm developed by Microsoft. It shares his main characteristics with XGBoost, but it builds the trees dividing the nodes that maximize the gain (uses a 'leaf-wise tree growth'). That becomes in assymetric trees with branches more developed than others. In addition, LightGBM applies a set of computational advantages to make it faster.

The parameters that we will use are equivalent to the parameters defined in XGBoost. Also, we will compare different models with diferent number of estimators too.

In [13]:
#We generate our LightGBM algorithms:
metrics = {}
for n_estimators in [1, 5, 10, 20, 50, 100, 200, 500, 1000]:
    model = LGBMClassifier(max_depth = 5, n_estimators = n_estimators )
    model.fit(train[model_columns],train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred = model.predict_proba(test[model_columns])[:, 1]

    metrics['LGBM_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(test.RainTomorrow, test_pred)-1
    }

metrics_LGBM = pd.DataFrame.from_dict(metrics, orient='index',columns=['Train_Gini', 'Test_Gini'])
metrics_LGBM['delta%'] = 100*(metrics_LGBM.Test_Gini - metrics_LGBM.Train_Gini) / metrics_LGBM.Train_Gini
metrics_LGBM

Unnamed: 0,Train_Gini,Test_Gini,delta%
LGBM_1,0.655849,0.632815,-3.512204
LGBM_5,0.704649,0.679644,-3.548661
LGBM_10,0.719746,0.695993,-3.300265
LGBM_20,0.739624,0.714137,-3.446011
LGBM_50,0.776222,0.74321,-4.252954
LGBM_100,0.801383,0.758471,-5.354684
LGBM_200,0.83101,0.76815,-7.564197
LGBM_500,0.881065,0.771816,-12.399557
LGBM_1000,0.927716,0.769263,-17.079838


# 7. CatBoost

CatBoost is the last boosting algorithm studied in the project. His name is the abbreviature of '**Cat**egorical **Boost**ing' and is famous due to the way of deal with categorical variables. These special method allows us to use it directly without encoding the categorical values, so we save preprocessing time.

To use it taking advantage of the characteristic commented, we will use a previous and less preprocessed dataset than in the other examples.