# 6. Boosting algortihms

This notebook will show applications of boosting algorithms on the dataset [Rain in Australia](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) available in Kaggle. It contains about 10 years of daily weather observations from many locations across Australia.

### Index:
1. [Packages required](#1.-Packages-required)
2. [Loading data](#2.-Loading-data)
3. [AdaBoost](#3.-AdaBoost)
4. [Gradient Boosting](#4.-Gradient-Boosting)
5. [XGBoost](#5.-XGBoost)
6. [LightGBM](#6.-LightGBM)
7. [CatBoost](#7.-CatBoost)
8. [Conclusions](#8.-Conclusions)

# 1. Packages required

In [None]:
!pip install xgboost
!pip install lightgbm
!pip install catboost

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
import time

# 2. Loading data

In [2]:
weather = pd.read_parquet('../data/04_model_input/master.parquet')
weather.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Temp9am,Temp3pm,RainToday,RainTomorrow,Date_month,Date_day,Location_encoded,WindGustDir_encoded,WindDir9am_encoded,WindDir3pm_encoded
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,16.9,21.8,0,0.0,12,1,2,12.0,12.0,13.0
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,17.2,24.3,0,0.0,12,2,2,13.0,15.0,11.0
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,21.0,23.2,0,0.0,12,3,2,11.0,12.0,11.0
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,18.1,26.5,0,0.0,12,4,2,2.0,6.0,4.0
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,17.8,29.7,0,0.0,12,5,2,12.0,3.0,14.0


In [3]:
#We fix the variables we are interested in and the date to separate data:
test_date = '2015-01-01'

model_columns = list(set(weather.select_dtypes(include='number').columns) - set(['RainTomorrow']))

In [4]:
#We separate in train/test data and solve Nan problems:
train = weather[weather.Date < test_date].fillna(-1)
test = weather[weather.Date >= test_date].fillna(-1)

# 3. AdaBoost

AdaBoost is a boosting algorithm that reduces the prediction error building (sequentially) trees with only two leave nodes. According to the error from the last estimator, the sample weights are changed and the trees are generated taking into account these different weights.

The Python implementation allows you to modify the base estimator, but we won't modify it to use the original one ("base_estimator = None" is a decision tree with max_depth = 1). Also, to avoid overfitting, we will use a learning rate $\nu = 0.1$

Then, we will generate different model with different number of boosting iterations to see the evolution.

In [6]:
#We generate our AdaBoost algorithms:
metrics = {}
for n_estimators in [1, 3, 5, 10, 15, 20, 30, 50, 100, 200, 500, 1000]:
    start_time = time.time()
    model = AdaBoostClassifier(n_estimators = n_estimators, learning_rate = 0.1)
    model.fit(train[model_columns],train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred = model.predict_proba(test[model_columns])[:, 1]

    metrics['AdaB_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(test.RainTomorrow, test_pred)-1,
        'Run_Time': time.time() - start_time
    }

metrics_AdaB = pd.DataFrame.from_dict(metrics, orient='index',columns=['Run_Time', 'Train_Gini', 'Test_Gini'])
metrics_AdaB['delta%'] = 100*(metrics_AdaB.Test_Gini - metrics_AdaB.Train_Gini) / metrics_AdaB.Train_Gini
metrics_AdaB

Unnamed: 0,Run_Time,Train_Gini,Test_Gini,delta%
AdaB_1,0.267484,0.35681,0.324403,-9.082368
AdaB_3,0.474498,0.435319,0.394087,-9.471638
AdaB_5,0.803782,0.540856,0.518804,-4.077312
AdaB_10,1.406464,0.569703,0.540461,-5.132848
AdaB_15,2.117057,0.616769,0.573879,-6.953917
AdaB_20,2.762799,0.644189,0.607569,-5.68461
AdaB_30,4.07494,0.66349,0.625806,-5.679689
AdaB_50,7.374036,0.69059,0.660897,-4.299631
AdaB_100,15.036206,0.709561,0.690381,-2.703051
AdaB_200,28.34707,0.725582,0.711206,-1.981235


In [7]:
metrics_AdaB.to_parquet('../data/models/adab.parquet')

# 4. Gradient Boosting

Gradient Boosting is another boosting algorithm. While AdaBoost modifies the sample weights to build the trees, Gradient Boosting computes the residuals and try to classify them. By this way, the model starts with a big error but the more iterations you make the less error you will be comitting. Also, other difference between AdaBoost and Gradient Boosting is the fact that AdaBoost (originally) builds trees with only two leave nodes and Gradient Boost doesn't have a predetermined number of leaves.

About the parameters that we will choose to build our model:
* Loss function: loss = log_loss (default), that it's the same that we have been studying at the project.
* Learning rate: learning_rate = 0.1 (default), that it's the most common value to avoid overfitting.
* Error measure: criterion = mse, that it's the same that we have been studying.
* Tree depth: max_depth = 5, that it's the same that we fix in bagging. We fix it to, later, do good comparisons

Also, we will generate models with different numbers of iterations (n_estimators) to compare them and see clearly the evolution.

In [8]:
#We generate our Gradient Boosting algorithms:
metrics = {}
for n_estimators in [1, 3, 5, 10, 15, 20, 30, 50, 100, 200, 500, 1000]:
    start_time = time.time()
    model = GradientBoostingClassifier(max_depth = 5, n_estimators = n_estimators, criterion = 'mse' )
    model.fit(train[model_columns],train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred = model.predict_proba(test[model_columns])[:, 1]

    metrics['GB_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(test.RainTomorrow, test_pred)-1,
        'Run_Time': time.time() - start_time
    }

metrics_GB = pd.DataFrame.from_dict(metrics, orient='index',columns=['Run_Time', 'Train_Gini', 'Test_Gini'])
metrics_GB['delta%'] = 100*(metrics_GB.Test_Gini - metrics_GB.Train_Gini) / metrics_GB.Train_Gini
metrics_GB

Unnamed: 0,Run_Time,Train_Gini,Test_Gini,delta%
GB_1,0.703866,0.655905,0.626416,-4.496019
GB_3,1.446274,0.68936,0.659995,-4.259822
GB_5,2.159501,0.699321,0.670895,-4.064779
GB_10,4.117382,0.715642,0.688369,-3.811052
GB_15,6.1148,0.728085,0.702854,-3.465366
GB_20,7.992234,0.739203,0.714574,-3.331857
GB_30,12.185609,0.7533,0.726915,-3.502579
GB_50,20.227505,0.774052,0.743346,-3.966943
GB_100,39.237735,0.799155,0.759341,-4.981933
GB_200,78.730275,0.827495,0.768533,-7.125317


In [9]:
metrics_GB.to_parquet('../data/models/gb.parquet')

# 5. XGBoost

XGBoost is the abbreviation of 'e**X**treme **G**radient **Boost**ing' and is a boosting method based on the last one. XGBoost models apply Gradient Boosting using their own type of trees, which are built taking into account the gradient and the hessian of the Loss Function and some regularization parameters. Also, XGBoost has many computational advantages to do the task faster.

XGBoost is a algorithm with a lot of parameters: regularization parameters, maximum depth of the trees, number of iterations, minimum number of subjects in the node to divide it ... That's very useful because it allows you to modify the algorithm as you want. However, to show a basic example, we will use many of the default values and:
* Learning rate: eta = 0.1, that it's a typical value for the learning rate and it's the same that we used in other models.
* $\gamma$: gamma = 0 (default), that it's one of the regularization parameters studied in the project.
* $\lambda$: lambda = 1 (default), that it's the other regularization parameter studied.
* Build method: tree_method = auto (default), that chooses the optimal method according to the length of the data. We have different methods based on the optimizations commented at the project.
* Tree depth: max_depth = 5, that it's the same that we fix in bagging. We fix it to, later, do good comparisons

Also, we will generate diffent models changing the number of iterations that the model makes.

In [10]:
#We generate our XGBoost algorithms:
metrics = {}
for n_estimators in [1, 3, 5, 10, 15, 20, 30, 50, 100, 200, 500, 1000]:
    start_time = time.time()
    model = XGBClassifier(max_depth = 5, eta = 0.1, n_estimators = n_estimators )
    model.fit(train[model_columns],train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred = model.predict_proba(test[model_columns])[:, 1]

    metrics['XGB_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(test.RainTomorrow, test_pred)-1,
        'Run_Time': time.time() - start_time
    }

metrics_XGB = pd.DataFrame.from_dict(metrics, orient='index',columns=['Run_Time', 'Train_Gini', 'Test_Gini'])
metrics_XGB['delta%'] = 100*(metrics_XGB.Test_Gini - metrics_XGB.Train_Gini) / metrics_XGB.Train_Gini
metrics_XGB

Unnamed: 0,Run_Time,Train_Gini,Test_Gini,delta%
XGB_1,0.360338,0.655905,0.628052,-4.246574
XGB_3,0.25865,0.689313,0.660337,-4.203672
XGB_5,0.346698,0.699824,0.672692,-3.877051
XGB_10,0.401839,0.712149,0.683318,-4.048513
XGB_15,0.514247,0.726794,0.70257,-3.333092
XGB_20,0.631665,0.735552,0.710096,-3.460835
XGB_30,0.840167,0.750958,0.723927,-3.599428
XGB_50,1.27238,0.773221,0.741087,-4.155891
XGB_100,2.348921,0.799161,0.757171,-5.254256
XGB_200,4.557895,0.829289,0.768926,-7.278898


In [11]:
metrics_XGB.to_parquet('../data/models/xgb.parquet')

# 6. LightGBM

LightGBM is the abbreviaton of '**Light** **G**radient **B**oosting **M**achine' and is a algorithm developed by Microsoft. It shares his main characteristics with XGBoost, but it builds the trees dividing the nodes that maximize the gain (uses a 'leaf-wise tree growth'). That becomes in assymetric trees with branches more developed than others. In addition, LightGBM applies a set of computational advantages to make it faster.

The parameters that we will use are equivalent to the parameters defined in XGBoost. Also, we will compare different models with diferent number of estimators too.

In [12]:
#We generate our LightGBM algorithms:
metrics = {}
for n_estimators in [1, 3, 5, 10, 15, 20, 30, 50, 100, 200, 500, 1000]:
    start_time = time.time()
    model = LGBMClassifier(max_depth = 5, n_estimators = n_estimators )
    model.fit(train[model_columns],train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred = model.predict_proba(test[model_columns])[:, 1]

    metrics['LGBM_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(test.RainTomorrow, test_pred)-1,
        'Run_Time': time.time() - start_time
    }

metrics_LGBM = pd.DataFrame.from_dict(metrics, orient='index',columns=['Run_Time', 'Train_Gini', 'Test_Gini'])
metrics_LGBM['delta%'] = 100*(metrics_LGBM.Test_Gini - metrics_LGBM.Train_Gini) / metrics_LGBM.Train_Gini
metrics_LGBM

Unnamed: 0,Run_Time,Train_Gini,Test_Gini,delta%
LGBM_1,0.335185,0.655849,0.632815,-3.512204
LGBM_3,0.270756,0.692153,0.663234,-4.178068
LGBM_5,0.302234,0.704649,0.679644,-3.548661
LGBM_10,0.329781,0.719746,0.695993,-3.300265
LGBM_15,0.330081,0.730714,0.707332,-3.199937
LGBM_20,0.376667,0.739624,0.714137,-3.446011
LGBM_30,0.418544,0.756404,0.728042,-3.749613
LGBM_50,0.516237,0.776222,0.743217,-4.252088
LGBM_100,0.697214,0.801383,0.758477,-5.353936
LGBM_200,1.09844,0.83101,0.768152,-7.564044


In [13]:
metrics_LGBM.to_parquet('../data/models/lgbm.parquet')

# 7. CatBoost

CatBoost is the last boosting algorithm studied in the project. His name is the abbreviature of '**Cat**egorical **Boost**ing' and is famous due to the way of deal with categorical variables. These special method allows us to use it directly without encoding the categorical values, so we save preprocessing time.

To use it taking advantage of the characteristic commented, we will use a previous and less preprocessed dataset than in the other examples.

In [19]:
#We fix the new columns that we will take into account to generate the model:
model_columns = list(set(weather.columns) - set(['RainTomorrow', 'Date', 'Location_encoded', 'WindDir3pm_encoded',
                                                 'WindDir9am_encoded', 'WindGustDir_encoded']))

After removing the encoded variables, we apply the CatBoost algorithm:

In [23]:
#We generate our CatBoost algorithms:
cat_features = list(weather.select_dtypes(include='object').columns) 
metrics = {}
for n_estimators in [1, 3, 5, 10, 15, 20, 30, 50, 100, 200, 500, 1000]:
    start_time = time.time()
    model = CatBoostClassifier(n_estimators = n_estimators, cat_features = cat_features, silent = True)
    model.fit(train[model_columns],train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred = model.predict_proba(test[model_columns])[:, 1]

    metrics['CatB_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(test.RainTomorrow, test_pred)-1,
        'Run_Time': time.time() - start_time
    }

metrics_CatB = pd.DataFrame.from_dict(metrics, orient='index',columns=['Run_Time', 'Train_Gini', 'Test_Gini'])
metrics_CatB['delta%'] = 100*(metrics_CatB.Test_Gini - metrics_CatB.Train_Gini) / metrics_CatB.Train_Gini
metrics_CatB

Unnamed: 0,Run_Time,Train_Gini,Test_Gini,delta%
CatB_1,0.698275,0.627189,0.623608,-0.570824
CatB_3,0.893615,0.70173,0.688385,-1.901694
CatB_5,0.975,0.721483,0.705266,-2.247694
CatB_10,1.293467,0.740927,0.718397,-3.040822
CatB_15,1.569221,0.752073,0.726358,-3.419266
CatB_20,1.819914,0.760728,0.733936,-3.521918
CatB_30,2.529816,0.774914,0.740177,-4.482624
CatB_50,3.712721,0.793528,0.747507,-5.799525
CatB_100,6.377156,0.820623,0.741832,-9.601308
CatB_200,18.785368,0.850018,0.775901,-8.719408


In [24]:
metrics_CatB.to_parquet('../data/models/catb.parquet')

# 8. Conclusions