# 4. Bagging algorithms

This notebook will show different examples of bagging algorithms: Bagging, Random Forest and Extra-Trees.
The dataset used for them is [Rain in Australia](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) challenge from Kaggle. It contains about 10 years of daily weather observations from many locations across Australia.

### Index:
1. [Packages required](#1.-Packages-required)
2. [Loading data](#2.-Loading-data)
3. [Bagging](#3.-Bagging)
4. [Random Forest](#4.-Random-Forest)
5. [Extra-Trees](#5.-Extra-Trees)
6. [Conclusions](#6.-Conclusions)

# 1. Packages required

In [31]:
import os
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score

# 2. Loading data

In [3]:
weather = pd.read_parquet('../data/04_model_input/master.parquet')
weather.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Temp9am,Temp3pm,RainToday,RainTomorrow,Date_month,Date_day,Location_encoded,WindGustDir_encoded,WindDir9am_encoded,WindDir3pm_encoded
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,16.9,21.8,0,0.0,12,1,2,12.0,12.0,13.0
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,17.2,24.3,0,0.0,12,2,2,13.0,15.0,11.0
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,21.0,23.2,0,0.0,12,3,2,11.0,12.0,11.0
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,18.1,26.5,0,0.0,12,4,2,2.0,6.0,4.0
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,17.8,29.7,0,0.0,12,5,2,12.0,3.0,14.0


# 3. Bagging

Bagging algorithm works taking training data samples and building different decision trees with each sample. The result of the algorithm is the mean of trees predictions or the most voted class (regresion or classification).

We are interested in running Bagging algorithm on the current dataset and show the results. We will choose train/test data according to an Out-Of-Time validation, with the purpose of knowing how well it works when it has to predict future 'RainTomorrow' values.

Also, we will test it with different numbers of samples and using Decision trees with max_depth = 5, the most efficient value as we can see in the last notebook.

In [22]:
#We fix the variables we are interested in and the date to separate data:
test_date = '2015-01-01'

model_columns = list(set(weather.select_dtypes(include='number').columns) - set(['RainTomorrow']))

In [23]:
#We separate in train/test data and solve Nan problems:
train = weather[weather.Date < test_date].fillna(-1)
test = weather[weather.Date >= test_date].fillna(-1)

In [36]:
#We generate our Bagging algorithms:
tree = DecisionTreeClassifier(max_depth = 5)
metrics = {}
for n_estimators in [1, 3, 5, 10, 15, 20, 30, 50, 100]:
    model = BaggingClassifier(base_estimator = tree, n_estimators = n_estimators)
    model.fit(train[model_columns],train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred = model.predict_proba(test[model_columns])[:, 1]

    metrics['Bag_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(test.RainTomorrow, test_pred)-1
    }

metrics_pd = pd.DataFrame.from_dict(metrics, orient='index',columns=['Train_Gini', 'Test_Gini'])
metrics_pd['delta%'] = 100*(metrics_pd.Test_Gini - metrics_pd.Train_Gini) / metrics_pd.Train_Gini
metrics_pd

Unnamed: 0,Train_Gini,Test_Gini,delta%
Bag_1,0.655142,0.626805,-4.325325
Bag_3,0.684884,0.661847,-3.363586
Bag_5,0.689336,0.664487,-3.604835
Bag_10,0.690995,0.66743,-3.410305
Bag_15,0.691203,0.666594,-3.56019
Bag_20,0.691793,0.66653,-3.651814
Bag_30,0.692361,0.667293,-3.620697
Bag_50,0.69509,0.669338,-3.704813
Bag_100,0.69496,0.671174,-3.422611


# 4. Random Forest

Random Forest algorithm works as Bagging but taking features samples too. The most common number of features is $\sqrt{p}$, where p is the total number of features. So, Random Forest build trees with different rows and different columns and the prediction is the mean of trees predictions or the most voted class (regression or classification).

Now, we are interested in running Random Forest algorithm on the current dataset to know the 'RainTomorrow' predictions. We will choose validation data according to the last example. Also, we will compare its efectiveness with different numbers of samples and we will use a Decision tree with max_depth = 5 as base estimator.

In [35]:
#We generate our Random Forest algorithms:
metrics = {}
for n_estimators in [1, 3, 5, 10, 15, 20, 30, 50, 100]:
    model = RandomForestClassifier(n_estimators = n_estimators, max_depth = 5, max_features = 'sqrt')
    model.fit(train[model_columns],train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred = model.predict_proba(test[model_columns])[:, 1]

    metrics['RF_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(test.RainTomorrow, test_pred)-1
    }

metrics_pd = pd.DataFrame.from_dict(metrics, orient='index',columns=['Train_Gini', 'Test_Gini'])
metrics_pd['delta%'] = 100*(metrics_pd.Test_Gini - metrics_pd.Train_Gini) / metrics_pd.Train_Gini
metrics_pd

Unnamed: 0,Train_Gini,Test_Gini,delta%
RF_1,0.582202,0.526342,-9.594697
RF_3,0.650651,0.626843,-3.659199
RF_5,0.677006,0.656661,-3.005106
RF_10,0.685607,0.659984,-3.737349
RF_15,0.688043,0.663882,-3.511555
RF_20,0.701299,0.677132,-3.446018
RF_30,0.699606,0.675405,-3.459186
RF_50,0.699707,0.672597,-3.874468
RF_100,0.704017,0.67975,-3.446925


# 5. Extra-Trees

Extra-Trees is based on Random Forest. The difference between them is in the way to build the trees: Random Forest separate nodes according to the most efficient partition and Extra Trees fix a random value for each variable and separate nodes according to this random values. Thus, it grants more randomness to the algorithm.

We will repeat the objective: evaluate the 'RainTomorrow' predictions that it offers, using Out-Of-Time validation and trees with max_depth = 5.

In [34]:
#We generate our Extra-Trees algorithms:
metrics = {}
for n_estimators in [1, 3, 5, 10, 15, 20, 30, 50, 100]:
    model = ExtraTreesClassifier(n_estimators = n_estimators, max_depth = 5, max_features = 'sqrt')
    model.fit(train[model_columns],train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred = model.predict_proba(test[model_columns])[:, 1]

    metrics['ET_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(test.RainTomorrow, test_pred)-1
    }

metrics_pd = pd.DataFrame.from_dict(metrics, orient='index',columns=['Train_Gini', 'Test_Gini'])
metrics_pd['delta%'] = 100*(metrics_pd.Test_Gini - metrics_pd.Train_Gini) / metrics_pd.Train_Gini
metrics_pd

Unnamed: 0,Train_Gini,Test_Gini,delta%
ET_1,0.460539,0.44476,-3.426191
ET_3,0.549762,0.530726,-3.462568
ET_5,0.618219,0.582691,-5.746791
ET_10,0.636469,0.615789,-3.249288
ET_15,0.654553,0.633127,-3.273476
ET_20,0.651349,0.627458,-3.667953
ET_30,0.652095,0.628526,-3.614463
ET_50,0.661596,0.639917,-3.276777
ET_100,0.660416,0.638042,-3.387768


# 6. Conclusions