<a href="https://colab.research.google.com/github/Bharat745/H2O/blob/master/Stacked_Ensemble_airlines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stacked Ensemble 

## Installation and Imports

In [1]:
# Installing java version for running H2O on colab
! apt-get install default-jre
!java -version

Reading package lists... Done
Building dependency tree       
Reading state information... Done
default-jre is already the newest version (2:1.11-68ubuntu1~18.04.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-410
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu218.04.1)
OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu218.04.1, mixed mode, sharing)


In [2]:
# Start and connect to local H2O cluster
# Importing matplotlib to make plots
! pip install h2o
import h2o
h2o.init(nthreads = -1)

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,1 hour 9 mins
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.26.0.1
H2O cluster version age:,2 days
H2O cluster name:,H2O_from_python_unknownUser_8y79te
H2O cluster total nodes:,1
H2O cluster free memory:,2.937 Gb
H2O cluster total cores:,2
H2O cluster allowed cores:,2


In [0]:
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
# Importing the data file
data = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/airlines/allyears2k_headers.zip")

Parse progress: |█████████████████████████████████████████████████████████| 100%


## Modelling / Analysis

In [0]:
# Splitting the data to train, validation & test
train, valid, test = data.split_frame([0.8, 0.1] , seed = 69)

In [6]:
print("%d/%d/%d" % (train.nrows, valid.nrows, test.nrows))

35255/4272/4451


In [0]:
# Setting the model parameters
y = "IsArrDelayed"
ignoreFields = [
    "ArrDelay" , "DepDelay", 
    "CarrierDelay" , "WeatherDelay", 
    "NASDelay" , "SecurityDelay", 
    "LateAircraftDelay" , 
    "IsDepDelayed" , "IsArrDelayed" , 
    "ActualElapsedTime" ,  #But CRSElapsedTime is fine
    "ArrTime"
]

x =[i for i in train.names if i not in ignoreFields]

In [0]:
# defining the model and setting the parameters
nfolds = 5
train2 = train.rbind(valid)

In [9]:
train2.nrows

39527

In [0]:
# Importing all the necessary model parameters
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator



### All Models

In [14]:
# GLM Model
m_GLM = H2OGeneralizedLinearEstimator(
    family = "binomial", 
    model_id = "glm_def", 
    nfolds = nfolds,
    fold_assignment = "Modulo",
    keep_cross_validation_predictions = True
  )

m_GLM.train(x, y, train2)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [15]:
# GBM Model
m_GBM = H2OGradientBoostingEstimator( 
    model_id = "gbm_def", 
    nfolds = nfolds,
    fold_assignment = "Modulo",
    keep_cross_validation_predictions = True
  )

m_GBM.train(x, y, train2)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [16]:
# GBM Model
m_RF = H2ORandomForestEstimator( 
    model_id = "rf_def", 
    nfolds = nfolds,
    fold_assignment = "Modulo",
    keep_cross_validation_predictions = True
  )

m_RF.train(x, y, train2)

drf Model Build progress: |███████████████████████████████████████████████| 100%


### Model Comparision

In [0]:
# we create a list of 3 model ids
models = [m_GLM.model_id, m_GBM.model_id, m_RF.model_id]

In [21]:

m_SE = H2OStackedEnsembleEstimator(model_id = "SE_models",
                           base_models = models)
m_SE.train(x, y, train2)

stackedensemble Model Build progress: |███████████████████████████████████| 100%


In [0]:
import pandas as pd

In [0]:
all_models = [m_GLM, m_GBM, m_RF, m_SE]
names = ["GLM", "GBM", "RF", "SE"]

In [24]:
pd.Series(map(lambda x: x.logloss(), all_models), names)

GLM    0.618522
GBM    0.508488
RF     0.515886
SE     0.234050
dtype: float64

Lower is better for the logloss. we see that SE is much better at 0.23

In [25]:
pd.Series(map(lambda x: x.auc(), all_models), names)

GLM    0.702621
GBM    0.849723
RF     0.831402
SE     0.991942
dtype: float64

ensembles gives you few more % better performance

In [26]:
pd.Series(map(lambda x: x.auc(xval = True), all_models), names)

GLM    0.698574
GBM    0.805115
RF     0.835802
SE          NaN
dtype: float64

so far we were looking at the results of the data it was trained on. We see that stacked ensemble does not have any results, as it was built on all the cross-validation data. there is no seperate dataset to evaluate it upon. 

In [0]:
# Analyzing the model on the test data. 
test_perf = list(map(lambda x: x.model_performance(test), all_models))

In [28]:
pd.Series(map(lambda p: p.logloss(), test_perf), names)

GLM    0.623406
GBM    0.544773
RF     0.492564
SE     0.489151
dtype: float64

In [30]:
pd.Series(map(lambda p: p.auc(), test_perf), names)

GLM    0.692037
GBM    0.802151
RF     0.839630
SE     0.840522
dtype: float64

Always compare the performance at the end. Do not run the ensemble mechanically. 
