
# H20 ML 

H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.

The speed, quality, ease-of-use, and model-deployment for the various cutting edge Supervised and Unsupervised algorithms like Deep Learning, Tree Ensembles, and GLRM make H2O a highly sought after API for big data data science.

Install H2O

Load the H2O Python module

In [5]:
import h2o

Start up the H2O Cluster

In [6]:
# Number of threads, nthreads = -1, means use all cores on your machine
# max_mem_size is the maximum memory (in GB) to allocate to H2O
h2o.init(nthreads = -1, max_mem_size = 8)

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.1+10, mixed mode)
  Starting server from C:\Users\shiva\Anaconda2\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: c:\users\shiva\appdata\local\temp\tmp8zsg6s
  JVM stdout: c:\users\shiva\appdata\local\temp\tmp8zsg6s\h2o_shiva_started_from_python.out
  JVM stderr: c:\users\shiva\appdata\local\temp\tmp8zsg6s\h2o_shiva_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,25 days
H2O cluster name:,H2O_from_python_shiva_8791f9
H2O cluster total nodes:,1
H2O cluster free memory:,8 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [7]:
h2o.init(max_mem_size = 2)            #uses all cores by default
h2o.remove_all()                          #clean slate, in case cluster was already running

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,23 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,25 days
H2O cluster name:,H2O_from_python_shiva_8791f9
H2O cluster total nodes:,1
H2O cluster free memory:,8 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


To learn more about the h2o package itself, we can use Python's builtin help() function.

In [8]:
help(h2o)

Help on package h2o:

NAME
    h2o - :mod:`h2o` -- module for using H2O services.

FILE
    c:\users\shiva\anaconda2\lib\site-packages\h2o\__init__.py

DESCRIPTION
    (please add description).

PACKAGE CONTENTS
    assembly
    astfun
    automl (package)
    backend (package)
    cross_validation
    demos
    display
    estimators (package)
    exceptions
    expr
    expr_optimizer
    frame
    grid (package)
    group_by
    h2o
    job
    model (package)
    schemas (package)
    targetencoder
    transforms (package)
    tree (package)
    two_dim_table
    utils (package)

SUBMODULES
    __init__

FUNCTIONS
    api(endpoint, data=None, json=None, filename=None, save_to=None)
        Perform a REST API request to a previously connected server.
        
        This function is mostly for internal purposes, but may occasionally be useful for direct access to
        the backend H2O server. It has same parameters as :meth:`H2OConnection.request <h2o.backend.H2OConnection.reques




In [9]:
h2o.init(ip="127.0.0.1", port=54321)

Checking whether there is an H2O instance running at http://127.0.0.1:54321. connected.


0,1
H2O cluster uptime:,1 min 16 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,25 days
H2O cluster name:,H2O_from_python_shiva_8791f9
H2O cluster total nodes:,1
H2O cluster free memory:,8 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


# H2O Deep Learning

H2O’s Deep Learning is based on a multi-layer feedforward artificial neural network that is trained with stochastic gradient descent using back-propagation. The network can contain a large number of hidden layers consisting of neurons with tanh, rectifier, and maxout activation functions. 

In [10]:
%matplotlib inline                         

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from h2o.estimators.deeplearning import H2OAutoEncoderEstimator, H2ODeepLearningEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator

Data Preprocesing

Import data

In [11]:
loan_csv = "https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv"
data_loan = h2o.import_file(loan_csv)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [12]:
data_loan.shape

(163987, 15)

Since we want to train a binary classification model, we must ensure that the response is coded as a factor. If the response is 0/1, H2O will assume it's numeric, which means that H2O will train a regression model instead.

In [13]:
data_loan['bad_loan'] = data_loan['bad_loan'].asfactor()  #encode the binary repsonse as a factor
data_loan['bad_loan'].levels()  #optional: after encoding, this shows the two factor levels, '0' and 

[['0', '1']]

Splitting the data

Next, we partition the data into training, validation and test sets.

In [15]:
splits = data_loan.split_frame(ratios=[0.7, 0.15], seed=1)  

train = splits[0]
valid = splits[1]
test = splits[2]

Notice that split_frame() uses approximate splitting not exact splitting (for efficiency), so these are not exactly 70%, 15% and 15% of the total rows

In [16]:
print(train.nrow)
print(valid.nrow)
print(test.nrow)

114908
24498
24581


Identify response and predictor variables

In H2O, we use y to designate the response variable and x to designate the list of predictor columns

In [19]:
y = 'bad_loan'
x = list(data_loan.columns)

In [20]:
x.remove(y)  #remove the response
x.remove('int_rate')  #remove the interest rate column because it's correlated with the outcome
# List of predictor columns
x

[u'loan_amnt',
 u'term',
 u'emp_length',
 u'home_ownership',
 u'annual_inc',
 u'purpose',
 u'addr_state',
 u'dti',
 u'delinq_2yrs',
 u'revol_util',
 u'total_acc',
 u'longest_credit_length',
 u'verification_status']

H2O Machine Learning

Now that we have prepared the data, we can train some models. We will start by training a single model from each of the H2O supervised algorithm

Random Forest (RF)


H2O's Random Forest (RF) is implements a distributed version of the standard Random Forest algorithm and variable importance measures.

In [21]:
# Import H2O RF:
from h2o.estimators.random_forest import H2ORandomForestEstimator

Train a default RF

First we will train a basic Random Forest model with default parameters. Random Forest will infer the response distribution from the response encoding. A seed is required for reproducibility.

In [23]:
# Initialize the RF estimator:

rf_fit1 = H2ORandomForestEstimator(model_id='rf_fit1', seed=1)

In [24]:
rf_fit1.train(x=x, y=y, training_frame=train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


Train an RF with more trees

Next we will increase the number of trees used in the forest by setting ntrees = 100. The default number of trees in an H2O Random Forest is 50, so this RF will be twice as big as the default. Usually increasing the number of trees in an RF will increase performance as well. Unlike Gradient Boosting Machines (GBMs), Random Forests are fairly resistant (although not free from) overfitting by increasing the number of trees. See the GBM example below for additional guidance on preventing overfitting using H2O's early stopping functionality.

In [25]:
rf_fit2 = H2ORandomForestEstimator(model_id='rf_fit2', ntrees=100, seed=1)
rf_fit2.train(x=x, y=y, training_frame=train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


Compare model performance

In [27]:
rf_perf1 = rf_fit1.model_performance(test)
rf_perf2 = rf_fit2.model_performance(test)
print(rf_perf1.auc())
print(rf_perf2.auc())

0.663460943045
0.669285050822


Cross-validate performance

Rather than using held-out test set to evaluate model performance, a user may wish to estimate model performance using cross-validation. Using the RF algorithm (with default model parameters) as an example, we demonstrate how to perform k-fold cross-validation using H2O. No custom code or loops are required, you simply specify the number of desired folds in the nfolds argument.

Since we are not going to use a test set here, we can use the original (full) dataset, which we called data rather than the subsampled train dataset. Note that this will take approximately k (nfolds) times longer than training a single RF model, since it will train k models in the cross-validation process (trained on n(k-1)/k rows), in addition to the final model trained on the full training_frame dataset with n rows.

In [29]:

rf_fit3 = H2ORandomForestEstimator(model_id='rf_fit3', seed=1, nfolds=5)
rf_fit3.train(x=x, y=y, training_frame=data_loan)

drf Model Build progress: |███████████████████████████████████████████████| 100%


To evaluate the cross-validated AUC

In [30]:
print rf_fit3.auc(xval=True)

0.66361117109


Note that the cross-validated AUC is slighly higher than the test set performance we estimated for rf_fit1, and this is likely due to the fact that we trained on more data (n rows) than we did while using train as the training set (0.75*n rows) in rf_fit1.