# Introduction
This tutorial shows how H2O [Gradient Boosted Methods](https://en.wikipedia.org/wiki/Gradient_boosting) and [Random Forest](https://en.wikipedia.org/wiki/Random_forest) models can be used to do supervised classification and regression. This tutorial covers usage of H2O from Python. An R version of this tutorial will be available as well in a separate document. This file is available in plain R, R markdown, regular markdown, plain Python and iPython Notebook formats. More examples and explanations can be found in our [H2O GBM booklet](http://h2o.ai/resources/) and on our [H2O Github Repository](http://github.com/h2oai/h2o-3/).


##Task: Predicting forest cover type from cartographic variables only

The actual forest cover type for a given observation (30 x 30 meter cell) was determined from the US Forest Service (USFS). We are using the UC Irvine Covertype dataset.

### H2O Python Module

Load the H2O Python module.

In [1]:
import h2o


### Start H2O
Start up a 1-node H2O cloud on your local machine, and allow it to use all CPU cores and up to 2GB of memory:

In [2]:
h2o.init(max_mem_size_GB = 2)            #uses all cores by default
h2o.remove_all()                          #clean slate, in case cluster was already running



No instance found at ip and port: localhost:54321. Trying to start local jar...


JVM stdout: c:\users\kevin\appdata\local\temp\tmpvlbjd1\h2o_Kevin_started_from_python.out
JVM stderr: c:\users\kevin\appdata\local\temp\tmp_0zyer\h2o_Kevin_started_from_python.err
Using ice_root: c:\users\kevin\appdata\local\temp\tmpxwtyyn


Java Version: java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)


Starting H2O JVM and connecting: . Connection successful!


0,1
H2O cluster uptime:,1 seconds 821 milliseconds
H2O cluster version:,3.7.0.3248
H2O cluster name:,H2O_started_from_python
H2O cluster total nodes:,1
H2O cluster total memory:,1.78 GB
H2O cluster total cores:,8
H2O cluster allowed cores:,8
H2O cluster healthy:,True
H2O Connection ip:,127.0.0.1
H2O Connection port:,54321


To learn more about the h2o package itself, we can use Python's builtin help() function.

In [3]:
help(h2o)

Help on package h2o:

NAME
    h2o

FILE
    d:\anaconda\lib\site-packages\h2o\__init__.py

DESCRIPTION
    The H2O Python Module
    
    This module provides access to the H2O JVM, as well as its extensions, objects,
    machine-learning algorithms, and modeling support capabilities, such as basic
    munging and feature generation.
    
    The H2O JVM uses a web server so that all communication occurs on a socket (specified
    by an IP address and a port) via a series of REST calls (see connection.py for the REST
    layer implementation and details). There is a single active connection to the H2O JVM at
    any time, and this handle is stashed out of sight in a singleton instance of
    :class:`H2OConnection` (this is the global  :envvar:`__H2OConn__`). In other words,
    this package does not rely on Jython, and there is no direct manipulation of the JVM.
    
    The H2O python module is not intended as a replacement for other popular machine learning
    frameworks such as sc

help() can be used on H2O functions and models. Jupyter's builtin shift-tab functionality also works

In [4]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
help(H2OGradientBoostingEstimator)
help(h2o.import_file)

Help on class H2OGradientBoostingEstimator in module h2o.estimators.gbm:

class H2OGradientBoostingEstimator(h2o.estimators.estimator_base.H2OEstimator)
 |  Builds gradient boosted classification trees, and gradient boosted regression trees on
 |  a parsed data set. The default distribution function will guess the model type based on
 |  the response column type run properly the response column must be an numeric for
 |  "gaussian" or an enum for "bernoulli" or "multinomial".
 |  
 |  Parameters
 |  ----------
 |  model_id : str, optional
 |    The unique id assigned to the resulting model. If none is given, an id will
 |    automatically be generated.
 |  distribution : str
 |     The distribution function of the response. Must be "AUTO", "bernoulli",
 |     "multinomial", "poisson", "gamma", "tweedie" or "gaussian"
 |  tweedie_power : float
 |    Tweedie power (only for Tweedie distribution, must be between 1 and 2)
 |  ntrees : int
 |    A non-negative integer that determines the nu

##H2O GBM and RF

While H2O Gradient Boosting Methods and H2O Random Forest have many flexible parameters options, they were designed to be just as easy to use as the other supervised training methods in H2O. Early stopping, automatic data standardization and handling of categorical variables and missing values and adaptive learning rates (per weight) reduce the amount of parameters the user has to specify. Often, it's just the number and sizes of hidden layers, the number of epochs and the activation function and maybe some regularization techniques. 

###Getting started

We begin by importing our data into H2OFrames, which operate similarly in function to pandas DataFrames but exist on the H2O cloud itself.  

In this case, the H2O cluster is running on our laptops. Data files are imported by their relative locations to this notebook.

In [5]:
covtype_df = h2o.import_file("../data/covtype.full.csv")


Parse Progress: [##################################################] 100%


We import the full covertype dataset (581k rows, 13 columns, 10 numerical, 3 categorical) and then split the data 3 ways:  
  
60% for training  
20% for validation (hyper parameter tuning)  
20% for final testing  

 We will train a data set on one set and use the others to test the validity of the model by ensuring that it can predict accurately on data the model has not been shown.  
 
 The second set will be used for validation most of the time.  
 
 The third set will be withheld until the end, to ensure that our validation accuracy is consistent with data we have never seen during the iterative process. 

In [6]:
#split the data as described above
train, valid, test = covtype_df.split_frame([0.6, 0.2], seed=1234)

#Prepare predictors and response columns
covtype_X = covtype_df.col_names[:-1]     #last column is Cover_Type, our desired response variable 
covtype_y = covtype_df.col_names[-1]    

###The First Random Forest
We build our first model with the following parameters

**model_id:** Not required, but allows us to easily find our model in the [Flow](http://localhost:54321/) interface  
**ntrees:** Maximum number of trees used by the random forest. Default value is 50. We can afford to increase this, as our early-stopping criterion will decide when the random forest is sufficiently accurate.  
**stopping_rounds:** Stopping criterion described above. Stops fitting new trees when 2-tree rolling average is within 0.001 (default) of the two prior rolling averages. Can be thought of as a convergence setting.  
**score_each_teration:** predict against training and validation for each tree. Default will skip several.  
**seed:** set the randomization seed so we can reproduce results


In [7]:
rf_v1 = H2ORandomForestEstimator(
    model_id="rf_covType_v1",
    ntrees=200,
    stopping_rounds=2,
    score_each_iteration=True,
    seed=1000000)

###Model Construction
H2O in Python is designed to be very similar in look and feel to to scikit-learn. Models are initialized individually with desired or default parameters and then trained on data.  

**Note that the below example uses model.train() as opposed the traditional model.fit()**  
This is because h2o-py takes column indices for the feature and response columns AND the whole data frame, while scikit-learn takes in a feature frame and a response frame.

H2O supports model.fit() so that it can be incorporated into a scikit-learn pipeline, but we advise using train() in all other cases.

In [8]:
rf_v1.train(covtype_X, covtype_y, training_frame=train, validation_frame=valid)


drf Model Build Progress: [##################################################] 100%


Note that the progress bar does not behave linearly. H2O estimates completion time initially based on the number of epochs specified. However, convergence can allow for early stops, in which case the bar jumps to 100%.

We can view information about the model in [Flow](http://localhost:54321/) or within Python. To find more information in Flow, enter `getModel "rf_covType_v1"` into a cell and run in place pressing Ctrl-Enter. Alternatively, you can click on the Models tab, select List All Models, and click on the model named "rf_covType_v1" as specified in our model construction above.

In Python, we can run *rf_v1.summary()* to get some basic stats

In [9]:
rf_v1.summary()


Model Summary:


0,1,2,3,4,5,6,7,8
,number_of_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,154.0,8685760.0,18.0,20.0,19.948051,481.0,14114.0,4856.169


To look at validation statistics, we can use the scoring history function.

In [10]:
rf_v1.score_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_MSE,training_logloss,training_classification_error,validation_MSE,validation_logloss,validation_classification_error
0,,2015-11-07 00:54:28,2.167 sec,1,0.157191,3.773954,0.160639,0.157481,3.773075,0.165623
1,,2015-11-07 00:54:30,3.628 sec,2,0.144252,3.218432,0.150622,0.094394,1.061510,0.113763
2,,2015-11-07 00:54:31,4.893 sec,3,0.133442,2.870879,0.136207,0.077862,0.535113,0.091070
3,,2015-11-07 00:54:32,6.256 sec,4,0.119160,2.313463,0.126116,0.072011,0.379240,0.082469
4,,2015-11-07 00:54:34,7.687 sec,5,0.109254,1.930477,0.117807,0.068963,0.316427,0.077975
5,,2015-11-07 00:54:35,9.200 sec,6,0.100706,1.556173,0.110796,0.068007,0.289077,0.076465
6,,2015-11-07 00:54:37,10.778 sec,7,0.094270,1.267866,0.105162,0.067546,0.275304,0.075556
7,,2015-11-07 00:54:38,12.436 sec,8,0.089282,1.057529,0.100625,0.067060,0.267718,0.074733
8,,2015-11-07 00:54:40,14.195 sec,9,0.084504,0.888560,0.095604,0.065910,0.257667,0.072966
9,,2015-11-07 00:54:42,16.027 sec,10,0.080621,0.757980,0.091451,0.064802,0.250031,0.071877


Here we can see the hit ratio table.

In [11]:
rf_v1.hit_ratio_table(valid=True)


Top-7 Hit Ratios:


0,1
k,hit_ratio
1,0.9315878
2,1.0
3,0.9995283
4,0.9997427
5,1.0
6,1.0
7,1.0




###Now for GBM

First we will use all default settings, then make some changes to improve our predictions.

In [12]:
gbm_v1 = H2OGradientBoostingEstimator(
    model_id="gbm_covType_v1",
    seed=2000000
)
gbm_v1.train(covtype_X, covtype_y, training_frame=train, validation_frame=valid)


gbm Model Build Progress: [##################################################] 100%


In [13]:
gbm_v1.score_history()


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_MSE,training_logloss,training_classification_error,validation_MSE,validation_logloss,validation_classification_error
0,,2015-11-07 00:55:22,1.454 sec,1,0.646936,1.640322,0.264529,0.647347,1.641737,0.268151
1,,2015-11-07 00:55:23,2.584 sec,2,0.57581,1.441639,0.259892,0.576546,1.443982,0.263589
2,,2015-11-07 00:55:24,3.624 sec,3,0.516175,1.295177,0.258498,0.517202,1.298256,0.26244
3,,2015-11-07 00:55:29,8.088 sec,8,0.331017,0.897133,0.247464,0.332998,0.902404,0.251797
4,,2015-11-07 00:55:34,13.860 sec,14,0.240376,0.703644,0.239162,0.242904,0.710344,0.2431
5,,2015-11-07 00:55:43,22.080 sec,23,0.190174,0.583091,0.224983,0.193166,0.59127,0.229979
6,,2015-11-07 00:55:55,34.799 sec,37,0.161117,0.505595,0.205317,0.164692,0.515542,0.211883
7,,2015-11-07 00:56:08,47.474 sec,50,0.148453,0.470053,0.192291,0.15237,0.481191,0.199439


In [14]:
gbm_v1.hit_ratio_table(valid=True)


Top-7 Hit Ratios:


0,1
k,hit_ratio
1,0.8005609
2,0.982719
3,0.9975644
4,0.9995369
5,0.9999914
6,1.0
7,1.0




This default GBM is much worse than our original random forest.  


The GBM is far from converging, so there are three primary knobs to adjust to get our performance up if we want to keep a similar run time.  

1: Adding trees will help. The default is 50.  
2: Increasing the learning rate will also help. The contribution of each tree will be stronger, so the model will move further away from the overall mean.  
3: Increasing the depth will help. This is the parameter that is the least straightforward. Tuning trees and learning rate both have direct impact that is easy to understand. Changing the depth means you are adjusting the "weakness" of each learner. Adding depth makes each tree fit the data closer.  
  
The first configuration will attack depth the most, since we've seen the random forest focus on a continuous variable (elevation) and 40-class factor (soil type) the most.  

Also we will take a look at how to review a model while it is running.  

###GBM Round 2

Let's do the following:

1. decrease the number of trees to speed up runtime(from default 50 to 20)
2. increase the learning rate (from default 0.1 to 0.2)
3. increase the depth (from default 5 to 10)

In [15]:
gbm_v2 = H2OGradientBoostingEstimator(
    ntrees=20,
    learn_rate=0.2,
    max_depth=10,
    stopping_tolerance=0.01, #10-fold increase in threshold as defined in rf_v1
    stopping_rounds=2,
    score_each_iteration=True,
    model_id="gbm_covType_v2",
    seed=2000000
)
gbm_v2.train(covtype_X, covtype_y, training_frame=train, validation_frame=valid)


gbm Model Build Progress: [##################################################] 100%


###Live Performance Monitoring

While this is running, we can actually look at the model. To do this we simply need a new connection to H2O. 

This Python notebook will run the model, so we need either another notebook or the web browser (or R, etc.). In this demo, we will use [Flow](http://localhost:54321) in our web browser http://localhost:54321 and the focus will be to look at model performance, since we are using Python to control H2O. 

In [16]:
gbm_v2.hit_ratio_table(valid=True)


Top-7 Hit Ratios:


0,1
k,hit_ratio
1,0.9
2,0.995326
3,1.0
4,1.0
5,0.9999914
6,1.0
7,1.0




This has moved us in the right direction, but still lower accuracy than the random forest.  

It still has yet to converge, so we can make it more aggressive.  

We can now add the stochastic nature of random forest into the GBM using some of the new H2O settings. This will help generalize and also provide a quicker runtime, so we can add a few more trees.

### GBM: Third Time is the Charm

1. Add a few trees(from 20 to 30)
2. Increase learning rate (to 0.3)
3. Use a random 70% of rows to fit each tree
4. Use a random 70% of columns to fit each tree

In [17]:
gbm_v3 = H2OGradientBoostingEstimator(
    ntrees=30,
    learn_rate=0.3,
    max_depth=10,
    sample_rate=0.7,
    col_sample_rate=0.7,
    stopping_rounds=2,
    stopping_tolerance=0.01, #10-fold increase in threshold as defined in rf_v1
    score_each_iteration=True,
    model_id="gbm_covType_v3",
    seed=2000000
)
gbm_v3.train(covtype_X, covtype_y, training_frame=train, validation_frame=valid)


gbm Model Build Progress: [##################################################] 100%


In [18]:
gbm_v3.hit_ratio_table(valid=True)


Top-7 Hit Ratios:


0,1
k,hit_ratio
1,0.9366477
2,0.9970841
3,0.9996055
4,0.9999142
5,1.0
6,1.0
7,1.0




###Parity

Now the GBM is close to the initial random forest.

However, we used a default random forest. Random forest's primary strength is how well it runs with standard parameters, and while there are only a few parameters to tune, we can experiment with those to see if it will make a difference.  

The main parameters to tune are the tree depth and the mtries, which is the number of predictors to use.  

The default depth of trees is 20. It is common to increase this number, to the point that in some implementations, the depth is unlimited. We will increase ours from 20 to 30.  

Note that the default mtries depends on whether classification or regression is being run. The default for classification is one-third of the columns. The default for regression is the square root of the number of columns.  

###Random Forest #2

In [19]:
rf_v2 = H2ORandomForestEstimator(
    model_id="rf_covType_v2",
    ntrees=200,
    max_depth=30,
    stopping_rounds=2,
    stopping_tolerance=0.01,
    score_each_iteration=True,
    seed=3000000)
rf_v2.train(covtype_X, covtype_y, training_frame=train, validation_frame=valid)


drf Model Build Progress: [##################################################] 100%


In [20]:
rf_v2.hit_ratio_table(valid=True)


Top-7 Hit Ratios:


0,1
k,hit_ratio
1,0.9532255
2,1.0
3,0.9988937
4,1.0
5,1.0
6,1.0
7,1.0




###Final Predictions

Now that we have our validation accuracy up beyond 95%, we can start considering our test data.  
We have withheld an extra test set to ensure that after all the parameter tuning we have repeatedly applied with the validation data, we still have a completely pristine data set upon which to test the predictive capacity of our model.

In [21]:
#Excludes the "Cover_Type" column from the features provided
final_rf_predictions = rf_v2.predict(test[:-1])

Technically, our model won't look at the ["Cover_Type"] column within the test data, as it is trained on a set of features not including "Cover_Type". It is up to the user whether to include it in the test frame provided for predictions, as it has no effect whatsoever.

Let's take a peek at the first few rows of predictions returned by our model.

In [22]:
final_rf_predictions

predict,class_1,class_2,class_3,class_4,class_5,class_6,class_7
class_2,0.0,1.0,0.0,0,0.0,0.0,0.0
class_1,0.999849,4.31602e-05,0.0,0,0.0,0.000107533,0.0
class_1,0.637334,0.362666,0.0,0,0.0,0.0,0.0
class_1,0.999184,0.0,0.0,0,0.0,0.000815898,0.0
class_2,0.0,1.0,0.0,0,0.0,0.0,0.0
class_2,0.0369924,0.961802,0.0,0,0.00120583,0.0,0.0
class_2,0.0247397,0.975204,0.0,0,0.0,5.60849e-05,0.0
class_7,0.132196,0.0,0.0,0,0.0,0.0,0.867804
class_2,0.289029,0.692452,0.0145395,0,0.0039251,5.44396e-05,0.0
class_2,0.0333318,0.93329,0.0,0,0.0333318,4.62299e-05,0.0




Let's compare these predictions to the accuracy we got from our experimentation

In [23]:
#validation set accuracy
rf_v2.hit_ratio_table(valid=True)


Top-7 Hit Ratios:


0,1
k,hit_ratio
1,0.9532255
2,1.0
3,0.9988937
4,1.0
5,1.0
6,1.0
7,1.0




In [24]:
#test set accuracy
(final_rf_predictions['predict']==test['Cover_Type']).as_data_frame(use_pandas=True).mean()

predict    0.952841
dtype: float64

Our final error rates are very similar between validation and test sets. This suggests that we did not overfit the validation set during our experimentation. This concludes our demo of H2O GBM and H2O Random Forests.


###Shut down the cluster
Shut down the cluster now that we are done using it.

In [25]:
h2o.shutdown(prompt=False)

###Possible Further Steps

Model-agnostic gains can be found in improving handling of categorical features. We could experiment with the nbins and nbins_cats settings to control the H2O splitting.The general guidance is to lower the number to increase generalization (avoid overfitting), increase to better fit the distribution.  
 
A good example of adjusting this value is for nbins_cats to be increased to match the number of values in a category. Though usually unnecessary, this can improve performance if a problem has a very important categorical predictor.  


With regards to our Random Forest, we could further experiment with deeper trees or a higher percentage of columns used (mtries).  

The GBM can be set to converge a slower for optimal accuracy. If we were to relax our runtime requirements a little bit, we could balance the learn rate and number of trees used.  

In a production setting where fine-grain accuracy is beneficial, it is common to set the learn rate to a very small number, such as 0.01 or smaller, and add trees to match.  

Use of early stopping is very powerful in allowing the setting of a low learning rate and the building as many trees as needed until the desired convergence is met.