# Introduction
This tutorial shows how H2O [Gradient Boosted Models](https://en.wikipedia.org/wiki/Gradient_boosting) and [Random Forest](https://en.wikipedia.org/wiki/Random_forest) models can be used to do supervised classification and regression. This tutorial covers usage of H2O from Python. An R version of this tutorial will be available as well in a separate document. This file is available in plain R, R markdown, regular markdown, plain Python and iPython Notebook formats. More examples and explanations can be found in our [H2O GBM booklet](http://h2o.ai/resources/) and on our [H2O Github Repository](http://github.com/h2oai/h2o-3/).


## Task: Predicting forest cover type from cartographic variables only

The actual forest cover type for a given observation (30 x 30 meter cell) was determined from the US Forest Service (USFS). We are using the UC Irvine Covertype dataset.

### H2O Python Module

Load the H2O Python module.

In [1]:
import h2o
from h2o.automl import H2OAutoML
import random, os, sys
from datetime import datetime
import pandas as pd
import logging
import csv
import optparse
import time
import json
from distutils.util import strtobool
import psutil
import numpy as np

In [2]:
pct_memory=0.5
virtual_memory=psutil.virtual_memory()
min_mem_size=int(round(int(pct_memory*virtual_memory.available)/1073741824,0))
print(min_mem_size)

2


In [3]:
# 65535 Highest port no
port_no=random.randint(5555,55555)
h2o.init(strict_version_check=False,min_mem_size_GB=min_mem_size,port=port_no) # start h2o

Checking whether there is an H2O instance running at http://localhost:32448..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_121"; OpenJDK Runtime Environment (Zulu 8.20.0.5-macosx) (build 1.8.0_121-b15); OpenJDK 64-Bit Server VM (Zulu 8.20.0.5-macosx) (build 25.121-b15, mixed mode)
  Starting server from /Users/bear/anaconda/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/lh/42j8mfjx069d1bkc2wlf2pw40000gn/T/tmpw8b_7nb8
  JVM stdout: /var/folders/lh/42j8mfjx069d1bkc2wlf2pw40000gn/T/tmpw8b_7nb8/h2o_bear_started_from_python.out
  JVM stderr: /var/folders/lh/42j8mfjx069d1bkc2wlf2pw40000gn/T/tmpw8b_7nb8/h2o_bear_started_from_python.err
  Server is running at http://127.0.0.1:32448
Connecting to H2O server at http://127.0.0.1:32448... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.1.3
H2O cluster version age:,"14 days, 19 hours and 19 minutes"
H2O cluster name:,H2O_from_python_bear_bqt8w0
H2O cluster total nodes:,1
H2O cluster free memory:,3.556 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


To learn more about the h2o package itself, we can use Python's builtin help() function.

In [4]:
help(h2o)

Help on package h2o:

NAME
    h2o - :mod:`h2o` -- module for using H2O services.

DESCRIPTION
    (please add description).

PACKAGE CONTENTS
    assembly
    astfun
    auth
    automl (package)
    backend (package)
    cross_validation
    demos
    display
    estimators (package)
    exceptions
    expr
    expr_optimizer
    frame
    grid (package)
    group_by
    h2o
    job
    model (package)
    schemas (package)
    targetencoder
    transforms (package)
    tree (package)
    two_dim_table
    utils (package)

FUNCTIONS
    api(endpoint, data=None, json=None, filename=None, save_to=None)
        Perform a REST API request to a previously connected server.
        
        This function is mostly for internal purposes, but may occasionally be useful for direct access to
        the backend H2O server. It has same parameters as :meth:`H2OConnection.request <h2o.backend.H2OConnection.request>`.
    
    as_list(data, use_pandas=True, header=True)
        Convert an H2O data




help() can be used on H2O functions and models. Jupyter's builtin shift-tab functionality also works

In [5]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
help(H2OGradientBoostingEstimator)
help(h2o.import_file)

Help on class H2OGradientBoostingEstimator in module h2o.estimators.gbm:

class H2OGradientBoostingEstimator(h2o.estimators.estimator_base.H2OEstimator)
 |  Gradient Boosting Machine
 |  
 |  Builds gradient boosted trees on a parsed data set, for regression or classification.
 |  The default distribution function will guess the model type based on the response column type.
 |  Otherwise, the response column must be an enum for "bernoulli" or "multinomial", and numeric
 |  for all other distributions.
 |  
 |  Method resolution order:
 |      H2OGradientBoostingEstimator
 |      h2o.estimators.estimator_base.H2OEstimator
 |      h2o.model.model_base.ModelBase
 |      h2o.utils.backward_compatibility.BackwardsCompatibleBase
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, **kwargs)
 |      Construct a new model instance.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  balance_classe


Help on function import_file in module h2o.h2o:

import_file(path=None, destination_frame=None, parse=True, header=0, sep=None, col_names=None, col_types=None, na_strings=None, pattern=None, skipped_columns=None)
    Import a dataset that is already on the cluster.
    
    The path to the data must be a valid path for each node in the H2O cluster. If some node in the H2O cluster
    cannot see the file, then an exception will be thrown by the H2O cluster. Does a parallel/distributed
    multi-threaded pull of the data. The main difference between this method and :func:`upload_file` is that
    the latter works with local files, whereas this method imports remote files (i.e. files local to the server).
    If you running H2O server on your own maching, then both methods behave the same.
    
    :param path: path(s) specifying the location of the data to import or a path to a directory of files to import
    :param destination_frame: The unique hex key assigned to the imported file. I

## H2O GBM and RF

While H2O Gradient Boosting Models and H2O Random Forest have many flexible parameters options, they were designed to be just as easy to use as the other supervised training methods in H2O. Early stopping, automatic data standardization and handling of categorical variables and missing values and adaptive learning rates (per weight) reduce the amount of parameters the user has to specify. Often, it's just the number and sizes of hidden layers, the number of epochs and the activation function and maybe some regularization techniques. 

### Getting started

We begin by importing our data into H2OFrames, which operate similarly in function to pandas DataFrames but exist on the H2O cloud itself.  

In this case, the H2O cluster is running on our laptops. Data files are imported by their relative locations to this notebook.

In [6]:
import os
covtype_df = h2o.import_file(os.path.realpath("data/covtype.full.csv"))

Parse progress: |█████████████████████████████████████████████████████████| 100%


We import the full covertype dataset (581k rows, 13 columns, 10 numerical, 3 categorical) and then split the data 3 ways:  
  
60% for training  
20% for validation (hyper parameter tuning)  
20% for final testing  

 We will train a data set on one set and use the others to test the validity of the model by ensuring that it can predict accurately on data the model has not been shown.  
 
 The second set will be used for validation most of the time.  
 
 The third set will be withheld until the end, to ensure that our validation accuracy is consistent with data we have never seen during the iterative process. 

In [7]:
#split the data as described above
train, valid, test = covtype_df.split_frame([0.6, 0.2], seed=1234)

#Prepare predictors and response columns
covtype_X = covtype_df.col_names[:-1]     #last column is Cover_Type, our desired response variable 
covtype_y = covtype_df.col_names[-1]    

### The First Random Forest
We build our first model with the following parameters

**model_id:** Not required, but allows us to easily find our model in the [Flow](http://localhost:54321/) interface  
**ntrees:** Maximum number of trees used by the random forest. Default value is 50. We can afford to increase this, as our early-stopping criterion will decide when the random forest is sufficiently accurate.  
**stopping_rounds:** Stopping criterion described above. Stops fitting new trees when 2-tree rolling average is within 0.001 (default) of the two prior rolling averages. Can be thought of as a convergence setting.  
**score_each_teration:** predict against training and validation for each tree. Default will skip several.  
**seed:** set the randomization seed so we can reproduce results


In [8]:
rf_v1 = H2ORandomForestEstimator(
    model_id="rf_covType_v1",
    ntrees=200,
    stopping_rounds=2,
    score_each_iteration=True,
    seed=1000000)

### Model Construction
H2O in Python is designed to be very similar in look and feel to to scikit-learn. Models are initialized individually with desired or default parameters and then trained on data.  

**Note that the below example uses model.train() as opposed the traditional model.fit()**  
This is because h2o-py takes column indices for the feature and response columns AND the whole data frame, while scikit-learn takes in a feature frame and a response frame.

H2O supports model.fit() so that it can be incorporated into a scikit-learn pipeline, but we advise using train() in all other cases.

In [9]:
rf_v1.train(covtype_X, covtype_y, training_frame=train, validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


Note that the progress bar does not behave linearly. H2O estimates completion time initially based on the number of epochs specified. However, convergence can allow for early stops, in which case the bar jumps to 100%.

We can view information about the model in [Flow](http://localhost:54321/) or within Python. To find more information in Flow, enter `getModel "rf_covType_v1"` into a cell and run in place pressing Ctrl-Enter. Alternatively, you can click on the Models tab, select List All Models, and click on the model named "rf_covType_v1" as specified in our model construction above.

In Python, we can call the model itself to get an overview of its stats.

In [10]:
rf_v1

Model Details
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  rf_covType_v1


ModelMetricsMultinomial: drf
** Reported on train data. **

MSE: 0.056076839859494784
RMSE: 0.23680548950456107
LogLoss: 0.2384329935890688
Mean Per-Class Error: 0.11102189346483596
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4,5,6,7,8
class_1,class_2,class_3,class_4,class_5,class_6,class_7,Error,Rate
117176.0,9534.0,5.0,0.0,53.0,11.0,338.0,0.0782035,"9,941 / 127,117"
5414.0,164066.0,321.0,3.0,240.0,244.0,50.0,0.0368209,"6,272 / 170,338"
32.0,413.0,20370.0,93.0,22.0,512.0,0.0,0.0499953,"1,072 / 21,442"
0.0,32.0,178.0,1390.0,0.0,58.0,0.0,0.1616405,"268 / 1,658"
93.0,1386.0,63.0,0.0,4161.0,17.0,0.0,0.2725524,"1,559 / 5,720"
38.0,368.0,739.0,42.0,7.0,9239.0,0.0,0.1144446,"1,194 / 10,433"
709.0,70.0,0.0,0.0,2.0,0.0,11519.0,0.0634959,"781 / 12,300"
123462.0,175869.0,21676.0,1528.0,4485.0,10081.0,11907.0,0.0604198,"21,087 / 349,008"


Top-7 Hit Ratios: 


0,1
k,hit_ratio
1,0.9395802
2,0.9962408
3,0.9982007
4,0.9982522
5,0.9982522
6,0.9982522
7,1.0



ModelMetricsMultinomial: drf
** Reported on validation data. **

MSE: 0.053141408943200595
RMSE: 0.23052420467968346
LogLoss: 0.20030408050020734
Mean Per-Class Error: 0.10251306792333845
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4,5,6,7,8
class_1,class_2,class_3,class_4,class_5,class_6,class_7,Error,Rate
39403.0,2998.0,0.0,0.0,15.0,2.0,82.0,0.0728706,"3,097 / 42,500"
1589.0,54529.0,104.0,0.0,83.0,60.0,15.0,0.0328308,"1,851 / 56,380"
0.0,131.0,6844.0,30.0,3.0,135.0,0.0,0.0418592,"299 / 7,143"
1.0,1.0,61.0,479.0,0.0,20.0,0.0,0.1476868,83 / 562
29.0,432.0,24.0,0.0,1377.0,8.0,0.0,0.2636364,"493 / 1,870"
0.0,129.0,212.0,19.0,3.0,3101.0,0.0,0.1047921,"363 / 3,464"
204.0,16.0,0.0,0.0,1.0,0.0,3878.0,0.0539156,"221 / 4,099"
41226.0,58236.0,7245.0,528.0,1482.0,3326.0,3975.0,0.0552242,"6,407 / 116,018"


Top-7 Hit Ratios: 


0,1
k,hit_ratio
1,0.9447758
2,0.9978452
3,0.9996811
4,0.9997845
5,0.9997932
6,0.9997932
7,1.0


Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,validation_rmse,validation_logloss,validation_classification_error
,2019-02-09 15:08:03,0.033 sec,0.0,,,,,,
,2019-02-09 15:08:05,2.084 sec,1.0,0.3356442,2.4967403,0.1239456,0.3342398,2.4447915,0.1272906
,2019-02-09 15:08:07,3.945 sec,2.0,0.3192478,2.1652719,0.1135361,0.2663761,0.7915252,0.0850127
,2019-02-09 15:08:08,5.272 sec,3.0,0.3061849,1.8191096,0.1055723,0.2506499,0.4602991,0.0723336
,2019-02-09 15:08:10,6.947 sec,4.0,0.2953465,1.5013414,0.0992534,0.2449232,0.3418711,0.0674033
---,---,---,---,---,---,---,---,---,---
,2019-02-09 15:08:53,50.229 sec,20.0,0.2385726,0.2613526,0.0620141,0.2307102,0.2016582,0.0560861
,2019-02-09 15:08:56,53.486 sec,21.0,0.2380320,0.2537838,0.0615603,0.2306266,0.2008930,0.0556638
,2019-02-09 15:09:00,57.085 sec,22.0,0.2373729,0.2481413,0.0611638,0.2303363,0.2003093,0.0554483



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
Soil_Type,784024.0625000,1.0,0.2524228
Elevation,738730.6250000,0.9422295,0.2378402
Horizontal_Distance_To_Roadways,327317.3125000,0.4174838,0.1053824
Horizontal_Distance_To_Fire_Points,317908.1875000,0.4054827,0.1023531
Wilderness_Area,178298.4375000,0.2274145,0.0574046
Horizontal_Distance_To_Hydrology,159165.6250000,0.2030111,0.0512446
Vertical_Distance_To_Hydrology,134396.0781250,0.1714183,0.0432699
Aspect,106049.9843750,0.1352637,0.0341436
Hillshade_Noon,99759.4375000,0.1272403,0.0321183




To look at validation statistics, we can use the scoring history function.

In [11]:
rf_v1.score_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,validation_rmse,validation_logloss,validation_classification_error
0,,2019-02-09 15:08:03,0.033 sec,0.0,,,,,,
1,,2019-02-09 15:08:05,2.084 sec,1.0,0.335644,2.49674,0.123946,0.33424,2.444792,0.127291
2,,2019-02-09 15:08:07,3.945 sec,2.0,0.319248,2.165272,0.113536,0.266376,0.791525,0.085013
3,,2019-02-09 15:08:08,5.272 sec,3.0,0.306185,1.81911,0.105572,0.25065,0.460299,0.072334
4,,2019-02-09 15:08:10,6.947 sec,4.0,0.295347,1.501341,0.099253,0.244923,0.341871,0.067403
5,,2019-02-09 15:08:11,8.449 sec,5.0,0.285458,1.259114,0.093016,0.240228,0.284286,0.063619
6,,2019-02-09 15:08:13,9.949 sec,6.0,0.276749,1.059652,0.087339,0.237109,0.253836,0.061215
7,,2019-02-09 15:08:15,11.637 sec,7.0,0.271068,0.889017,0.084058,0.237045,0.242014,0.060637
8,,2019-02-09 15:08:18,14.708 sec,8.0,0.264125,0.748301,0.079793,0.234717,0.228079,0.059456
9,,2019-02-09 15:08:19,16.463 sec,9.0,0.259892,0.640781,0.07688,0.234854,0.223989,0.058715


Here we can see the hit ratio table.

In [12]:
rf_v1.hit_ratio_table(valid=True)

Top-7 Hit Ratios: 


0,1
k,hit_ratio
1,0.9447758
2,0.9978452
3,0.9996811
4,0.9997845
5,0.9997932
6,0.9997932
7,1.0




### Now for GBM

First we will use all default settings, then make some changes to improve our predictions.

In [13]:
gbm_v1 = H2OGradientBoostingEstimator(
    model_id="gbm_covType_v1",
    seed=2000000
)
gbm_v1.train(covtype_X, covtype_y, training_frame=train, validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [14]:
gbm_v1.score_history()


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,validation_rmse,validation_logloss,validation_classification_error
0,,2019-02-09 15:09:12,0.039 sec,0.0,0.857143,1.94591,0.622472,0.857143,1.94591,0.621429
1,,2019-02-09 15:09:14,1.457 sec,1.0,0.80295,1.633685,0.259722,0.803065,1.634374,0.26233
2,,2019-02-09 15:09:14,2.130 sec,2.0,0.756429,1.432764,0.256757,0.756669,1.433968,0.258555
3,,2019-02-09 15:09:15,2.769 sec,3.0,0.715237,1.284885,0.252385,0.715665,1.286751,0.254857
4,,2019-02-09 15:09:15,3.398 sec,4.0,0.678832,1.170576,0.252316,0.67945,1.173003,0.25378
5,,2019-02-09 15:09:16,4.022 sec,5.0,0.646422,1.078163,0.249665,0.647176,1.080859,0.251159
6,,2019-02-09 15:09:20,8.169 sec,13.0,0.495028,0.713972,0.235509,0.496759,0.71839,0.237661
7,,2019-02-09 15:09:25,13.333 sec,22.0,0.435449,0.581374,0.222369,0.437986,0.587285,0.225612
8,,2019-02-09 15:09:31,19.307 sec,31.0,0.407285,0.518258,0.207879,0.410528,0.525352,0.211743
9,,2019-02-09 15:09:37,25.355 sec,40.0,0.390641,0.481536,0.195628,0.394511,0.489773,0.199633


In [15]:
gbm_v1.hit_ratio_table(valid=True)

Top-7 Hit Ratios: 


0,1
k,hit_ratio
1,0.8097192
2,0.983244
3,0.9980606
4,0.9996983
5,0.9999828
6,0.9999914
7,1.0




This default GBM is much worse than our original random forest.  


The GBM is far from converging, so there are three primary knobs to adjust to get our performance up if we want to keep a similar run time.  

1: Adding trees will help. The default is 50.  
2: Increasing the learning rate will also help. The contribution of each tree will be stronger, so the model will move further away from the overall mean.  
3: Increasing the depth will help. This is the parameter that is the least straightforward. Tuning trees and learning rate both have direct impact that is easy to understand. Changing the depth means you are adjusting the "weakness" of each learner. Adding depth makes each tree fit the data closer.  
  
The first configuration will attack depth the most, since we've seen the random forest focus on a continuous variable (elevation) and 40-class factor (soil type) the most.  

Also we will take a look at how to review a model while it is running.  

### GBM Round 2

Let's do the following:

1. decrease the number of trees to speed up runtime(from default 50 to 20)
2. increase the learning rate (from default 0.1 to 0.2)
3. increase the depth (from default 5 to 10)

In [16]:
gbm_v2 = H2OGradientBoostingEstimator(
    ntrees=20,
    learn_rate=0.2,
    max_depth=10,
    stopping_tolerance=0.01, #10-fold increase in threshold as defined in rf_v1
    stopping_rounds=2,
    score_each_iteration=True,
    model_id="gbm_covType_v2",
    seed=2000000
)
gbm_v2.train(covtype_X, covtype_y, training_frame=train, validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


### Live Performance Monitoring

While this is running, we can actually look at the model. To do this we simply need a new connection to H2O. 

This Python notebook will run the model, so we need either another notebook or the web browser (or R, etc.). In this demo, we will use [Flow](http://localhost:54321) in our web browser http://localhost:54321 and the focus will be to look at model performance, since we are using Python to control H2O. 

In [None]:
gbm_v2.hit_ratio_table(valid=True)

Top-7 Hit Ratios: 


0,1
k,hit_ratio
1,0.9176852
2,0.9969315
3,0.9998535
4,0.9999828
5,1.0
6,1.0
7,1.0




This has moved us in the right direction, but still lower accuracy than the random forest.  

It still has yet to converge, so we can make it more aggressive.  

We can now add the stochastic nature of random forest into the GBM using some of the new H2O settings. This will help generalize and also provide a quicker runtime, so we can add a few more trees.

### GBM: Third Time is the Charm

1. Add a few trees(from 20 to 30)
2. Increase learning rate (to 0.3)
3. Use a random 70% of rows to fit each tree
4. Use a random 70% of columns to fit each tree

In [None]:
gbm_v3 = H2OGradientBoostingEstimator(
    ntrees=30,
    learn_rate=0.3,
    max_depth=10,
    sample_rate=0.7,
    col_sample_rate=0.7,
    stopping_rounds=2,
    stopping_tolerance=0.01, #10-fold increase in threshold as defined in rf_v1
    score_each_iteration=True,
    model_id="gbm_covType_v3",
    seed=2000000
)
gbm_v3.train(covtype_X, covtype_y, training_frame=train, validation_frame=valid)

gbm Model Build progress: |████

In [None]:
gbm_v3.hit_ratio_table(valid=True)

### Parity

Now the GBM is close to the initial random forest.

However, we used a default random forest. Random forest's primary strength is how well it runs with standard parameters, and while there are only a few parameters to tune, we can experiment with those to see if it will make a difference.  

The main parameters to tune are the tree depth and the mtries, which is the number of predictors to use.  

The default depth of trees is 20. It is common to increase this number, to the point that in some implementations, the depth is unlimited. We will increase ours from 20 to 30.  

Note that the default mtries depends on whether classification or regression is being run. The default for classification is one-third of the columns. The default for regression is the square root of the number of columns.  

### Random Forest #2

In [None]:
rf_v2 = H2ORandomForestEstimator(
    model_id="rf_covType_v2",
    ntrees=200,
    max_depth=30,
    stopping_rounds=2,
    stopping_tolerance=0.01,
    score_each_iteration=True,
    seed=3000000)
rf_v2.train(covtype_X, covtype_y, training_frame=train, validation_frame=valid)

In [None]:
rf_v2.hit_ratio_table(valid=True)

### Final Predictions

Now that we have our validation accuracy up beyond 95%, we can start considering our test data.  
We have withheld an extra test set to ensure that after all the parameter tuning we have repeatedly applied with the validation data, we still have a completely pristine data set upon which to test the predictive capacity of our model.

In [None]:
#Excludes the "Cover_Type" column from the features provided
final_rf_predictions = rf_v2.predict(test[:-1])

Technically, our model won't look at the ["Cover_Type"] column within the test data, as it is trained on a set of features not including "Cover_Type". It is up to the user whether to include it in the test frame provided for predictions, as it has no effect whatsoever.

Let's take a peek at the first few rows of predictions returned by our model.

In [None]:
final_rf_predictions

Let's compare these predictions to the accuracy we got from our experimentation

In [None]:
#validation set accuracy
rf_v2.hit_ratio_table(valid=True)

In [None]:
#test set accuracy
(final_rf_predictions['predict']==test['Cover_Type']).as_data_frame(use_pandas=True).mean()

Our final error rates are very similar between validation and test sets. This suggests that we did not overfit the validation set during our experimentation. This concludes our demo of H2O GBM and H2O Random Forests.


### Shut down the cluster
Shut down the cluster now that we are done using it.

In [None]:
h2o.shutdown(prompt=False)

### Possible Further Steps

Model-agnostic gains can be found in improving handling of categorical features. We could experiment with the nbins and nbins_cats settings to control the H2O splitting.The general guidance is to lower the number to increase generalization (avoid overfitting), increase to better fit the distribution.  
 
A good example of adjusting this value is for nbins_cats to be increased to match the number of values in a category. Though usually unnecessary, this can improve performance if a problem has a very important categorical predictor.  


With regards to our Random Forest, we could further experiment with deeper trees or a higher percentage of columns used (mtries).  

The GBM can be set to converge a slower for optimal accuracy. If we were to relax our runtime requirements a little bit, we could balance the learn rate and number of trees used.  

In a production setting where fine-grain accuracy is beneficial, it is common to set the learn rate to a very small number, such as 0.01 or smaller, and add trees to match.  

Use of early stopping is very powerful in allowing the setting of a low learning rate and the building as many trees as needed until the desired convergence is met.

### More information can be found in the [H2O Gradient Boosted Models booklet](http://h2o.ai/resources/), in our [H2O SlideShare Presentations](http://www.slideshare.net/0xdata/presentations), our [H2O YouTube channel](https://www.youtube.com/user/0xdata/), as well as on our [H2O Github Repository](https://github.com/h2oai/h2o-3/), especially in our [H2O GBM R tests](https://github.com/h2oai/h2o-3/tree/master/h2o-r/tests/testdir_algos/gbm), and [H2O GBM Python tests](https://github.com/h2oai/h2o-3/tree/master/h2o-py/tests/testdir_algos/gbm).