## H2O gradient boosting and grid search

In this tutorial I will show you how to build and tune an H2O gbm model. I will first build a baseline gbm model without tuning the hyperparameters. Then I fine-tune the gbm model with hyperparameters in the gbm function. Then I use h2o.grid to conduct an extensive grid search to optimize the performance. You will see the model performance, measured by the area under the curve (AUC) in our case, has improved dramatically.

The dataset is an anonymized credit card [dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud/data) from Kaggle competition. I choose this anonymized dataset so as not to focus on feature engineering but to focus on the detail of h2o.gbm functionalities. 

The H2O’s GBM supports the following functionalities:
* supervised learning for regression and classification tasks
* distributed and parallelized computation on either a single node or a multi-node cluster
* fast and memory-efficient Java implementations of the underlying algorithms
* user-friendly web interface to mirror the model building and scoring process running in R or Python
* grid search for hyperparameter optimization and model selection
* model export in plain Java code for deployment in production environments
* additional parameters for model tuning.


We use h2o.init(nthreads=-1) to initilize an h2o environment. "Number of threads" is pretty much the number of CPUs used for a laptop. -1 means use all CPUs on the host (Default). A positive integer specifies the number of CPUs directly.

In [2]:
library(dplyr)
library(h2o)
h2o.init(nthreads=-1)

 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         3 seconds 988 milliseconds 
    H2O cluster version:        3.16.0.2 
    H2O cluster version age:    4 months and 30 days !!! 
    H2O cluster name:           H2O_started_from_R_chriskuo_hhp100 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.78 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.4.3 (2017-11-30) 


“
Your H2O cluster version is too old (4 months and 30 days)!
Please download and install the latest version from http://h2o.ai/download/”




In [3]:
df <- h2o.importFile(path = "/Users/chriskuo/Downloads/creditcard.csv")
dim(df)
summary(df,exact_quantiles=TRUE)

# Specify the response variable
response <- "Class"

# Make the response variable a categorical variable
df[[response]] <- as.factor(df[[response]])           

## Exclude the variable 'Time' 
predictors <- setdiff(names(df), c(response, "Time")) 



 Time             V1                   V2                  
 Min.   :     0   Min.   :-5.641e+01   Min.   :-7.272e+01  
 1st Qu.: 54202   1st Qu.:-9.204e-01   1st Qu.:-5.985e-01  
 Median : 84692   Median : 1.811e-02   Median : 6.549e-02  
 Mean   : 94814   Mean   : 1.328e-15   Mean   : 4.088e-16  
 3rd Qu.:139320   3rd Qu.: 1.316e+00   3rd Qu.: 8.037e-01  
 Max.   :172792   Max.   : 2.455e+00   Max.   : 2.206e+01  
 V3                   V4                   V5                  
 Min.   :-4.833e+01   Min.   :-5.683e+00   Min.   :-1.137e+02  
 1st Qu.:-8.904e-01   1st Qu.:-8.486e-01   1st Qu.:-6.916e-01  
 Median : 1.798e-01   Median :-1.985e-02   Median :-5.434e-02  
 Mean   :-1.584e-15   Mean   : 2.210e-15   Mean   : 1.073e-15  
 3rd Qu.: 1.027e+00   3rd Qu.: 7.433e-01   3rd Qu.: 6.119e-01  
 Max.   : 9.383e+00   Max.   : 1.688e+01   Max.   : 3.480e+01  
 V6                   V7                   V8                  
 Min.   :-2.616e+01   Min.   :-4.356e+01   Min.   :-7.322e+01  
 1st

### Split the data

Below is the standard syntax of h2o to split the dataset for training and testing purpose. In order to run and test on small samples, I use 10% for training and 10% for validation. H2o requires only two ratios. The third one is implied. So the test dataset is 90% (but I will not use it.)

In [6]:
splits <- h2o.splitFrame(
  data = df, 
  ratios = c(0.1,0.1),   # the ratios should sum up to to be less than 1.0. 
    destination_frames = c("train", "valid", "test"), seed = 1234
)
train <- splits[[1]]
valid <- splits[[2]]
test  <- splits[[3]]

### Build a baseline gbm model without hyper-parameter tuning

Below I just use all the default values for the hyperpamaters. The AUC on the validation data is 0.569. 

In [7]:
gbm <- h2o.gbm(x = predictors, y = response, training_frame = train)
gbm

## Get the AUC on the validation set
h2o.auc(h2o.performance(gbm, newdata = valid)) 



Model Details:

H2OBinomialModel: gbm
Model ID:  GBM_model_R_1525121768299_63 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              50                       50                8421         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000          7         14     8.44000


H2OBinomialMetrics: gbm
** Reported on training data. **

MSE:  0.000581082
RMSE:  0.02410564
LogLoss:  0.01538777
Mean Per-Class Error:  0.1342519
AUC:  0.7444208
Gini:  0.4888416

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
           0  1    Error       Rate
0      28414  6 0.000211   =6/28420
1         11 30 0.268293     =11/41
Totals 28425 36 0.000597  =17/28461

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.281354 0.779221   8
2                       max f2  0.281354 0.750000   8
3        

### Find-tune hyper-parameters in h2o.gbm

The overall strategy is to test more trees and smaller learning rate. The hyperparameters for tuning are the following:

* Learning rate (shrinkage)
* Number of trees
* Interaction depth
* Minimum observation in a node
* Bag fraction (fraction of randomly selected observations)


The learning rate, a value between 0 and 1, corresponds to how quickly the error is corrected from each tree to the next. A small learning rate will result in long computational time, and a large learning rate makes the system unable to settle down. It will be efficient if the learning rate can decay over time. Therefore there is a hyperparameter to decay the learning rate called the "learn_rate_annealing". "Annealing", in materials science, describes a heating process that heats up in the beginning then cools down slowly. In gbm a common way to decay the learning rate is call the "step decay". It reduces the learning rate by some factor in every few iterations or epochs. Typical values are to reduce the learning rate by a half every 5 epochs. Because we use learning_rate_annealing, we can start with a large learning rate=0.05.

* learn_rate= 0.05.
* learn_rate_annealing=0.99.
* ntrees = 1000.
* max_runtime_secs=1200. Early stopping based on timeout. In this case no more than 1200 seconds.
* stopping_rounds = 5.
* stopping_tolerance = 1e-4.
* stopping_metric = "AUC". The above three hyperparameters control the early stopping when the AUC does not improve by at least 0.01% for 5 consecutive scoring events.
* score_tree_interval = 10. Score every 10 trees to make early stopping reproducible.

In [9]:
gbm <- h2o.gbm(x = predictors, y = response, 
               training_frame = train, 
               validation_frame = valid,
               learn_rate = .05, learn_rate_annealing =.99,
               ntrees=1000,
               stopping_rounds = 5,
               stopping_tolerance = 1e-4,
               stopping_metric = "AUC", 
               seed = 1234)

# print the auc for the validation data
print(h2o.auc(gbm, valid = TRUE))

[1] 0.9517869


### Fine-turn the hyper-parameters using h2o.grid

You can type "?h2o.grid()" to understand the grid serach options. The h2o.grid() reserves the following commands:

* hyper_params: List of lists of hyper parameters. 
* search_criteria: The default strategy 'Cartesian' covers the entire space of hyperparameter combinations. For example, if you have three hyperparameters and you have 2, 4, 6 values for each, the Catesian search will result in $2 * 4 * 6 = 48$ models. The alternative is 'RandomDiscrete' strategy to get random search of all the combinations of your hyperparameters. 
* algorithm: Which algorithm.
* grid_id: An id that we can retrieve it later. In this example is "my_grid".
* ntrees: The number of trees

In [36]:
hyper_params = list( ntrees = seq(100,3000,200), 
                    max_depth=seq(2,12,3)   )

grid <- h2o.grid(
  hyper_params = hyper_params,
  
  search_criteria = list(strategy = "Cartesian"),
  
  algorithm="gbm",
  
  grid_id="my_grid",
  
  # Below are is the same as h2o.gbm()
  x = predictors, 
  y = response, 
  training_frame = train, 
  validation_frame = valid,
  learn_rate = 0.05,                                                         
  learn_rate_annealing = 0.99,                                               
  sample_rate = 0.8,                                                       
  col_sample_rate = 0.8, 
  seed = 1234,                                                             
  stopping_rounds = 5,
  stopping_tolerance = 1e-4,
  stopping_metric = "AUC", 
  score_tree_interval = 10                                                
)

grid        

Hyper-parameter: max_depth, 14
Hyper-parameter: ntrees, 1000
[2018-04-30 19:04:50] failure_details: NA 
[2018-04-30 19:04:50] failure_stack_traces: water.Job$JobCancelledException
	at hex.tree.SharedTree$Driver.scoreAndBuildTrees(SharedTree.java:437)
	at hex.tree.SharedTree$Driver.computeImpl(SharedTree.java:352)
	at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:206)
	at hex.ModelBuilder.trainModelNested(ModelBuilder.java:262)
	at hex.grid.GridSearch.startBuildModel(GridSearch.java:332)
	at hex.grid.GridSearch.buildModel(GridSearch.java:314)
	at hex.grid.GridSearch.gridSearch(GridSearch.java:213)
	at hex.grid.GridSearch.access$000(GridSearch.java:68)
	at hex.grid.GridSearch$1.compute2(GridSearch.java:135)
	at water.H2O$H2OCountedCompleter.compute(H2O.java:1263)
	at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
	at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
	at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
	at jsr166y.ForkJoinPool.runWorker(ForkJo

H2O Grid Details

Grid ID: my_grid 
Used hyper parameters: 
  -  max_depth 
  -  ntrees 
Number of models: 172 
Number of failed models: 1 

Hyper-Parameter Search Summary: ordered by increasing logloss
  max_depth ntrees         model_ids              logloss
1         2   1700 my_grid_model_144 0.020718249226475475
2         2    300 my_grid_model_116 0.020718249226475475
3         2   1000   my_grid_model_0 0.020718249226475475
4         2   3000  my_grid_model_30 0.020718249226475475
5         2   1800  my_grid_model_12 0.020718249226475475

---
    max_depth ntrees        model_ids              logloss
167         2   2400 my_grid_model_64 0.051342688610633055
168         2   3000 my_grid_model_76 0.051342688610633055
169         2   1400 my_grid_model_44 0.051342688610633055
170         2   1000 my_grid_model_36 0.051342688610633055
171         2   2200 my_grid_model_60 0.051342688610633055
172         2   1800 my_grid_model_52 0.051342688610633055
Failed models
-------------
 ma

In [37]:
## sort the grid models by decreasing AUC
sortedGrid <- h2o.getGrid("my_grid", sort_by="auc", decreasing = TRUE)    
print(sortedGrid)

Hyper-parameter: max_depth, 14
Hyper-parameter: ntrees, 1000
[2018-04-30 19:05:41] failure_details: NA 
[2018-04-30 19:05:41] failure_stack_traces: water.Job$JobCancelledException
	at hex.tree.SharedTree$Driver.scoreAndBuildTrees(SharedTree.java:437)
	at hex.tree.SharedTree$Driver.computeImpl(SharedTree.java:352)
	at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:206)
	at hex.ModelBuilder.trainModelNested(ModelBuilder.java:262)
	at hex.grid.GridSearch.startBuildModel(GridSearch.java:332)
	at hex.grid.GridSearch.buildModel(GridSearch.java:314)
	at hex.grid.GridSearch.gridSearch(GridSearch.java:213)
	at hex.grid.GridSearch.access$000(GridSearch.java:68)
	at hex.grid.GridSearch$1.compute2(GridSearch.java:135)
	at water.H2O$H2OCountedCompleter.compute(H2O.java:1263)
	at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
	at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
	at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
	at jsr166y.ForkJoinPool.runWorker(ForkJo

You can print out the top 10 models from the grid search. Below the AUC has increased to 0.96.

In [38]:
for (i in 1:10) {
  gbm <- h2o.getModel(sortedGrid@model_ids[[i]])
  print(h2o.auc(h2o.performance(gbm, valid = TRUE)))
}

[1] 0.966033
[1] 0.966033
[1] 0.966033
[1] 0.966033
[1] 0.966033
[1] 0.966033
[1] 0.966033
[1] 0.966033
[1] 0.966033
[1] 0.966033


You can also understand the details of the best model.

In [39]:
best_model <- h2o.getModel(sortedGrid@model_ids[[1]])
summary(best_model)

scoring_history <- as.data.frame(best_model@model$scoring_history)
#plot(scoring_history$number_of_trees, scoring_history$training_MSE, type="p") #training mse
#points(scoring_history$number_of_trees, scoring_history$validation_MSE, type="l") #validation mse

## get the actual number of trees
ntrees <- best_model@model$model_summary$number_of_trees

Model Details:

H2OBinomialModel: gbm
Model Key:  my_grid_model_37 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             160                      160               34446         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000          6         16    12.16250

H2OBinomialMetrics: gbm
** Reported on training data. **

MSE:  0.0004033557
RMSE:  0.02008372
LogLoss:  0.01243093
Mean Per-Class Error:  0.0001759324
AUC:  0.9997854
Gini:  0.9995709

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
           0  1    Error       Rate
0      28410 10 0.000352  =10/28420
1          0 41 0.000000      =0/41
Totals 28410 51 0.000351  =10/28461

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.123504 0.891304   8
2                       max f2  0.123504 0.953488   8
3                