## H2O random forest

The dataset is an anonymized credit card [dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud/data) from Kaggle competition. I choose this anonymized dataset so as not to focus on feature engineering but to focus on the detail of h2o.gbm functionalities. 

We use h2o.init(nthreads=-1) to initilize an h2o environment. "Number of threads" is pretty much the number of CPUs used for a laptop. -1 means use all CPUs on the host (Default). A positive integer specifies the number of CPUs directly.

In [1]:
library(dplyr)
library(h2o)
h2o.init(nthreads=-1)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

----------------------------------------------------------------------


Attaching package: ‘h2o’

The following objects are masked from ‘package:stats’:

    cor, sd, var

The following objects are masked from ‘package:base’:

    &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
    colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc




H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /var/folders/jw/wyhtzlf94zbgtpf9g_n5tryr0000gn/T//RtmptdYAaX/h2o_chriskuo_started_from_r.out
    /var/folders/jw/wyhtzlf94zbgtpf9g_n5tryr0000gn/T//RtmptdYAaX/h2o_chriskuo_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 609 milliseconds 
    H2O cluster version:        3.16.0.2 
    H2O cluster version age:    5 months !!! 
    H2O cluster name:           H2O_started_from_R_chriskuo_rbo106 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.78 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         XGBoost, Algos, 

“
Your H2O cluster version is too old (5 months)!
Please download and install the latest version from http://h2o.ai/download/”




In [2]:
df <- h2o.importFile(path = "/Users/chriskuo/Downloads/creditcard.csv")
dim(df)
summary(df,exact_quantiles=TRUE)

# Specify the response variable
response <- "Class"

# Make the response variable a categorical variable
df[[response]] <- as.factor(df[[response]])           

## Exclude the variable 'Time' 
predictors <- setdiff(names(df), c(response, "Time")) 



 Time             V1                   V2                  
 Min.   :     0   Min.   :-5.641e+01   Min.   :-7.272e+01  
 1st Qu.: 54202   1st Qu.:-9.204e-01   1st Qu.:-5.985e-01  
 Median : 84692   Median : 1.811e-02   Median : 6.549e-02  
 Mean   : 94814   Mean   : 1.328e-15   Mean   : 4.088e-16  
 3rd Qu.:139320   3rd Qu.: 1.316e+00   3rd Qu.: 8.037e-01  
 Max.   :172792   Max.   : 2.455e+00   Max.   : 2.206e+01  
 V3                   V4                   V5                  
 Min.   :-4.833e+01   Min.   :-5.683e+00   Min.   :-1.137e+02  
 1st Qu.:-8.904e-01   1st Qu.:-8.486e-01   1st Qu.:-6.916e-01  
 Median : 1.798e-01   Median :-1.985e-02   Median :-5.434e-02  
 Mean   :-1.584e-15   Mean   : 2.210e-15   Mean   : 1.073e-15  
 3rd Qu.: 1.027e+00   3rd Qu.: 7.433e-01   3rd Qu.: 6.119e-01  
 Max.   : 9.383e+00   Max.   : 1.688e+01   Max.   : 3.480e+01  
 V6                   V7                   V8                  
 Min.   :-2.616e+01   Min.   :-4.356e+01   Min.   :-7.322e+01  
 1st

### Split the data

Below is the standard syntax of h2o to split the dataset for training and testing purpose. In order to run and test on small samples, I use 10% for training and 10% for validation. H2o requires only two ratios. The third one is implied. So the test dataset is 90% (but I will not use it.)

In [3]:
splits <- h2o.splitFrame(
  data = df, 
  ratios = c(0.1,0.1),   # the ratios should sum up to to be less than 1.0. 
    destination_frames = c("train", "valid", "test"), seed = 1234
)
train <- splits[[1]]
valid <- splits[[2]]
test  <- splits[[3]]

### Build a random forest model 

In [5]:
rf_model <- h2o.randomForest(        
      training_frame = train,       
      validation_frame = valid,     
      x=predictors,                       
      y=response,                         
      model_id = "rf_model",      
      ntrees = 200,                 
      max_depth = 30,               
      stopping_rounds = 2,          
      stopping_tolerance = 1e-2,    
      score_each_iteration = T,     
      seed=1234)                 



## Get the AUC on the validation set
h2o.auc(h2o.performance(rf_model, newdata = valid)) 

Rf_predictions<-h2o.predict(object = rf_model,newdata = valid)






In [10]:
hyper_params = list( ntrees = seq(100,1000,200), 
                    max_depth=seq(2,12,3)   )

grid <- h2o.grid(
  hyper_params = hyper_params,
  
  search_criteria = list(strategy = "Cartesian"),
  
  algorithm="randomForest",
  
  grid_id="rf_grid",
  
  # Below are is the same as h2o.gbm()
  x = predictors, 
  y = response, 
  training_frame = train, 
  validation_frame = valid,
  seed = 1234,                                                             
  stopping_rounds = 5,
  stopping_tolerance = 1e-4,
  stopping_metric = "AUC", 
  score_tree_interval = 10       
)

grid        



H2O Grid Details

Grid ID: rf_grid 
Used hyper parameters: 
  -  max_depth 
  -  ntrees 
Number of models: 20 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by increasing logloss
   max_depth ntrees        model_ids               logloss
1          8    500 rf_grid_model_10 0.0034990039653087722
2          8    900 rf_grid_model_18 0.0034990039653087722
3          8    700 rf_grid_model_14 0.0034990039653087722
4          8    300  rf_grid_model_6  0.003502550559858588
5         11    100  rf_grid_model_3  0.003523577488917962
6         11    900 rf_grid_model_19 0.0035435883958802273
7         11    500 rf_grid_model_11 0.0035435883958802273
8         11    700 rf_grid_model_15 0.0035435883958802273
9         11    300  rf_grid_model_7 0.0035435883958802273
10         8    100  rf_grid_model_2  0.003572975067744199
11         5    300  rf_grid_model_5 0.0037900850733852513
12         5    900 rf_grid_model_17 0.0037900850733852513
13         5    500  rf_grid_mo

In [11]:
## sort the grid models by decreasing AUC
sortedGrid <- h2o.getGrid("rf_grid", sort_by="auc", decreasing = TRUE)    
print(sortedGrid)

H2O Grid Details

Grid ID: rf_grid 
Used hyper parameters: 
  -  max_depth 
  -  ntrees 
Number of models: 20 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by decreasing auc
   max_depth ntrees        model_ids                auc
1         11    100  rf_grid_model_3 0.9911282197744096
2         11    900 rf_grid_model_19 0.9884858969959844
3         11    500 rf_grid_model_11 0.9884858969959844
4         11    700 rf_grid_model_15 0.9884858969959844
5         11    300  rf_grid_model_7 0.9884858969959844
6          8    500 rf_grid_model_10 0.9879157223684473
7          8    900 rf_grid_model_18 0.9879157223684473
8          8    700 rf_grid_model_14 0.9879157223684473
9          8    300  rf_grid_model_6 0.9874157494217778
10         5    300  rf_grid_model_5 0.9833659685537564
11         5    900 rf_grid_model_17 0.9833659685537564
12         5    500  rf_grid_model_9 0.9833659685537564
13         5    700 rf_grid_model_13 0.9833659685537564
14         8    10

You can print out the top 10 models from the grid search. Below the AUC has increased to 0.96.

In [12]:
for (i in 1:10) {
  topModels <- h2o.getModel(sortedGrid@model_ids[[i]])
  print(h2o.auc(h2o.performance(topModels, valid = TRUE)))
}

[1] 0.9911282
[1] 0.9884859
[1] 0.9884859
[1] 0.9884859
[1] 0.9884859
[1] 0.9879157
[1] 0.9879157
[1] 0.9879157
[1] 0.9874157
[1] 0.983366


You can also understand the details of the best model.

In [13]:
best_model <- h2o.getModel(sortedGrid@model_ids[[1]])
summary(best_model)

scoring_history <- as.data.frame(best_model@model$scoring_history)
#plot(scoring_history$number_of_trees, scoring_history$training_MSE, type="p") #training mse
#points(scoring_history$number_of_trees, scoring_history$validation_MSE, type="l") #validation mse

## get the actual number of trees
ntrees <- best_model@model$model_summary$number_of_trees

Model Details:

H2OBinomialModel: drf
Model Key:  rf_grid_model_3 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             100                      100               41703         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1        11   10.56000          9         41    28.16000

H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  0.0004903668
RMSE:  0.02214423
LogLoss:  0.005157531
Mean Per-Class Error:  0.09770172
AUC:  0.9587112
Gini:  0.9174225

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
           0  1    Error       Rate
0      28412  8 0.000281   =8/28420
1          8 33 0.195122      =8/41
Totals 28420 41 0.000562  =16/28461

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.323529 0.804878  38
2                  

In [40]:
# All done. Shut down H2O.
h2o.shutdown(prompt=FALSE)