## H2O random forest

You will install H2o according to the [H2o instruction](https://h2o-release.s3.amazonaws.com/h2o/master/3888/docs-website/h2o-docs/downloading.html). We use h2o.init(nthreads=-1) to initilize an h2o environment. "Number of threads" is pretty much the number of CPUs used for a laptop. -1 means use all CPUs on the host (Default). A positive integer specifies the number of CPUs directly.

We will use the dataset [Gender recognition by voice](https://www.kaggle.com/primaryobjects/voicegender) on the Kaggle site.

### Learning Objectives:

1. Use H2O to build a random forest model
2. Use H2O grid search to find the optimal hyper-parameters
3. Plot the ROC

### Initialize H2O

In [1]:
#install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R")))
library(dplyr)
library(h2o)
h2o.init(nthreads=-1)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

----------------------------------------------------------------------


Attaching package: ‘h2o’

The following objects are masked from ‘package:stats’:

    cor, sd, var

The following objects are masked from ‘package:base’:

    &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
    colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc




H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /var/folders/jw/wyhtzlf94zbgtpf9g_n5tryr0000gn/T//Rtmp7gkpNR/h2o_chriskuo_started_from_r.out
    /var/folders/jw/wyhtzlf94zbgtpf9g_n5tryr0000gn/T//Rtmp7gkpNR/h2o_chriskuo_started_from_r.err


Starting H2O JVM and connecting: ... Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         3 seconds 164 milliseconds 
    H2O cluster timezone:       America/New_York 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.21.0.4353 
    H2O cluster version age:    6 days  
    H2O cluster name:           H2O_started_from_R_chriskuo_ljv918 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.78 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
 

In [2]:
df.hex <- h2o.importFile(path = "/Users/chriskuo/Downloads/voice.csv")
head(df.hex)
h2o.table(df.hex$label)



meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,⋯,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
0.05978098,0.06424127,0.03202691,0.015071489,0.09019344,0.07512195,12.863462,274.402906,0.8933694,0.4919178,⋯,0.05978098,0.08427911,0.01570167,0.2758621,0.0078125,0.0078125,0.0078125,0.0,0.0,male
0.06600874,0.06731003,0.04022873,0.019413867,0.09266619,0.07325232,22.423285,634.613855,0.8921932,0.5137238,⋯,0.06600874,0.10793655,0.01582591,0.25,0.009014423,0.0078125,0.0546875,0.046875,0.05263158,male
0.0773155,0.08382942,0.03671846,0.008701057,0.13190802,0.12320696,30.757155,1024.927705,0.8463891,0.478905,⋯,0.0773155,0.09870626,0.01565558,0.2711864,0.007990057,0.0078125,0.015625,0.0078125,0.04651163,male
0.15122809,0.07211059,0.15801119,0.096581728,0.20795525,0.11137352,1.232831,4.177296,0.9633225,0.7272318,⋯,0.15122809,0.08896485,0.01779755,0.25,0.201497396,0.0078125,0.5625,0.5546875,0.24711908,male
0.13512039,0.0791461,0.12465623,0.078720218,0.20604493,0.12732471,1.101174,4.333713,0.9719551,0.7835681,⋯,0.13512039,0.10639784,0.01693122,0.2666667,0.7128125,0.0078125,5.484375,5.4765625,0.20827389,male
0.13278641,0.07955687,0.11908985,0.067957993,0.2095916,0.14163361,1.932562,8.308895,0.9631813,0.738307,⋯,0.13278641,0.11013192,0.0171123,0.2539683,0.298221983,0.0078125,2.7265625,2.71875,0.12515964,male


   label Count
1 female  1584
2   male  1584

[2 rows x 2 columns] 

### Split the data

Below is the standard syntax of h2o to split the dataset for training and testing purpose. In order to run and test on small samples, I use 10% for training and 10% for validation. H2o requires only two ratios. The third one is implied. So the test dataset is 90% (but I will not use it.)

In [3]:
splits <- h2o.splitFrame(
  data = df.hex, 
  ratios = c(0.2,0.2),   # the ratios should sum up to to be less than 1.0. 
    destination_frames = c("train", "valid", "test"), seed = 1234
)
train <- splits[[1]]
valid <- splits[[2]]
test  <- splits[[3]]
valid

    meanfreq         sd     median        Q25        Q75        IQR       skew
1 0.06600874 0.06731003 0.04022873 0.01941387 0.09266619 0.07325232 22.4232854
2 0.16051433 0.07676688 0.14433678 0.11053217 0.23196187 0.12142971  1.3971564
3 0.14223942 0.07801846 0.13858744 0.08820628 0.20858744 0.12038117  1.0997462
4 0.15303905 0.07403113 0.15806452 0.09273118 0.21436559 0.12163441  0.8857016
5 0.16764836 0.06721993 0.15723951 0.13602165 0.20952639 0.07350474  1.9360651
6 0.13789282 0.07242066 0.11985240 0.08653137 0.20070111 0.11416974  2.0109933
        kurt    sp.ent       sfm      mode   centroid    meanfun     minfun
1 634.613855 0.8921932 0.5137238 0.0000000 0.06600874 0.10793655 0.01582591
2   4.766611 0.9592546 0.7198579 0.1283241 0.16051433 0.09305243 0.01775805
3   4.070284 0.9707229 0.7709921 0.2191031 0.14223942 0.09672895 0.01795735
4   3.523982 0.9732178 0.8075517 0.2167742 0.15303905 0.06740750 0.01626016
5   6.334626 0.9039823 0.4828861 0.1345061 0.16764836 0.14274825 0.

### The target variable and the predictors

In [5]:
## Exclude the variable 'Type'
predictors <- setdiff(names(df.hex), 'label')
predictors

### Learning Objective 1: Build a random forest model 

* stopping_metric: the default is ['AUTO'](http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.randomForest.html), which is the [logloss](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/stopping_metric.html) function.
* A lower LogLoss value means better predictions.
* In fact, Log Loss is -1 * the log of the likelihood function. 
* LogLoss: $$−[\sum_i y_i log(\hat{y})+log(1−y_i)(1−\hat{y})]$$
* LogLikehood: $$\sum_i y_i log(\hat{y})+log(1−y_i)(1−\hat{y})$$

In [6]:
rf_model <- h2o.randomForest(        
      training_frame = train,       
      validation_frame = valid,     
      x=predictors,                       
      y='label',                         
      model_id = "rf_model",      
      ntrees = 200, #2000 is recommended                
      max_depth = 10, #30 is recommended               
      stopping_rounds = 2,          
      stopping_tolerance = 1e-2,    
      score_each_iteration = T,     
      seed=1234)                 

## Get the AUC on the validation set
h2o.auc(h2o.performance(rf_model, newdata = test)) 

Rf_predictions<-h2o.predict(object = rf_model,newdata = valid)






### Learning Objective 2: Use Grid-search to find the optimal hyper-parameters

In [7]:
hyper_params = list( ntrees = seq(100,1000,200), 
                    max_depth=seq(2,12,3)   )

grid <- h2o.grid(
  hyper_params = hyper_params,
  
  search_criteria = list(strategy = "Cartesian"),
  
  algorithm="randomForest",
  
  grid_id="rf_grid",
  
  # Below are is the same as h2o.gbm()
  x = predictors, 
  y = 'label', 
  training_frame = train, 
  validation_frame = valid,
  seed = 1234,                                                             
  stopping_rounds = 5,
  stopping_tolerance = 1e-8,
  stopping_metric = "AUC", 
  score_tree_interval = 10       
)

grid        



H2O Grid Details

Grid ID: rf_grid 
Used hyper parameters: 
  -  max_depth 
  -  ntrees 
Number of models: 20 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by increasing logloss
   max_depth ntrees        model_ids             logloss
1         11    900 rf_grid_model_19  0.1024994452356641
2         11    300  rf_grid_model_7  0.1024994452356641
3         11    500 rf_grid_model_11  0.1024994452356641
4         11    700 rf_grid_model_15  0.1024994452356641
5         11    100  rf_grid_model_3 0.10323083345907151
6          8    300  rf_grid_model_6 0.10576761836574729
7          8    500 rf_grid_model_10 0.10576761836574729
8          8    900 rf_grid_model_18 0.10576761836574729
9          8    700 rf_grid_model_14 0.10576761836574729
10         8    100  rf_grid_model_2  0.1068203416301204
11         5    900 rf_grid_model_17 0.12680637040877746
12         5    500  rf_grid_model_9 0.12680637040877746
13         5    300  rf_grid_model_5 0.12680637040877746


In [8]:
## sort the grid models by decreasing AUC
sortedGrid <- h2o.getGrid("rf_grid", sort_by="auc", decreasing = TRUE)    
print(sortedGrid)

H2O Grid Details

Grid ID: rf_grid 
Used hyper parameters: 
  -  max_depth 
  -  ntrees 
Number of models: 20 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by decreasing auc
   max_depth ntrees        model_ids                auc
1         11    900 rf_grid_model_19 0.9944427083333334
2         11    300  rf_grid_model_7 0.9944427083333334
3         11    500 rf_grid_model_11 0.9944427083333334
4         11    700 rf_grid_model_15 0.9944427083333334
5         11    100  rf_grid_model_3           0.994375
6          8    300  rf_grid_model_6 0.9939479166666667
7          8    500 rf_grid_model_10 0.9939479166666667
8          8    900 rf_grid_model_18 0.9939479166666667
9          8    700 rf_grid_model_14 0.9939479166666667
10         8    100  rf_grid_model_2          0.9938125
11         5    900 rf_grid_model_17 0.9932708333333334
12         5    500  rf_grid_model_9 0.9932708333333334
13         5    300  rf_grid_model_5 0.9932708333333334
14         5    70

### You can print out the top 10 models from the grid search. 

Below the AUC has increased.

In [9]:
for (i in 1:10) {
  topModels <- h2o.getModel(sortedGrid@model_ids[[i]])
  print(h2o.auc(h2o.performance(topModels, valid = TRUE)))
}

[1] 0.9944427
[1] 0.9944427
[1] 0.9944427
[1] 0.9944427
[1] 0.994375
[1] 0.9939479
[1] 0.9939479
[1] 0.9939479
[1] 0.9939479
[1] 0.9938125


You can also understand the details of the best model.

In [10]:
best_model <- h2o.getModel(sortedGrid@model_ids[[1]])
summary(best_model)

scoring_history <- as.data.frame(best_model@model$scoring_history)
#plot(scoring_history$number_of_trees, scoring_history$training_MSE, type="p") #training mse
#points(scoring_history$number_of_trees, scoring_history$validation_MSE, type="l") #validation mse

## get the actual number of trees
ntrees <- best_model@model$model_summary$number_of_trees

Model Details:

H2OBinomialModel: drf
Model Key:  rf_grid_model_19 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             120                      120               46222         6
  max_depth mean_depth min_leaves max_leaves mean_leaves
1        11    8.55000         12         47    25.67500

H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  0.02043548
RMSE:  0.1429527
LogLoss:  0.081188
Mean Per-Class Error:  0.02143525
AUC:  0.9973679
Gini:  0.9947359

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
       female male    Error     Rate
female    303    7 0.022581   =7/310
male        7  338 0.020290   =7/345
Totals    310  345 0.021374  =14/655

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.472222 0.979710 106
2                   

### Learning Objective 3: Plotting ROC, Precision-recall

* Use H2o metric functions available [here](https://rdrr.io/cran/h2o/man/h2o.metric.html).

In [41]:
# Calculate performance measures at threshold that maximizes precision
my.pred = h2o.predict(best_model,test)
head(my.pred)
my.perf = h2o.performance(best_model, test)
my.perf 



predict,female,male
male,0.166666667,0.8333333
male,0.091666667,0.9083333
male,0.008333333,0.9916667
male,0.011111111,0.9888889
male,0.012777778,0.9872222
male,0.004444445,0.9955556


H2OBinomialMetrics: drf

MSE:  0.02054478
RMSE:  0.1433345
LogLoss:  0.1014877
Mean Per-Class Error:  0.02119041
AUC:  0.9967621
Gini:  0.9935243

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
       female male    Error      Rate
female    941   13 0.013627   =13/954
male       27  912 0.028754   =27/939
Totals    968  925 0.021130  =40/1893

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.608333 0.978541  69
2                       max f2  0.333333 0.981407 104
3                 max f0point5  0.766667 0.984358  50
4                 max accuracy  0.616667 0.978870  68
5                max precision  1.000000 1.000000   0
6                   max recall  0.000000 1.000000 174
7              max specificity  1.000000 1.000000   0
8             max absolute_mcc  0.616667 0.957868  68
9   max min_per_class_accuracy  0.525000 0.976939  78
10 max mean_per

### Plotting ROC

In [49]:
tpr=as.data.frame(h2o.tpr(my.perf))
fpr=as.data.frame(h2o.fpr(my.perf))
ROC_out<-merge(tpr,fpr,by='threshold')
head(ROC_out)

threshold,tpr,fpr
0.0,1.0,1.0
0.008333333,0.998935,0.6645702
0.016666667,0.998935,0.5555556
0.025,0.998935,0.4874214
0.027083333,0.998935,0.4318658
0.033333333,0.998935,0.4308176


In [50]:
# Then give the pdf output file a name 
pdf(file="/Users/chriskuo/Downloads/my_ROC.pdf")
ggplot(ROC_out, aes(x = fpr, y = tpr)) +
  theme_bw() +
  geom_line() +
  ggtitle("ROC")
dev.off()

“Cannot open temporary file '/var/folders/jw/wyhtzlf94zbgtpf9g_n5tryr0000gn/T//Rtmp7gkpNR/pdf80843e17391' for compression (reason: No such file or directory); compression has been turned off for this device”

### Plotting Precision-Recall Curve

In [51]:
h2o.F1(my.perf)
precision=as.data.frame(h2o.precision(my.perf))
recall=as.data.frame(h2o.recall(my.perf))
PR_out<-merge(precision,recall,by='threshold')
head(PR_out)

threshold,f1
1.0000000,0.4867043
0.9983333,0.4939856
0.9972222,0.5035857
0.9955556,0.5047771
0.9916667,0.6450216
0.9900000,0.6479482
0.9888889,0.6498922
0.9872222,0.6508621
0.9833333,0.7327935
0.9805556,0.7353535


threshold,precision,tpr
0.0,0.496038,1.0
0.008333333,0.5966921,0.998935
0.016666667,0.6389646,0.998935
0.025,0.6685674,0.998935
0.027083333,0.6948148,0.998935
0.033333333,0.6953299,0.998935


In [52]:
# Then give the pdf output file a name 
pdf(file="/Users/chriskuo/Downloads/my_PR.pdf")
ggplot(out, aes(x = tpr, y = precision)) +
  theme_bw() +
  geom_line() +
  ggtitle("Precision-Recall")
dev.off()

“Cannot open temporary file '/var/folders/jw/wyhtzlf94zbgtpf9g_n5tryr0000gn/T//Rtmp7gkpNR/pdf8085e27b98a' for compression (reason: No such file or directory); compression has been turned off for this device”

In [97]:
# All done. Shut down H2O.
h2o.shutdown(prompt=FALSE)