# H2O Use Case - Predictive Maintenance

- Source: https://archive.ics.uci.edu/ml/datasets/SECOM
- H2O Basics: train a default Gradient Boosting Machine (GBM) for binary classification.

In [1]:
# Load h2o library
suppressPackageStartupMessages(library(h2o))

In [2]:
# Start and connect to a local H2O cluster
h2o.init(nthreads = -1)


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/RtmpkdXBvI/h2o_joe_started_from_r.out
    /tmp/RtmpkdXBvI/h2o_joe_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 645 milliseconds 
    H2O cluster version:        3.10.4.4 
    H2O cluster version age:    3 days  
    H2O cluster name:           H2O_started_from_R_joe_xnu227 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   5.21 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 3.3.2 (2016-10-31) 



In [3]:
# Importing data from local CSV
h_secom <- h2o.importFile(path = "secom.csv", destination_frame = "h_secom")



In [4]:
# Print out column names
colnames(h_secom)

In [5]:
# Look at "Classification"
summary(h_secom$Classification, exact_quantiles=TRUE)

 Classification   
 Min.   :-1.0000  
 1st Qu.:-1.0000  
 Median :-1.0000  
 Mean   :-0.8673  
 3rd Qu.:-1.0000  
 Max.   : 1.0000  

In [6]:
# "Classification" is a column of numerical values
# Convert "Classification" in secom dataset from numerical to categorical value
h_secom$Classification <- as.factor(h_secom$Classification)

In [7]:
# Look at "Classification" again
summary(h_secom$Classification, exact_quantiles=TRUE)

 Classification
 -1:1463       
 1 : 104       

In [8]:
# Define target (y) and features (x)
target <- "Classification"
features <- setdiff(colnames(h_secom), target)
print(features)

  [1] "Feature 001" "Feature 002" "Feature 003" "Feature 004" "Feature 005"
  [6] "Feature 006" "Feature 007" "Feature 008" "Feature 009" "Feature 010"
 [11] "Feature 011" "Feature 012" "Feature 013" "Feature 014" "Feature 015"
 [16] "Feature 016" "Feature 017" "Feature 018" "Feature 019" "Feature 020"
 [21] "Feature 021" "Feature 022" "Feature 023" "Feature 024" "Feature 025"
 [26] "Feature 026" "Feature 027" "Feature 028" "Feature 029" "Feature 030"
 [31] "Feature 031" "Feature 032" "Feature 033" "Feature 034" "Feature 035"
 [36] "Feature 036" "Feature 037" "Feature 038" "Feature 039" "Feature 040"
 [41] "Feature 041" "Feature 042" "Feature 043" "Feature 044" "Feature 045"
 [46] "Feature 046" "Feature 047" "Feature 048" "Feature 049" "Feature 050"
 [51] "Feature 051" "Feature 052" "Feature 053" "Feature 054" "Feature 055"
 [56] "Feature 056" "Feature 057" "Feature 058" "Feature 059" "Feature 060"
 [61] "Feature 061" "Feature 062" "Feature 063" "Feature 064" "Feature 065"
 [66] "Featu

In [9]:
# Splitting dataset into training and test
h_split <- h2o.splitFrame(h_secom, ratios = 0.7, seed = 1234)
h_train <- h_split[[1]] # 70%
h_test  <- h_split[[2]] # 30%

In [10]:
# Look at the size
dim(h_train)
dim(h_test)

In [11]:
# Check Classification in each dataset
summary(h_train$Classification, exact_quantiles = TRUE)
summary(h_test$Classification, exact_quantiles = TRUE)

 Classification
 -1:1028       
 1 :  77       

 Classification
 -1:435        
 1 : 27        

In [12]:
# H2O Gradient Boosting Machine with default settings
model <- h2o.gbm(x = features, 
                 y = target, 
                 training_frame = h_train,
                 seed = 1234)

“Dropping constant columns: [Feature 516, Feature 234, Feature 233, Feature 236, Feature 235, Feature 510, Feature 238, Feature 513, Feature 237, Feature 479, Feature 515, Feature 514, Feature 193, Feature 192, Feature 195, Feature 194, Feature 075, Feature 230, Feature 232, Feature 231, Feature 529, Feature 244, Feature 365, Feature 401, Feature 400, Feature 006, Feature 403, Feature 402, Feature 405, Feature 404, Feature 241, Feature 482, Feature 243, Feature 242, Feature 180, Feature 179, Feature 459, Feature 050, Feature 053, Feature 450, Feature 210, Feature 331, Feature 452, Feature 330, Feature 451, Feature 191, Feature 070, Feature 190, Feature 506, Feature 505, Feature 508, Feature 507, Feature 509, Feature 465, Feature 343, Feature 464, Feature 467, Feature 466, Feature 227, Feature 348, Feature 502, Feature 504, Feature 503, Feature 463, Feature 187, Feature 462, Feature 399, Feature 277, Feature 398, Feature 315, Feature 314, Feature 316, Feature 150, Feature 395, Feature 3



In [13]:
# Print out model summary
summary(model)

Model Details:

H2OBinomialModel: gbm
Model Key:  GBM_model_R_1492511204569_3 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              50                       50               11653         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000          7         18    12.56000

H2OBinomialMetrics: gbm
** Reported on training data. **

MSE:  0.004654337
RMSE:  0.0682227
LogLoss:  0.03489075
Mean Per-Class Error:  0
AUC:  1
Gini:  1

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
         -1  1    Error     Rate
-1     1028  0 0.000000  =0/1028
1         0 77 0.000000    =0/77
Totals 1028 77 0.000000  =0/1105

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.510169 1.000000  75
2                       max f2  0.510169 1.000000  75
3                 max f0point5  0.510169 1.0000

In [14]:
# Check performance on test set
h2o.performance(model, h_test)

H2OBinomialMetrics: gbm

MSE:  0.05435156
RMSE:  0.2331342
LogLoss:  0.2154845
Mean Per-Class Error:  0.3260536
AUC:  0.7404427
Gini:  0.4808855

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
        -1  1    Error     Rate
-1     393 42 0.096552  =42/435
1       15 12 0.555556   =15/27
Totals 408 54 0.123377  =57/462

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.056665 0.296296  53
2                       max f2  0.038989 0.382653  86
3                 max f0point5  0.084880 0.259259  26
4                 max accuracy  0.584555 0.939394   0
5                max precision  0.084880 0.259259  26
6                   max recall  0.007542 1.000000 375
7              max specificity  0.584555 0.997701   0
8             max absolute_mcc  0.056665 0.254007  53
9   max min_per_class_accuracy  0.026387 0.703704 137
10 max mean_per_class_accuracy  0.023328