# Introductory H2O Machine Learning Tutorial



## Install H2O

The first step in this tutorial is to download and install the h2o Python module.  
The latest version is always here: http://www.h2o.ai/download/h2o/py

### Start up the H2O Cluster

Once the Python module is installed, we begin by starting up a local (on your laptop) H2O cluster.

In [1]:
# Load the H2O library and start up the H2O cluter locally on your machine
import h2o

# Number of threads, nthreads = -1, means use all cores on your machine
# max_mem_size is the maximum memory (in GB) to allocate to H2O
h2o.init(nthreads = -1, max_mem_size = 8)

ERROR:h2o:Key init.version_check is not a valid config key


Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,3 hours 19 mins
H2O cluster version:,3.10.5.3
H2O cluster version age:,11 days
H2O cluster name:,H2O_from_python_avkashchauhan_2m13v0
H2O cluster total nodes:,1
H2O cluster free memory:,6.350 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8
H2O cluster status:,"locked, healthy"
H2O connection url:,http://localhost:54321


## Data prep

### Import data
Next we will import a cleaned up version of the Lending Club "Bad Loans" dataset. The purpose here is to predict whether a loan will be bad (i.e. not repaid to the lender). The response column, `bad_loan`, is 1 if the loan was bad, and 0 otherwise.

In [2]:
loan_csv = "/Users/avkashchauhan/learn/workshop2017/loan-exercise/loan.csv"  # modify this for your machine
# Alternatively, you can import the data directly from a URL
#loan_csv = "https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv"
data = h2o.import_file(loan_csv)  # 163,987 rows x 15 columns

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [3]:
data.shape

(163987, 15)

In [5]:
data.summary

loan_amnt,term,int_rate,emp_length,home_ownership,annual_inc,purpose,addr_state,dti,delinq_2yrs,revol_util,total_acc,bad_loan,longest_credit_length,verification_status
5000,36 months,10.65,10,RENT,24000,credit_card,AZ,27.65,0,83.7,9,0,26,verified
2500,60 months,15.27,0,RENT,30000,car,GA,1.0,0,9.4,4,1,12,verified
2400,36 months,15.96,10,RENT,12252,small_business,IL,8.72,0,98.5,10,0,10,not verified
10000,36 months,13.49,10,RENT,49200,other,CA,20.0,0,21.0,37,0,15,verified
5000,36 months,7.9,3,RENT,36000,wedding,AZ,11.2,0,28.3,12,0,7,verified
3000,36 months,18.64,9,RENT,48000,car,CA,5.35,0,87.5,4,0,4,verified
5600,60 months,21.28,4,OWN,40000,small_business,CA,5.55,0,32.6,13,1,7,verified
5375,60 months,12.69,0,RENT,15000,other,TX,18.08,0,36.5,3,1,7,verified
6500,60 months,14.65,5,OWN,72000,debt_consolidation,AZ,16.12,0,20.6,23,0,13,not verified
12000,36 months,12.69,10,OWN,75000,debt_consolidation,CA,10.78,0,67.1,34,0,22,verified


<bound method H2OFrame.summary of >

In [4]:
data.describe

loan_amnt,term,int_rate,emp_length,home_ownership,annual_inc,purpose,addr_state,dti,delinq_2yrs,revol_util,total_acc,bad_loan,longest_credit_length,verification_status
5000,36 months,10.65,10,RENT,24000,credit_card,AZ,27.65,0,83.7,9,0,26,verified
2500,60 months,15.27,0,RENT,30000,car,GA,1.0,0,9.4,4,1,12,verified
2400,36 months,15.96,10,RENT,12252,small_business,IL,8.72,0,98.5,10,0,10,not verified
10000,36 months,13.49,10,RENT,49200,other,CA,20.0,0,21.0,37,0,15,verified
5000,36 months,7.9,3,RENT,36000,wedding,AZ,11.2,0,28.3,12,0,7,verified
3000,36 months,18.64,9,RENT,48000,car,CA,5.35,0,87.5,4,0,4,verified
5600,60 months,21.28,4,OWN,40000,small_business,CA,5.55,0,32.6,13,1,7,verified
5375,60 months,12.69,0,RENT,15000,other,TX,18.08,0,36.5,3,1,7,verified
6500,60 months,14.65,5,OWN,72000,debt_consolidation,AZ,16.12,0,20.6,23,0,13,not verified
12000,36 months,12.69,10,OWN,75000,debt_consolidation,CA,10.78,0,67.1,34,0,22,verified


<bound method H2OFrame.describe of >

### Encode response variable
Since we want to train a binary classification model, we must ensure that the response is coded as a factor. If the response is 0/1, H2O will assume it's numeric, which means that H2O will train a regression model instead.

In [6]:
data['bad_loan'] = data['bad_loan'].asfactor()  #encode the binary repsonse as a factor

In [7]:
data['bad_loan'].levels()  #optional: after encoding, this shows the two factor levels, '0' and '1'

[['0', '1']]

### Partition data

Next, we partition the data into training, validation and test sets.

In [8]:
# Partition data into 70%, 15%, 15% chunks
# Setting a seed will guarantee reproducibility
splits = data.split_frame(ratios=[0.7, 0.15], seed=1)  

In [9]:
splits

loan_amnt,term,int_rate,emp_length,home_ownership,annual_inc,purpose,addr_state,dti,delinq_2yrs,revol_util,total_acc,bad_loan,longest_credit_length,verification_status
5000,36 months,10.65,10,RENT,24000,credit_card,AZ,27.65,0,83.7,9,0,26,verified
2500,60 months,15.27,0,RENT,30000,car,GA,1.0,0,9.4,4,1,12,verified
10000,36 months,13.49,10,RENT,49200,other,CA,20.0,0,21.0,37,0,15,verified
3000,36 months,18.64,9,RENT,48000,car,CA,5.35,0,87.5,4,0,4,verified
5375,60 months,12.69,0,RENT,15000,other,TX,18.08,0,36.5,3,1,7,verified
6500,60 months,14.65,5,OWN,72000,debt_consolidation,AZ,16.12,0,20.6,23,0,13,not verified
3000,36 months,9.91,3,RENT,15000,credit_card,IL,12.56,0,43.1,11,0,8,verified
1000,36 months,16.29,0,RENT,28000,debt_consolidation,MO,20.31,0,81.5,23,0,4,not verified
10000,36 months,15.27,4,RENT,42000,home_improvement,CA,18.6,0,70.2,28,0,13,not verified
3600,36 months,6.03,10,MORTGAGE,110000,major_purchase,CT,10.52,0,16.0,42,0,18,not verified


loan_amnt,term,int_rate,emp_length,home_ownership,annual_inc,purpose,addr_state,dti,delinq_2yrs,revol_util,total_acc,bad_loan,longest_credit_length,verification_status
2400,36 months,15.96,10,RENT,12252,small_business,IL,8.72,0,98.5,10,0,10,not verified
5000,36 months,7.9,3,RENT,36000,wedding,AZ,11.2,0,28.3,12,0,7,verified
5600,60 months,21.28,4,OWN,40000,small_business,CA,5.55,0,32.6,13,1,7,verified
9000,36 months,13.49,0,RENT,30000,debt_consolidation,VA,10.08,0,91.7,9,1,7,verified
10000,36 months,10.65,3,RENT,100000,other,CA,7.06,0,55.5,29,1,20,verified
3000,36 months,18.25,9,MORTGAGE,65000,other,PA,17.39,0,98.1,22,0,13,not verified
10000,36 months,10.65,2,RENT,51400,credit_card,TX,19.14,0,59.1,24,0,11,not verified
8000,36 months,16.77,0,RENT,62000,debt_consolidation,VA,21.64,0,66.9,20,0,5,not verified
16425,36 months,14.27,4,RENT,44544,debt_consolidation,CA,22.71,0,83.4,18,0,9,verified
3000,36 months,13.49,1,RENT,33600,debt_consolidation,AL,18.11,0,83.2,7,0,6,not verified


loan_amnt,term,int_rate,emp_length,home_ownership,annual_inc,purpose,addr_state,dti,delinq_2yrs,revol_util,total_acc,bad_loan,longest_credit_length,verification_status
12000,36 months,12.69,10,OWN,75000,debt_consolidation,CA,10.78,0,67.1,34,0,22,verified
10000,36 months,11.71,5,RENT,50000,debt_consolidation,CA,16.01,0,91.8,17,0,8,not verified
6000,36 months,6.03,10,MORTGAGE,45600,debt_consolidation,LA,5.34,0,32.5,28,0,16,not verified
16000,60 months,19.91,7,RENT,81000,credit_card,MA,20.52,0,75.1,21,0,13,verified
17675,60 months,14.65,0,RENT,50000,debt_consolidation,WI,16.46,0,57.4,14,0,9,verified
20975,60 months,17.58,5,MORTGAGE,44000,credit_card,GA,18.79,0,79.4,21,0,11,verified
6400,36 months,16.77,5,RENT,75000,debt_consolidation,CA,20.22,0,67.5,27,1,17,not verified
18000,60 months,19.91,10,MORTGAGE,65000,debt_consolidation,FL,6.81,0,77.8,40,0,22,not verified
35000,60 months,17.27,3,MORTGAGE,150000,home_improvement,NY,7.51,0,53.3,31,0,8,verified
7000,36 months,11.71,4,OWN,39120,debt_consolidation,FL,21.01,0,52.4,26,0,15,not verified


[, , ]

In [10]:
train = splits[0]
valid = splits[1]
test = splits[2]

In [13]:
train.shape
test.shape

(24581, 15)

In [12]:
print(train.shape)
print(valid.shape)
print(test.shape)

(114908, 15)
(24498, 15)
(24581, 15)


Notice that `split_frame()` uses approximate splitting not exact splitting (for efficiency), so these are not exactly 70%, 15% and 15% of the total rows.

In [14]:
print(train.nrow)
print(valid.nrow)
print(test.nrow)

114908
24498
24581


### Identify response and predictor variables
In H2O, we use `y` to designate the response variable and `x` to designate the list of predictor columns.

In [15]:
train['bad_loan']

bad_loan
0
1
0
0
1
0
0
0
0
0




#### List of columns in your dataframe

In [16]:
data.columns

['loan_amnt',
 'term',
 'int_rate',
 'emp_length',
 'home_ownership',
 'annual_inc',
 'purpose',
 'addr_state',
 'dti',
 'delinq_2yrs',
 'revol_util',
 'total_acc',
 'bad_loan',
 'longest_credit_length',
 'verification_status']

##### defining the response variable

In [21]:
y = 'bad_loan'

#### getting all the colmuns besides the response variable

In [22]:
x = list(data.columns)
print(x)

['loan_amnt', 'term', 'int_rate', 'emp_length', 'home_ownership', 'annual_inc', 'purpose', 'addr_state', 'dti', 'delinq_2yrs', 'revol_util', 'total_acc', 'bad_loan', 'longest_credit_length', 'verification_status']


##### removing the response from the list

In [23]:
x.remove(y)  #remove the response
print(x)

['loan_amnt', 'term', 'int_rate', 'emp_length', 'home_ownership', 'annual_inc', 'purpose', 'addr_state', 'dti', 'delinq_2yrs', 'revol_util', 'total_acc', 'longest_credit_length', 'verification_status']


#### Now remove the interest rate column because it's correlated with the outcome

In [24]:
x.remove('int_rate')  

In [25]:
# List of predictor columns
x

['loan_amnt',
 'term',
 'emp_length',
 'home_ownership',
 'annual_inc',
 'purpose',
 'addr_state',
 'dti',
 'delinq_2yrs',
 'revol_util',
 'total_acc',
 'longest_credit_length',
 'verification_status']

## H2O Machine Learning

Now that we have prepared the data, we can train some models. We will start by training a single model from each of the H2O supervised algos:

- Generalized Linear Model (GLM)
- Random Forest (RF)
- Gradient Boosting Machine (RF)
- Deep Learning (DL)
- Naive Bayes (NB)

## 1. Generalized Linear Model
Let's start with a basic binomial Generalized Linear Model (GLM).  By default, H2O's GLM uses a regularized, elastic net model.

In [26]:
# Import H2O GLM:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

### Train a default GLM
We first create an object of class, `"H2OGeneralizedLinearEstimator"`.  This does not actually do any training, it just sets the model up for training by specifying model parameters.

In [27]:
# Initialize the GLM estimator:
# Similar to R's glm() and H2O's R GLM, H2O's GLM has the "family" argument

glm_fit1 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit1')

Now that `glm_fit1` object is initialized, we can train the model:

In [28]:
glm_fit1.train(x=x, y=y, training_frame=train)

glm Model Build progress: |███████████████████████████████████████████████| 100%


### Train a GLM with lambda search

Next we will do some automatic tuning by passing in a validation frame and setting `lambda_search = True`.  Since we are training a GLM with regularization, we should try to find the right amount of regularization (to avoid overfitting).  The model parameter, `lambda`, controls the amount of regularization in a GLM model and we can find the optimal value for `lambda` automatically by setting `lambda_search = True` and passing in a validation frame (which is used to evaluate model performance using a particular value of lambda).

In [29]:
glm_fit2 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit2', lambda_search=True)
glm_fit2.train(x=x, y=y, training_frame=train, validation_frame=valid)

glm Model Build progress: |███████████████████████████████████████████████| 100%


#### Displaying the model

In [31]:
glm_fit1

Model Details
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  glm_fit1


ModelMetricsBinomialGLM: glm
** Reported on train data. **

MSE: 0.13984803127954054
RMSE: 0.3739626067931666
LogLoss: 0.4457344746970694
Null degrees of freedom: 114907
Residual degrees of freedom: 114856
Null deviance: 108939.63716429766
Residual deviance: 102436.9140369817
AIC: 102540.9140369817
AUC: 0.67415072995296
Gini: 0.34830145990592
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.1891036422294594: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,62290.0,31734.0,0.3375,(31734.0/94024.0)
1,8583.0,12301.0,0.411,(8583.0/20884.0)
Total,70873.0,44035.0,0.3509,(40317.0/114908.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.1891036,0.3789646,226.0
max f2,0.1120643,0.5475462,315.0
max f0point5,0.2683107,0.3409905,153.0
max accuracy,0.5215806,0.8187332,24.0
max precision,0.7383309,1.0,0.0
max recall,0.0009766,1.0,399.0
max specificity,0.7383309,1.0,0.0
max absolute_mcc,0.2081632,0.2012044,207.0
max min_per_class_accuracy,0.1802906,0.6242555,235.0


Gains/Lift Table: Avg response rate: 18.17 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100080,0.4669396,2.7415323,2.7415323,0.4982609,0.4982609,0.0274373,0.0274373,174.1532274,174.1532274
,2,0.0200073,0.4282534,2.4565970,2.5991266,0.4464752,0.4723793,0.0245643,0.0520015,145.6597003,159.9126607
,3,0.0300066,0.4036249,2.1501210,2.4495015,0.3907746,0.4451856,0.0214997,0.0735012,115.0120963,144.9501467
,4,0.0400059,0.3839071,2.1213888,2.3674912,0.3855527,0.4302806,0.0212124,0.0947137,112.1388835,136.7491153
,5,0.0500052,0.3672320,2.1644870,2.3268974,0.3933856,0.4229029,0.0216434,0.1163570,116.4487028,132.6897393
,6,0.1000017,0.3109149,1.8790812,2.1030088,0.3415144,0.3822122,0.0939475,0.2103045,87.9081216,110.3008790
,7,0.1500070,0.2743605,1.6163798,1.9407897,0.2937696,0.3527296,0.0808274,0.2911320,61.6379753,94.0789701
,8,0.2000035,0.2483284,1.5084368,1.8327109,0.2741514,0.3330868,0.0754166,0.3665486,50.8436756,83.2710871
,9,0.3000052,0.2112755,1.2770320,1.6474846,0.2320947,0.2994227,0.1277054,0.4942540,27.7031977,64.7484573



Scoring History: 


0,1,2,3,4,5
,timestamp,duration,iteration,negative_log_likelihood,objective
,2017-07-12 01:21:33,0.000 sec,0,54469.8185821,0.4740298
,2017-07-12 01:21:33,0.038 sec,1,51474.8736259,0.4483405
,2017-07-12 01:21:33,0.060 sec,2,51227.1975515,0.4461995
,2017-07-12 01:21:33,0.080 sec,3,51218.4598670,0.4461301
,2017-07-12 01:21:33,0.098 sec,4,51218.4570185,0.4461301




### Evaluate model performance
Let's compare the performance of the two GLMs that were just trained.

In [32]:
glm_perf1 = glm_fit1.model_performance(test)
glm_perf2 = glm_fit2.model_performance(test)

In [33]:
# Print model performance
print(glm_perf1)
print(glm_perf2)


ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.14215509900200168
RMSE: 0.37703461247211995
LogLoss: 0.4510784560183179
Null degrees of freedom: 24580
Residual degrees of freedom: 24529
Null deviance: 23672.922265642344
Residual deviance: 22175.919054772545
AIC: 22279.919054772545
AUC: 0.6774770487678061
Gini: 0.35495409753561225
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.1933006386619594: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,13645.0,6346.0,0.3174,(6346.0/19991.0)
1,1939.0,2651.0,0.4224,(1939.0/4590.0)
Total,15584.0,8997.0,0.337,(8285.0/24581.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.1933006,0.3902260,221.0
max f2,0.1186092,0.5566546,303.0
max f0point5,0.2762009,0.3539777,151.0
max accuracy,0.4942441,0.8144095,33.0
max precision,0.7445000,1.0,0.0
max recall,0.0025743,1.0,398.0
max specificity,0.7445000,1.0,0.0
max absolute_mcc,0.1941702,0.2106102,220.0
max min_per_class_accuracy,0.1800094,0.6279326,234.0


Gains/Lift Table: Avg response rate: 18.67 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100077,0.4653404,2.8735958,2.8735958,0.5365854,0.5365854,0.0287582,0.0287582,187.3595834,187.3595834
,2,0.0200155,0.4286398,2.3946632,2.6341295,0.4471545,0.4918699,0.0239651,0.0527233,139.4663195,163.4129514
,3,0.0300232,0.4021231,2.2422755,2.5035115,0.4186992,0.4674797,0.0224401,0.0751634,124.2275537,150.3511522
,4,0.0400309,0.3853779,2.2858149,2.4490874,0.4268293,0.4573171,0.0228758,0.0980392,128.5814868,144.9087359
,5,0.0500386,0.3692731,2.0463485,2.3685396,0.3821138,0.4422764,0.0204793,0.1185185,104.6348548,136.8539597
,6,0.1000366,0.3097208,1.8911445,2.1299391,0.3531326,0.3977227,0.0945534,0.2130719,89.1144473,112.9939106
,7,0.1500346,0.2743521,1.6863431,1.9821139,0.3148902,0.3701193,0.0843137,0.2973856,68.6343113,98.2113869
,8,0.2000325,0.2482081,1.3943922,1.8352133,0.2603743,0.3426886,0.0697168,0.3671024,39.4392238,83.5213343
,9,0.3000285,0.2107601,1.2810979,1.6505332,0.2392189,0.3082034,0.1281046,0.4952070,28.1097869,65.0533230





ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.14221872366627972
RMSE: 0.37711897813061557
LogLoss: 0.4512741024580795
Null degrees of freedom: 24580
Residual degrees of freedom: 24533
Null deviance: 23672.922265642344
Residual deviance: 22185.537425044102
AIC: 22281.537425044102
AUC: 0.6769812537646298
Gini: 0.3539625075292596
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.19498552263693553: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,13774.0,6217.0,0.311,(6217.0/19991.0)
1,1984.0,2606.0,0.4322,(1984.0/4590.0)
Total,15758.0,8823.0,0.3336,(8201.0/24581.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.1949855,0.3885782,222.0
max f2,0.1186837,0.5562704,308.0
max f0point5,0.2588212,0.3518654,160.0
max accuracy,0.4757272,0.8145315,33.0
max precision,0.7361547,1.0,0.0
max recall,0.0026919,1.0,398.0
max specificity,0.7361547,1.0,0.0
max absolute_mcc,0.1949855,0.2085943,222.0
max min_per_class_accuracy,0.1803942,0.6270322,238.0


Gains/Lift Table: Avg response rate: 18.67 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100077,0.4620187,2.9606745,2.9606745,0.5528455,0.5528455,0.0296296,0.0296296,196.0674496,196.0674496
,2,0.0200155,0.4252075,2.2205059,2.5905902,0.4146341,0.4837398,0.0222222,0.0518519,122.0505872,159.0590184
,3,0.0300232,0.3988653,2.3075845,2.4962550,0.4308943,0.4661247,0.0230937,0.0749455,130.7584533,149.6254967
,4,0.0400309,0.3813612,2.3511239,2.4599722,0.4390244,0.4593496,0.0235294,0.0984749,135.1123864,145.9972191
,5,0.0500386,0.3665433,2.0681182,2.3816014,0.3861789,0.4447154,0.0206972,0.1191721,106.8118214,138.1601396
,6,0.1000366,0.3089697,1.8824295,2.1321170,0.3515053,0.3981293,0.0941176,0.2132898,88.2429522,113.2116958
,7,0.1500346,0.2734416,1.5991936,1.9545240,0.2986168,0.3649675,0.0799564,0.2932462,59.9193598,95.4524005
,8,0.2000325,0.2478303,1.5207590,1.8461048,0.2839707,0.3447224,0.0760349,0.3692810,52.0759035,84.6104817
,9,0.3000285,0.2103529,1.2614892,1.6512594,0.2355574,0.3083390,0.1261438,0.4954248,26.1489228,65.1259377






Instead of printing the entire model performance metrics object, it is probably easier to print just the metric that you are interested in comparing.

In [36]:
# Retreive test set AUC
print(glm_perf1.auc())
print(glm_perf2.auc())

0.6774770487678061
0.6769812537646298


In [37]:
# Compare test AUC to the training AUC and validation AUC
print(glm_fit2.auc(train=True))
print(glm_fit2.auc(valid=True))

0.6735092338305698
0.6753405433388947


## 2. Random Forest
H2O's Random Forest (RF) is implements a distributed version of the standard Random Forest algorithm and variable importance measures.

In [38]:
# Import H2O RF:
from h2o.estimators.random_forest import H2ORandomForestEstimator

### Train and a default RF
First we will train a basic Random Forest model with default parameters. Random Forest will infer the response distribution from the response encoding. A seed is required for reproducibility.

In [40]:
# Initialize the RF estimator:

rf_fit1 = H2ORandomForestEstimator(model_id='rf_fit1', seed=1, nfolds=5)

Now that `rf_fit1` object is initialized, we can train the model:

In [41]:
rf_fit1.train(x=x, y=y, training_frame=train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


### Train an RF with more trees

Next we will increase the number of trees used in the forest by setting `ntrees = 100`.  The default number of trees in an H2O Random Forest is 50, so this RF will be twice as big as the default.  Usually increasing the number of trees in an RF will increase performance as well.  Unlike Gradient Boosting Machines (GBMs), Random Forests are fairly resistant (although not free from) overfitting by increasing the number of trees.  See the GBM example below for additional guidance on preventing overfitting using H2O's early stopping functionality.

In [None]:
rf_fit2 = H2ORandomForestEstimator(model_id='rf_fit2', ntrees=100, seed=1)
rf_fit2.train(x=x, y=y, training_frame=train)

### Compare model performance
Let's compare the performance of the two RFs that were just trained.

In [None]:
rf_perf1 = rf_fit1.model_performance(test)
rf_perf2 = rf_fit2.model_performance(test)

In [None]:
# Retreive test set AUC
print rf_perf1.auc()
print rf_perf2.auc()

### Cross-validate performance

Rather than using held-out test set to evaluate model performance, a user may wish to estimate model performance using cross-validation.  Using the RF algorithm (with default model parameters) as an example, we demonstrate how to perform k-fold cross-validation using H2O.  No custom code or loops are required, you simply specify the number of desired folds in the `nfolds` argument.

Since we are not going to use a test set here, we can use the original (full) dataset, which we called `data` rather than the subsampled `train` dataset.  Note that this will take approximately k (`nfolds`) times longer than training a single RF model, since it will train k models in the cross-validation process (trained on n(k-1)/k rows), in addition to the final model trained on the full `training_frame` dataset with n rows. 

In [None]:
rf_fit3 = H2ORandomForestEstimator(model_id='rf_fit3', seed=1, nfolds=5)
rf_fit3.train(x=x, y=y, training_frame=data)

To evaluate the cross-validated AUC, do the following:

In [None]:
print rf_fit3.auc(xval=True)

Note that the cross-validated AUC is slighly higher than the test set performance we estimated for `rf_fit1`, and this is likely due to the fact that we trained on more data (n rows) than we did while using `train` as the training set (0.75*n rows) in `rf_fit1`.

## 3. Gradient Boosting Machine
H2O's Gradient Boosting Machine (GBM) offers a Stochastic GBM, which can increase performance quite a bit compared to the original GBM implementation.

In [None]:
# Import H2O GBM:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

### Train a default GBM

First we will train a basic GBM model with default parameters. GBM will infer the response distribution from the response encoding if not specified explicitly through the `distribution` argument. A seed is required for reproducibility.

In [None]:
# Initialize and train the GBM estimator:

gbm_fit1 = H2OGradientBoostingEstimator(model_id='gbm_fit1', seed=1)
gbm_fit1.train(x=x, y=y, training_frame=train)

### Train a GBM with more trees

Next we will increase the number of trees used in the GBM by setting `ntrees=500`.  The default number of trees in an H2O GBM is 50, so this GBM will trained using ten times the default.  Increasing the number of trees in a GBM is one way to increase performance of the model, however, you have to be careful not to overfit your model to the training data by using too many trees.  To automatically find the optimal number of trees, you must use H2O's early stopping functionality.  This example will not do that, however, the following example will.

In [None]:
gbm_fit2 = H2OGradientBoostingEstimator(model_id='gbm_fit2', ntrees=500, seed=1)
gbm_fit2.train(x=x, y=y, training_frame=train)

### Train a GBM with early stopping

We will again set `ntrees = 500`, however, this time we will use early stopping in order to prevent overfitting (from too many trees).  All of H2O's algorithms have early stopping available, however, with the exception of Deep Learning, it is not enabled by default.  

There are several parameters that should be used to control early stopping.  The three that are generic to all the algorithms are: `stopping_rounds`, `stopping_metric` and `stopping_tolerance`.  The stopping metric is the metric by which you'd like to measure performance, and so we will choose AUC here.  The `score_tree_interval` is a parameter specific to Random Forest and GBM.  Setting `score_tree_interval=5` will score the model after every five trees.  The parameters we have set below specify that the model will stop training after there have been three scoring intervals where the AUC has not increased more than 0.0005.  Since we have specified a validation frame, the stopping tolerance will be computed on validation AUC rather than training AUC. 

In [None]:
# Now let's use early stopping to find optimal ntrees

gbm_fit3 = H2OGradientBoostingEstimator(model_id='gbm_fit3', 
                                        ntrees=500, 
                                        score_tree_interval=5,     #used for early stopping
                                        stopping_rounds=3,         #used for early stopping
                                        stopping_metric='AUC',     #used for early stopping
                                        stopping_tolerance=0.0005, #used for early stopping
                                        seed=1)

# The use of a validation_frame is recommended with using early stopping
gbm_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)

### Compare model performance

Let's compare the performance of the three GBMs that were just trained.

In [None]:
gbm_perf1 = gbm_fit1.model_performance(test)
gbm_perf2 = gbm_fit2.model_performance(test)
gbm_perf3 = gbm_fit3.model_performance(test)

In [None]:
# Retreive test set AUC
print gbm_perf1.auc()
print gbm_perf2.auc()
print gbm_perf3.auc()

### Scoring History

To examine the scoring history, use the `scoring_history` method on a trained model.  If `score_tree_interval` is not specified, it will score at various intervals, as we can see for `gbm_fit2.scoring_history()` below.  However, regular 5-tree intervals are used for `gbm_fit3.scoring_history()`.  

The `gbm_fit2` was trained only using a training set (no validation set), so the scoring history is calculated for training set performance metrics only.

In [None]:
gbm_fit2.scoring_history()

When early stopping is used, we see that training stopped at 105 trees instead of the full 500.  Since we used a validation set in `gbm_fit3`, both training and validation performance metrics are stored in the scoring history object.  Take a look at the validation AUC to observe that the correct stopping tolerance was enforced.

In [None]:
gbm_fit3.scoring_history()

## 4. Deep Learning

H2O's Deep Learning algorithm is a multilayer feed-forward artificial neural network.  It can also be used to train an autoencoder, however, in the example below we will train a standard supervised prediction model.

In [None]:
# Import H2O DL:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

### Train a default DL

First we will train a basic DL model with default parameters. DL will infer the response distribution from the response encoding if not specified explicitly through the `distribution` argument.  H2O's DL will not be reproducbible if run on more than a single core, so in this example, the performance metrics below may vary slightly from what you see on your machine.

In H2O's DL, early stopping is enabled by default, so below, it will use the training set and default stopping parameters to perform early stopping.

In [None]:
# Initialize and train the DL estimator:

dl_fit1 = H2ODeepLearningEstimator(model_id='dl_fit1', seed=1)
dl_fit1.train(x=x, y=y, training_frame=train)

### Train a DL with new architecture and more epochs

Next we will increase the number of epochs used in the GBM by setting `epochs=20` (the default is 10).  Increasing the number of epochs in a deep neural net may increase performance of the model, however, you have to be careful not to overfit your model.  To automatically find the optimal number of epochs, you must use H2O's early stopping functionality.  Unlike the rest of the H2O algorithms, H2O's DL will use early by default, so we will first turn it off in the next example by setting `stopping_rounds=0`, for comparison.

In [None]:
dl_fit2 = H2ODeepLearningEstimator(model_id='dl_fit2', 
                                   epochs=20, 
                                   hidden=[10,10], 
                                   stopping_rounds=0,  #disable early stopping
                                   seed=1)
dl_fit2.train(x=x, y=y, training_frame=train)

### Train a DL with early stopping

This example will use the same model parameters as `dl_fit2`, however, we will turn on early stopping and specify the stopping criterion.  We will also pass a validation set, as is recommended for early stopping.

In [None]:
dl_fit3 = H2ODeepLearningEstimator(model_id='dl_fit3', 
                                   epochs=20, 
                                   hidden=[10,10],
                                   score_interval=1,          #used for early stopping
                                   stopping_rounds=3,         #used for early stopping
                                   stopping_metric='AUC',     #used for early stopping
                                   stopping_tolerance=0.0005, #used for early stopping
                                   seed=1)
dl_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)

### Compare model performance

Again, we will compare the model performance of the three models using a test set and AUC.

In [None]:
dl_perf1 = dl_fit1.model_performance(test)
dl_perf2 = dl_fit2.model_performance(test)
dl_perf3 = dl_fit3.model_performance(test)

In [None]:
# Retreive test set AUC
print dl_perf1.auc()
print dl_perf2.auc()
print dl_perf3.auc()

In [None]:
dl_fit3.scoring_history()

## 4. Naive Bayes

The Naive Bayes (NB) algorithm does not usually beat an algorithm like a Random Forest or GBM, however it is still a popular algorithm, especially in the text domain (when your input is text encoded as "Bag of Words", for example).  The Naive Bayes algorithm is for binary or multiclass classification problems only, not regression.  Therefore, your response must be a factor instead of numeric. 

In [None]:
# Import H2O NB:
from h2o.estimators.naive_bayes import H2ONaiveBayesEstimator

### Train a default NB

First we will train a basic NB model with default parameters. 

In [None]:
# Initialize and train the NB estimator:

nb_fit1 = H2ONaiveBayesEstimator(model_id='nb_fit1')
nb_fit1.train(x=x, y=y, training_frame=train)

### Train a NB model with Laplace Smoothing

One of the few tunable model parameters for the Naive Bayes algorithm is the amount of Laplace smoothing.  The H2O Naive Bayes model will not use any Laplace smoothing by default.

In [None]:
nb_fit2 = H2ONaiveBayesEstimator(model_id='nb_fit2', laplace=6)
nb_fit2.train(x=x, y=y, training_frame=train)

### Compare model performance

We will compare the model performance of the two NB models using test set AUC.

In [None]:
nb_perf1 = nb_fit1.model_performance(test)
nb_perf2 = nb_fit2.model_performance(test)

In [None]:
# Retreive test set AUC
print nb_perf1.auc()
print nb_perf2.auc()

In [None]:
h2o.shutdown(prompt=False)