# ++++++++++++++++++++++++++++++
# Prudential Life Insurance Assessment
# ++++++++++++++++++++++++++++++

## Part 2: Train a GBM Model
## ----------------------------------------------

Author: <a href='mailto:rory_creedon@swissre.com'>Rory Creedon</a><br/>
Date: September 2017 <br/>
Purpose: To train gradient boosting machine model

### Introduction to Boosting

Boosting is a "forward learning" decision tree model that grows trees sequentially using information from from the trees grown previously. 

The idea behind boosting is the results of a weak learner (that is a decision tree that generates predictions that are only slightly better than chance) can be filtered into correct predicitons, and the residuals (errors). A decision tree is then fit to these residuals. This tree is then added to the previous tree, the residuals updated and the process repeated on the new residuals. 

In the original implemtation of this method, the individual trees would only be split once, and so they would have only 2 terminal nodes. With more recent implementations trees can be deeper, although often on a few terminal nodes is sufficient. 

### Parameters of the Model

I am using the <a href="http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html"> H2O</a> GBM implementation. 

What follows is a list of parameters and notes about their applicability to the problem in hand:

| | |
|---|---|---|
|**Parameter** |**From the Docs** |**Notes**|
| `ntrees` | Specify the number of trees to build | In general a small learning rate and a large number of trees is preferable. In practice we will estimate the optimal number of trees with a cross=-validated approach. As a rule of thumb the number of trees should not be less than 1000. |
| `max_depth` | Specify the maximum tree depth| Deeper trees perform better on a training set as they can overfit the data. Greater depth also increases the computation time, particulalry for trees greter than depth 10. An optimal range of depths will be estimated using a gridsearch.|
|`min_rows` | Specify the minimum number of observations for a leaf  | This option specifies the number of observations that must be in the leafs that result from a split if the split is to be allowed to happen. The default value is 10. It seems like there is no need to optimize this parameter.|
|`nbins`| (Numerical/real/int only) Specify the number of bins for the histogram to build, then split at the best point | When a split in the data is being considered, a histogram is constructed of the feature and the best split is determined. The splits are made at the boundaries of the bins. Increasing the number of bins increased the specificity of the model (and can lead to overfitting). If we observe overfitting in our models we will reduce the number of bins from the default 20.|
|`nbins_cats` | (Categorical/facor only) Specify the number of bins for the histogram to build, then split at the best point  | A small value for catgroical vairables with many levels introduces randomness to the split, where large values result in perfect splits and overfitting. If we observe overfitting in our models we will reduce the number of bins from the default 1024|
| `learn_rate`|Specify the learning rate. | Lower rates are better but require more trees|
|`learn_rate_anealing` | Specifies to reduce the `learn_rate` by this factor after every tree. |  Instead of using a learn rate of 0.01, we can try a learn rate of 0.05 with an anealing of 0.99. The result should be generated more quickly without much accuracy being sacrificed 
|`sample_rate_per_class` | Specifies that each tree in the ensemble should sample from the full training dataset using a per-class-specific sampling rate rather than a global sample factor | This is useful when working with unbalanced data sets like this one. It varies between 0 and 1, 1 improves training score, less than one improves test scores. I could not figure out how to correctly tune this parameter so it does not appear below.|
| `col_sample_rate` | Specify the column sampling rate (y-axis) | Good general values for large datasets are around 0.7 to 0.8  |
| `max_abs_leafnode_pred` | Reduces overfitting by limiting the maximum absolute value of a leaf node prediction. | |
| `'min_split_improvement` | Specifies the minimum relative improvement in squared error reduction in order for a split to happen. | When properly tuned, this option can help reduce overfitting. Optimal values would be in the 1e-10...1e-3 range.|
|`balance_classes` | Specify whether to oversample the minority classes to balance the class distribution. | This is useful when the response variable is highly imbalanced, as it is in this case. When used the model undersamples majority classes or oversamples minority classes. This option will be used in all models. |

---

### Dataset Preliminaries

In [1]:
#imports
import time
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.grid.grid_search import H2OGridSearch

In [2]:
#initialize h2o cluster
h2o.init(nthreads=-1, strict_version_check=True)

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.7.0_151"; OpenJDK Runtime Environment (amzn-2.6.11.0.74.amzn1-x86_64 u151-b00); OpenJDK 64-Bit Server VM (build 24.151-b00, mixed mode)
  Starting server from /home/ec2-user/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp3vpisxwn
  JVM stdout: /tmp/tmp3vpisxwn/h2o_ec2_user_started_from_python.out
  JVM stderr: /tmp/tmp3vpisxwn/h2o_ec2_user_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster version:,3.15.0.4050
H2O cluster version age:,1 day
H2O cluster name:,H2O_from_python_ec2_user_ibh5wr
H2O cluster total nodes:,1
H2O cluster free memory:,26.52 Gb
H2O cluster total cores:,64
H2O cluster allowed cores:,64
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://127.0.0.1:54321


This is being run on an AWS virtual Linux machine with 64 CPUs, as can be seen from the number of H2O clusers.  

In [3]:
#import the dataset
df = pd.read_csv('train.csv')

#split the Product_info2 variable
df['product_Letter'] = df['Product_Info_2'].str.slice(0, 1)\
                       .astype('category')\
                       .cat.codes

df['product_Number'] = df['Product_Info_2'].str.slice(1, 2)\
                       .astype(int)
    
df = df.drop('Product_Info_2', axis=1)

#create a sum of the medical data
df['health_sum'] = df.filter(regex='Medical_Key').sum(axis=1)

#convert to H2OFrame
h2oDF = h2o.H2OFrame(df)

#convert factors to factor types
factor_cols = ['Product_Info_1', 'Product_Info_3', 'Product_Info_5', 'Product_Info_6', 
               'Product_Info_7', 'Employment_Info_2', 'Employment_Info_3', 
               'Employment_Info_5', 'InsuredInfo_1', 'InsuredInfo_2', 'InsuredInfo_3', 
               'InsuredInfo_4', 'InsuredInfo_5', 'InsuredInfo_6', 'InsuredInfo_7', 
               'Insurance_History_1', 'Insurance_History_2', 'Insurance_History_3', 
               'Insurance_History_4', 'Insurance_History_7', 'Insurance_History_8', 
               'Insurance_History_9', 'Family_Hist_1', 'Medical_History_2', 'Medical_History_3',
               'Medical_History_4', 'Medical_History_5', 'Medical_History_6', 'Medical_History_7', 
               'Medical_History_8', 'Medical_History_9', 'Medical_History_11', 'Medical_History_12', 
               'Medical_History_13', 'Medical_History_14', 'Medical_History_16', 'Medical_History_17', 
               'Medical_History_18', 'Medical_History_19', 'Medical_History_20', 'Medical_History_21', 
               'Medical_History_22', 'Medical_History_23', 'Medical_History_25', 'Medical_History_26', 
               'Medical_History_27', 'Medical_History_28', 'Medical_History_29', 'Medical_History_30', 
               'Medical_History_31', 'Medical_History_33', 'Medical_History_34', 'Medical_History_35', 
               'Medical_History_36', 'Medical_History_37', 'Medical_History_38', 'Medical_History_39', 
               'Medical_History_40', 'Medical_History_41', 'Response', 'product_Letter', 
               'product_Number']

for col in factor_cols:
    h2oDF[col] = h2oDF[col].asfactor()
    
# determine the response
response = "Response"

#determine the predictors
predictors = h2oDF.columns
del predictors[-4] #Response
del predictors[0]  #ID

#split the data into three pieces:

#1. 60% Training data
#2. 20% Validation data
#3. 20% Test data

seed = 20170929 #nunu's birthday
train60, valid, test = h2oDF.split_frame(ratios = [0.6, 0.2], seed = seed)
train80 = train60.rbind(valid)
print('train60', len(train60))
print('train80', len(train80))
print('valid', len(valid))
print('test', len(test))

Parse progress: |█████████████████████████████████████████████████████████| 100%
train60 35818
train80 47612
valid 11794
test 11769


I start by establishing baseline performance for the train set, by taking the default parameters for the GBM estimator. The only parameters that I add are the `distribution` (as this is a multinomial classification problem) and the `balance_classes` parameter as the response variable is highly imbalanced.  

In [10]:
start_time = time.time()
baseline = H2OGradientBoostingEstimator(distribution='multinomial', balance_classes=True)
baseline.train(x=predictors, y=response, training_frame=train60)
print("--- %s seconds ---" % (time.time() - start_time))

gbm Model Build progress: |███████████████████████████████████████████████| 100%
--- 11.308806419372559 seconds ---


In [11]:
baseline.model_performance()


ModelMetricsMultinomial: gbm
** Reported on train data. **

MSE: 0.34184425893878667
RMSE: 0.5846744897280766
LogLoss: 1.0088513799079613
Mean Per-Class Error: 0.3106870284628406
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4,5,6,7,8,9
1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,Error,Rate
5900.0,543.0,90.0,195.0,591.0,1291.0,935.0,2264.0,0.5003811,"5,909 / 11,809"
282.0,5720.0,96.0,175.0,774.0,1601.0,896.0,2250.0,0.5150076,"6,074 / 11,794"
59.0,117.0,9593.0,698.0,229.0,560.0,174.0,387.0,0.1882034,"2,224 / 11,817"
55.0,0.0,40.0,10135.0,0.0,339.0,162.0,1075.0,0.1415382,"1,671 / 11,806"
106.0,430.0,0.0,0.0,8086.0,1403.0,587.0,1139.0,0.3118883,"3,665 / 11,751"
206.0,205.0,3.0,12.0,351.0,7690.0,1213.0,2176.0,0.3513833,"4,166 / 11,856"
57.0,27.0,0.0,0.0,47.0,1183.0,6818.0,3646.0,0.4211241,"4,960 / 11,778"
17.0,8.0,0.0,4.0,15.0,266.0,350.0,11132.0,0.0559701,"660 / 11,792"
6682.0,7050.0,9822.0,11219.0,10093.0,14333.0,11135.0,24069.0,0.3106787,"29,329 / 94,403"


Top-8 Hit Ratios: 


0,1
k,hit_ratio
1,0.6893213
2,0.835768
3,0.9031599
4,0.9468026
5,0.9714735
6,0.988867
7,0.9987713
8,1.0000001




Performance is good even with with default model. However the extent of overfitting is not known, and so we perform the same exercise with 5 fold cross validation. 

In [12]:
start_time = time.time()
CVbaseline = H2OGradientBoostingEstimator(distribution='multinomial', balance_classes=True, 
                                          nfolds=5, seed=seed)
CVbaseline.train(x=predictors, y=response, training_frame=train80)
print("--- %s seconds ---" % (time.time() - start_time))

gbm Model Build progress: |███████████████████████████████████████████████| 100%
--- 100.31310510635376 seconds ---


In [13]:
print('Train mean per class error', CVbaseline.mean_per_class_error(train=True))

Train mean per class error 0.33224066114924866


In [14]:
#cross vlaidation metrics
CVbaseline.cross_validation_metrics_summary().as_data_frame()

Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,accuracy,0.55355775,0.001520318,0.5506971,0.5546034,0.5556143,0.55124855,0.5556254
1,err,0.44644225,0.001520318,0.4493029,0.4453966,0.4443857,0.44875145,0.44437462
2,err_count,4251.2,21.09692,4254.0,4262.0,4203.0,4295.0,4242.0
3,logloss,1.2787966,0.0049447035,1.2873873,1.273314,1.2727906,1.2873297,1.2731613
4,max_per_class_error,0.8190659,0.0062207775,0.8301887,0.8094241,0.8263959,0.8208955,0.80842525
5,mean_per_class_accuracy,0.46130127,0.0033562302,0.45728013,0.46135736,0.45514297,0.46827403,0.46445188
6,mean_per_class_error,0.53869873,0.0033562302,0.54271984,0.53864264,0.544857,0.531726,0.5355481
7,mse,0.4301344,0.0007164079,0.4315152,0.4298728,0.42853418,0.4308653,0.42988443
8,r2,0.92853343,0.00039531346,0.9289005,0.92764693,0.9281074,0.9289242,0.9290883
9,rmse,0.6558458,0.00054623996,0.65689814,0.65564686,0.65462524,0.6564033,0.65565574


In [None]:
#training data model performance
CVbaseline.model_performance(xval=True)

In [15]:
#cross validation model performance
CVbaseline.model_performance(xval=True)


ModelMetricsMultinomial: gbm
** Reported on cross-validation data. **

MSE: 0.4301353274612132
RMSE: 0.6558470305347225
LogLoss: 1.278795393015315
Mean Per-Class Error: 0.5386129410708305
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4,5,6,7,8,9
1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,Error,Rate
895.0,674.0,94.0,136.0,449.0,948.0,590.0,1163.0,0.8191554,"4,054 / 4,949"
531.0,1144.0,84.0,111.0,650.0,1022.0,526.0,1179.0,0.7819706,"4,103 / 5,247"
122.0,57.0,284.0,191.0,30.0,62.0,13.0,38.0,0.6436637,513 / 797
76.0,16.0,79.0,761.0,2.0,79.0,24.0,133.0,0.3495726,"409 / 1,170"
171.0,442.0,2.0,0.0,2138.0,860.0,247.0,504.0,0.5100825,"2,226 / 4,364"
330.0,390.0,3.0,17.0,445.0,4473.0,1221.0,2163.0,0.5053086,"4,569 / 9,042"
127.0,100.0,0.0,2.0,51.0,1159.0,2555.0,2432.0,0.6023965,"3,871 / 6,426"
60.0,54.0,0.0,11.0,27.0,688.0,671.0,14106.0,0.0967535,"1,511 / 15,617"
2312.0,2877.0,546.0,1229.0,3792.0,9291.0,5847.0,21718.0,0.4464421,"21,256 / 47,612"


Top-8 Hit Ratios: 


0,1
k,hit_ratio
1,0.5535579
2,0.7377132
3,0.851088
4,0.9222045
5,0.9652818
6,0.9902335
7,0.9962404
8,1.0




There is a fairly significant distance between the mean_per_class_error of the training set and that of the validation set. To some extent this is expected, but... The mean squared errors are more stable between the train and validation sets. This is encouraging. 

### Tuning  `learn_rate` and `max_depth`

Most parameters will be estimated using a multi hyper-parameter search. However, for `learn_rate` and `max_depth` we will run them individually. This is because these parameters greatly affect the speed that convergance takes place, and therefore narrowing the search field is preferable. This is expecially true for the max_depth. Depths greater than 10 can massively slow down performance. 

We will use this exercise to understand how many trees should be specified for the `ntrees` parameter. The standard setting for `ntrees` is 1000. We can observe how many trees are actually built given the other parameters and then set the ntrees more specifically in the larger tuning exercise. 

We start by perfroming a grid search for the `max_depth` parameter. The values to be searched are between 1 and 30, at intervals of 2. 

In [16]:
start_time = time.time()
hyper_params = {'max_depth' : list(range(1, 30, 2))}

depthModel = H2OGradientBoostingEstimator(
    distribution='multinomial',
    ntrees = 1000, #a standard value of ntrees
    learn_rate=0.05, #a fairly standard value of learn_rate
    learn_rate_annealing=0.99,
    sample_rate = 0.8, #a fairly standard value for a large data set
    col_sample_rate = 0.8, #a fairly standard value for a large data set
    seed=seed,
    score_tree_interval = 4,
    stopping_rounds = 5,
    stopping_metric = "mean_per_class_error",
    stopping_tolerance = 1e-3)

depthGrid = H2OGridSearch(depthModel, hyper_params)

depthGrid.train(x=predictors,
                y=response,
                training_frame = train60,
                validation_frame = valid)
print("--- %s seconds ---" % (time.time() - start_time))

gbm Grid Build progress: |████████████████████████████████████████████████| 100%
--- 810.5275394916534 seconds ---


In [60]:
sorted_depth_grid = depthGrid.get_grid(sort_by='mean_per_class_error')
sorted_depth_grid

     max_depth                                                      model_ids  \
0            3   Grid_GBM_py_65_sid_8734_model_python_1507061735713_5_model_1   
1            5   Grid_GBM_py_65_sid_8734_model_python_1507061735713_5_model_2   
2           17   Grid_GBM_py_65_sid_8734_model_python_1507061735713_5_model_8   
3           19   Grid_GBM_py_65_sid_8734_model_python_1507061735713_5_model_9   
4            7   Grid_GBM_py_65_sid_8734_model_python_1507061735713_5_model_3   
5           23  Grid_GBM_py_65_sid_8734_model_python_1507061735713_5_model_11   
6           21  Grid_GBM_py_65_sid_8734_model_python_1507061735713_5_model_10   
7           13   Grid_GBM_py_65_sid_8734_model_python_1507061735713_5_model_6   
8           27  Grid_GBM_py_65_sid_8734_model_python_1507061735713_5_model_13   
9           29  Grid_GBM_py_65_sid_8734_model_python_1507061735713_5_model_14   
10          25  Grid_GBM_py_65_sid_8734_model_python_1507061735713_5_model_12   
11          15   Grid_GBM_py



In [58]:
for x in range(5):
    print("Model {}".format(x))
    print("valid MPCE", sorted_depth_grid[x].mean_per_class_error(valid=True))
    print("train MPCE", sorted_depth_grid[x].mean_per_class_error(train=True))

Model 0
valid MPCE 0.5326005301577738
train MPCE 0.46508718616269923
Model 1
valid MPCE 0.5348217821483695
train MPCE 0.39472570578367006
Model 2
valid MPCE 0.5366781932755325
train MPCE 0.08340849485757895
Model 3
valid MPCE 0.5377181317006009
train MPCE 0.029438585992860297
Model 4
valid MPCE 0.5378934076518931
train MPCE 0.3193535729461138


In [25]:
for model in list(depthGrid.mean_per_class_error().keys())[:5]:
    print("number of trees", (depthGrid.scoring_history()[model])['number_of_trees'].iloc[-1])

number of trees 48.0
number of trees 48.0
number of trees 48.0
number of trees 68.0
number of trees 72.0


The sweet spot for the `max_depth` is between 3 and 19. However, it is worth noting that tuning this parameter has not brought significant gains to the validation set mean_per_class_error, and seems to have increased overfitting in some of the five best performing models. 

Next we tune the `learn_rate`. In general the learn rate should be low. The learn rate captures how correct observations are weighted after each tree is built. A small learn late should help with overfitting. Note that standard learn rates are 0.05 and 0.01. However, in testing I found that higher learn rates were performing better on this data (not sure why) and as such I am searching a space from 0.005 to 0.6 which is a much larger space than would typically be considered. 

I include the `learn_rate_annealing` parameter as this helps the tree to converge by lowering the learning rate for each tree (thereby increasing the chances of early stopping).

In [19]:
start_time = time.time()
hyper_params2 = {'learn_rate' : [x/1000 for x in range(5, 601, 10)]}

learnModel = H2OGradientBoostingEstimator(
    distribution='multinomial',
    learn_rate_annealing = 0.99,
    max_depth = 15,
    ntrees=1000,
    sample_rate = 0.8,
    col_sample_rate = 0.8,
    seed=seed,
    score_tree_interval = 4,
    stopping_rounds = 5,
    stopping_metric = "mean_per_class_error",
    stopping_tolerance = 1e-3)

learnGrid = H2OGridSearch(learnModel, hyper_params2)

learnGrid.train(x=predictors,
                y=response,
                training_frame = train60,
                validation_frame = valid)
print("--- %s seconds ---" % (time.time() - start_time))

gbm Grid Build progress: |████████████████████████████████████████████████| 100%
--- 2680.8130984306335 seconds ---


In [61]:
sorted_learn_grid = learnGrid.get_grid(sort_by='mean_per_class_error')
sorted_learn_grid 

     learn_rate  \
0         0.035   
1         0.025   
2         0.005   
3         0.015   
4         0.045   
5         0.055   
6         0.095   
7         0.105   
8         0.275   
9         0.075   
10        0.205   
11        0.085   
12        0.145   
13        0.255   
14        0.325   
15        0.065   
16        0.235   
17        0.225   
18        0.125   
19        0.115   
20        0.195   
21        0.165   
22        0.185   
23        0.175   
24        0.155   
25        0.285   
26        0.215   
27        0.335   
28        0.245   
29        0.305   
30        0.135   
31        0.365   
32        0.355   
33        0.345   
34        0.445   
35        0.315   
36        0.265   
37        0.425   
38        0.295   
39        0.375   
40        0.405   
41        0.415   
42        0.465   
43        0.435   
44        0.385   
45        0.595   
46        0.475   
47        0.505   
48        0.455   
49        0.495   
50        0.515   
51        0.



In [62]:
for x in range(5):
    print("Model {}".format(x))
    print("valid MPCE", sorted_learn_grid[x].mean_per_class_error(valid=True))
    print("train MPCE", sorted_learn_grid[x].mean_per_class_error(train=True))

Model 0
valid MPCE 0.5339439192825468
train MPCE 0.11534431492531119
Model 1
valid MPCE 0.5366625025848393
train MPCE 0.1441639214072638
Model 2
valid MPCE 0.5373379388786985
train MPCE 0.21926398067242622
Model 3
valid MPCE 0.5374451071104057
train MPCE 0.19240577794490843
Model 4
valid MPCE 0.5380081720825463
train MPCE 0.10411344307316037


The sweet spot for the learn_rate appears to be between 0.005 and 0.05. Again, these small learn rates do not appear to be doing anything for the overfitting problem as the difference between train and valid scores has been increased. 

For the remaining tuning parameters (including those that I hope will prevent overfitting), I will run the program in an ipython kernel in AWS to take advantage of the persistency offered by using the amazon servers.

In [None]:
#learnGrid.get_grid(sort_by='mean_per_class_error')

In [None]:
#wlearnGrid.scoring_history()

In [None]:
################
##THE FOLLOWING CODE WAS RUN IN AN IPYTHON TERMINAL TO PREVENT DATA LOSS IN THE EVENT OF SSH FAILURE
#
#
#learn_range = [0.005, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06]
#start_time = time.time()
#hyper_params = {'learn_rate' : learn_range,
#                'max_depth' : list(range(3, 19, 2)),
#                'col_sample_rate' : [x/100 for x in range(70, 81, 5)],
#                'col_sample_rate_per_tree': [x/100. for x in range(20,101)],
#                'sample_rate' : [x/100 for x in range(70, 86, 5)],
#                'nbins' : [2**x for x in range(4,11)],
#                'nbins_cats': [2**x for x in range(4,11)],
#                'min_split_improvement' : [1/x for x in [1000, 10000, 100000, 1000000, 10000000, 
#                                                         10000000, 1000000000, 10000000000]] }
#tuneModel = H2OGradientBoostingEstimator(distribution = 'multinomial',
#                                         balance_classes=True,
#                                         learn_rate_annealing = 0.99,
#                                         ntrees = 500,
#                                         stopping_rounds = 3,
#                                         stopping_metric = "mean_per_class_error",
#                                         stopping_tolerance = 1e-3)
#
#search_params = {'strategy': "RandomDiscrete",
#                 'seed' : 20170929,
#                 'stopping_rounds' : 5,
#                 'stopping_metric' : "mean_per_class_error",
#                 'stopping_tolerance': 1e-3,
#                 'max_models' : 1000,
#                 'max_runtime_secs' : 7200}
#
#tuneModelGrid = H2OGridSearch(tuneModel, hyper_params=hyper_params, 
#                              search_criteria=search_params)
#
#tuneModelGrid.train(x=predictors,
#                    y=response,
#                    training_frame = train60,
#                    validation_frame = valid)
#
#print("--- %s seconds ---" % (time.time() - start_time))
##save some model outputs
#pickle.dump(tuneModelGrid.scoring_history(), open("tuneGridScoreHistory.pkl", "wb"))
#sorted_grid = sorted(tuneModelGrid.mean_per_class_error(valid=True).items(),key=lambda x: x[1])
#pickle.dump(sorted_grid, open("meanClassError.pkl", "wb"))
##asave top 10 performing models
#for model in sorted_grid[:10]:
#    m = h2o.get_model(model[0])
#    model_path = h2o.save_model(model=m, path="", force=True)

In [4]:
cd ~/finalGaProject

/home/ec2-user/finalGaProject


In [5]:
#get top 5 models
top_models = pickle.load(open("meanClassError.pkl", "rb"))[:5]
top_models

[('Grid_GBM_py_65_sid_be2c_model_python_1507067122023_1_model_29',
  0.5251082117448248),
 ('Grid_GBM_py_65_sid_be2c_model_python_1507067122023_1_model_87',
  0.5251828923012749),
 ('Grid_GBM_py_65_sid_be2c_model_python_1507067122023_1_model_44',
  0.5260783132930181),
 ('Grid_GBM_py_65_sid_be2c_model_python_1507067122023_1_model_226',
  0.5283238682082384),
 ('Grid_GBM_py_65_sid_be2c_model_python_1507067122023_1_model_156',
  0.5284547573366978)]

The tuning was not hugely successful as the validation set improved only by 1/100. The compute time for the grid search was 5 hours on a 64 CPU virtual machine. In other words it was a computationally heavy exercise. I will consider using LightGBM in future iterations. 

In [6]:
#get the scoring history
scoreHistory = pickle.load(open("tuneGridScoreHistory.pkl", "rb"))

In [10]:
#look at the scoring history for best performing model
topScore = scoreHistory['Grid_GBM_py_65_sid_be2c_model_python_1507067122023_1_model_29']
topScore

Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,validation_rmse,validation_logloss,validation_classification_error
0,,2017-10-03 22:13:26,27 min 50.860 sec,0.0,0.879943,2.428884,0.874971,0.812499,1.809797,0.675683
1,,2017-10-03 22:13:26,27 min 51.067 sec,1.0,0.865411,2.253363,0.872023,0.795429,1.716133,0.673054
2,,2017-10-03 22:13:26,27 min 51.177 sec,2.0,0.856652,2.192316,0.868376,0.784767,1.669535,0.671189
3,,2017-10-03 22:13:26,27 min 51.305 sec,3.0,0.846979,2.098468,0.856999,0.776467,1.63082,0.665677
4,,2017-10-03 22:13:26,27 min 51.448 sec,4.0,0.833945,2.002353,0.823133,0.763019,1.575668,0.639732
5,,2017-10-03 22:13:26,27 min 51.598 sec,5.0,0.827643,1.967003,0.776502,0.755782,1.548408,0.598525
6,,2017-10-03 22:13:27,27 min 51.765 sec,6.0,0.822578,1.932854,0.755042,0.751599,1.532444,0.580549
7,,2017-10-03 22:13:27,27 min 51.925 sec,7.0,0.817469,1.905423,0.72424,0.74654,1.514487,0.556215
8,,2017-10-03 22:13:27,27 min 52.122 sec,8.0,0.806715,1.840441,0.686929,0.738324,1.484391,0.525945
9,,2017-10-03 22:13:27,27 min 52.304 sec,9.0,0.801,1.813373,0.668851,0.732866,1.466239,0.513481


In [16]:
#inspect the model
topMod = h2o.load_model('/home/ec2-user/finalGaProject/Grid_GBM_py_65_sid_be2c_model_python_1507067122023_1_model_29')

In [21]:
topMod

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  Grid_GBM_py_65_sid_be2c_model_python_1507067122023_1_model_29


ModelMetricsMultinomial: gbm
** Reported on train data. **

MSE: 0.3229076198566854
RMSE: 0.5682496105204872
LogLoss: 0.9572627489489983
Mean Per-Class Error: 0.2992520893570521
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4,5,6,7,8,9
1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,Error,Rate
6197.0,496.0,65.0,186.0,500.0,1407.0,790.0,2133.0,0.4736708,"5,577 / 11,774"
282.0,6049.0,87.0,171.0,723.0,1570.0,748.0,2160.0,0.4869381,"5,741 / 11,790"
58.0,194.0,9505.0,658.0,230.0,597.0,173.0,388.0,0.1946963,"2,298 / 11,803"
14.0,0.0,0.0,10006.0,0.0,391.0,187.0,1170.0,0.1497281,"1,762 / 11,768"
105.0,387.0,0.0,0.0,8180.0,1492.0,551.0,1077.0,0.3063094,"3,612 / 11,792"
142.0,220.0,4.0,9.0,287.0,7881.0,1171.0,2046.0,0.3298469,"3,879 / 11,760"
41.0,21.0,0.0,0.0,48.0,1070.0,7084.0,3571.0,0.4014364,"4,751 / 11,835"
14.0,9.0,0.0,5.0,14.0,233.0,331.0,11186.0,0.0513908,"606 / 11,792"
6853.0,7376.0,9661.0,11035.0,9982.0,14641.0,11035.0,23731.0,0.2992769,"28,226 / 94,314"


Top-8 Hit Ratios: 


0,1
k,hit_ratio
1,0.7007231
2,0.8503721
3,0.9114766
4,0.9534428
5,0.9759845
6,0.9907013
7,0.9992366
8,1.0



ModelMetricsMultinomial: gbm
** Reported on validation data. **

MSE: 0.4144929666013407
RMSE: 0.643811281822042
LogLoss: 1.2406756451508365
Mean Per-Class Error: 0.5251082117448248
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4,5,6,7,8,9
1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,Error,Rate
251.0,167.0,26.0,20.0,107.0,250.0,128.0,275.0,0.7949346,"973 / 1,224"
122.0,308.0,17.0,35.0,160.0,274.0,130.0,276.0,0.7670197,"1,014 / 1,322"
17.0,16.0,66.0,55.0,12.0,13.0,1.0,7.0,0.6470588,121 / 187
18.0,6.0,16.0,204.0,0.0,26.0,5.0,27.0,0.3245033,98 / 302
51.0,114.0,1.0,0.0,539.0,203.0,45.0,112.0,0.4938967,"526 / 1,065"
96.0,109.0,1.0,5.0,104.0,1201.0,287.0,494.0,0.4771441,"1,096 / 2,297"
29.0,17.0,0.0,1.0,18.0,282.0,627.0,598.0,0.6011450,"945 / 1,572"
10.0,13.0,0.0,2.0,8.0,173.0,158.0,3461.0,0.0951634,"364 / 3,825"
594.0,750.0,127.0,322.0,948.0,2422.0,1381.0,5250.0,0.4355605,"5,137 / 11,794"


Top-8 Hit Ratios: 


0,1
k,hit_ratio
1,0.5644395
2,0.7502119
3,0.8634051
4,0.9302188
5,0.9727828
6,0.9922842
7,0.9972868
8,1.0


Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,validation_rmse,validation_logloss,validation_classification_error
,2017-10-03 22:13:26,27 min 50.860 sec,0.0,0.8799430,2.4288837,0.8749708,0.8124991,1.8097968,0.6756826
,2017-10-03 22:13:26,27 min 51.067 sec,1.0,0.8654112,2.2533631,0.8720232,0.7954293,1.7161329,0.6730541
,2017-10-03 22:13:26,27 min 51.177 sec,2.0,0.8566521,2.1923159,0.8683759,0.7847667,1.6695349,0.6711887
,2017-10-03 22:13:26,27 min 51.305 sec,3.0,0.8469787,2.0984681,0.8569990,0.7764671,1.6308202,0.6656775
,2017-10-03 22:13:26,27 min 51.448 sec,4.0,0.8339450,2.0023528,0.8231334,0.7630193,1.5756677,0.6397321
---,---,---,---,---,---,---,---,---,---
,2017-10-03 22:13:29,27 min 54.386 sec,17.0,0.7507252,1.5728223,0.5348623,0.7022607,1.3673472,0.4749873
,2017-10-03 22:13:30,27 min 54.761 sec,18.0,0.7453550,1.5499964,0.5070191,0.7004060,1.3617261,0.4693064
,2017-10-03 22:13:34,27 min 58.767 sec,60.0,0.6377394,1.1844511,0.3706025,0.6593346,1.2633294,0.4411565



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
Medical_History_15,61214.5039062,1.0,0.2581582
Family_Hist_3,25733.4296875,0.4203813,0.1085249
Family_Hist_5,22521.5449219,0.3679119,0.0949795
BMI,22199.1054688,0.3626445,0.0936197
Product_Info_4,8978.0693359,0.1466657,0.0378630
---,---,---,---
Medical_Keyword_27,36.2540741,0.0005922,0.0001529
Medical_History_22,34.4385834,0.0005626,0.0001452
Medical_Keyword_44,19.0572853,0.0003113,0.0000804



See the whole table with table.as_data_frame()




This is the end of the tuning. Next I inspect the model.

In [39]:
import os
h2o.export_file(train60, path = "/home/ec2-user/finalGaProject/h2oData/train60.h20", force=True)
h2o.export_file(train80, path='/home/ec2-user/finalGaProject/h2oData/train80.h20', force=True)
h2o.export_file(valid, path='/home/ec2-user/finalGaProject/h2oData/valid.h20', force=True)
h2o.export_file(test, path='/home/ec2-user/finalGaProject/h2oData/test.h20', force=True)

Export File progress: |███████████████████████████████████████████████████| 100%
Export File progress: |███████████████████████████████████████████████████| 100%
Export File progress: |███████████████████████████████████████████████████| 100%
Export File progress: |███████████████████████████████████████████████████| 100%
