# ABSTRACT

Hyperparameters are parameters that are specified prior to running machine learning algorithms that have a large effect on the predictive power of statistical models. Knowledge of the relative importance of a hyperparameter to an algorithm and its range of values is crucial to hyperparameter tuning and creating effective models. To either experts or non-experts, determining hyperparameters that optimize model performance can be a tedious and difficult task. Therefore, we develop a hyperparameter database that allows users to visualize and understand how to choose hyperparameters that maximize the predictive power of their models. 

The database is created by running millions of hyperparameter values, over thousands of public datasets and calculating the individual conditional expectation of every hyperparameter to the quality of a model.                 

We analyze the **effect of hyperparameters** on algorithms such as                                                  
Distributed Random Forest (DRF),                                                                               
Generalized Linear Model (GLM),                                                                                
Gradient Boosting Machine (GBM),                                                                            
Boosting (XGBoost) and several more.                                                                          
Consequently, the database attempts to provide a one-stop platform for data scientists to identify hyperparameters that have the most effect on their models in order to speed up the process of developing effective predictive models. Moreover, the database will also use these public datasets to build models that can predict hyperparameters without search and for visualizing and teaching concepts such as statistical power and bias/variance tradeoff. The raw data will also be publically available for the research community.


In [52]:
# import h2o package and specific estimator 
import h2o
from h2o.automl import H2OAutoML
import random, os, sys
from datetime import datetime
import pandas as pd
import logging
import csv
import optparse
import time
import json
from distutils.util import strtobool
import psutil

import warnings
warnings.filterwarnings('ignore')

In [53]:
h2o.init(strict_version_check=False) # start h2o

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,2 mins 24 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.1
H2O cluster version age:,"7 days, 17 hours and 29 minutes"
H2O cluster name:,H2O_from_python_newzysharma_odm49t
H2O cluster total nodes:,1
H2O cluster free memory:,2.000 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [54]:
#importing data to the server
hp = h2o.import_file(path="hour.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [55]:
#Displaying the head
hp.head()

instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
1,2011-01-01 00:00:00,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
2,2011-01-01 00:00:00,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
3,2011-01-01 00:00:00,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
4,2011-01-01 00:00:00,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
5,2011-01-01 00:00:00,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1
6,2011-01-01 00:00:00,1,0,1,5,0,6,0,2,0.24,0.2576,0.75,0.0896,0,1,1
7,2011-01-01 00:00:00,1,0,1,6,0,6,0,1,0.22,0.2727,0.8,0.0,2,0,2
8,2011-01-01 00:00:00,1,0,1,7,0,6,0,1,0.2,0.2576,0.86,0.0,1,2,3
9,2011-01-01 00:00:00,1,0,1,8,0,6,0,1,0.24,0.2879,0.75,0.0,1,7,8
10,2011-01-01 00:00:00,1,0,1,9,0,6,0,1,0.32,0.3485,0.76,0.0,8,6,14




In [56]:
hp.describe()

Rows:17379
Cols:17




Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
type,int,time,int,int,int,int,int,int,int,int,real,real,real,real,int,int,int
mins,1.0,1293840000000.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
mean,8690.0,1325477314552.0461,2.501639910236492,0.5025605615973301,6.5377754761493785,11.546751826917548,0.028770355026181024,3.003682605443351,0.6827205247712756,1.425283387997008,0.4969871684216584,0.47577510213476026,0.6272288394038784,0.1900976063064618,35.67621842453536,153.78686920996606,189.4630876345014
maxs,17379.0,1356912000000.0,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0
sigma,5017.029499614288,18150225217.779854,1.1069181394480765,0.5000078290910197,3.438775713750168,6.914405095264493,0.16716527638437123,2.005771456110988,0.4654306335238829,0.6393568777542534,0.19255612124972193,0.17185021563535932,0.19292983406291514,0.1223402285727905,49.30503038705309,151.35728591258314,181.38759909186476
zeros,0,0,0,8645,0,726,16879,2502,5514,0,0,2,22,2180,1581,24,0
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1.0,2011-01-01 00:00:00,1.0,0.0,1.0,0.0,0.0,6.0,0.0,1.0,0.24,0.2879,0.81,0.0,3.0,13.0,16.0
1,2.0,2011-01-01 00:00:00,1.0,0.0,1.0,1.0,0.0,6.0,0.0,1.0,0.22,0.2727,0.8,0.0,8.0,32.0,40.0
2,3.0,2011-01-01 00:00:00,1.0,0.0,1.0,2.0,0.0,6.0,0.0,1.0,0.22,0.2727,0.8,0.0,5.0,27.0,32.0


# Model with 500 seconds

In [6]:
# Assume the following are passed by the user from the web interface

'''
Need a user id and project id?

'''
target='cnt' 
data_file='hour.csv'
run_time=500
run_id='SOME_ID_20180617_221528' # Just some arbitrary ID
server_path='Users/newzysharma/Desktop/Desktop/Machine_Learning/Project'
classification=False
scale=False
max_models=None
balance_y=False # balance_classes=balance_y
balance_threshold=0.2
project ="HyperparameterDB_Project"  # project_name = project

In [7]:
# assign target and inputs for logistic regression
y = target
X = [name for name in hp.columns if name != y]
print(y)
print(X)

cnt
['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered']


In [8]:
# determine column types
ints, reals, enums = [], [], []
for key, val in hp.types.items():
    if key in X:
        if val == 'enum':
            enums.append(key)
        elif val == 'int':
            ints.append(key)            
        else: 
            reals.append(key)

print(ints)
print(enums)
print(reals)

['instant', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'casual', 'registered']
[]
['dteday', 'temp', 'atemp', 'hum', 'windspeed']


In [9]:
# impute missing values
_ = hp[reals].impute(method='mean')
_ = hp[ints].impute(method='median')

if scale:
    hp[reals] = df[reals].scale()
    hp[ints] = df[ints].scale()

In [19]:
# # set target to factor for classification by default or if user specifies classification
# if classification:
#     [y] = hp[y].asfactor()

In [10]:
hp[y].levels()

[]


## Cross-validate rather than take a test training split with 500 seconds

In [14]:
# automl
# runs for run_time seconds then builds a stacked ensemble

aml = H2OAutoML(max_runtime_secs=run_time,project_name = project) # init automl, run for 500 seconds
aml.train(x=X,  
           y=y,
           training_frame=hp)

AutoML progress: |████████████████████████████████████████████████████████| 100%


## Leaderboard

In [19]:
# view leaderboard
lb = aml.leaderboard
lb.head(30)

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
XGBoost_1_AutoML_20190408_134347,33.6267,5.79885,33.6267,3.48832,0.0709085
XGBoost_2_AutoML_20190408_134347,34.8881,5.90662,34.8881,3.65338,
XGBoost_3_AutoML_20190408_134347,37.692,6.13938,37.692,4.24472,
DRF_1_AutoML_20190408_134347,199.056,14.1087,199.056,7.61016,0.116413
StackedEnsemble_AllModels_AutoML_20190408_134347,215.127,14.6672,215.127,11.1482,
StackedEnsemble_BestOfFamily_AutoML_20190408_134347,578.165,24.0451,578.165,18.4664,
GBM_1_AutoML_20190408_134347,19075.9,138.115,19075.9,106.817,1.38931
GLM_grid_1_AutoML_20190408_134347_model_1,32825.3,181.178,32825.3,142.229,1.56932




In [20]:
aml.leader

Model Details
H2OXGBoostEstimator :  XGBoost
Model Key:  XGBoost_1_AutoML_20190408_134347


ModelMetricsRegression: xgboost
** Reported on train data. **

MSE: 1.8044970589585165
RMSE: 1.3433156959399069
MAE: 0.9567310729234085
RMSLE: 0.033470831388227845
Mean Residual Deviance: 1.8044970589585165

ModelMetricsRegression: xgboost
** Reported on cross-validation data. **

MSE: 33.62667667791591
RMSE: 5.798851324005118
MAE: 3.488319483920533
RMSLE: 0.07090846493182505
Mean Residual Deviance: 33.62667667791591
Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,3.4882748,0.2780323,3.5737598,3.6770396,3.7553453,3.7236972,2.711532
mean_residual_deviance,33.626022,4.375016,32.716263,35.153805,39.735123,38.27176,22.253164
mse,33.626022,4.375016,32.716263,35.153805,39.735123,38.27176,22.253164
r2,0.9989782,0.0001325,0.9990069,0.9989397,0.9987854,0.9988390,0.9993199
residual_deviance,33.626022,4.375016,32.716263,35.153805,39.735123,38.27176,22.253164
rmse,5.77124,0.3992551,5.7198133,5.9290643,6.3035803,6.186417,4.7173257
rmsle,0.0701319,0.0073951,0.0780898,0.0747013,0.0710742,0.0770181,0.0497763


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2019-04-08 13:48:12,4 min 24.405 sec,0.0,261.9286476,188.9630876,68606.6164192
,2019-04-08 13:48:12,4 min 24.518 sec,5.0,203.9399399,146.5347758,41591.4990976
,2019-04-08 13:48:12,4 min 24.616 sec,10.0,158.8014455,113.6471378,25217.8991072
,2019-04-08 13:48:12,4 min 24.745 sec,15.0,124.0125718,88.2422080,15379.1179652
,2019-04-08 13:48:12,4 min 24.854 sec,20.0,96.5444771,68.4358050,9320.8360536
---,---,---,---,---,---,---
,2019-04-08 13:48:56,5 min 8.231 sec,380.0,1.3921634,0.9895238,1.9381188
,2019-04-08 13:48:57,5 min 9.193 sec,385.0,1.3736363,0.9777073,1.8868767
,2019-04-08 13:48:58,5 min 10.198 sec,390.0,1.3635291,0.9707541,1.8592116



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
registered,2302417408.0000000,1.0,0.6603332
casual,721023232.0000000,0.3131592,0.2067894
hr,307383072.0000000,0.1335045,0.0881574
instant,52000256.0000000,0.0225851,0.0149137
workingday,42638236.0000000,0.0185189,0.0122286
weekday,27811040.0000000,0.0120791,0.0079762
atemp,18675220.0000000,0.0081111,0.0053561
temp,3797814.2500000,0.0016495,0.0010892
dteday,3107042.2500000,0.0013495,0.0008911




In [21]:
aml.leader.algo

'xgboost'

## Ensemble Exploration

In [22]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
aml_leaderboard_df

Unnamed: 0,model_id,mean_residual_deviance,rmse,mse,mae,rmsle
0,XGBoost_1_AutoML_20190408_134347,33.626677,5.798851,33.626677,3.488319,0.070908
1,XGBoost_2_AutoML_20190408_134347,34.888147,5.906619,34.888147,3.653379,
2,XGBoost_3_AutoML_20190408_134347,37.692033,6.139384,37.692033,4.244718,
3,DRF_1_AutoML_20190408_134347,199.055553,14.108705,199.055553,7.610164,0.116413
4,StackedEnsemble_AllModels_AutoML_20190408_134347,215.126519,14.667192,215.126519,11.148237,
5,StackedEnsemble_BestOfFamily_AutoML_20190408_1...,578.165385,24.04507,578.165385,18.466439,
6,GBM_1_AutoML_20190408_134347,19075.853602,138.115363,19075.853602,106.817011,1.389313
7,GLM_grid_1_AutoML_20190408_134347_model_1,32825.32154,181.177597,32825.32154,142.22889,1.569322


## Getting Models

### Parameters for XGBoost_1_AutoML_20190408_134347

In [78]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[0])

In [24]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'XGBoost_1_AutoML_20190408_134347',
   'type': 'Key<Model>',
   'URL': '/3/Models/XGBoost_1_AutoML_20190408_134347'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'validation_frame': {'default': None, 'actual': None},
 'nfolds': {'default': 0, 'actual': 5},
 'keep_cross_validation_models': {'default': True, 'actual': False},
 'keep_cross_validation_predictions': {'default': False, 'actual': True},
 'keep_cross_validation_fold_assignment': {'default': False, 'actual': False},
 'score_each_iteration': {'default': False, 'actual': False},
 'fold_assignment': {'default': 'AUTO', 'actual': 'Modulo'},
 'fold_column': {'defau

### Parameters for StackedEnsemble_AllModels_AutoML_20190408_134347

In [95]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[4])

In [96]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'StackedEnsemble_AllModels_AutoML_20190408_171007',
   'type': 'Key<Model>',
   'URL': '/3/Models/StackedEnsemble_AllModels_AutoML_20190408_171007'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'response_column': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ColSpecifierV3',
    'schema_type': 'VecSpecifier'},
   'column_name': 'cnt',
   'is_member_of_frames': None}},
 'validation_frame': {'default': None, 'actual': None},
 'blending_frame': {'default': None, 'actual': None},
 'base_models': {'default': [],
  'actual': [{'__meta': {'schema_version': 3,
     'schema_name': 'ModelKey

### BUG:  I am trying to get Parameters for model 6(GBM_1_AutoML_20190408_134347) but i am getting some other model but i am getting GBM_4_AutoML_20190408_171007

In [105]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[6])

In [106]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'GBM_4_AutoML_20190408_171007',
   'type': 'Key<Model>',
   'URL': '/3/Models/GBM_4_AutoML_20190408_171007'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'validation_frame': {'default': None, 'actual': None},
 'nfolds': {'default': 0, 'actual': 5},
 'keep_cross_validation_models': {'default': True, 'actual': False},
 'keep_cross_validation_predictions': {'default': False, 'actual': True},
 'keep_cross_validation_fold_assignment': {'default': False, 'actual': False},
 'score_each_iteration': {'default': False, 'actual': False},
 'score_tree_interval': {'default': 0, 'actual': 5},
 'fold_assignment': {'default': 'AUTO',

# Model with runtime 1000 seconds

In [57]:
# Assume the following are passed by the user from the web interface

'''
Need a user id and project id?

'''
target='cnt' 
data_file='hour.csv'
run_time=1000
run_id='SOME_ID_20180617_221529' # Just some arbitrary ID
server_path='Users/newzysharma/Desktop/Desktop/Machine_Learning/Project'
classification=False
scale=False
max_models=None
balance_y=False # balance_classes=balance_y
balance_threshold=0.2
project ="HyperparameterDB_Project"  # project_name = project

In [58]:
# Use local data file or download from some type of bucket
import os

data_path=os.path.join(server_path,data_file)
data_path

'Users/newzysharma/Desktop/Desktop/Machine_Learning/Project/hour.csv'

In [59]:
# assign target and inputs for logistic regression
y = target
X = [name for name in hp.columns if name != y]
print(y)
print(X)

cnt
['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered']


In [60]:
# determine column types
ints, reals, enums = [], [], []
for key, val in hp.types.items():
    if key in X:
        if val == 'enum':
            enums.append(key)
        elif val == 'int':
            ints.append(key)            
        else: 
            reals.append(key)

print(ints)
print(enums)
print(reals)

['instant', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'casual', 'registered']
[]
['dteday', 'temp', 'atemp', 'hum', 'windspeed']


In [61]:
# impute missing values
_ = hp[reals].impute(method='mean')
_ = hp[ints].impute(method='median')

if scale:
    hp[reals] = hp[reals].scale()
    hp[ints] = hp[ints].scale()

In [20]:
# set target to factor for classification by default or if user specifies classification
#if classification:
    #hp[y] = hp[y].asfactor()

In [31]:
hp[y].levels()

[]

In [62]:
if classification:
    class_percentage = y_balance=df[y].mean()[0]/(df[y].max()-df[y].min())
    if class_percentage < balance_threshold:
        balance_y=True
        

print(run_time)
type(run_time)

1000


int


## Cross-validate rather than take a test training split with 1000 seconds

In [63]:
# automl
# runs for run_time seconds then builds a stacked ensemble

aml = H2OAutoML(max_runtime_secs=run_time,project_name = project) # init automl, run for 1000 seconds
aml.train(x=X,  
           y=y,
           training_frame=hp)

AutoML progress: |████████████████████████████████████████████████████████| 100%


## Leaderboard

In [64]:
# view leaderboard
lb = aml.leaderboard
lb

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
GBM_2_AutoML_20190408_171007,10.3989,3.22474,10.3989,1.88964,0.0353651
GBM_1_AutoML_20190408_171007,13.2483,3.63982,13.2483,2.30598,0.0573338
XGBoost_1_AutoML_20190408_171007,14.936,3.86471,14.936,2.25298,0.0426858
GBM_3_AutoML_20190408_171007,22.3503,4.72761,22.3503,2.93989,0.0688736
StackedEnsemble_AllModels_AutoML_20190408_171007,23.1046,4.80673,23.1046,3.22004,0.180671
XGBoost_3_AutoML_20190408_171007,23.4324,4.84071,23.4324,3.16966,0.0785482
GBM_4_AutoML_20190408_171007,25.8815,5.08739,25.8815,2.98719,
XGBoost_2_AutoML_20190408_171007,31.3904,5.60271,31.3904,3.26648,0.0692559
GLM_grid_1_AutoML_20190408_171007_model_1,38.387,6.19573,38.387,4.45197,
DRF_1_AutoML_20190408_171007,44.9979,6.70804,44.9979,3.64084,0.0762501




In [65]:
aml.leader

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_2_AutoML_20190408_171007


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 2.7545139005515074
RMSE: 1.659672829370749
MAE: 1.1420248950973013
RMSLE: 0.026260655890205944
Mean Residual Deviance: 2.7545139005515074

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 10.398929973581447
RMSE: 3.2247371944984056
MAE: 1.8896396840192615
RMSLE: 0.03536506290104928
Mean Residual Deviance: 10.398929973581447
Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,1.8896418,0.0331531,1.8888159,1.8440878,1.955546,1.8330745,1.9266849
mean_residual_deviance,10.398917,1.1037041,9.484616,8.705696,13.301646,10.319401,10.1832285
mse,10.398917,1.1037041,9.484616,8.705696,13.301646,10.319401,10.1832285
r2,0.9996837,0.0000345,0.9997121,0.9997374,0.9995934,0.9996870,0.9996888
residual_deviance,10.398917,1.1037041,9.484616,8.705696,13.301646,10.319401,10.1832285
rmse,3.2161787,0.1659996,3.0797105,2.9505417,3.6471422,3.2123823,3.191117
rmsle,0.0353167,0.0013090,0.0337463,0.0375869,0.0355834,0.0327149,0.0369519


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2019-04-08 17:25:27,16.816 sec,0.0,181.3823804,142.3998489,32899.5679309
,2019-04-08 17:25:27,16.882 sec,5.0,107.4239264,84.2400360,11539.8999535
,2019-04-08 17:25:27,16.935 sec,10.0,63.8940838,50.0233089,4082.4539467
,2019-04-08 17:25:27,16.985 sec,15.0,38.4509700,30.0349704,1478.4770913
,2019-04-08 17:25:27,17.034 sec,20.0,25.6530386,19.7348606,658.0783895
---,---,---,---,---,---,---
,2019-04-08 17:25:30,19.344 sec,270.0,1.7050142,1.1700358,2.9070734
,2019-04-08 17:25:30,19.393 sec,275.0,1.6909674,1.1622558,2.8593708
,2019-04-08 17:25:30,19.451 sec,280.0,1.6770692,1.1544659,2.8125612



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
registered,2568583680.0000000,1.0,0.9182406
casual,153480320.0000000,0.0597529,0.0548675
hr,43721364.0000000,0.0170216,0.0156299
instant,9335018.0,0.0036343,0.0033372
workingday,7972549.0,0.0031039,0.0028501
dteday,5198927.5,0.0020240,0.0018586
temp,3354083.5,0.0013058,0.0011990
weekday,2513148.2500000,0.0009784,0.0008984
atemp,2094532.2500000,0.0008154,0.0007488




In [66]:
aml.leader.algo

'gbm'

## Ensemble Exploration

In [67]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
aml_leaderboard_df

Unnamed: 0,model_id,mean_residual_deviance,rmse,mse,mae,rmsle
0,GBM_2_AutoML_20190408_171007,10.39893,3.224737,10.39893,1.88964,0.035365
1,GBM_1_AutoML_20190408_171007,13.248308,3.639823,13.248308,2.305975,0.057334
2,XGBoost_1_AutoML_20190408_171007,14.936004,3.864713,14.936004,2.252979,0.042686
3,GBM_3_AutoML_20190408_171007,22.350299,4.72761,22.350299,2.93989,0.068874
4,StackedEnsemble_AllModels_AutoML_20190408_171007,23.104649,4.80673,23.104649,3.220036,0.180671
5,XGBoost_3_AutoML_20190408_171007,23.432444,4.840707,23.432444,3.16966,0.078548
6,GBM_4_AutoML_20190408_171007,25.881513,5.087388,25.881513,2.987189,
7,XGBoost_2_AutoML_20190408_171007,31.390413,5.602715,31.390413,3.266475,0.069256
8,GLM_grid_1_AutoML_20190408_171007_model_1,38.387049,6.195728,38.387049,4.451969,
9,DRF_1_AutoML_20190408_171007,44.997852,6.708044,44.997852,3.640842,0.07625


##### As per the rmse metric that "The smaller the RMSE value, the better the model".
##### As per the mse metric that "The smaller the MSE value, the better the model".
So, In our case GBM_2_AutoML_20190408_171007 has the smallest rmse = 3.224737 and mse = 10.398930 that is why it is our best model.

## Getting Models

### Parameters for GBM_2_AutoML_20190408_171007

In [69]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[0])

In [70]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'GBM_2_AutoML_20190408_171007',
   'type': 'Key<Model>',
   'URL': '/3/Models/GBM_2_AutoML_20190408_171007'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'validation_frame': {'default': None, 'actual': None},
 'nfolds': {'default': 0, 'actual': 5},
 'keep_cross_validation_models': {'default': True, 'actual': False},
 'keep_cross_validation_predictions': {'default': False, 'actual': True},
 'keep_cross_validation_fold_assignment': {'default': False, 'actual': False},
 'score_each_iteration': {'default': False, 'actual': False},
 'score_tree_interval': {'default': 0, 'actual': 5},
 'fold_assignment': {'default': 'AUTO',

### Parameters for XGBoost_1_AutoML_20190408_171007

In [82]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[2])

In [83]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'XGBoost_1_AutoML_20190408_171007',
   'type': 'Key<Model>',
   'URL': '/3/Models/XGBoost_1_AutoML_20190408_171007'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'validation_frame': {'default': None, 'actual': None},
 'nfolds': {'default': 0, 'actual': 5},
 'keep_cross_validation_models': {'default': True, 'actual': False},
 'keep_cross_validation_predictions': {'default': False, 'actual': True},
 'keep_cross_validation_fold_assignment': {'default': False, 'actual': False},
 'score_each_iteration': {'default': False, 'actual': False},
 'fold_assignment': {'default': 'AUTO', 'actual': 'Modulo'},
 'fold_column': {'defau

### Parameters for StackedEnsemble_AllModels_AutoML_20190408_171007

In [86]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[4])

In [87]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'StackedEnsemble_AllModels_AutoML_20190408_171007',
   'type': 'Key<Model>',
   'URL': '/3/Models/StackedEnsemble_AllModels_AutoML_20190408_171007'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'response_column': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ColSpecifierV3',
    'schema_type': 'VecSpecifier'},
   'column_name': 'cnt',
   'is_member_of_frames': None}},
 'validation_frame': {'default': None, 'actual': None},
 'blending_frame': {'default': None, 'actual': None},
 'base_models': {'default': [],
  'actual': [{'__meta': {'schema_version': 3,
     'schema_name': 'ModelKey

## Model with runtime 1350 seconds

In [128]:
# Assume the following are passed by the user from the web interface

'''
Need a user id and project id?

'''
target='cnt' 
data_file='hour.csv'
run_time=1350
run_id='SOME_ID_20180617_221530' # Just some arbitrary ID
server_path='Users/newzysharma/Desktop/Desktop/Machine_Learning/Project'
classification=False
scale=False
max_models=None
balance_y=False # balance_classes=balance_y
balance_threshold=0.2
project ="HyperparameterDB_Project"  # project_name = project

In [109]:
# Use local data file or download from some type of bucket
import os

data_path=os.path.join(server_path,data_file)
data_path

'Users/newzysharma/Desktop/Desktop/Machine_Learning/Project/hour.csv'

In [110]:
# assign target and inputs for logistic regression
y = target
X = [name for name in hp.columns if name != y]
print(y)
print(X)

cnt
['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered']


In [111]:
# determine column types
ints, reals, enums = [], [], []
for key, val in hp.types.items():
    if key in X:
        if val == 'enum':
            enums.append(key)
        elif val == 'int':
            ints.append(key)            
        else: 
            reals.append(key)

print(ints)
print(enums)
print(reals)

['instant', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'casual', 'registered']
[]
['dteday', 'temp', 'atemp', 'hum', 'windspeed']


In [112]:
# impute missing values
_ = hp[reals].impute(method='mean')
_ = hp[ints].impute(method='median')

if scale:
    hp[reals] = hp[reals].scale()
    hp[ints] = hp[ints].scale()

In [113]:
if classification:
    class_percentage = y_balance=df[y].mean()[0]/(df[y].max()-df[y].min())
    if class_percentage < balance_threshold:
        balance_y=True
        

print(run_time)
type(run_time)

1350


int

## Cross-validate rather than take a test training split with 1350 seconds

In [114]:
# automl
# runs for run_time seconds then builds a stacked ensemble

aml = H2OAutoML(max_runtime_secs=run_time,project_name = project) # init automl, run for 1350 seconds
aml.train(x=X,  
           y=y,
           training_frame=hp)

AutoML progress: |████████████████████████████████████████████████████████| 100%


## Leaderboard

In [115]:
# view leaderboard
lb = aml.leaderboard
lb

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
GBM_2_AutoML_20190408_171007,10.3989,3.22474,10.3989,1.88964,0.0353651
GBM_1_AutoML_20190408_171007,13.2483,3.63982,13.2483,2.30598,0.0573338
GBM_4_AutoML_20190408_190707,14.2118,3.76985,14.2118,2.15209,0.0367956
XGBoost_1_AutoML_20190408_171007,14.936,3.86471,14.936,2.25298,0.0426858
GBM_3_AutoML_20190408_190707,15.2039,3.89922,15.2039,2.27726,0.0438891
GBM_1_AutoML_20190408_190707,19.1514,4.37623,19.1514,2.70842,
XGBoost_2_AutoML_20190408_190707,19.5511,4.42167,19.5511,2.52855,0.050435
XGBoost_1_AutoML_20190408_190707,20.9605,4.57826,20.9605,2.65315,0.0493793
GBM_2_AutoML_20190408_190707,22.3321,4.72569,22.3321,2.82382,
GBM_3_AutoML_20190408_171007,22.3503,4.72761,22.3503,2.93989,0.0688736




In [116]:
aml.leader

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_2_AutoML_20190408_171007


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 2.7545139005515074
RMSE: 1.659672829370749
MAE: 1.1420248950973013
RMSLE: 0.026260655890205944
Mean Residual Deviance: 2.7545139005515074

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 10.398929973581447
RMSE: 3.2247371944984056
MAE: 1.8896396840192615
RMSLE: 0.03536506290104928
Mean Residual Deviance: 10.398929973581447
Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,1.8896418,0.0331531,1.8888159,1.8440878,1.955546,1.8330745,1.9266849
mean_residual_deviance,10.398917,1.1037041,9.484616,8.705696,13.301646,10.319401,10.1832285
mse,10.398917,1.1037041,9.484616,8.705696,13.301646,10.319401,10.1832285
r2,0.9996837,0.0000345,0.9997121,0.9997374,0.9995934,0.9996870,0.9996888
residual_deviance,10.398917,1.1037041,9.484616,8.705696,13.301646,10.319401,10.1832285
rmse,3.2161787,0.1659996,3.0797105,2.9505417,3.6471422,3.2123823,3.191117
rmsle,0.0353167,0.0013090,0.0337463,0.0375869,0.0355834,0.0327149,0.0369519


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2019-04-08 17:25:27,16.816 sec,0.0,181.3823804,142.3998489,32899.5679309
,2019-04-08 17:25:27,16.882 sec,5.0,107.4239264,84.2400360,11539.8999535
,2019-04-08 17:25:27,16.935 sec,10.0,63.8940838,50.0233089,4082.4539467
,2019-04-08 17:25:27,16.985 sec,15.0,38.4509700,30.0349704,1478.4770913
,2019-04-08 17:25:27,17.034 sec,20.0,25.6530386,19.7348606,658.0783895
---,---,---,---,---,---,---
,2019-04-08 17:25:30,19.344 sec,270.0,1.7050142,1.1700358,2.9070734
,2019-04-08 17:25:30,19.393 sec,275.0,1.6909674,1.1622558,2.8593708
,2019-04-08 17:25:30,19.451 sec,280.0,1.6770692,1.1544659,2.8125612



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
registered,2568583680.0000000,1.0,0.9182406
casual,153480320.0000000,0.0597529,0.0548675
hr,43721364.0000000,0.0170216,0.0156299
instant,9335018.0,0.0036343,0.0033372
workingday,7972549.0,0.0031039,0.0028501
dteday,5198927.5,0.0020240,0.0018586
temp,3354083.5,0.0013058,0.0011990
weekday,2513148.2500000,0.0009784,0.0008984
atemp,2094532.2500000,0.0008154,0.0007488




In [118]:
aml.leader.algo

'gbm'

## Ensemble Exploration

In [119]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
aml_leaderboard_df

Unnamed: 0,model_id,mean_residual_deviance,rmse,mse,mae,rmsle
0,GBM_2_AutoML_20190408_171007,10.39893,3.224737,10.39893,1.88964,0.035365
1,GBM_1_AutoML_20190408_171007,13.248308,3.639823,13.248308,2.305975,0.057334
2,GBM_4_AutoML_20190408_190707,14.211758,3.769849,14.211758,2.152092,0.036796
3,XGBoost_1_AutoML_20190408_171007,14.936004,3.864713,14.936004,2.252979,0.042686
4,GBM_3_AutoML_20190408_190707,15.203899,3.899218,15.203899,2.277258,0.043889
5,GBM_1_AutoML_20190408_190707,19.15141,4.376232,19.15141,2.708416,
6,XGBoost_2_AutoML_20190408_190707,19.551146,4.421668,19.551146,2.528554,0.050435
7,XGBoost_1_AutoML_20190408_190707,20.960508,4.578265,20.960508,2.653145,0.049379
8,GBM_2_AutoML_20190408_190707,22.332141,4.725689,22.332141,2.823816,
9,GBM_3_AutoML_20190408_171007,22.350299,4.72761,22.350299,2.93989,0.068874


## Getting models

### Parameters for GBM_2_AutoML_20190408_171007

In [121]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[0])

In [122]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'GBM_2_AutoML_20190408_171007',
   'type': 'Key<Model>',
   'URL': '/3/Models/GBM_2_AutoML_20190408_171007'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'validation_frame': {'default': None, 'actual': None},
 'nfolds': {'default': 0, 'actual': 5},
 'keep_cross_validation_models': {'default': True, 'actual': False},
 'keep_cross_validation_predictions': {'default': False, 'actual': True},
 'keep_cross_validation_fold_assignment': {'default': False, 'actual': False},
 'score_each_iteration': {'default': False, 'actual': False},
 'score_tree_interval': {'default': 0, 'actual': 5},
 'fold_assignment': {'default': 'AUTO',

### Parameters for XGBoost_1_AutoML_20190408_171007

In [124]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[3])

In [125]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'XGBoost_1_AutoML_20190408_171007',
   'type': 'Key<Model>',
   'URL': '/3/Models/XGBoost_1_AutoML_20190408_171007'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'validation_frame': {'default': None, 'actual': None},
 'nfolds': {'default': 0, 'actual': 5},
 'keep_cross_validation_models': {'default': True, 'actual': False},
 'keep_cross_validation_predictions': {'default': False, 'actual': True},
 'keep_cross_validation_fold_assignment': {'default': False, 'actual': False},
 'score_each_iteration': {'default': False, 'actual': False},
 'fold_assignment': {'default': 'AUTO', 'actual': 'Modulo'},
 'fold_column': {'defau

### Parameters for DRF_1_AutoML_20190408_171007

In [126]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[20])

In [127]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'DRF_1_AutoML_20190408_171007',
   'type': 'Key<Model>',
   'URL': '/3/Models/DRF_1_AutoML_20190408_171007'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'validation_frame': {'default': None, 'actual': None},
 'nfolds': {'default': 0, 'actual': 5},
 'keep_cross_validation_models': {'default': True, 'actual': False},
 'keep_cross_validation_predictions': {'default': False, 'actual': True},
 'keep_cross_validation_fold_assignment': {'default': False, 'actual': False},
 'score_each_iteration': {'default': False, 'actual': False},
 'score_tree_interval': {'default': 0, 'actual': 0},
 'fold_assignment': {'default': 'AUTO',

## Model with runtime 1500 seconds

In [134]:
# Assume the following are passed by the user from the web interface

'''
Need a user id and project id?

'''
target='cnt' 
data_file='hour.csv'
run_time=1500
run_id='SOME_ID_20180617_221531' # Just some arbitrary ID
server_path='Users/newzysharma/Desktop/Desktop/Machine_Learning/Project'
classification=False
scale=False
max_models=None
balance_y=False # balance_classes=balance_y
balance_threshold=0.2
project ="HyperparameterDB_Project"  # project_name = project

In [135]:
# Use local data file or download from some type of bucket
import os

data_path=os.path.join(server_path,data_file)
data_path

'Users/newzysharma/Desktop/Desktop/Machine_Learning/Project/hour.csv'

In [136]:
# assign target and inputs for logistic regression
y = target
X = [name for name in hp.columns if name != y]
print(y)
print(X)

cnt
['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered']


In [137]:
# impute missing values
_ = hp[reals].impute(method='mean')
_ = hp[ints].impute(method='median')

if scale:
    hp[reals] = hp[reals].scale()
    hp[ints] = hp[ints].scale()

In [138]:
if classification:
    class_percentage = y_balance=df[y].mean()[0]/(df[y].max()-df[y].min())
    if class_percentage < balance_threshold:
        balance_y=True
        

print(run_time)
type(run_time)

1500


int

## Cross-validate rather than take a test training split with 1500 seconds

In [139]:
# automl
# runs for run_time seconds then builds a stacked ensemble

aml = H2OAutoML(max_runtime_secs=run_time,project_name = project) # init automl, run for 1500 seconds
aml.train(x=X,  
           y=y,
           training_frame=hp)

AutoML progress: |████████████████████████████████████████████████████████| 100%


## Leaderboard

In [140]:
# view leaderboard
lb = aml.leaderboard
lb

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
GBM_2_AutoML_20190408_171007,10.3989,3.22474,10.3989,1.88964,0.0353651
GBM_2_AutoML_20190408_194251,12.7928,3.57671,12.7928,2.13437,0.0450689
GBM_1_AutoML_20190408_171007,13.2483,3.63982,13.2483,2.30598,0.0573338
GBM_4_AutoML_20190408_190707,14.2118,3.76985,14.2118,2.15209,0.0367956
GBM_3_AutoML_20190408_194251,14.2999,3.78152,14.2999,2.17271,0.0358335
XGBoost_1_AutoML_20190408_171007,14.936,3.86471,14.936,2.25298,0.0426858
GBM_3_AutoML_20190408_190707,15.2039,3.89922,15.2039,2.27726,0.0438891
XGBoost_3_AutoML_20190408_194251,15.9838,3.99797,15.9838,2.48441,
GBM_1_AutoML_20190408_190707,19.1514,4.37623,19.1514,2.70842,
XGBoost_2_AutoML_20190408_190707,19.5511,4.42167,19.5511,2.52855,0.050435




In [141]:
aml.leader

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_2_AutoML_20190408_171007


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 2.7545139005515074
RMSE: 1.659672829370749
MAE: 1.1420248950973013
RMSLE: 0.026260655890205944
Mean Residual Deviance: 2.7545139005515074

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 10.398929973581447
RMSE: 3.2247371944984056
MAE: 1.8896396840192615
RMSLE: 0.03536506290104928
Mean Residual Deviance: 10.398929973581447
Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,1.8896418,0.0331531,1.8888159,1.8440878,1.955546,1.8330745,1.9266849
mean_residual_deviance,10.398917,1.1037041,9.484616,8.705696,13.301646,10.319401,10.1832285
mse,10.398917,1.1037041,9.484616,8.705696,13.301646,10.319401,10.1832285
r2,0.9996837,0.0000345,0.9997121,0.9997374,0.9995934,0.9996870,0.9996888
residual_deviance,10.398917,1.1037041,9.484616,8.705696,13.301646,10.319401,10.1832285
rmse,3.2161787,0.1659996,3.0797105,2.9505417,3.6471422,3.2123823,3.191117
rmsle,0.0353167,0.0013090,0.0337463,0.0375869,0.0355834,0.0327149,0.0369519


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2019-04-08 17:25:27,16.816 sec,0.0,181.3823804,142.3998489,32899.5679309
,2019-04-08 17:25:27,16.882 sec,5.0,107.4239264,84.2400360,11539.8999535
,2019-04-08 17:25:27,16.935 sec,10.0,63.8940838,50.0233089,4082.4539467
,2019-04-08 17:25:27,16.985 sec,15.0,38.4509700,30.0349704,1478.4770913
,2019-04-08 17:25:27,17.034 sec,20.0,25.6530386,19.7348606,658.0783895
---,---,---,---,---,---,---
,2019-04-08 17:25:30,19.344 sec,270.0,1.7050142,1.1700358,2.9070734
,2019-04-08 17:25:30,19.393 sec,275.0,1.6909674,1.1622558,2.8593708
,2019-04-08 17:25:30,19.451 sec,280.0,1.6770692,1.1544659,2.8125612



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
registered,2568583680.0000000,1.0,0.9182406
casual,153480320.0000000,0.0597529,0.0548675
hr,43721364.0000000,0.0170216,0.0156299
instant,9335018.0,0.0036343,0.0033372
workingday,7972549.0,0.0031039,0.0028501
dteday,5198927.5,0.0020240,0.0018586
temp,3354083.5,0.0013058,0.0011990
weekday,2513148.2500000,0.0009784,0.0008984
atemp,2094532.2500000,0.0008154,0.0007488




In [142]:
aml.leader.algo

'gbm'

## Ensemble Exploration

In [143]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
aml_leaderboard_df

Unnamed: 0,model_id,mean_residual_deviance,rmse,mse,mae,rmsle
0,GBM_2_AutoML_20190408_171007,10.39893,3.224737,10.39893,1.88964,0.035365
1,GBM_2_AutoML_20190408_194251,12.792848,3.576709,12.792848,2.134374,0.045069
2,GBM_1_AutoML_20190408_171007,13.248308,3.639823,13.248308,2.305975,0.057334
3,GBM_4_AutoML_20190408_190707,14.211758,3.769849,14.211758,2.152092,0.036796
4,GBM_3_AutoML_20190408_194251,14.299879,3.781518,14.299879,2.172706,0.035834
5,XGBoost_1_AutoML_20190408_171007,14.936004,3.864713,14.936004,2.252979,0.042686
6,GBM_3_AutoML_20190408_190707,15.203899,3.899218,15.203899,2.277258,0.043889
7,XGBoost_3_AutoML_20190408_194251,15.983771,3.997971,15.983771,2.484415,
8,GBM_1_AutoML_20190408_190707,19.15141,4.376232,19.15141,2.708416,
9,XGBoost_2_AutoML_20190408_190707,19.551146,4.421668,19.551146,2.528554,0.050435


## Getting Models

### Parameters for GBM_2_AutoML_20190408_171007

In [144]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[0])

In [145]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'GBM_2_AutoML_20190408_171007',
   'type': 'Key<Model>',
   'URL': '/3/Models/GBM_2_AutoML_20190408_171007'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'validation_frame': {'default': None, 'actual': None},
 'nfolds': {'default': 0, 'actual': 5},
 'keep_cross_validation_models': {'default': True, 'actual': False},
 'keep_cross_validation_predictions': {'default': False, 'actual': True},
 'keep_cross_validation_fold_assignment': {'default': False, 'actual': False},
 'score_each_iteration': {'default': False, 'actual': False},
 'score_tree_interval': {'default': 0, 'actual': 5},
 'fold_assignment': {'default': 'AUTO',

### Parameters for XGBoost_1_AutoML_20190408_171007	

In [146]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[5])

In [147]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'XGBoost_1_AutoML_20190408_171007',
   'type': 'Key<Model>',
   'URL': '/3/Models/XGBoost_1_AutoML_20190408_171007'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'validation_frame': {'default': None, 'actual': None},
 'nfolds': {'default': 0, 'actual': 5},
 'keep_cross_validation_models': {'default': True, 'actual': False},
 'keep_cross_validation_predictions': {'default': False, 'actual': True},
 'keep_cross_validation_fold_assignment': {'default': False, 'actual': False},
 'score_each_iteration': {'default': False, 'actual': False},
 'fold_assignment': {'default': 'AUTO', 'actual': 'Modulo'},
 'fold_column': {'defau

### Parameters for DRF_1_AutoML_20190408_171007

In [148]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[29])

In [149]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'DRF_1_AutoML_20190408_171007',
   'type': 'Key<Model>',
   'URL': '/3/Models/DRF_1_AutoML_20190408_171007'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'validation_frame': {'default': None, 'actual': None},
 'nfolds': {'default': 0, 'actual': 5},
 'keep_cross_validation_models': {'default': True, 'actual': False},
 'keep_cross_validation_predictions': {'default': False, 'actual': True},
 'keep_cross_validation_fold_assignment': {'default': False, 'actual': False},
 'score_each_iteration': {'default': False, 'actual': False},
 'score_tree_interval': {'default': 0, 'actual': 0},
 'fold_assignment': {'default': 'AUTO',

## Model with runtime 1850 seconds

In [150]:
# Assume the following are passed by the user from the web interface

'''
Need a user id and project id?

'''
target='cnt' 
data_file='hour.csv'
run_time=1850
run_id='SOME_ID_20180617_221531' # Just some arbitrary ID
server_path='Users/newzysharma/Desktop/Desktop/Machine_Learning/Project'
classification=False
scale=False
max_models=None
balance_y=False # balance_classes=balance_y
balance_threshold=0.2
project ="HyperparameterDB_Project"  # project_name = project

In [151]:
# Use local data file or download from some type of bucket
import os

data_path=os.path.join(server_path,data_file)
data_path

'Users/newzysharma/Desktop/Desktop/Machine_Learning/Project/hour.csv'

In [152]:
# assign target and inputs for logistic regression
y = target
X = [name for name in hp.columns if name != y]
print(y)
print(X)

cnt
['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered']


In [153]:
# impute missing values
_ = hp[reals].impute(method='mean')
_ = hp[ints].impute(method='median')

if scale:
    hp[reals] = hp[reals].scale()
    hp[ints] = hp[ints].scale()

In [154]:
if classification:
    class_percentage = y_balance=df[y].mean()[0]/(df[y].max()-df[y].min())
    if class_percentage < balance_threshold:
        balance_y=True
        

print(run_time)
type(run_time)

1850


int

## Cross-validate rather than take a test training split with 1850 seconds

In [155]:
# automl
# runs for run_time seconds then builds a stacked ensemble

aml = H2OAutoML(max_runtime_secs=run_time,project_name = project) # init automl, run for 1850 seconds
aml.train(x=X,  
           y=y,
           training_frame=hp)

AutoML progress: |████████████████████████████████████████████████████████| 100%


## Leaderboard

In [156]:
# view leaderboard
lb = aml.leaderboard
lb

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
GBM_2_AutoML_20190408_171007,10.3989,3.22474,10.3989,1.88964,0.0353651
GBM_2_AutoML_20190408_194251,12.7928,3.57671,12.7928,2.13437,0.0450689
GBM_3_AutoML_20190408_203115,12.9605,3.60007,12.9605,2.11321,0.0405568
GBM_1_AutoML_20190408_171007,13.2483,3.63982,13.2483,2.30598,0.0573338
GBM_1_AutoML_20190408_203115,13.4612,3.66895,13.4612,2.19063,0.0428744
GBM_4_AutoML_20190408_190707,14.2118,3.76985,14.2118,2.15209,0.0367956
GBM_3_AutoML_20190408_194251,14.2999,3.78152,14.2999,2.17271,0.0358335
XGBoost_1_AutoML_20190408_171007,14.936,3.86471,14.936,2.25298,0.0426858
GBM_3_AutoML_20190408_190707,15.2039,3.89922,15.2039,2.27726,0.0438891
XGBoost_3_AutoML_20190408_194251,15.9838,3.99797,15.9838,2.48441,




In [157]:
aml.leader

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_2_AutoML_20190408_171007


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 2.7545139005515074
RMSE: 1.659672829370749
MAE: 1.1420248950973013
RMSLE: 0.026260655890205944
Mean Residual Deviance: 2.7545139005515074

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 10.398929973581447
RMSE: 3.2247371944984056
MAE: 1.8896396840192615
RMSLE: 0.03536506290104928
Mean Residual Deviance: 10.398929973581447
Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,1.8896418,0.0331531,1.8888159,1.8440878,1.955546,1.8330745,1.9266849
mean_residual_deviance,10.398917,1.1037041,9.484616,8.705696,13.301646,10.319401,10.1832285
mse,10.398917,1.1037041,9.484616,8.705696,13.301646,10.319401,10.1832285
r2,0.9996837,0.0000345,0.9997121,0.9997374,0.9995934,0.9996870,0.9996888
residual_deviance,10.398917,1.1037041,9.484616,8.705696,13.301646,10.319401,10.1832285
rmse,3.2161787,0.1659996,3.0797105,2.9505417,3.6471422,3.2123823,3.191117
rmsle,0.0353167,0.0013090,0.0337463,0.0375869,0.0355834,0.0327149,0.0369519


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2019-04-08 17:25:27,16.816 sec,0.0,181.3823804,142.3998489,32899.5679309
,2019-04-08 17:25:27,16.882 sec,5.0,107.4239264,84.2400360,11539.8999535
,2019-04-08 17:25:27,16.935 sec,10.0,63.8940838,50.0233089,4082.4539467
,2019-04-08 17:25:27,16.985 sec,15.0,38.4509700,30.0349704,1478.4770913
,2019-04-08 17:25:27,17.034 sec,20.0,25.6530386,19.7348606,658.0783895
---,---,---,---,---,---,---
,2019-04-08 17:25:30,19.344 sec,270.0,1.7050142,1.1700358,2.9070734
,2019-04-08 17:25:30,19.393 sec,275.0,1.6909674,1.1622558,2.8593708
,2019-04-08 17:25:30,19.451 sec,280.0,1.6770692,1.1544659,2.8125612



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
registered,2568583680.0000000,1.0,0.9182406
casual,153480320.0000000,0.0597529,0.0548675
hr,43721364.0000000,0.0170216,0.0156299
instant,9335018.0,0.0036343,0.0033372
workingday,7972549.0,0.0031039,0.0028501
dteday,5198927.5,0.0020240,0.0018586
temp,3354083.5,0.0013058,0.0011990
weekday,2513148.2500000,0.0009784,0.0008984
atemp,2094532.2500000,0.0008154,0.0007488




In [158]:
aml.leader.algo

'gbm'

## Ensemble Exploration

In [159]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
aml_leaderboard_df

Unnamed: 0,model_id,mean_residual_deviance,rmse,mse,mae,rmsle
0,GBM_2_AutoML_20190408_171007,10.398930,3.224737,10.398930,1.889640,0.035365
1,GBM_2_AutoML_20190408_194251,12.792848,3.576709,12.792848,2.134374,0.045069
2,GBM_3_AutoML_20190408_203115,12.960531,3.600074,12.960531,2.113213,0.040557
3,GBM_1_AutoML_20190408_171007,13.248308,3.639823,13.248308,2.305975,0.057334
4,GBM_1_AutoML_20190408_203115,13.461223,3.668954,13.461223,2.190633,0.042874
5,GBM_4_AutoML_20190408_190707,14.211758,3.769849,14.211758,2.152092,0.036796
6,GBM_3_AutoML_20190408_194251,14.299879,3.781518,14.299879,2.172706,0.035834
7,XGBoost_1_AutoML_20190408_171007,14.936004,3.864713,14.936004,2.252979,0.042686
8,GBM_3_AutoML_20190408_190707,15.203899,3.899218,15.203899,2.277258,0.043889
9,XGBoost_3_AutoML_20190408_194251,15.983771,3.997971,15.983771,2.484415,


## Getting Models

### Parameters for GBM_2_AutoML_20190408_171007

In [160]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[0])

In [161]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'GBM_2_AutoML_20190408_171007',
   'type': 'Key<Model>',
   'URL': '/3/Models/GBM_2_AutoML_20190408_171007'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'validation_frame': {'default': None, 'actual': None},
 'nfolds': {'default': 0, 'actual': 5},
 'keep_cross_validation_models': {'default': True, 'actual': False},
 'keep_cross_validation_predictions': {'default': False, 'actual': True},
 'keep_cross_validation_fold_assignment': {'default': False, 'actual': False},
 'score_each_iteration': {'default': False, 'actual': False},
 'score_tree_interval': {'default': 0, 'actual': 5},
 'fold_assignment': {'default': 'AUTO',

### Parameters for XGBoost_1_AutoML_20190408_190707

In [162]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[17])

In [163]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'XGBoost_1_AutoML_20190408_190707',
   'type': 'Key<Model>',
   'URL': '/3/Models/XGBoost_1_AutoML_20190408_190707'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'validation_frame': {'default': None, 'actual': None},
 'nfolds': {'default': 0, 'actual': 5},
 'keep_cross_validation_models': {'default': True, 'actual': False},
 'keep_cross_validation_predictions': {'default': False, 'actual': True},
 'keep_cross_validation_fold_assignment': {'default': False, 'actual': False},
 'score_each_iteration': {'default': False, 'actual': False},
 'fold_assignment': {'default': 'AUTO', 'actual': 'Modulo'},
 'fold_column': {'defau

### Parameters for StackedEnsemble_AllModels_AutoML_20190408_171007

In [164]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[20])

In [165]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'StackedEnsemble_AllModels_AutoML_20190408_171007',
   'type': 'Key<Model>',
   'URL': '/3/Models/StackedEnsemble_AllModels_AutoML_20190408_171007'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_hour.hex',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_hour.hex'}},
 'response_column': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ColSpecifierV3',
    'schema_type': 'VecSpecifier'},
   'column_name': 'cnt',
   'is_member_of_frames': None}},
 'validation_frame': {'default': None, 'actual': None},
 'blending_frame': {'default': None, 'actual': None},
 'base_models': {'default': [],
  'actual': [{'__meta': {'schema_version': 3,
     'schema_name': 'ModelKey

# CONCLUSION

<table style="width:50%">
  <tr>
      <th>Runtime of model in Number of Seconds<br></th>
    <th>Models Generated</th> 
  </tr>
    
   <tr>
    <td>500</td>
    <td>7</td> 
  </tr>
    
  <tr>
    <td>1000</td>
    <td>13</td> 
  </tr>
  
  <tr>
    <td>1350</td>
    <td>33</td> 
  </tr>
  
   <tr>
    <td>1500</td>
    <td>56</td> 
  </tr>
  
   <tr>
    <td>1850</td>
    <td>84<td> 
  </tr>
    
</table>

# CONTRIBUTION

# CITATIONS

https://github.com/nikbearbrown/CSYE_7245/blob/master/H2O/H2O_automl_model.ipynb

# LICENSE


Copyright 2019 Newzy Sharma 

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated 
documentation files (the "Software"), to deal in the Software without restriction, including without limitation the 
rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the 
Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE 
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR 
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.