# ABSTRACT

Hyperparameters are parameters that are specified prior to running machine learning algorithms that have a large effect on the predictive power of statistical models. Knowledge of the relative importance of a hyperparameter to an algorithm and its range of values is crucial to hyperparameter tuning and creating effective models. To either experts or non-experts, determining hyperparameters that optimize model performance can be a tedious and difficult task. Therefore, we develop a hyperparameter database that allows users to visualize and understand how to choose hyperparameters that maximize the predictive power of their models. 

The database is created by running millions of hyperparameter values, over thousands of public datasets and calculating the individual conditional expectation of every hyperparameter to the quality of a model.                 

We analyze the **effect of hyperparameters** on algorithms such as                                                  
Distributed Random Forest (DRF),                                                                               
Generalized Linear Model (GLM),                                                                                
Gradient Boosting Machine (GBM),                                                                            
Boosting (XGBoost) and several more.                                                                          
Consequently, the database attempts to provide a one-stop platform for data scientists to identify hyperparameters that have the most effect on their models in order to speed up the process of developing effective predictive models. Moreover, the database will also use these public datasets to build models that can predict hyperparameters without search and for visualizing and teaching concepts such as statistical power and bias/variance tradeoff. The raw data will also be publically available for the research community.


In [1]:
# import h2o package and specific estimator 

from __future__ import print_function
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import seaborn as sns
import statsmodels.api as sm

## Importinh H2o

import h2o
from h2o.automl import H2OAutoML
import random, os, sys
from datetime import datetime
import pandas as pd
import logging
import csv
import optparse
import time
import json
from distutils.util import strtobool
import psutil

import warnings
warnings.filterwarnings('ignore')

In [2]:
h2o.init(strict_version_check=False) # start h2o

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,11 hours 50 mins
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.1
H2O cluster version age:,25 days
H2O cluster name:,H2O_from_python_newzysharma_g5g8o7
H2O cluster total nodes:,1
H2O cluster free memory:,2.000 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [3]:
#importing data to the server
hp = h2o.import_file(path="hour.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [4]:
#Displaying the head
hp.head()

instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
1,2011-01-01 00:00:00,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
2,2011-01-01 00:00:00,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
3,2011-01-01 00:00:00,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
4,2011-01-01 00:00:00,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
5,2011-01-01 00:00:00,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1
6,2011-01-01 00:00:00,1,0,1,5,0,6,0,2,0.24,0.2576,0.75,0.0896,0,1,1
7,2011-01-01 00:00:00,1,0,1,6,0,6,0,1,0.22,0.2727,0.8,0.0,2,0,2
8,2011-01-01 00:00:00,1,0,1,7,0,6,0,1,0.2,0.2576,0.86,0.0,1,2,3
9,2011-01-01 00:00:00,1,0,1,8,0,6,0,1,0.24,0.2879,0.75,0.0,1,7,8
10,2011-01-01 00:00:00,1,0,1,9,0,6,0,1,0.32,0.3485,0.76,0.0,8,6,14




In [5]:
hp.describe()

Rows:17379
Cols:17




Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
type,int,time,int,int,int,int,int,int,int,int,real,real,real,real,int,int,int
mins,1.0,1293840000000.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
mean,8690.0,1325477314552.0461,2.501639910236492,0.5025605615973301,6.5377754761493785,11.546751826917548,0.028770355026181024,3.003682605443351,0.6827205247712756,1.425283387997008,0.4969871684216584,0.47577510213476026,0.6272288394038784,0.1900976063064618,35.67621842453536,153.78686920996606,189.4630876345014
maxs,17379.0,1356912000000.0,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0
sigma,5017.029499614288,18150225217.779854,1.1069181394480765,0.5000078290910197,3.438775713750168,6.914405095264493,0.16716527638437123,2.005771456110988,0.4654306335238829,0.6393568777542534,0.19255612124972193,0.17185021563535932,0.19292983406291514,0.1223402285727905,49.30503038705309,151.35728591258314,181.38759909186476
zeros,0,0,0,8645,0,726,16879,2502,5514,0,0,2,22,2180,1581,24,0
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1.0,2011-01-01 00:00:00,1.0,0.0,1.0,0.0,0.0,6.0,0.0,1.0,0.24,0.2879,0.81,0.0,3.0,13.0,16.0
1,2.0,2011-01-01 00:00:00,1.0,0.0,1.0,1.0,0.0,6.0,0.0,1.0,0.22,0.2727,0.8,0.0,8.0,32.0,40.0
2,3.0,2011-01-01 00:00:00,1.0,0.0,1.0,2.0,0.0,6.0,0.0,1.0,0.22,0.2727,0.8,0.0,5.0,27.0,32.0


In [6]:
# Functions

def alphabet(n):
  alpha='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'    
  str=''
  r=len(alpha)-1   
  while len(str)<n:
    i=random.randint(0,r)
    str+=alpha[i]   
  return str
  
  
def set_meta_data(analysis,run_id,server,data,test,model_path,target,run_time,classification,scale,model,balance,balance_threshold,name,path,nthreads,min_mem_size):
  m_data={}
  m_data['start_time'] = time.time()
  m_data['target']=target
  m_data['server_path']=server
  m_data['data_path']=data 
  m_data['test_path']=test
  m_data['max_models']=model
  m_data['run_time']=run_time
  m_data['run_id'] =run_id
  m_data['scale']=scale
  m_data['classification']=classification
  m_data['scale']=False
  m_data['model_path']=model_path
  m_data['balance']=balance
  m_data['balance_threshold']=balance_threshold
  m_data['project'] =name
  m_data['end_time'] = time.time()
  m_data['execution_time'] = 0.0
  m_data['run_path'] =path
  m_data['nthreads'] = nthreads
  m_data['min_mem_size'] = min_mem_size
  m_data['analysis'] = analysis
  return m_data


def dict_to_json(dct,n):
  j = json.dumps(dct, indent=4)
  f = open(n, 'w')
  print(j, file=f)
  f.close()
  
  
def stackedensemble(mod):
    coef_norm=None
    try:
      metalearner = h2o.get_model(mod.metalearner()['name'])
      coef_norm=metalearner.coef_norm()
    except:
      pass        
    return coef_norm

def stackedensemble_df(df):
    bm_algo={ 'GBM': None,'GLM': None,'DRF': None,'XRT': None,'Dee': None}
    for index, row in df.iterrows():
      if len(row['model_id'])>3:
        key=row['model_id'][0:3]
        if key in bm_algo:
          if bm_algo[key] is None:
                bm_algo[key]=row['model_id']
    bm=list(bm_algo.values()) 
    bm=list(filter(None.__ne__, bm))             
    return bm

def se_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['auc']=modl.auc()   
    d['roc']=modl.roc()
    d['mse']=modl.mse()   
    d['null_degrees_of_freedom']=modl.null_degrees_of_freedom()
    d['null_deviance']=modl.null_deviance()
    d['residual_degrees_of_freedom']=modl.residual_degrees_of_freedom()   
    d['residual_deviance']=modl.residual_deviance()
    d['rmse']=modl.rmse()
    return d

def get_model_by_algo(algo,models_dict):
    mod=None
    mod_id=None    
    for m in list(models_dict.keys()):
        if m[0:3]==algo:
            mod_id=m
            mod=h2o.get_model(m)      
    return mod,mod_id     
    
    
def gbm_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['varimp']=modl.varimp()  
    return d
    
    
def dl_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['varimp']=modl.varimp()  
    return d
    
    
def drf_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['varimp']=modl.varimp()  
    d['roc']=modl.roc()      
    return d
    
def xrt_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['varimp']=modl.varimp()  
    d['roc']=modl.roc()      
    return d
    
    
def glm_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['coef']=modl.coef()  
    d['coef_norm']=modl.coef_norm()      
    return d
    
def model_performance_stats(perf):
    d={}
    try:    
      d['mse']=perf.mse()
    except:
      pass      
    try:    
      d['rmse']=perf.rmse() 
    except:
      pass      
    try:    
      d['null_degrees_of_freedom']=perf.null_degrees_of_freedom()
    except:
      pass      
    try:    
      d['residual_degrees_of_freedom']=perf.residual_degrees_of_freedom()
    except:
      pass      
    try:    
      d['residual_deviance']=perf.residual_deviance() 
    except:
      pass      
    try:    
      d['null_deviance']=perf.null_deviance() 
    except:
      pass      
    try:    
      d['aic']=perf.aic() 
    except:
      pass      
    try:
      d['logloss']=perf.logloss() 
    except:
      pass    
    try:
      d['auc']=perf.auc()
    except:
      pass  
    try:
      d['gini']=perf.gini()
    except:
      pass    
    return d
    
def impute_missing_values(df, x, scal=False):
    # determine column types
    ints, reals, enums = [], [], []
    for key, val in df.types.items():
        if key in x:
            if val == 'enum':
                enums.append(key)
            elif val == 'int':
                ints.append(key)            
            else: 
                reals.append(key)    
    _ = df[reals].impute(method='mean')
    _ = df[ints].impute(method='median')
    if scal:
        df[reals] = df[reals].scale()
        df[ints] = df[ints].scale()    
    return


def get_independent_variables(df, targ):
    C = [name for name in df.columns if name != targ]
    # determine column types
    ints, reals, enums = [], [], []
    for key, val in df.types.items():
        if key in C:
            if val == 'enum':
                enums.append(key)
            elif val == 'int':
                ints.append(key)            
            else: 
                reals.append(key)    
    x=ints+enums+reals
    return x
    
def get_all_variables_csv(i):
    ivd={}
    try:
      iv = pd.read_csv(i,header=None)
    except:
      sys.exit(1)    
    col=iv.values.tolist()[0]
    dt=iv.values.tolist()[1]
    i=0
    for c in col:
      ivd[c.strip()]=dt[i].strip()
      i+=1        
    return ivd
    
    

def check_all_variables(df,dct,y=None):     
    targ=list(dct.keys())     
    for key, val in df.types.items():
        if key in targ:
          if dct[key] not in ['real','int','enum']:                      
            targ.remove(key)  
    for key, val in df.types.items():
        if key in targ:            
          if dct[key] != val:
            print('convert ',key,' ',dct[key],' ',val)
            if dct[key]=='enum':
                try:
                  df[key] = df[key].asfactor() 
                except:
                  targ.remove(key)                 
            if dct[key]=='int': 
                try:                
                  df[key] = df[key].asnumeric() 
                except:
                  targ.remove(key)                  
            if dct[key]=='real':
                try:                
                  df[key] = df[key].asnumeric()  
                except:
                  targ.remove(key)                  
    if y is None:
      y=df.columns[-1] 
    if y in targ:
      targ.remove(y)
    else:
      y=targ.pop()            
    return targ    
    
def predictions(mod,data,run_id):
    test = h2o.import_file(data)
    mod_perf=mod_best.model_performance(test)
              
    stats_test={}
    stats_test=model_performance_stats(mod_perf)

    n=run_id+'_test_stats.json'
    dict_to_json(stats_test,n) 

    try:    
      cf=mod_perf.confusion_matrix(metrics=["f1","f2","f0point5","accuracy","precision","recall","specificity","absolute_mcc","min_per_class_accuracy","mean_per_class_accuracy"])
      cf_df=cf[0].table.as_data_frame()
      cf_df.to_csv(run_id+'_test_confusion_matrix.csv')
    except:
      pass

    predictions = mod_best.predict(test)
    predictions_df=test.cbind(predictions).as_data_frame() 
    predictions_df.to_csv(run_id+'_predictions.csv')
    return

def predictions_test(mod,test,run_id):
    mod_perf=mod_best.model_performance(test)          
    stats_test={}
    stats_test=model_performance_stats(mod_perf)
    n=run_id+'_test_stats.json'
    dict_to_json(stats_test,n) 
    try:
      cf=mod_perf.confusion_matrix()
#      cf=mod_perf.confusion_matrix(metrics=["f1","f2","f0point5","accuracy","precision","recall","specificity","absolute_mcc","min_per_class_accuracy","mean_per_class_accuracy"])
      cf_df=cf.table.as_data_frame()
      cf_df.to_csv(run_id+'_test_confusion_matrix.csv')
    except:
      pass
    predictions = mod_best.predict(test)    
    predictions_df=test.cbind(predictions).as_data_frame() 
    predictions_df.to_csv(run_id+'_predictions.csv')
    return predictions

def check_X(x,df):
    for name in x:
        if name not in df.columns:
          x.remove(name)  
    return x    
    
    
def get_stacked_ensemble(lst):
    se=None
    for model in model_set:
      if 'BestOfFamily' in model:
        se=model
    if se is None:     
      for model in model_set:
        if 'AllModels'in model:
          se=model           
    return se       
    
def get_variables_types(df):
    d={}
    for key, val in df.types.items():
        d[key]=val           
    return d    
    
#  End Functions

In [7]:
all_variables=None

In [8]:
corr = hp.cor
corr

instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
1,2011-01-01 00:00:00,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
2,2011-01-01 00:00:00,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
3,2011-01-01 00:00:00,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
4,2011-01-01 00:00:00,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
5,2011-01-01 00:00:00,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1
6,2011-01-01 00:00:00,1,0,1,5,0,6,0,2,0.24,0.2576,0.75,0.0896,0,1,1
7,2011-01-01 00:00:00,1,0,1,6,0,6,0,1,0.22,0.2727,0.8,0.0,2,0,2
8,2011-01-01 00:00:00,1,0,1,7,0,6,0,1,0.2,0.2576,0.86,0.0,1,2,3
9,2011-01-01 00:00:00,1,0,1,8,0,6,0,1,0.24,0.2879,0.75,0.0,1,7,8
10,2011-01-01 00:00:00,1,0,1,9,0,6,0,1,0.32,0.3485,0.76,0.0,8,6,14


<bound method H2OFrame.cor of >

In [9]:
hp.shape

(17379, 17)

# Model with 500 seconds

In [10]:
# Assume the following are passed by the user from the web interface

'''
Need a user id and project id?

'''
target='cnt' 
data_file='hour.csv'
run_time=500
run_id='500_' # Just some arbitrary ID
server_path='/Users/newzysharma/Desktop/Desktop/info6105/INFO6105-FinalProject/HyperparametersDB/500'
classification=False
scale=False
max_models=None
balance_y=False # balance_classes=balance_y
balance_threshold=0.2
project ="HyperparameterDB_Project"  # project_name = project

In [11]:
# assign target and inputs for logistic regression
y = target
X = [name for name in hp.columns if name != y]
print(y)
print(X)

cnt
['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered']


In [12]:
# determine column types
ints, reals, enums = [], [], []
for key, val in hp.types.items():
    if key in X:
        if val == 'enum':
            enums.append(key)
        elif val == 'int':
            ints.append(key)            
        else: 
            reals.append(key)

print(ints)
print(enums)
print(reals)

['instant', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'casual', 'registered']
[]
['dteday', 'temp', 'atemp', 'hum', 'windspeed']


In [13]:
# impute missing values
_ = hp[reals].impute(method='mean')
_ = hp[ints].impute(method='median')

if scale:
    hp[reals] = df[reals].scale()
    hp[ints] = df[ints].scale()

In [14]:
# # set target to factor for classification by default or if user specifies classification
# if classification:
#     [y] = hp[y].asfactor()

In [15]:
hp[y].levels()

[]


## Cross-validate rather than take a test training split with 500 seconds

In [16]:
# automl
# runs for run_time seconds then builds a stacked ensemble

aml = H2OAutoML(max_runtime_secs=run_time,project_name = project) # init automl, run for 500 seconds
aml.train(x=X,  
           y=y,
           training_frame=hp)

AutoML progress: |████████████████████████████████████████████████████████| 100%


## Leaderboard

In [19]:
# view leaderboard
lb = aml.leaderboard
lb.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
XGBoost_1_AutoML_20190426_123514,17.9275,4.23409,17.9275,2.48923,0.051761
XGBoost_2_AutoML_20190426_123514,30.2733,5.50212,30.2733,3.10906,0.06022
XGBoost_3_AutoML_20190426_123514,33.4611,5.78455,33.4611,3.82409,0.109715
StackedEnsemble_AllModels_AutoML_20190426_123514,154.252,12.4198,154.252,9.1498,0.382846
DRF_1_AutoML_20190426_123514,435.833,20.8766,435.833,11.5742,0.189786
StackedEnsemble_BestOfFamily_AutoML_20190426_123514,506.632,22.5085,506.632,16.854,0.550865
GLM_grid_1_AutoML_20190426_123514_model_1,32871.0,181.304,32871.0,142.334,1.56985




In [20]:
aml.leader

Model Details
H2OXGBoostEstimator :  XGBoost
Model Key:  XGBoost_1_AutoML_20190426_123514


ModelMetricsRegression: xgboost
** Reported on train data. **

MSE: 2.3311976605560263
RMSE: 1.5268260086060972
MAE: 1.1264397348603896
RMSLE: NaN
Mean Residual Deviance: 2.3311976605560263

ModelMetricsRegression: xgboost
** Reported on cross-validation data. **

MSE: 17.927493323148283
RMSE: 4.234087070803845
MAE: 2.4892300761015576
RMSLE: 0.051760963821947045
Mean Residual Deviance: 17.927493323148283
Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,2.4892728,0.2642054,2.272543,2.2685363,2.3630378,2.3087745,3.2334728
mean_residual_deviance,17.928217,4.5184965,13.175345,13.903452,16.388021,15.6784,30.495863
mse,17.928217,4.5184965,13.175345,13.903452,16.388021,15.6784,30.495863
r2,0.9994544,0.0001390,0.9996000,0.9995806,0.9994991,0.9995243,0.999068
residual_deviance,17.928217,4.5184965,13.175345,13.903452,16.388021,15.6784,30.495863
rmse,4.1777267,0.4872456,3.629786,3.7287333,4.048212,3.959596,5.522306
rmsle,0.0515602,0.0032237,0.0498979,0.0547677,0.0462717,0.0481455,0.0587184


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2019-04-26 12:39:53,4 min 39.303 sec,0.0,261.9286476,188.9630876,68606.6164192
,2019-04-26 12:39:53,4 min 39.410 sec,5.0,204.0742245,146.5194116,41646.2890946
,2019-04-26 12:39:53,4 min 39.513 sec,10.0,160.9129461,113.9933905,25892.9762306
,2019-04-26 12:39:53,4 min 39.625 sec,15.0,125.7043313,88.4980068,15801.5789181
,2019-04-26 12:39:53,4 min 39.743 sec,20.0,98.8392505,68.8326343,9769.1974405
---,---,---,---,---,---,---
,2019-04-26 12:40:38,5 min 23.986 sec,400.0,1.5823116,1.1657351,2.5037101
,2019-04-26 12:40:39,5 min 24.939 sec,405.0,1.5676095,1.1553785,2.4573995
,2019-04-26 12:40:40,5 min 25.900 sec,410.0,1.5527340,1.1451441,2.4109830



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
registered,2491783424.0000000,1.0,0.7151880
hr,375320192.0000000,0.1506231,0.1077238
casual,300915392.0000000,0.1207631,0.0863683
instant,97092480.0000000,0.0389651,0.0278673
atemp,89090200.0000000,0.0357536,0.0255705
temp,42659012.0000000,0.0171199,0.0122439
workingday,31720598.0000000,0.0127301,0.0091044
weekday,22565672.0000000,0.0090560,0.0064768
dteday,11745990.0000000,0.0047139,0.0033713




In [21]:
aml.leader.algo

'xgboost'

## Ensemble Exploration

In [22]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
aml_leaderboard_df

Unnamed: 0,model_id,mean_residual_deviance,rmse,mse,mae,rmsle
0,XGBoost_1_AutoML_20190426_123514,17.927493,4.234087,17.927493,2.48923,0.051761
1,XGBoost_2_AutoML_20190426_123514,30.27329,5.502117,30.27329,3.109063,0.06022
2,XGBoost_3_AutoML_20190426_123514,33.46107,5.784554,33.46107,3.824091,0.109715
3,StackedEnsemble_AllModels_AutoML_20190426_123514,154.251941,12.41982,154.251941,9.149797,0.382846
4,DRF_1_AutoML_20190426_123514,435.832512,20.876602,435.832512,11.57419,0.189786
5,StackedEnsemble_BestOfFamily_AutoML_20190426_1...,506.632105,22.50849,506.632105,16.854016,0.550865
6,GLM_grid_1_AutoML_20190426_123514_model_1,32870.983124,181.303566,32870.983124,142.334167,1.569853


In [51]:
aml_leaderboard_df.iloc[0]

model_id                  XGBoost_1_AutoML_20190426_123514
mean_residual_deviance                             17.9275
rmse                                               4.23409
mse                                                17.9275
mae                                                2.48923
rmsle                                             0.051761
Name: 0, dtype: object

In [81]:
aml_leaderboard_df['model_id'] = aml_leaderboard_df['model_id'].astype('str')
xgboost = pd.DataFrame(columns= aml_leaderboard_df.columns)
for i in range(aml_leaderboard_df.shape[0]):
    x = aml_leaderboard_df.iloc[i,:1].str.split('_')[0][0]
    if x == 'XGBoost':
        xgboost = boost.append(aml_leaderboard_df.iloc[i], ignore_index = True)  

In [82]:
xgboost.to_csv('XGBoost.csv')

In [83]:
aml_leaderboard_df['model_id'] = aml_leaderboard_df['model_id'].astype('str')
drf = pd.DataFrame(columns= aml_leaderboard_df.columns)
for i in range(aml_leaderboard_df.shape[0]):
    x = aml_leaderboard_df.iloc[i,:1].str.split('_')[0][0]
    if x == 'DRF':
        drf = drf.append(aml_leaderboard_df.iloc[i], ignore_index = True)    

In [84]:
drf.to_csv('DRF.csv')

In [85]:
aml_leaderboard_df['model_id'] = aml_leaderboard_df['model_id'].astype('str')
glm = pd.DataFrame(columns= aml_leaderboard_df.columns)
for i in range(aml_leaderboard_df.shape[0]):
    x = aml_leaderboard_df.iloc[i,:1].str.split('_')[0][0]
    if x == 'GLM':
        glm = glm.append(aml_leaderboard_df.iloc[i], ignore_index = True)  

In [86]:
glm.to_csv('GLM.csv')

In [87]:
aml_leaderboard_df['model_id'] = aml_leaderboard_df['model_id'].astype('str')
gbm = pd.DataFrame(columns= aml_leaderboard_df.columns)
for i in range(aml_leaderboard_df.shape[0]):
    x = aml_leaderboard_df.iloc[i,:1].str.split('_')[0][0]
    if x == 'GBM':
        gbm = gbm.append(aml_leaderboard_df.iloc[i], ignore_index = True)  

In [88]:
gbm.to_csv('GBM.csv')

In [89]:
aml_leaderboard_df['model_id'] = aml_leaderboard_df['model_id'].astype('str')
xrt = pd.DataFrame(columns= aml_leaderboard_df.columns)
for i in range(aml_leaderboard_df.shape[0]):
    x = aml_leaderboard_df.iloc[i,:1].str.split('_')[0][0]
    if x == 'XRT':
        xrt = xrt.append(aml_leaderboard_df.iloc[i], ignore_index = True)  

In [90]:
xrt.to_csv('XRT.csv')

In [91]:
aml_leaderboard_df['model_id'] = aml_leaderboard_df['model_id'].astype('str')
dl = pd.DataFrame(columns= aml_leaderboard_df.columns)
for i in range(aml_leaderboard_df.shape[0]):
    x = aml_leaderboard_df.iloc[i,:1].str.split('_')[0][0]
    if x == 'DeepLearning':
        dl = dl.append(aml_leaderboard_df.iloc[i], ignore_index = True)  

In [92]:
dl.to_csv('DeepLearning.csv')

## Generating JSON file for all the models through FOR loop

In [93]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[0])

In [94]:
aml_leaderboard_df.shape

(7, 6)

In [95]:
model_set.shape

(7,)

In [96]:
import os.path
from pathlib import Path

In [97]:
##iterating over number of rows(all model_id), and generates the 'JSON' file with unique name each time, 
#if the particular file already exists then it will not overwrite it.

for i in range(model_set.shape[0]):
    mod_best = h2o.get_model(model_set[i])
    hy_parameter = mod_best.params
    n = run_id + str(i) +'_' + model_set[i] + '.json'
    file_name = Path(n)
    if not file_name.is_file():
        dict_to_json(hy_parameter, n)

# Conclusion
The Aim of the study was to create a hyperparameter database of the different models that were developed by running H2O AutoML function. This function was ran with a runtime of 500 and a leaderboard was generated whichw as arranged in the order of their efficiency in predicting the count. The metrics to understand the efficinecy used were MSE, RMSE, Residual Deviance and RMSLE. With this runtime we generated 6 models and their parameters were obtained. Out of these we analyzed the hyperparameters

# Contribution:

Code by Self: 30%

Code from other sources: 70%

# Citation
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

https://stats.stackexchange.com/questions/95495/guideline-to-select-the-hyperparameters-in-deep-learning

https://github.com/nikbearbrown/CSYE_7245/tree/master/H2O

https://github.com/nikbearbrown/CSYE_7245/tree/master/H2O

https://medium.com/@elutins/grid-searching-in-machine-learning-quick-explanation-and-python-implementation-550552200596

http://docs.h2o.ai/

# Licence
Copyright 2019 |Ashu Kapil| |Newzy Sharma| |Bhaghirathi Kundu|

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.