# Hyperpearameter Database Project

# Abstract

Hyperparameters are the parameters of an algorithm that has to be defined before running the models. If the ideal values of those parameters are defined then it can greatly improve the models predictibility. 
It is difficult to get the values manually. So in this project we are using H2O software to get the ideal values of the hyperparameters so that it can give the best result about the algorithm.
In this project I have ran H2O on the mushroom classification dataset for 5 runtimes. I have got a leaderboard for each and every runtime. Then exported the hyperparameter values of each model to json files.
In this project I have tried to find the important hyperparameters for every model, found the ranges of those hyperparameters and compared those across the models.

# Importing Libraries

In [1]:
import h2o
from h2o.automl import H2OAutoML
import random, os, sys
from datetime import datetime
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import logging
import csv
import optparse
import time
import json
from distutils.util import strtobool
import psutil
import warnings
warnings.filterwarnings('ignore')
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection  import train_test_split 
from sklearn.metrics import accuracy_score, log_loss, mean_squared_error
import json

# Connecting to H2O cluster

In [2]:

port_no = 54321
h2o.init(strict_version_check=False) # start h2o

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
  Starting server from C:\Users\deodh\Anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: c:\users\deodh\appdata\local\temp\tmprofzry
  JVM stdout: c:\users\deodh\appdata\local\temp\tmprofzry\h2o_deodh_started_from_python.out
  JVM stderr: c:\users\deodh\appdata\local\temp\tmprofzry\h2o_deodh_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,04 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.1
H2O cluster version age:,19 days
H2O cluster name:,H2O_from_python_deodh_6k182v
H2O cluster total nodes:,1
H2O cluster free memory:,3.535 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [3]:
df=pd.read_csv("mushrooms.csv")

# Passing the data frame to H2O

In [4]:
df = h2o.H2OFrame(df)

Parse progress: |█████████████████████████████████████████████████████████| 100%


# Function for RunId
Here we have created a run function for each runtime where we can select how many characters we want for the runid. This function creates a random runid everytime we execute it.

In [5]:
def run_id(n):
    letter='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'    
    str=''
    r=len(letter)-1   
    while len(str)<n:
        i=random.randint(0,r)
        str+=letter[i]   
    return str
server_path=None

# Creating Metadata

In [17]:
data_path= 'C:\Users\deodh\Desktop\hyperparameter\mushrooms.csv'
all_variables=None
test_path=None
target=None
nthreads=1 
min_mem_size=6 
run_time=400
classification=True
scale=False
max_models=100   
model_path=None
balance_y=False 
balance_threshold=0.2
name=None 
server_path=None  

# Definig the metadata values

In [18]:

def meta_data(run_id,server,data,test,model_path,target,run_time,regression,model,balance,balance_threshold,name,path,nthreads,min_mem_size):
    m_data={}
    m_data['start_time'] = time.time()
    m_data['target']=target
    m_data['server_path']=server
    m_data['data_path']=data 
    m_data['test_path']=test
    m_data['max_models']=model
    m_data['run_time']=run_time
    m_data['run_id'] =run_id
    m_data['scale']=scale
    m_data['classification']=classification
    m_data['scale']=False
    m_data['model_path']=model_path
    m_data['balance']=balance
    m_data['balance_threshold']=balance_threshold
    m_data['project'] =name
    m_data['end_time'] = time.time()
    m_data['execution_time'] = 0.0
    m_data['run_path'] =path
    m_data['nthreads'] = nthreads
    m_data['min_mem_size'] = min_mem_size
    return m_data

# Target and Independent varaibles
Here we define the target and varibles as X and Y.

In [21]:
target = 'class'

def get_independent_variables(train_data, targ):
    C = [name for name in train_data.columns if name != targ]
    # determine column types
    ints, reals, enums = [], [], []
    for key, val in train_data.types.items():
        if key in C:
            if val == 'enum':
                enums.append(key)
            elif val == 'int':
                ints.append(key)            
            else: 
                reals.append(key)    
    x = ints + enums + reals
    return x

X = get_independent_variables(df, target) 
print(X)
y = target
print(y)

[u'veil-color', u'cap-surface', u'habitat', u'odor', u'stalk-root', u'cap-shape', u'cap-color', u'stalk-color-above-ring', u'spore-print-color', u'gill-color', u'population', u'stalk-color-below-ring', u'ring-type', u'stalk-shape', u'bruises', u'stalk-surface-above-ring', u'veil-type', u'gill-attachment', u'gill-spacing', u'ring-number', u'stalk-surface-below-ring', u'gill-size']
class


# Specifying the runtime for H2O to run

In [19]:
 run_time=400
aml1 = H2OAutoML(max_runtime_secs=run_time)

In [11]:
model_start_time = time.time()
aml1.train(x=X,y=y,training_frame=df)  # Change training_frame=train

AutoML progress: |████████████████████████████████████████████████████████| 100%


# Creating a run_id
Here we use the run_id function where we can define the number of characters. We have set the path to create a folder for run_id to the current directory.

In [12]:

runid=run_id(10)
if server_path==None:
    server_path=os.path.abspath(os.curdir)
os.chdir(server_path) 
run_dir = os.path.join(server_path,runid)
os.mkdir(run_dir)
os.chdir(run_dir)    

print (runid)

2Rr6eu3vHP


# Printing the leaderboard generated by H2O

In [13]:
lb1 = aml1.leaderboard
lb1.head(500)

model_id,auc,logloss,mean_per_class_error,rmse,mse
GBM_grid_1_AutoML_20190420_034740_model_6,1.0,0.384723,0.0,0.319508,0.102086
GBM_2_AutoML_20190420_034740,1.0,9.880490000000001e-18,0.0,5.43808e-16,2.95727e-31
GBM_grid_1_AutoML_20190420_034740_model_4,1.0,0.00223732,0.0,0.00480168,2.30561e-05
GBM_4_AutoML_20190420_034740,1.0,1.1463e-16,0.0,7.13105e-15,5.08518e-29
GBM_1_AutoML_20190420_034740,1.0,8.636890000000001e-18,0.0,7.17082e-16,5.14206e-31
StackedEnsemble_AllModels_AutoML_20190420_034740,1.0,0.000885639,0.0,0.000908697,8.25731e-07
GBM_grid_1_AutoML_20190420_034740_model_1,1.0,4.86686e-16,0.0,4.02041e-14,1.61637e-27
GBM_3_AutoML_20190420_034740,1.0,3.64471e-17,0.0,2.38155e-15,5.6718000000000005e-30
GLM_grid_1_AutoML_20190420_034740_model_1,1.0,0.00199809,0.0,0.00783831,6.1439e-05
StackedEnsemble_BestOfFamily_AutoML_20190420_034740,1.0,0.00150672,0.0,0.001699,2.88662e-06




In [14]:
aml1_leaderboard_df=aml1.leaderboard.as_data_frame()
model_set=aml1_leaderboard_df['model_id']
model_set

0             GBM_grid_1_AutoML_20190420_034740_model_6
1                          GBM_2_AutoML_20190420_034740
2             GBM_grid_1_AutoML_20190420_034740_model_4
3                          GBM_4_AutoML_20190420_034740
4                          GBM_1_AutoML_20190420_034740
5      StackedEnsemble_AllModels_AutoML_20190420_034740
6             GBM_grid_1_AutoML_20190420_034740_model_1
7                          GBM_3_AutoML_20190420_034740
8             GLM_grid_1_AutoML_20190420_034740_model_1
9     StackedEnsemble_BestOfFamily_AutoML_20190420_0...
10            GBM_grid_1_AutoML_20190420_034740_model_5
11            GBM_grid_1_AutoML_20190420_034740_model_2
12            GBM_grid_1_AutoML_20190420_034740_model_3
13                         DRF_1_AutoML_20190420_034740
14                         GBM_5_AutoML_20190420_034740
15                DeepLearning_1_AutoML_20190420_034740
16    DeepLearning_grid_1_AutoML_20190420_034740_mod...
17    DeepLearning_grid_1_AutoML_20190420_034740

Printing the metadata and exporing it to the json file.

In [22]:
metadata = meta_data(runid,server_path,data_path,test_path,model_path,target,run_time,classification,max_models,balance_y,balance_threshold,name,run_dir,nthreads,min_mem_size)
print(metadata)

{'run_id': '2Rr6eu3vHP', 'min_mem_size': 6, 'server_path': None, 'scale': False, 'target': 'class', 'classification': True, 'test_path': None, 'execution_time': 0.0, 'start_time': 1555747273.322, 'data_path': 'C:\\Users\\deodh\\Desktop\\hyperparameter\\mushrooms.csv', 'run_path': 'C:\\Users\\deodh\\Desktop\\hyperparameter\\2Rr6eu3vHP', 'project': None, 'end_time': 1555747273.322, 'nthreads': 1, 'run_time': 400, 'max_models': 100, 'balance': False, 'balance_threshold': 0.2, 'model_path': None}


# Exporting metadata, leaderboard and all the hyperparameters to json files

In [23]:
metadata = json.dumps(metadata)

In [24]:
with open('metadata.json', 'w') as fp:
    json.dump(metadata, fp)

In [25]:
df1 = lb1.as_data_frame()
df1.to_csv("400leaderboard.csv")

In [26]:
model1 = h2o.get_model(lb1[0,'model_id'])
model1 = model1.params
with open('model1-GBM_1.json', 'w') as fp:
    json.dump(model1, fp)

In [27]:
model2 = h2o.get_model(lb1[1,'model_id'])
model2 = model2.params
with open('model2-GBM_4.json', 'w') as fp:
    json.dump(model2, fp)

In [28]:
model3 = h2o.get_model(lb1[2,'model_id'])
model3 = model3.params
with open('model3-GBM_3.json', 'w') as fp:
    json.dump(model3, fp)

In [29]:
model4 = h2o.get_model(lb1[3,'model_id'])
model4 = model4.params
with open('model4-GBM_grid_1_2.json', 'w') as fp:
    json.dump(model4, fp)

In [30]:
model5 = h2o.get_model(lb1[4,'model_id'])
model5 = model5.params
with open('model5-GLM_grid_1_1.json', 'w') as fp:
    json.dump(model5, fp)

In [31]:
model6 = h2o.get_model(lb1[5,'model_id'])
model6 = model6.params
with open('model6-GLM_grid_1_4.json', 'w') as fp:
    json.dump(model6, fp)

In [32]:
model7 = h2o.get_model(lb1[6,'model_id'])
model7 = model7.params
with open('model7-GBM_5.json', 'w') as fp:
    json.dump(model7, fp)

In [33]:
model8 = h2o.get_model(lb1[7,'model_id'])
model8 = model8.params
with open('model8-GBM_grid_1_3.json', 'w') as fp:
    json.dump(model8, fp)

In [34]:
model9 = h2o.get_model(lb1[8,'model_id'])
model9 = model9.params
with open('model9-StackedEnsemble_AllModels.json', 'w') as fp:
    json.dump(model9, fp)

In [35]:
model10 = h2o.get_model(lb1[9,'model_id'])
model10 = model10.params
with open('model10-DeepLearning_1.json', 'w') as fp:
    json.dump(model10, fp)

In [36]:
model11 = h2o.get_model(lb1[10,'model_id'])
model11 = model11.params
with open('model11-DRF_1.json', 'w') as fp:
    json.dump(model11, fp)

In [37]:
model12 = h2o.get_model(lb1[11,'model_id'])
model12 = model12.params
with open('model12-GBM_2.json', 'w') as fp:
    json.dump(model12, fp)

In [38]:
model13 = h2o.get_model(lb1[12,'model_id'])
model13 = model13.params
with open('model13-GBM_grid_1_1.json', 'w') as fp:
    json.dump(model13, fp)

In [39]:
model14 = h2o.get_model(lb1[13,'model_id'])
model14 = model14.params
with open('model14-StackedEnsemble_BestOfFamily.json', 'w') as fp:
    json.dump(model14, fp)

In [40]:
model15 = h2o.get_model(lb1[14,'model_id'])
model15 = model15.params
with open('model15-DeepLearning_grid_1_2.json', 'w') as fp:
    json.dump(model15, fp)

In [41]:
model16 = h2o.get_model(lb1[15,'model_id'])
model16 = model16.params
with open('model16-GBM_grid_1_5.json', 'w') as fp:
    json.dump(model16, fp)

In [42]:
model17 = h2o.get_model(lb1[16,'model_id'])
model17 = model17.params
with open('model17-DeepLearning_grid_1_1.json', 'w') as fp:
    json.dump(model17, fp)

In [43]:
model18 = h2o.get_model(lb1[17,'model_id'])
model18 = model18.params
with open('model18-XRT_1.json', 'w') as fp:
    json.dump(model18, fp)

In [44]:
model19 = h2o.get_model(lb1[18,'model_id'])
model19 = model19.params
with open('model19-DeepLearning_grid_1_3.json', 'w') as fp:
    json.dump(model19, fp)

In [45]:
model20 = h2o.get_model(lb1[19,'model_id'])
model20 = model20.params
with open('model20.json', 'w') as fp:
    json.dump(model20, fp)

# Conclusion

In this project I ran H2O for 5 runtimes 200,400,600,1000 and got the best models which gave the best results. Stored them in the csv and stored their hyperparameters in the json files. I found the ranges of the hyperparameters of each algorithm.
GBM algorithm had the most models that got generated.

1. For GBM ntrees, learning rate,stopping tolerence are the best hyperparameters.
2. For XRT sample rate and stopping tolerence are the best hyperparameters.
3. For DRF sample rate, stopping tolerence, seed and ntrees are the best hyperparameters.
4. For Deep Learning stopping_tolerence, rate, maxw2, huberalpha, elastic averaging moving rate and averaging regularization are the best hyperparameters.

I have compared the hyperparameters of all the models.

# Contribution
40% by me and 
60% from external resources.

# Citation

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html

https://github.com/prabhuSub/Hyperparamter-Samples/tree/master/Hyperparamet

https://github.com/nikbearbrown/CSYE_7245/tree/master/H2O

https://www.jeremyjordan.me/hyperparameter-tuning/

https://towardsdatascience.com/understanding-hyperparameters-and-its-optimisation-techniques-f0debba07568

# License
Copyright 2019 Mayuresh Deodhar

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.