# Abstract

Hyperparameters have much importance in data science because they directly control the behaviour of the training algorithm and have a significant impact on the performance of the model. Finding out the hyperparameters is a strenuous task.The aim of this project is to make the process easier and determine the important hyperparameters from the dataset.H2O algorithm is used to achieve this.Various models are generated for runtimes 300,500,800,1000,1200

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import seaborn as sns
import random, os, sys
from datetime import datetime
import time

In [2]:
# Loading the data set using pandas
df=pd.read_csv("indian_liver_patient.csv", sep=',')

In [3]:
df.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


In [4]:
df.describe()

Unnamed: 0,Age,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
count,583.0,583.0,583.0,583.0,583.0,583.0,583.0,583.0,579.0,583.0
mean,44.746141,3.298799,1.486106,290.576329,80.713551,109.910806,6.48319,3.141852,0.947064,1.286449
std,16.189833,6.209522,2.808498,242.937989,182.620356,288.918529,1.085451,0.795519,0.319592,0.45249
min,4.0,0.4,0.1,63.0,10.0,10.0,2.7,0.9,0.3,1.0
25%,33.0,0.8,0.2,175.5,23.0,25.0,5.8,2.6,0.7,1.0
50%,45.0,1.0,0.3,208.0,35.0,42.0,6.6,3.1,0.93,1.0
75%,58.0,2.6,1.3,298.0,60.5,87.0,7.2,3.8,1.1,2.0
max,90.0,75.0,19.7,2110.0,2000.0,4929.0,9.6,5.5,2.8,2.0


In [5]:
df.shape

(583, 11)

# Data Cleaning

In [6]:
#To check the data types
df.dtypes

Age                             int64
Gender                         object
Total_Bilirubin               float64
Direct_Bilirubin              float64
Alkaline_Phosphotase            int64
Alamine_Aminotransferase        int64
Aspartate_Aminotransferase      int64
Total_Protiens                float64
Albumin                       float64
Albumin_and_Globulin_Ratio    float64
Dataset                         int64
dtype: object

In [7]:
#To show the total NULL Values present in the NULL Valued fields
df.isnull().sum()

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    4
Dataset                       0
dtype: int64

filling the null values with median

In [8]:
fill = df['Albumin_and_Globulin_Ratio']
Albumin_and_Globulin_Ratio = fill.fillna(fill.median(),inplace=True)

# H2O

In [9]:
import h2o
from h2o.automl import H2OAutoML
import random, os, sys
from datetime import datetime
import pandas as pd
import logging
import csv
import optparse
import time
import json
from distutils.util import strtobool
import psutil

import warnings
warnings.filterwarnings('ignore')

In [10]:
port_no=random.randint(5555,55555)
h2o.init(strict_version_check=False,min_mem_size_GB=5,port=port_no)

Checking whether there is an H2O instance running at http://localhost:27586 ..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
  Starting server from C:\Users\Manvi\Anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\Manvi\AppData\Local\Temp\tmpwqj9xz_a
  JVM stdout: C:\Users\Manvi\AppData\Local\Temp\tmpwqj9xz_a\h2o_Manvi_started_from_python.out
  JVM stderr: C:\Users\Manvi\AppData\Local\Temp\tmpwqj9xz_a\h2o_Manvi_started_from_python.err
  Server is running at http://127.0.0.1:27586
Connecting to H2O server at http://127.0.0.1:27586 ... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.1
H2O cluster version age:,20 days
H2O cluster name:,H2O_from_python_Manvi_blojro
H2O cluster total nodes:,1
H2O cluster free memory:,4.792 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [11]:
#importing data to the server
df = h2o.import_file(path="indian_liver_patient.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [12]:
df.head()

Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1
46,Male,1.8,0.7,208,19,14,7.6,4.4,1.3,1
26,Female,0.9,0.2,154,16,12,7.0,3.5,1.0,1
29,Female,0.9,0.3,202,14,11,6.7,3.6,1.1,1
17,Male,0.9,0.3,202,22,19,7.4,4.1,1.2,2
55,Male,0.7,0.2,290,53,58,6.8,3.4,1.0,1




In [13]:
df.isna()

isNA(Age),isNA(Gender),isNA(Total_Bilirubin),isNA(Direct_Bilirubin),isNA(Alkaline_Phosphotase),isNA(Alamine_Aminotransferase),isNA(Aspartate_Aminotransferase),isNA(Total_Protiens),isNA(Albumin),isNA(Albumin_and_Globulin_Ratio),isNA(Dataset)
0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0




In [34]:
target = 'Dataset'
run_time=1200
pct_memory=0.5
server_path=None 
data_path=None
all_variables=None
test_path=None
model_path=None
nthreads=1 
name=None 
virtual_memory=psutil.virtual_memory()
min_mem_size=int(round(int(pct_memory*virtual_memory.available)/1073741824,0))
run_id='SOME_ID_20180617_221529' # Just some arbitrary ID
classification=True
scale=False
max_models=None
balance_y=False # balance_classes=balance_y
balance_threshold=0.2
project ="automl_test"
analysis=0

defining functions

In [35]:
def alphabet(n):
  alpha='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'    
  str=''
  r=len(alpha)-1   
  while len(str)<n:
    i=random.randint(0,r)
    str+=alpha[i]   
  return str


def set_meta_data(analysis,run_id,server,data,test,model_path,target,run_time,classification,scale,model,balance,balance_threshold,name,path,nthreads,min_mem_size):
  m_data={}
  m_data['start_time'] = time.time()
  m_data['target']=target
  m_data['server_path']=server
  m_data['data_path']=data 
  m_data['test_path']=test
  m_data['max_models']=model
  m_data['run_time']=run_time
  m_data['run_id'] =run_id
  m_data['scale']=scale
  m_data['classification']=classification
  m_data['scale']=False
  m_data['model_path']=model_path
  m_data['balance']=balance
  m_data['balance_threshold']=balance_threshold
  m_data['project'] =name
  m_data['end_time'] = time.time()
  m_data['execution_time'] = 0.0
  m_data['run_path'] =path
  m_data['nthreads'] = nthreads
  m_data['min_mem_size'] = min_mem_size
  m_data['analysis'] = analysis
  return m_data


def automl(maxruntime,X,Y,df):
    aml = H2OAutoML(max_runtime_secs=maxruntime,exclude_algos = ['DeepLearning', 'StackedEnsemble'])
    aml.train(x=X,y=y,training_frame=df)
    return aml


def dict_to_json(dct,n):  
    j = json.dumps(dct, indent=4)
    f = open(n, 'w')
    print(j, file=f)
    f.close()

generating a unique random ID for every runtime

In [54]:
run_id=alphabet(9)
if server_path==None:
  server_path=os.path.abspath(os.curdir)
os.chdir(server_path) 
run_dir = os.path.join(server_path,run_id)
os.mkdir(run_dir)
os.chdir(run_dir)    

# run_id to std out
print (run_id)

RHNRdOUmX


In [55]:
# meta data
meta_data = set_meta_data(analysis, run_id,server_path,data_path,test_path,model_path,target,run_time,classification,scale,max_models,balance_y,balance_threshold,name,run_dir,nthreads,min_mem_size)
print(meta_data)

{'start_time': 1555886773.8907285, 'target': 'Dataset', 'server_path': 'C:\\Users\\Manvi\\Anaconda3\\indian-liver-patient-records\\uh5qmPqcn', 'data_path': None, 'test_path': None, 'max_models': None, 'run_time': 1200, 'run_id': 'RHNRdOUmX', 'scale': False, 'classification': True, 'model_path': None, 'balance': False, 'balance_threshold': 0.2, 'project': None, 'end_time': 1555886773.8907285, 'execution_time': 0.0, 'run_path': 'C:\\Users\\Manvi\\Anaconda3\\indian-liver-patient-records\\uh5qmPqcn\\RHNRdOUmX', 'nthreads': 1, 'min_mem_size': 1, 'analysis': 0}


In [56]:
y = target
X=[name for name in df.columns if name != y]
print(X)
print(y)

['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin', 'Alkaline_Phosphotase', 'Alamine_Aminotransferase', 'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin', 'Albumin_and_Globulin_Ratio']
Dataset


In [57]:
meta_data['X']=X  
model_start_time = time.time()

In [58]:
if analysis == 3:
  classification=False
elif analysis == 2:
  classification=True
elif analysis == 1:
  classification=True

The dependent variable is classification type

In [59]:
if classification:
    df[y] = df[y].asfactor()

In [60]:
classification=True
if classification:
    print(df[y].levels())

[['1', '2']]


# Runtime: 1200sec

In [61]:
aml5 = automl(1200,X,y,df)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [62]:
meta_data['run_time'] = 1200
meta_data['end_time'] = time.time()
meta_data['execution_time'] = meta_data['end_time'] - meta_data['start_time']

generating a leaderboard for the best models

In [63]:
aml5.leaderboard

model_id,auc,logloss,mean_per_class_error,rmse,mse
GBM_1_AutoML_20190421_184623,0.763315,0.518868,0.282106,0.417282,0.174124
GBM_grid_1_AutoML_20190421_184623_model_3,0.75529,0.511873,0.287526,0.416346,0.173344
GBM_grid_1_AutoML_20190421_184623_model_17,0.751821,0.520431,0.295349,0.420088,0.176474
GLM_grid_1_AutoML_20190421_184623_model_1,0.751699,0.503103,0.2955,0.413658,0.171113
GBM_5_AutoML_20190421_184623,0.74756,0.513657,0.291787,0.416341,0.17334
GBM_grid_1_AutoML_20190421_184623_model_10,0.738341,0.517955,0.3002,0.419934,0.176345
GBM_grid_1_AutoML_20190421_184623_model_19,0.737506,0.563755,0.306606,0.436159,0.190235
XRT_1_AutoML_20190421_184623,0.737455,0.522825,0.302582,0.426277,0.181712
GBM_grid_1_AutoML_20190421_184623_model_13,0.736016,0.567106,0.294716,0.437618,0.19151
GBM_4_AutoML_20190421_184623,0.735541,0.539759,0.292881,0.430512,0.18534




In [64]:
aml5_leaderboard_df=aml5.leaderboard.as_data_frame()
aml5_leaderboard_df

Unnamed: 0,model_id,auc,logloss,mean_per_class_error,rmse,mse
0,GBM_1_AutoML_20190421_184623,0.763315,0.518868,0.282106,0.417282,0.174124
1,GBM_grid_1_AutoML_20190421_184623_model_3,0.75529,0.511873,0.287526,0.416346,0.173344
2,GBM_grid_1_AutoML_20190421_184623_model_17,0.751821,0.520431,0.295349,0.420088,0.176474
3,GLM_grid_1_AutoML_20190421_184623_model_1,0.751699,0.503103,0.2955,0.413658,0.171113
4,GBM_5_AutoML_20190421_184623,0.74756,0.513657,0.291787,0.416341,0.17334
5,GBM_grid_1_AutoML_20190421_184623_model_10,0.738341,0.517955,0.3002,0.419934,0.176345
6,GBM_grid_1_AutoML_20190421_184623_model_19,0.737506,0.563755,0.306606,0.436159,0.190235
7,XRT_1_AutoML_20190421_184623,0.737455,0.522825,0.302582,0.426277,0.181712
8,GBM_grid_1_AutoML_20190421_184623_model_13,0.736016,0.567106,0.294716,0.437618,0.19151
9,GBM_4_AutoML_20190421_184623,0.735541,0.539759,0.292881,0.430512,0.18534


In [65]:
length = len(aml5_leaderboard_df)
length
meta_data["models_generated"] = length

storing the leaderboard into a csv file

In [66]:
# save leaderboard
leaderboard_stats=run_id+ '1200sec'+ '_leaderboard.csv'
aml5_leaderboard_df.to_csv(leaderboard_stats)

In [67]:
aml5_leaderboard_df=aml5.leaderboard.as_data_frame()
model5_set=aml5_leaderboard_df['model_id']
model5_set

0                   GBM_1_AutoML_20190421_184623
1      GBM_grid_1_AutoML_20190421_184623_model_3
2     GBM_grid_1_AutoML_20190421_184623_model_17
3      GLM_grid_1_AutoML_20190421_184623_model_1
4                   GBM_5_AutoML_20190421_184623
5     GBM_grid_1_AutoML_20190421_184623_model_10
6     GBM_grid_1_AutoML_20190421_184623_model_19
7                   XRT_1_AutoML_20190421_184623
8     GBM_grid_1_AutoML_20190421_184623_model_13
9                   GBM_4_AutoML_20190421_184623
10    GBM_grid_1_AutoML_20190421_184623_model_14
11     GBM_grid_1_AutoML_20190421_184623_model_6
12    GBM_grid_1_AutoML_20190421_184623_model_22
13     GBM_grid_1_AutoML_20190421_184623_model_7
14     GBM_grid_1_AutoML_20190421_184623_model_2
15                  GBM_2_AutoML_20190421_184623
16     GBM_grid_1_AutoML_20190421_184623_model_1
17                  GBM_3_AutoML_20190421_184623
18                  DRF_1_AutoML_20190421_184623
19     GBM_grid_1_AutoML_20190421_184623_model_9
20    GBM_grid_1_Aut

getting the best parameters and storing them in a json file

In [68]:
count = 0;
for i in model5_set:
    count = count+1;
for i in range(0,count):
    mod_best=h2o.get_model(model5_set[i])
    parameters = mod_best.params
    n= str((model5_set[i]))+'__1200'
    dict_to_json(parameters,n)

In [69]:
# Update and save meta data
n=run_id+'_meta_data.json'
dict_to_json(meta_data,n)

In [70]:
meta_data

{'start_time': 1555886773.8907285,
 'target': 'Dataset',
 'server_path': 'C:\\Users\\Manvi\\Anaconda3\\indian-liver-patient-records\\uh5qmPqcn',
 'data_path': None,
 'test_path': None,
 'max_models': None,
 'run_time': 1200,
 'run_id': 'RHNRdOUmX',
 'scale': False,
 'classification': True,
 'model_path': None,
 'balance': False,
 'balance_threshold': 0.2,
 'project': None,
 'end_time': 1555886798.8437598,
 'execution_time': 24.953031301498413,
 'run_path': 'C:\\Users\\Manvi\\Anaconda3\\indian-liver-patient-records\\uh5qmPqcn\\RHNRdOUmX',
 'nthreads': 1,
 'min_mem_size': 1,
 'analysis': 0,
 'X': ['Age',
  'Gender',
  'Total_Bilirubin',
  'Direct_Bilirubin',
  'Alkaline_Phosphotase',
  'Alamine_Aminotransferase',
  'Aspartate_Aminotransferase',
  'Total_Protiens',
  'Albumin',
  'Albumin_and_Globulin_Ratio'],
 'models_generated': 30}

# Conclusion

1.Models have been generated through H2OAutoML for runtime of 1200secs.

2.A leaderboard is obtained listing the best models.

3.Best models are choosen based on metrics like rmse,mse,auc,logloss.

4.Model generated through GBM is considered to be the best

# Contribution

selected a dataset and performed H2O algorithm to generate a leaderboard of best models

# Citations

https://github.com/prabhuSub/Hyperparamter-Samples

https://machinelearningmastery.com/vector-norms-machine-learning/

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html?highlight=hyperparameters#supported-grid-search-hyperparameters


# License

Copyright 2019 Manogjna Potluri 

Copyright 2019 Manvitha Jagadam


Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
