## HPO Training using Accelerator API

In this notebook you will learn how to submit a model and dataset to the Watson Machine Learning Accelerator (WMLA) API to run Hyper Parameter Optimization (HPO).  

For this notebook you will use a model and dataset that have already been setup to leverage the API.  For details on the API see this link in the Knowledge Center (KC).

[API Documentation](https://www.ibm.com/support/knowledgecenter/en/SSFHA8_1.2.1/cm/deeplearning.html)

## Imports

To use the WMLA API, we will be using the python requests library

In [1]:
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

import json
import time
import urllib
import pandas as pd
import os,sys
import tarfile
import tempfile
from IPython.display import clear_output
import time
import pprint

# utility print function
def nprint(mystring) :
    print("**{}** : {}".format(sys._getframe(1).f_code.co_name,mystring))

# utility makedir
def makeDirIfNotExist(directory) :
    if not os.path.exists(directory):  
        nprint("Making directory {}".format(directory))
        os.makedirs(directory) 
    else :
        nprint("Directory {} already exists .. ".format(directory))


## Environment details and Project Config:

In [2]:
def getconfig(cfg_in={}):
    cfg = {}
    cfg["master_host"] = 'https://p10a117.pbm.ihost.com'
    cfg["dli_rest_port"] = '9243'
    cfg["sc_rest_port"] = '8643'
    cfg["num_images"] = {"train":200,"valid":20,"test":20}
    # ==== CLASS ENTER User login details below =====
    cfg["wmla_user"] = 'vanstee'  
    cfg["wmla_pwd"] = 'pwvanstee'
    # ==== CLASS ENTER User login details above =====
    cfg["sig_name"]  = 'b0p036a-dliauto'
    cfg["code_dir"] = "/gpfs/software/wmla-p10a117/dli_data_fs/models/pytorch_hpo"

    # overwrite configs if passed
    for (k,v) in cfg_in.items() :
        nprint("Overriding Config {}:{} with {}".format(k,cfg[k],v))
        cfg[k] = v
    return cfg

# cfg is used as a global variable throughout this notebook
cfg=getconfig()

In [3]:
# REST call variables
commonHeaders = {'Accept': 'application/json'}

# Use closures for cfg for now ..
def get_tmp_dir() :
    return "/gpfs/home/s4s004/"+cfg["wmla_user"]+"/2020-05-wmla/tmp"

def get_tar_file() :
    return get_tmp_dir() + "/" + cfg["wmla_user"]+".modelDir.tar"

#get api endpoint
def get_ep(mode="sc") :
    if mode=="sc" :
        sc_rest_url =  cfg["master_host"] +':'+ cfg["sc_rest_port"] +'/platform/rest/conductor/v1'
        return sc_rest_url
    elif(mode=="dl") :
        dl_rest_url = cfg["master_host"] +':'+cfg["dli_rest_port"] +'/platform/rest/deeplearning/v1'
        return dl_rest_url
    else :
        nprint("Error mode : {} not supported".format(mode))

def myauth():
    return(cfg["wmla_user"],cfg["wmla_pwd"])

print ("SC API Endpoints : {}".format(get_ep("sc")))
print ("DL API Endpoints : {}".format(get_ep("dl")))
print (myauth())
print (get_tar_file())
#myauth = (wmla_user, wmla_pwd)

# Setup Requests session
req = requests.Session()

SC API Endpoints : https://p10a117.pbm.ihost.com:8643/platform/rest/conductor/v1
DL API Endpoints : https://p10a117.pbm.ihost.com:9243/platform/rest/deeplearning/v1
('vanstee', 'pwvanstee')
/gpfs/home/s4s004/vanstee/2020-05-wmla/tmp/vanstee.modelDir.tar


## Health Check

Check if there is any existing hpo tasks and also verify the platform health

Rest API: **GET platform/rest/deeplearning/v1/hypersearch**
- Description: Get all the hpo task that the login user can access.
- OUTPUT: A list of hpo tasks and each one with the same format which can be found in the api doc.

In [4]:
def hpo_health_check():
    getTuneStatusUrl = get_ep("dl") + '/hypersearch'
    nprint ('getTuneStatusUrl: %s' %getTuneStatusUrl)
    r = req.get(getTuneStatusUrl, headers=commonHeaders, verify=False, auth=myauth())
    
    if not r.ok:
        nprint('check hpo task status failed: code=%s, %s'%(r.status_code, r.content))
    else:
        if len(r.json()) == 0:
            nprint('There is no hpo task been created')
        for item in r.json():
            nprint('Hpo task: %s, State: %s'%(item['hpoName'], item['state']))
            #print('Best:%s'%json.dumps(item.get('best'), sort_keys=True, indent=4))

hpo_health_check()


**hpo_health_check** : getTuneStatusUrl: https://p10a117.pbm.ihost.com:9243/platform/rest/deeplearning/v1/hypersearch
**hpo_health_check** : Hpo task: kelvinl-hpo-254313202512616, State: FINISHED
**hpo_health_check** : Hpo task: kelvinl-hpo-254758923243594, State: FINISHED
**hpo_health_check** : Hpo task: kelvinl-hpo-266345130757669, State: FAILED
**hpo_health_check** : Hpo task: kelvinl-hpo-267216444279712, State: FINISHED
**hpo_health_check** : Hpo task: kelvinl-hpo-403697921838806, State: FAILED
**hpo_health_check** : Hpo task: kelvinl-hpo-404372545448297, State: FAILED
**hpo_health_check** : Hpo task: kelvinl-hpo-404624765236135, State: FAILED
**hpo_health_check** : Hpo task: kelvinl-hpo-405300918575393, State: FAILED
**hpo_health_check** : Hpo task: kelvinl-hpo-405544795318004, State: FINISHED
**hpo_health_check** : Hpo task: kelvinl-hpo-406348140701279, State: FINISHED
**hpo_health_check** : Hpo task: kelvinl-hpo-410153859499985, State: FINISHED
**hpo_health_check** : Hpo task: v

## Launch an HPO task using API


This part of the lab will step through the steps required to train a simple MNIST (greyscale digits) dataset using WMLA HPO through API ... 

We will be using the **PyTorch** deep learning framework for this example.  Lets take a look at the code ..

In [5]:
print("Code directory : {}".format(cfg["code_dir"]))
!ls {cfg["code_dir"]}

Code directory : /gpfs/software/wmla-p10a117/dli_data_fs/models/pytorch_hpo
pytorch_mnist_HPO.py  pytorch_mnist_original.py


### Model file update to Run HPO

**Developer Note** 
The WMLA framework requires 2 changes to your code to support the HPO API, these are :
- Inject hyper-parameters for the sub-training during search
- Retrieve sub-training result metric

We will cover these details in the next 2 sections

For an even more detailed review, see the [documentation in Github](https://github.com/IBM/wmla-learning-path/blob/master/02_hpo_for_developers.md)

#### Model update part 1 - Inject hyper-parameters

The hyper-parameters will be supplied in a file called **config.json** with JSON format,located in the current working directory and can be read direcly as the following example snippet.

<pre>
hyper_params = json.loads(open("<b>config.json</b>").read())
learning_rate = float(hyper_params.get("<b>learning_rate</b>", "0.01"))
</pre>

After this, you can use these hyper-parameters during the model trainings. The **hyper-parameter name** and **value** type is defined through the search space part in body of REST call when launching a new hpo task.

#### Model update part 2 - Retrieve sub-training result metric

At the end of your training run, your code will need to create a file called **val_dict_list.json** with test metrics generated during training. These metrics will be used by the search algorithm to propose new sets of hyper-parameters. Please note that **val_dict_list.json** should be created under the result directory which can be retrieved through the environment variable **RESULT_DIR**.

<pre>
with open('{}/val_dict_list.json'.format(os.environ['<b>RESULT_DIR</b>']), 'w') as f:
    json.dump(test_metrics, f)
</pre>

The content of **val_dict_list.json** will be some thing as below, **step** is some thing optional meaning the training iteration or epochs, one of **loss** and **accuracy** can be the name of target metric to optimize, at least one metric need to be included here. The specific name of metric used to optimize (minimize or maximize) is defined in the body of REST call when launching a new hpo task. 

```
[
{‘step’: 1, ‘loss’:0.2487, ‘accuracy’: 0.4523},
{‘step’: 2, ‘loss’:0.1487, ‘accuracy’: 0.5523},
{‘step’: 3, ‘loss’:0.1087, ‘accuracy’: 0.6523},
…
]
```

**We have added this to the pytorch_mnist_HPO.py file already for you!**

In [6]:
!grep -A1 -B2 val_dict {cfg["code_dir"]}/pytorch_mnist_HPO.py



    # HPO - dump metric values to val_dict_list.json start
    training_out =[]
--
            out[metric] = value
        training_out.append(out)
    with open('{}/val_dict_list.json'.format(os.environ['RESULT_DIR']), 'w') as f:
        json.dump(training_out, f)
# HPO - dump metric values to val_dict_list.json end
if __name__ == '__main__':


### Launch HPO task

Here we package up our model to send to the API for HPO.  Lets see how this works ...



REST API: **POST /platform/rest/deeplearning/v1/hypersearch**
- Description: Start a new HPO task
- Content-type: Multi-Form
- Multi-Form Data:
  - files: Model files tar package, ending with `.modelDir.tar`
  - form-filed: {‘data’: ‘String format of input parameters to start hpo task, let’s call it as **hpo_input** and show its specification later’}


#### Model file update to Run HPO

#### Package model files for training

Package the updated model files into a tar file ending with `.modelDir.tar`

REST API expects a modelDir.tar with the model code inside ..


In [7]:
# Tar up 
def make_tarfile():
    makeDirIfNotExist(get_tmp_dir())
    tar_archive_base=os.path.basename(cfg["code_dir"])
    nprint("Tarring up {} to {}".format(cfg["code_dir"],get_tmp_dir()))
    nprint("Adding base directory to archive : {}".format(tar_archive_base))
    with tarfile.open(get_tar_file(), "w:gz") as tar:
        tar.add(cfg["code_dir"], arcname=tar_archive_base)
#MODEL_DIR_SUFFIX = ".modelDir.tar"
#tempFile = tempfile.mktemp(MODEL_DIR_SUFFIX)

make_tarfile()

files = {'file': open(get_tar_file(), 'rb')}
print("Files : {}".format(files))

**makeDirIfNotExist** : Directory /gpfs/home/s4s004/vanstee/2020-05-wmla/tmp already exists .. 
**make_tarfile** : Tarring up /gpfs/software/wmla-p10a117/dli_data_fs/models/pytorch_hpo to /gpfs/home/s4s004/vanstee/2020-05-wmla/tmp
**make_tarfile** : Adding base directory to archive : pytorch_hpo
Files : {'file': <_io.BufferedReader name='/gpfs/home/s4s004/vanstee/2020-05-wmla/tmp/vanstee.modelDir.tar'>}


#### Construct POST request data

**hpo_input** will be a Python dict or json format as below, convert to string when calling REST.

In [8]:
# Note, this 
data =  {
        'modelSpec': # Define the model training related parameters
        {
            # Spark instance group which will be used to run the HPO sub-trainings. The Spark instance group selected
            # here should match the sub-training args, for example, if the sub-training args try to run a EDT job,
            # then we should put a Spark instance group with capability to run EDT job here.
            'sigName': cfg["sig_name"],

            # These are the arguments we'll pass to the execution engine; they follow the same conventions
            # of the dlicmd.py command line launcher
            #
            # See:
            #   https://www.ibm.com/support/knowledgecenter/en/SSFHA8_1.2.1/cm/dlicmd.html
            # In this example, args after --model-dir are all the required parameter for the original model itself.
            #
            'args': '--exec-start PyTorch --cs-datastore-meta type=fs --python-version 3.6\
                     --gpuPerWorker 1 --model-main pytorch_mnist_HPO.py --model-dir pytorch_hpo\
                     --debug-level debug'
                
        },
    
        'algoDef': # Define the parameters for search algorithms
        {
            # Name of the search algorithm, one of Random, Bayesian, Tpe, Hyperband, ExperimentGridSearch
            'algorithm': 'Random', 
            # Max running time of the hpo task in minutes, -1 means unlimited
            'maxRunTime': 60,  
            # Max number of training job to submitted for hpo task, -1 means unlimited’,
            'maxJobNum': 4,            
            # Max number of training job to run in parallel, default 1. It depends on both the
            # avaiable resource and if the search algorithm support to run in parallel, current only Random
            # fully supports to run in parallel, Hyperband and Tpe supports to to in parellel in some phase,
            # Bayesian runs in sequence now.
            'maxParalleJobNum': 2, 
            # Name of the target metric that we are trying to optimize when searching hyper-parameters.
            # It is the same metric name that the model update part 2 trying to dump.
            'objectiveMetric' : 'loss',
            # Strategy as how to optimize the hyper-parameters, minimize means to find better hyper-parameters to
            # make the above objectiveMetric as small as possible, maximize means the opposite.
            'objective' : 'minimize',
        },
    
        # Define the hyper-paremeters to search and the corresponding search space.
        'hyperParams':
        [
             {
                 # Hyperparameter name, which will be the hyper-parameter key in config.json
                 'name': 'learning_rate',
                 # One of Range, Discrete
                 'type': 'Range',
                 # one of int, double, str
                 'dataType': 'DOUBLE',
                 # lower bound and upper bound when type=range and dataType=double
                 'minDbVal': 0.001,
                 'maxDbVal': 0.1,
                 # lower bound and upper bound when type=range and dataType=int
                 'minIntVal': 0,
                 'maxIntVal': 0,
                 # Discrete value list when type=discrete
                 'discreteDbVal': [],
                 'discreteIntVal': [],
                 'discreateStrVal': []
                 #step size to split the Range space. ONLY valid when type is Range
                 #'step': '0.002',
             }
         ]
    }
mydata={'data':json.dumps(data)}

#### Submit the Post request

Submit hpo task through the Post call and a hpo name/id as string format will get back.

**Note**:This cannot be submitted twice.. you need to rebuild the tar file prior to resubmitting

In [9]:
def submit_job():
    startTuneUrl=get_ep('dl') + '/hypersearch'
    nprint("startTuneUrl : {}".format(startTuneUrl))
    nprint("files : {}".format(files))
    nprint("myauth() : {}".format(myauth()))
    #print("hpo_job_id : {}".format(hpo_job_id))
    r = req.post(startTuneUrl, headers=commonHeaders, data=mydata, files=files, verify=False, auth=myauth())
    hpo_name=None
    if r.ok:
        hpo_name = r.json()
        print ('\nModel submitted successfully: {}'.format(hpo_name))
        
    else:
        print('\nModel submission failed with code={}, {}'. format(r.status_code, r.content))
    return hpo_name

hpo_job_id = submit_job()
print("hpo_job_id : {}".format(hpo_job_id))

**submit_job** : startTuneUrl : https://p10a117.pbm.ihost.com:9243/platform/rest/deeplearning/v1/hypersearch
**submit_job** : files : {'file': <_io.BufferedReader name='/gpfs/home/s4s004/vanstee/2020-05-wmla/tmp/vanstee.modelDir.tar'>}
**submit_job** : myauth() : ('vanstee', 'pwvanstee')

Model submitted successfully: vanstee-hpo-831009472195092
hpo_job_id : vanstee-hpo-831009472195092


In [10]:
getHpoUrl = get_ep('dl') +'/hypersearch/'+ hpo_job_id
pp = pprint.PrettyPrinter(indent=2)

keep_running=True
rr=10
while(keep_running):
    res = req.get(getHpoUrl, headers=commonHeaders, verify=False, auth=myauth())
    experiments=res.json()['experiments']
    experiments = pd.DataFrame.from_dict(experiments)
    pd.set_option('max_colwidth', 120)
    clear_output()
    print("Refreshing every {} seconds".format(rr))
    display(experiments)
    pp.pprint(res.json())
    if(res.json()['state'] not in ['SUBMITTED','RUNNING']) :
        keep_running=False
    time.sleep(rr)

Refreshing every 10 seconds


Unnamed: 0,id,hyperParams,state,metricVal,maxiteration,appId,driverId,startTime,endTime
0,0,"[{'name': 'learning_rate', 'dataType': 'double', 'userDefined': False, 'fixedVal': '0.07176378069813329'}]",FINISHED,0.045087,0,vanstee-831011048458738-1807318326,driver-20200526105345-0106-1c900f6f-d4eb-4d86-885e-caaf54b2cba3,2020-05-26 10:53:45,2020-05-26 10:58:00
1,1,"[{'name': 'learning_rate', 'dataType': 'double', 'userDefined': False, 'fixedVal': '0.07950452443574245'}]",FINISHED,0.043343,0,vanstee-831016981293908-1256126962,driver-20200526105351-0107-6fb2c545-0336-473c-998f-c5a644468c7d,2020-05-26 10:53:51,2020-05-26 10:59:01
2,2,"[{'name': 'learning_rate', 'dataType': 'double', 'userDefined': False, 'fixedVal': '0.08983621328711912'}]",FINISHED,0.041888,0,vanstee-831328606570062-1802464889,driver-20200526105902-0108-984f08db-9681-4ca2-8b0b-df4e1a93e4bf,2020-05-26 10:59:03,2020-05-26 11:04:19
3,3,"[{'name': 'learning_rate', 'dataType': 'double', 'userDefined': False, 'fixedVal': '0.0693363050432297'}]",FINISHED,0.045947,0,vanstee-831334628394027-477840709,driver-20200526105908-0109-b5e0bc6f-5be8-4cf9-91da-db9d16f6bdf4,2020-05-26 10:59:08,2020-05-26 11:04:19


{ 'best': { 'appId': 'vanstee-831328606570062-1802464889',
            'driverId': 'driver-20200526105902-0108-984f08db-9681-4ca2-8b0b-df4e1a93e4bf',
            'endTime': '2020-05-26 11:04:19',
            'hyperParams': [ { 'dataType': 'double',
                               'fixedVal': '0.08983621328711912',
                               'name': 'learning_rate',
                               'userDefined': False}],
            'id': 2,
            'maxiteration': 0,
            'metricVal': 0.04188754959106445,
            'startTime': '2020-05-26 10:59:03',
            'state': 'FINISHED'},
  'complete': 4,
  'createtime': '2020-05-26 10:53:43',
  'creator': 'vanstee',
  'duration': '00:10:36',
  'experiments': [ { 'appId': 'vanstee-831011048458738-1807318326',
                     'driverId': 'driver-20200526105345-0106-1c900f6f-d4eb-4d86-885e-caaf54b2cba3',
                     'endTime': '2020-05-26 10:58:00',
                     'hyperParams': [ { 'dataType': 'double',
   

In [None]:

if not res.ok:
    print('get hpo task failed: code=%s, %s'%(res.status_code, res.content))
else:
    json_out=res.json()
    
    while json_out['state'] in ['SUBMITTED','RUNNING']:
        clear_output()
        print('Hpo task %s state %s progress %s%%'%(hpo_job_id, json_out['state'], json_out['progress']))
        time.sleep(10)
        res = req.get(getHpoUrl, headers=commonHeaders, verify=False, auth=myauth())
        json_out=res.json()
        
        experiments_length = len(json_out['experiments'])
       
        ####
        ## Query the list of 6 sub-training of current batch, as maxParalleJobNum=6
        ###      
        count=0
        Experiment = []
        while (count < experiments_length):
                appID = json_out['experiments'][count]['appId']
                print ('appID: %s,' %appID )
                print ('count: %d' %count)
                Experiment.insert(count, appID)
                count+=1
 
        ####
        ## Query the state of 6 sub-training of current batch
        ###
    
        count = 0
        while (count < len(Experiment)):
                r = requests.get(get_ep('dl')+'/execs/'+Experiment[count], auth=myauth(), headers=commonHeaders, verify=False).json()    
                if not res.ok:
                    print('get hpo task failed: code=%s, %s'%(res.status_code, res.content))
                else:
                    print ('Experiement %s state: %s' %(Experiment[count], r['state']))
                count+=1
        
        #time.sleep(30)
        #print ('state:' + json_out['state'] )

        
print('Hpo task %s completes with state %s'%(hpo_job_id, json_out['state']))
print(json.dumps(json_out, indent=4, sort_keys=True))
 
mya