## HPO Training using Accelerator API

![SpectrumComputeFamily_Conductor-HorizontalColorWhite.png](https://raw.githubusercontent.com/IBM/wmla-learning-path/master/shared-images/hpo.png)

In this notebook you will learn how to submit a model and dataset to the Watson Machine Learning Accelerator (WMLA) API to run Hyper Parameter Optimization (HPO).  

For this notebook you will use a model and dataset that have already been setup to leverage the API.  For details on the API see this link in the Knowledge Center (KC).

[API Documentation](https://www.ibm.com/support/knowledgecenter/en/SSFHA8_1.2.1/cm/deeplearning.html)

## Imports

To use the WMLA API, we will be using the python requests library

In [None]:
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

import json
import time
import urllib
import pandas as pd
import os,sys
import tarfile
import tempfile
from IPython.display import clear_output
import time
import pprint

# utility print function
def nprint(mystring) :
    print("**{}** : {}".format(sys._getframe(1).f_code.co_name,mystring))

# utility makedir
def makeDirIfNotExist(directory) :
    if not os.path.exists(directory):  
        nprint("Making directory {}".format(directory))
        os.makedirs(directory) 
    else :
        nprint("Directory {} already exists .. ".format(directory))


## Environment details and Project Config:

In [None]:
print("This is your userid...")
!whoami

In [None]:
def getconfig(cfg_in={}):
    cfg = {}
    cfg["master_host"] = 'https://p10a117.pbm.ihost.com'
    cfg["dli_rest_port"] = '9243'
    cfg["sc_rest_port"] = '8643'
    cfg["num_images"] = {"train":200,"valid":20,"test":20}
    # ==== CLASS ENTER User login details below =====
    cfg["wmla_user"] = 'b0p036aX' # 'b0p036aX'  # <=enter your id here
    cfg["wmla_pwd"] = 'pwb0p036aX' # 'pwb0p036aX'  # <=enter your pwd here
    # ==== CLASS ENTER User login details above =====
    cfg["sig_name"]  = 'b0p036a-dliauto'
    cfg["code_dir"] = "./code_samples/pytorch_hpo"

    # overwrite configs if passed
    for (k,v) in cfg_in.items() :
        nprint("Overriding Config {}:{} with {}".format(k,cfg[k],v))
        cfg[k] = v
    return cfg

# cfg is used as a global variable throughout this notebook
cfg=getconfig()

In [None]:
# REST call variables
commonHeaders = {'Accept': 'application/json'}

# Use closures for cfg for now ..
def get_tmp_dir() :
    return "./tmp"

def get_tar_file() :
    return get_tmp_dir() + "/" + cfg["wmla_user"]+".modelDir.tar"

#get api endpoint
def get_ep(mode="sc") :
    if mode=="sc" :
        sc_rest_url =  cfg["master_host"] +':'+ cfg["sc_rest_port"] +'/platform/rest/conductor/v1'
        return sc_rest_url
    elif(mode=="dl") :
        dl_rest_url = cfg["master_host"] +':'+cfg["dli_rest_port"] +'/platform/rest/deeplearning/v1'
        return dl_rest_url
    else :
        nprint("Error mode : {} not supported".format(mode))

def myauth():
    return(cfg["wmla_user"],cfg["wmla_pwd"])

print ("SC API Endpoints : {}".format(get_ep("sc")))
print ("DL API Endpoints : {}".format(get_ep("dl")))
print (myauth())
print (get_tar_file())
#myauth = (wmla_user, wmla_pwd)

# Setup Requests session
req = requests.Session()

## Health Check

Check if there is any existing hpo tasks and also verify the platform health

Rest API: **GET platform/rest/deeplearning/v1/hypersearch**
- Description: Get all the hpo task that the login user can access.
- OUTPUT: A list of hpo tasks and each one with the same format which can be found in the api doc.

In [None]:
def hpo_health_check():
    getTuneStatusUrl = get_ep("dl") + '/hypersearch'
    nprint ('getTuneStatusUrl: %s' %getTuneStatusUrl)
    r = req.get(getTuneStatusUrl, headers=commonHeaders, verify=False, auth=myauth())
    
    if not r.ok:
        nprint('check hpo task status failed: code=%s, %s'%(r.status_code, r.content))
    else:
        if len(r.json()) == 0:
            nprint('There is no hpo task been created')
        for item in r.json():
            nprint('Hpo task: %s, State: %s'%(item['hpoName'], item['state']))
            #print('Best:%s'%json.dumps(item.get('best'), sort_keys=True, indent=4))

hpo_health_check()


## Launch an HPO task using API


This part of the lab will step through the steps required to train a simple MNIST (greyscale digits) dataset using WMLA HPO through API ... 

We will be using the **PyTorch** deep learning framework for this example.  Lets take a look at the code ..

* pytorch_mnist_original.py : original sample code from pytorch.org
* pytorch_mnist_HPO.py      : Modified code that implements the required HPO hooks for the WMLA framework
    * Learing rate controlled by HPO
* pytorch_mnist_HPO_advanced.py : Similar example just showing multiple settings 
    * Learning rate controlled by HPO
    * Dropout controlled by HPO

In [None]:
print("Code directory : {}".format(cfg["code_dir"]))
!ls {cfg["code_dir"]}

### Run mnist using command line interface
In the following cells we will be running an deep learning training job for Mnist dataset.  Typically you will develop your initial model and submit single jobs until you are happy with the result before you submit many many jobs.  Since our mnist program is written in python, lets submit if via the command line using the sample below.  

1.  open a terminal in our juypter noteboook and go to code_samples directory
```
cd code_samples
```

2. now source our conda environment manually
```
source /gpfs/software/wmla-p10a117/wmla_anaconda/b0p036a/anaconda/envs/powerai162/etc/profile.d/conda.sh
```
3 . export some required environment variables
```
export DLI_DATA_FS=/gpfs/software/wmla-p10a117/dli_data_fs/
export RESULT_DIR=./code_samples/
export EGO_TOP=/gpfs/software/wmla-p10a117/wmla
```

4. authenticate to WMLA
```
python $EGO_TOP/dli/1.2.3/dlpd/bin/dlicmd.py  --logon  --master-host p10a117.pbm.ihost.com --username b0p036aX --password pwb0p036aX
```

5.  Submit a single Job

```
python $EGO_TOP/dli/1.2.3/dlpd/bin/dlicmd.py --exec-start PyTorch --cs-datastore-meta type=fs \
    --python-version 3.6  --master-host p10a117.pbm.ihost.com --ig b0p036a-dliauto \
    --gpuPerWorker 1 --model-main pytorch_mnist_HPO.py --model-dir pytorch_hpo \
    --epochs 2 --debug-level debug
```
   
    
6.  View Job results..
```
JOB_ID=vanstee-1987246869690900-700353922
python  $EGO_TOP/dli/1.2.3/dlpd/bin/dlicmd.py  --exec-errlogs   $JOB_ID --master-host p10a117.pbm.ihost.com  | grep -v " INFO " | grep -v "^+" | grep -v "^Spark" | more
```

### Model file update to Run HPO

**Developer Note** 
The WMLA framework requires 2 changes to your code to support the HPO API, these are :
- Inject hyper-parameters for the sub-training during search
- Retrieve sub-training result metric

We will cover these details in the next 2 sections

For an even more detailed review, see the [documentation in Github](https://github.com/IBM/wmla-learning-path/blob/master/02_hpo_for_developers.md)

#### Model update part 1 - Inject hyper-parameters

The hyper-parameters will be supplied in a file called **config.json** with JSON format,located in the current working directory and can be read direcly as the following example snippet.

<pre>
hyper_params = json.loads(open("<b>config.json</b>").read())
learning_rate = float(hyper_params.get("<b>learning_rate</b>", "0.01"))
</pre>

After this, you can use these hyper-parameters during the model trainings. The **hyper-parameter name** and **value** type is defined through the search space part in body of REST call when launching a new hpo task.

#### Model update part 2 - Retrieve sub-training result metric

At the end of your training run, your code will need to create a file called **val_dict_list.json** with test metrics generated during training. These metrics will be used by the search algorithm to propose new sets of hyper-parameters. Please note that **val_dict_list.json** should be created under the result directory which can be retrieved through the environment variable **RESULT_DIR**.

<pre>
with open('{}/val_dict_list.json'.format(os.environ['<b>RESULT_DIR</b>']), 'w') as f:
    json.dump(test_metrics, f)
</pre>

The content of **val_dict_list.json** will be some thing as below, **step** is some thing optional meaning the training iteration or epochs, one of **loss** and **accuracy** can be the name of target metric to optimize, at least one metric need to be included here. The specific name of metric used to optimize (minimize or maximize) is defined in the body of REST call when launching a new hpo task. 

```
[
{‘step’: 1, ‘loss’:0.2487, ‘accuracy’: 0.4523},
{‘step’: 2, ‘loss’:0.1487, ‘accuracy’: 0.5523},
{‘step’: 3, ‘loss’:0.1087, ‘accuracy’: 0.6523},
…
]
```

**We have added this to the pytorch_mnist_HPO.py file already for you in the ./code_samples/pytorch_hpo directory!**

In [None]:
print(cfg["code_dir"])
!ls {cfg["code_dir"]}

### Launch HPO task

Here we package up our model to send to the API for HPO.  Lets see how this works ...



REST API: **POST /platform/rest/deeplearning/v1/hypersearch**
- Description: Start a new HPO task
- Content-type: Multi-Form
- Multi-Form Data:
  - files: Model files tar package, ending with `.modelDir.tar`
  - form-filed: {‘data’: ‘String format of input parameters to start hpo task, let’s call it as **hpo_input** and show its specification later’}


#### Package model files for training

Package the updated model files into a tar file ending with `.modelDir.tar`

REST API expects a modelDir.tar with the model code inside ..


In [None]:
# Tar up 
def make_tarfile():
    makeDirIfNotExist(get_tmp_dir())
    tar_archive_base=os.path.basename(cfg["code_dir"])
    nprint("Tarring up {} to {}".format(cfg["code_dir"],get_tmp_dir()))
    nprint("Adding base directory to archive : {}".format(tar_archive_base))
    with tarfile.open(get_tar_file(), "w:gz") as tar:
        tar.add(cfg["code_dir"], arcname=tar_archive_base)
#MODEL_DIR_SUFFIX = ".modelDir.tar"
#tempFile = tempfile.mktemp(MODEL_DIR_SUFFIX)

make_tarfile()

files = {'file': open(get_tar_file(), 'rb')}
print("Files : {}".format(files))

#### Construct POST request data

**hpo_input** will be a Python dict or json format as below, convert to string when calling REST.

In [None]:
# Note, this 
data =  {
        'modelSpec': # Define the model training related parameters
        {
            # Spark instance group which will be used to run the HPO sub-trainings. The Spark instance group selected
            # here should match the sub-training args, for example, if the sub-training args try to run a EDT job,
            # then we should put a Spark instance group with capability to run EDT job here.
            'sigName': cfg["sig_name"],

            # These are the arguments we'll pass to the execution engine; they follow the same conventions
            # of the dlicmd.py command line launcher
            #
            # See:
            #   https://www.ibm.com/support/knowledgecenter/en/SSFHA8_1.2.1/cm/dlicmd.html
            # In this example, args after --model-dir are all the required parameter for the original model itself.
            #
            'args': '--exec-start PyTorch --cs-datastore-meta type=fs --python-version 3.6\
                     --gpuPerWorker 1 --model-main pytorch_mnist_HPO.py --model-dir pytorch_hpo\
                     --epochs 3 --debug-level debug'
                
        },
    
        'algoDef': # Define the parameters for search algorithms
        {
            # Name of the search algorithm, one of Random, Bayesian, Tpe, Hyperband, ExperimentGridSearch
            'algorithm': 'Random', 
            # Max running time of the hpo task in minutes, -1 means unlimited
            'maxRunTime': 60,  
            # Max number of training job to submitted for hpo task, -1 means unlimited’,
            'maxJobNum': 4,            
            # Max number of training job to run in parallel, default 1. It depends on both the
            # avaiable resource and if the search algorithm support to run in parallel, current only Random
            # fully supports to run in parallel, Hyperband and Tpe supports to to in parellel in some phase,
            # Bayesian runs in sequence now.
            'maxParalleJobNum': 2, 
            # Name of the target metric that we are trying to optimize when searching hyper-parameters.
            # It is the same metric name that the model update part 2 trying to dump.
            'objectiveMetric' : 'loss',
            # Strategy as how to optimize the hyper-parameters, minimize means to find better hyper-parameters to
            # make the above objectiveMetric as small as possible, maximize means the opposite.
            'objective' : 'minimize',
        },
    
        # Define the hyper-paremeters to search and the corresponding search space.
        'hyperParams':
        [
             {
                 # Hyperparameter name, which will be the hyper-parameter key in config.json
                 'name': 'learning_rate',
                 # One of Range, Discrete
                 'type': 'Range',
                 # one of int, double, str
                 'dataType': 'DOUBLE',
                 # lower bound and upper bound when type=range and dataType=double
                 'minDbVal': 0.001,
                 'maxDbVal': 0.1,
                 # lower bound and upper bound when type=range and dataType=int
                 'minIntVal': 0,
                 'maxIntVal': 0,
                 # Discrete value list when type=discrete
                 'discreteDbVal': [],
                 'discreteIntVal': [],
                 'discreateStrVal': []
                 #step size to split the Range space. ONLY valid when type is Range
                 #'step': '0.002',
             }
         ]
    }
mydata={'data':json.dumps(data)}

#### Submit the Post request

Submit hpo task through the Post call and a hpo name/id as string format will get back.

**Note**:This cannot be submitted twice.. you need to rebuild the tar file prior to resubmitting

In [None]:
def submit_job(job_dict,job_files,job_auth):
    startTuneUrl=get_ep('dl') + '/hypersearch'
    nprint("startTuneUrl : {}".format(startTuneUrl))
    nprint("files : {}".format(job_files))
    nprint("myauth() : {}".format(job_auth))
    #print("hpo_job_id : {}".format(hpo_job_id))
    r = req.post(startTuneUrl, headers=commonHeaders, data=job_dict, files=job_files, verify=False, auth=job_auth)
    hpo_name=None
    if r.ok:
        hpo_name = r.json()
        print ('\nModel submitted successfully: {}'.format(hpo_name))
        
    else:
        print('\nModel submission failed with code={}, {}'. format(r.status_code, r.content))
    return hpo_name

hpo_job_id = submit_job(mydata,files,myauth())
print("hpo_job_id : {}".format(hpo_job_id))

### Query Status until complete

In [None]:
def query_job_status(job_id,refresh_rate=3) :

    getHpoUrl = get_ep('dl') +'/hypersearch/'+ job_id
    pp = pprint.PrettyPrinter(indent=2)

    keep_running=True
    res=None
    while(keep_running):
        res = req.get(getHpoUrl, headers=commonHeaders, verify=False, auth=myauth())
        experiments=res.json()['experiments']
        experiments = pd.DataFrame.from_dict(experiments)
        pd.set_option('max_colwidth', 120)
        clear_output()
        print("Refreshing every {} seconds".format(refresh_rate))
        display(experiments)
        pp.pprint(res.json())
        if(res.json()['state'] not in ['SUBMITTED','RUNNING']) :
            keep_running=False
        time.sleep(refresh_rate)
    return res
job_status = query_job_status(hpo_job_id,refresh_rate=10)

### Show the Best Result of HPO Job

In [None]:
# Lets query our result to see what happened during HPO training!

#res.ok
#res.json()
#print(type(res))
#print(dir(res))
#print(json.dumps(res.json(), indent=4, sort_keys=True))
        
print('Hpo task %s completes with state %s'%(hpo_job_id, job_status.json()['state']))
print("Best HPO result ...")
job_status.json()["best"]


#### Notebook Complete 
Congratulations, you have completed our demonstration of using WMLA for distributed hyperparameter optimization search

## [Optional Advanced Example] : Random Optimization using Multiple Parameters

In [None]:
# Note, this 
data =  {
        'modelSpec': # Define the model training related parameters
        {
            # Spark instance group which will be used to run the HPO sub-trainings. The Spark instance group selected
            # here should match the sub-training args, for example, if the sub-training args try to run a EDT job,
            # then we should put a Spark instance group with capability to run EDT job here.
            'sigName': cfg["sig_name"],

            # These are the arguments we'll pass to the execution engine; they follow the same conventions
            # of the dlicmd.py command line launcher
            #
            # See:
            #   https://www.ibm.com/support/knowledgecenter/en/SSFHA8_1.2.1/cm/dlicmd.html
            # In this example, args after --model-dir are all the required parameter for the original model itself.
            #
            'args': '--exec-start PyTorch --cs-datastore-meta type=fs --python-version 3.6\
                     --gpuPerWorker 1 --model-main pytorch_mnist_HPO_advanced.py --model-dir pytorch_hpo\
                     --epochs 3 --debug-level debug'
                
        },
    
        'algoDef': # Define the parameters for search algorithms
        {
            # Name of the search algorithm, one of Random, Bayesian, Tpe, Hyperband, ExperimentGridSearch
            'algorithm': 'Random', 
            # Max running time of the hpo task in minutes, -1 means unlimited
            'maxRunTime': 60,  
            # Max number of training job to submitted for hpo task, -1 means unlimited’,
            'maxJobNum': 30,            
            # Max number of training job to run in parallel, default 1. It depends on both the
            # avaiable resource and if the search algorithm support to run in parallel, current only Random
            # fully supports to run in parallel, Hyperband and Tpe supports to to in parellel in some phase,
            # Bayesian runs in sequence now.
            'maxParalleJobNum': 30, 
            # Name of the target metric that we are trying to optimize when searching hyper-parameters.
            # It is the same metric name that the model update part 2 trying to dump.
            'objectiveMetric' : 'loss',
            # Strategy as how to optimize the hyper-parameters, minimize means to find better hyper-parameters to
            # make the above objectiveMetric as small as possible, maximize means the opposite.
            'objective' : 'minimize',
        },
    
        # Define the hyper-paremeters to search and the corresponding search space.
        'hyperParams':
        [
             {
                 # Hyperparameter name, which will be the hyper-parameter key in config.json
                 'name': 'learning_rate',
                 # One of Range, Discrete
                 'type': 'Range',
                 # one of int, double, str
                 'dataType': 'DOUBLE',
                 # lower bound and upper bound when type=range and dataType=double
                 'minDbVal': 0.0001,
                 'maxDbVal': 0.1,
                 # lower bound and upper bound when type=range and dataType=int
                 'minIntVal': 0,
                 'maxIntVal': 0,
                 # Discrete value list when type=discrete
                 'discreteDbVal': [],
                 'discreteIntVal': [],
                 'discreateStrVal': []
                 #step size to split the Range space. ONLY valid when type is Range
                 #'step': '0.002',
             },{
                 # Hyperparameter name, which will be the hyper-parameter key in config.json
                 'name': 'num_hidden_layers',
                 # One of Range, Discrete
                 'type': 'Range',
                 # one of int, double, str
                 'dataType': 'INT',
                 # lower bound and upper bound when type=range and dataType=double
                 'minDbVal': 0.0,
                 'maxDbVal': 0.0,
                 # lower bound and upper bound when type=range and dataType=int
                 'minIntVal': 25,
                 'maxIntVal': 200,
                 # Discrete value list when type=discrete
                 'discreteDbVal': [],
                 'discreteIntVal': [],
                 'discreateStrVal': []
                 #step size to split the Range space. ONLY valid when type is Range
                 #'step': '0.002',
             },             {
                 # Hyperparameter name, which will be the hyper-parameter key in config.json
                 'name': 'dropout_rate',
                 # One of Range, Discrete
                 'type': 'Range',
                 # one of int, double, str
                 'dataType': 'DOUBLE',
                 # lower bound and upper bound when type=range and dataType=double
                 'minDbVal': 0.00,
                 'maxDbVal': 0.40,
                 # lower bound and upper bound when type=range and dataType=int
                 'minIntVal': 0,
                 'maxIntVal': 0,
                 # Discrete value list when type=discrete
                 'discreteDbVal': [],
                 'discreteIntVal': [],
                 'discreateStrVal': []
                 #step size to split the Range space. ONLY valid when type is Range
                 #'step': '0.002',
             }
         ]
    }
mydata={'data':json.dumps(data)}

### Submit Advanced Job

In [None]:
files = {'file': open(get_tar_file(), 'rb')}
print("Files : {}".format(files))
hpo_job_id = submit_job(mydata,files,myauth())


In [None]:
job_status = query_job_status(hpo_job_id,refresh_rate=15)

### Best Result 

In [None]:
print('Hpo task %s completes with state %s'%(hpo_job_id, job_status.json()['state']))
print("Best HPO result ...")
job_status.json()["best"]