# Automated Hyperparameter Optimization Training using WMLA API

&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;

In this notebook, you will learn how to submit a model and dataset to the Watson Machine Learning Accelerator (WMLA) API to run Hyper Parameter Optimization (HPO). In this particular example, we will be using the Pytorch MNIST HPO model as our training model, inject hyperparameters for the sub-training during search and submit a tuning metric for better results, and then query for the best job results. This notebook runs on Python 3.6.


![options](https://github.com/IBM/wmla-learning-path/raw/master/shared-images/WMLA-RestAPI-Demo.png)



![SpectrumComputeFamily_Conductor-HorizontalColorWhite.png](https://raw.githubusercontent.com/IBM/wmla-learning-path/master/shared-images/hpo.png)


For this notebook you will use a model and dataset that have already been set up to leverage the API.  For details on the API see [API Documentation](https://www.ibm.com/support/knowledgecenter/en/SSFHA8_1.2.1/cm/deeplearning.html) in the Knowledge Center (KC).

## Table of contents

1. [Setup](#setup)<br>

2. [Configuring environment and project details](#configure)<br>

3. [Health Check](#health)<br>

4. [Training with the HPO API](#train)<br>

5. [Deploy the HPO task](#deploy)<br>

6. [Find best job results](#best)<br>

<a id = "setup"></a>
## Step 1: Setup

TODO : suggest we delete this, and ask user to open other notebook with Pytorch code .... 

First, we must import the required modules. Here we will import the Pytorch MNIST HPO model.

In [2]:
!wget https://github.com/IBM/wmla-learning-path/raw/master/datasets/pytorch-mnist-hpo.modelDir.tar

--2020-07-20 21:57:31--  https://github.com/IBM/wmla-learning-path/raw/master/datasets/pytorch-mnist-hpo.modelDir.tar
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/IBM/wmla-learning-path/master/datasets/pytorch-mnist-hpo.modelDir.tar [following]
--2020-07-20 21:57:31--  https://raw.githubusercontent.com/IBM/wmla-learning-path/master/datasets/pytorch-mnist-hpo.modelDir.tar
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.36.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.36.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2587 (2.5K) [application/octet-stream]
Saving to: ‘pytorch-mnist-hpo.modelDir.tar’


2020-07-20 21:57:32 (20.5 MB/s) - ‘pytorch-mnist-hpo.modelDir.tar’ saved [2587/2587]



To use the WMLA API, we will be using the Python requests library.

In [2]:
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

import json
import time
import urllib
import pandas as pd
import os,sys
import tarfile
import tempfile
from IPython.display import clear_output
import time
import pprint

# utility print function
def nprint(mystring) :
    print("**{}** : {}".format(sys._getframe(1).f_code.co_name,mystring))

# utility makedir
def makeDirIfNotExist(directory) :
    if not os.path.exists(directory):  
        nprint("Making directory {}".format(directory))
        os.makedirs(directory) 
    else :
        nprint("Directory {} already exists .. ".format(directory))


<a id = "configure"></a>
## Step 2: Configuring environment and project details

Provide your credentials in this cell, including your cluster url, username and password, and instance group.

In [3]:
def getconfig(cfg_in={}):
    cfg = {}
    cfg["master_host"] = '' # <=enter your host url here
    cfg["dli_rest_port"] = ''
    cfg["sc_rest_port"] = ''
    cfg["num_images"] = {"train":200,"valid":20,"test":20}
    # ==== CLASS ENTER User login details below =====
    cfg["wmla_user"] = ''  # <=enter your id here
    cfg["wmla_pwd"] = ''  # <=enter your pwd here
    # ==== CLASS ENTER User login details above =====
    cfg["sig_name"]  = ''   # <=enter instance group here
    cfg["code_dir"] = "/home/wsuser/works/pytorch_hpo"

    # overwrite configs if passed
    for (k,v) in cfg_in.items() :
        nprint("Overriding Config {}:{} with {}".format(k,cfg[k],v))
        cfg[k] = v
    return cfg

# cfg is used as a global variable throughout this notebook
cfg=getconfig()

Here we will get and print out the API endpoints and setup requests session.   The following sections use the Watson ML Accelerator API to complete the various tasks required. We've given examples of a number of tasks but you should refer to the documentation at to see more details of what is possible and sample output you might expect.

    - https://www.ibm.com/support/knowledgecenter/SSFHA8_1.2.2/cm/deeplearning.html
    - https://www.ibm.com/support/knowledgecenter/SSZU2E_2.4.1/reference_s/api_references.html



In [4]:
# REST call variables
commonHeaders = {'Accept': 'application/json'}

# Use closures for cfg for now ..
def get_tmp_dir() :
    #return "/gpfs/home/s4s004/"+cfg["wmla_user"]+"/2020-05-wmla/tmp"
    return "/home/wsuser/work"
def get_tar_file() :
    #return get_tmp_dir() + "/" + cfg["wmla_user"]+".modelDir.tar"
    return get_tmp_dir() + "/pytorch-mnist-hpo.modelDir.tar"

#get api endpoint
def get_ep(mode="sc") :
    if mode=="sc" :
        sc_rest_url =  cfg["master_host"] +':'+ cfg["sc_rest_port"] +'/platform/rest/conductor/v1'
        return sc_rest_url
    elif(mode=="dl") :
        dl_rest_url = cfg["master_host"] +':'+cfg["dli_rest_port"] +'/platform/rest/deeplearning/v1'
        return dl_rest_url
    else :
        nprint("Error mode : {} not supported".format(mode))

def myauth():
    return(cfg["wmla_user"],cfg["wmla_pwd"])

print ("SC API Endpoints : {}".format(get_ep("sc")))
print ("DL API Endpoints : {}".format(get_ep("dl")))
print (myauth())
print (get_tar_file())
#myauth = (wmla_user, wmla_pwd)

# Setup Requests session
req = requests.Session()

SC API Endpoints : https://dse-ac922h.cpolab.ibm.com:8643/platform/rest/conductor/v1
DL API Endpoints : https://dse-ac922h.cpolab.ibm.com:9243/platform/rest/deeplearning/v1
('Admin', 'Admin')
/home/wsuser/work/pytorch-mnist-hpo.modelDir.tar


<a id = "health"></a>
## Step 3: Health Check

In this step, we will check if there are any existing HPO tasks and also verify the platform health.

Rest API: `GET platform/rest/deeplearning/v1/hypersearch`
- `Description`: Get all the HPO tasks that the login user can access.
- `OUTPUT`: A list of HPO tasks and each one with the same format which can be found in the api doc.

In [5]:
def hpo_health_check():
    getTuneStatusUrl = get_ep("dl") + '/hypersearch'
    nprint ('getTuneStatusUrl: %s' %getTuneStatusUrl)
    r = req.get(getTuneStatusUrl, headers=commonHeaders, verify=False, auth=myauth())
    
    if not r.ok:
        nprint('check hpo task status failed: code=%s, %s'%(r.status_code, r.content))
    else:
        if len(r.json()) == 0:
            nprint('There is no hpo task been created')
        for item in r.json():
            nprint('Hpo task: %s, State: %s'%(item['hpoName'], item['state']))
            #print('Best:%s'%json.dumps(item.get('best'), sort_keys=True, indent=4))

hpo_health_check()


**hpo_health_check** : getTuneStatusUrl: https://dse-ac922h.cpolab.ibm.com:9243/platform/rest/deeplearning/v1/hypersearch
**hpo_health_check** : Hpo task: Admin-hpo-16650377032674085, State: FINISHED
**hpo_health_check** : Hpo task: Admin-hpo-16652311451466531, State: FINISHED
**hpo_health_check** : Hpo task: Admin-hpo-16709092768652198, State: FINISHED
**hpo_health_check** : Hpo task: Admin-hpo-16783740796460860, State: FINISHED
**hpo_health_check** : Hpo task: Admin-hpo-16797156948888025, State: FINISHED
**hpo_health_check** : Hpo task: Admin-hpo-16899927798480429, State: FINISHED
**hpo_health_check** : Hpo task: Admin-hpo-16901184356247510, State: FINISHED
**hpo_health_check** : Hpo task: Admin-hpo-17306812964985375, State: FINISHED
**hpo_health_check** : Hpo task: Admin-hpo-17310278053357251, State: FINISHED
**hpo_health_check** : Hpo task: Admin-hpo-17550982147791472, State: FINISHED
**hpo_health_check** : Hpo task: Admin-hpo-17555313438664566, State: FINISHED
**hpo_health_check**

<a id = "train"></a>
## Step 4: Training with the HPO API

TODO : Now we could reference other notebook, or even put this description over there ...


The WMLA framework requires 2 changes to your code to support the HPO API, and these are:

* Inject hyperparameters for the sub-training during search
* Retrieve sub-training result metric

Note that the code sections below show a comparison between the "before" and "HPO enabled" versions of the code by using `diff`.


1. Import the dependent libararies:

&nbsp;
&nbsp;
![image1](https://github.com/IBM/wmla-learning-path/raw/dev/shared-images/hpo_update_model_0.png)
&nbsp;
&nbsp;

2. Get the WMLA cluster `DLI_DATA_FS`, `RESULT_DIR` and `LOG_DIR` for the HPO training job. The `DLI_DATA_FS` can be used for shared data placement, the `RESULT_DIR` can be used for final model saving, and the `LOG_DIR` can be used for user logs and monitoring.

&nbsp;
**Note**: `DLI_DATA_FS` is set when installing the DLI cluster; `RESULT_DIR` and `LOG_DIR` is generated by WMLA for each HPO experiment.

&nbsp;
&nbsp;
![image1](https://github.com/IBM/wmla-learning-path/raw/dev/shared-images/hpo_update_model_1.png)
&nbsp;
&nbsp;

3. Replace the hyperparameter definition code by reading hyperparameters from the `config.json` file. the `config.json` is generated by WMLA HPO, which contains a set of hyperparameter candidates for each tuning jobs. The hyperparameters and the search space is defined when submitting the HPO task. For example, here the hyperparameter `learning_rate` is set to tune:

&nbsp;
&nbsp;
![image2](https://github.com/IBM/wmla-learning-path/raw/dev/shared-images/hpo_update_model_2.png)

&nbsp;
Then you could use the hyperparameter you get from `config.json` where you want:
&nbsp;
![image2](https://github.com/IBM/wmla-learning-path/raw/dev/shared-images/hpo_update_model_2_2.png)
&nbsp;
&nbsp;

4.  Write the tuning result into `val_dict_list.json` under `RESULT_DIR`. WMLA HPO will read this file for each tuning job to get the metric values. Define a `test_metrics` list to store all metric values and pass the epoch parameter to the test function. Then you can add the metric values to the `test_metrics` list during the training test process. Please note that the metric names should be specified when submitting the HPO task, and be consistent with the code here.
&nbsp;
For example, at the HPO task submit request, `loss` will be used as the objective metric the tuning will try to minimize the `loss`:

```
'algoDef': # Define the parameters for search algorithms  
{
    # Name of the search algorithm, one of Random, Bayesian, Tpe, Hyperband  
    'algorithm': 'Random',   
    # Name of the target metric that we are trying to optimize when searching hyper-parameters.
    # It is the same metric name that the model update part 2 trying to dump.
    'objectiveMetric' : 'loss',
    # Strategy as how to optimize the hyper-parameters, minimize means to find better hyper-parameters to
    # make the above objectiveMetric as small as possible, maximize means the opposite.
    'objective' : 'minimize',
    ...
}
```
&nbsp;
The code change:

&nbsp;
&nbsp;
![image2](https://github.com/IBM/wmla-learning-path/raw/dev/shared-images/hpo_update_model_3.png)
&nbsp;
&nbsp;

5. After the training completes, write the metric list into the `val_dict_list.json` file. 
&nbsp;
&nbsp;
![image2](https://github.com/IBM/wmla-learning-path/raw/dev/shared-images/hpo_update_model_5.png)
&nbsp;
&nbsp;



## NBDEV

<a id = "deploy"></a>
## Step 5: Deploy the HPO task

Here we package up our model to send to the API for HPO.  



REST API: `POST /platform/rest/deeplearning/v1/hypersearch`

- Description: Start a new HPO task
- Content-type: Multi-Form
- Multi-Form Data:
  - files: Model files tar package, ending with `.modelDir.tar`
  - form-filed: {‘data’: ‘String format of input parameters to start hpo task, let’s call it as **hpo_input** and show its specification later’}


#### Package model files for training

Package the updated model files into a tar file ending with `.modelDir.tar`

REST API expects a `modelDir.tar` with the model code inside ..


In [6]:
files = {'file': open(get_tar_file(), 'rb')}
print("Files : {}".format(files))

Files : {'file': <_io.BufferedReader name='/home/wsuser/work/pytorch-mnist-hpo.modelDir.tar'>}


#### Construct POST request data

**hpo_input** will be in Python `dict` or `json` format as shown below, and will convert to string when calling REST.

In [7]:
# Note, this 
data =  {
        'modelSpec': # Define the model training related parameters
        {
            # Spark instance group which will be used to run the HPO sub-trainings. The Spark instance group selected
            # here should match the sub-training args, for example, if the sub-training args try to run a EDT job,
            # then we should put a Spark instance group with capability to run EDT job here.
            'sigName': cfg["sig_name"],

            # These are the arguments we'll pass to the execution engine; they follow the same conventions
            # of the dlicmd.py command line launcher
            #
            # See:
            #   https://www.ibm.com/support/knowledgecenter/en/SSFHA8_1.2.1/cm/dlicmd.html
            # In this example, args after --model-dir are all the required parameter for the original model itself.
            #
            'args': '--exec-start PyTorch --cs-datastore-meta type=fs --python-version 3.6\
                     --gpuPerWorker 1 --model-main pytorch_mnist_HPO.py --model-dir pytorch_hpo\
                     --debug-level debug'
                
        },
    
        'algoDef': # Define the parameters for search algorithms
        {
            # Name of the search algorithm, one of Random, Bayesian, Tpe, Hyperband, ExperimentGridSearch
            'algorithm': 'Random', 
            # Max running time of the hpo task in minutes, -1 means unlimited
            'maxRunTime': 60,  
            # Max number of training job to submitted for hpo task, -1 means unlimited’,
            'maxJobNum': 4,            
            # Max number of training job to run in parallel, default 1. It depends on both the
            # avaiable resource and if the search algorithm support to run in parallel, current only Random
            # fully supports to run in parallel, Hyperband and Tpe supports to to in parellel in some phase,
            # Bayesian runs in sequence now.
            'maxParalleJobNum': 4, 
            # Name of the target metric that we are trying to optimize when searching hyper-parameters.
            # It is the same metric name that the model update part 2 trying to dump.
            'objectiveMetric' : 'loss',
            # Strategy as how to optimize the hyper-parameters, minimize means to find better hyper-parameters to
            # make the above objectiveMetric as small as possible, maximize means the opposite.
            'objective' : 'minimize',
        },
    
        # Define the hyper-paremeters to search and the corresponding search space.
        'hyperParams':
        [
             {
                 # Hyperparameter name, which will be the hyper-parameter key in config.json
                 'name': 'learning_rate',
                 # One of Range, Discrete
                 'type': 'Range',
                 # one of int, double, str
                 'dataType': 'DOUBLE',
                 # lower bound and upper bound when type=range and dataType=double
                 'minDbVal': 0.001,
                 'maxDbVal': 0.1,
                 # lower bound and upper bound when type=range and dataType=int
                 'minIntVal': 0,
                 'maxIntVal': 0,
                 # Discrete value list when type=discrete
                 'discreteDbVal': [],
                 'discreteIntVal': [],
                 'discreateStrVal': []
                 #step size to split the Range space. ONLY valid when type is Range
                 #'step': '0.002',
             }
         ]
    }
mydata={'data':json.dumps(data)}

#### Submit the Post request

Submit the HPO task through the Post call and an HPO name/id in string format will be returned.

<div class="alert alert-block alert-warning">Note: This cannot be submitted twice. You need to rebuild the tar file prior to resubmitting.</div>

In [8]:
def submit_job():
    startTuneUrl=get_ep('dl') + '/hypersearch'
    nprint("startTuneUrl : {}".format(startTuneUrl))
    nprint("files : {}".format(files))
    nprint("myauth() : {}".format(myauth()))
    #print("hpo_job_id : {}".format(hpo_job_id))
    r = req.post(startTuneUrl, headers=commonHeaders, data=mydata, files=files, verify=False, auth=myauth())
    hpo_name=None
    if r.ok:
        hpo_name = r.json()
        print ('\nModel submitted successfully: {}'.format(hpo_name))
        
    else:
        print('\nModel submission failed with code={}, {}'. format(r.status_code, r.content))
    return hpo_name

hpo_job_id = submit_job()
print("hpo_job_id : {}".format(hpo_job_id))

**submit_job** : startTuneUrl : https://dse-ac922h.cpolab.ibm.com:9243/platform/rest/deeplearning/v1/hypersearch
**submit_job** : files : {'file': <_io.BufferedReader name='/home/wsuser/work/pytorch-mnist-hpo.modelDir.tar'>}
**submit_job** : myauth() : ('Admin', 'Admin')

Model submitted successfully: Admin-hpo-20018747584473295
hpo_job_id : Admin-hpo-20018747584473295


Print out task details here.

In [9]:
getHpoUrl = get_ep('dl') +'/hypersearch/'+ hpo_job_id
pp = pprint.PrettyPrinter(indent=2)

keep_running=True
rr=10
res=None
while(keep_running):
    res = req.get(getHpoUrl, headers=commonHeaders, verify=False, auth=myauth())
    experiments=res.json()['experiments']
    experiments = pd.DataFrame.from_dict(experiments)
    pd.set_option('max_colwidth', 120)
    clear_output()
    print("Refreshing every {} seconds".format(rr))
    display(experiments)
    pp.pprint(res.json())
    if(res.json()['state'] not in ['SUBMITTED','RUNNING']) :
        keep_running=False
    time.sleep(rr)

Refreshing every 10 seconds


Unnamed: 0,appId,driverId,endTime,hyperParams,id,maxiteration,metricVal,startTime,state
0,Admin-20018749948078432-1163435929,driver-20200709235741-0089-30bceca0-f9d1-481c-970f-cd94f4a3dd86,2020-07-10 00:02:08,"[{'name': 'learning_rate', 'dataType': 'double', 'userDefined': False, 'fixedVal': '0.014312814975199907'}]",0,0,0.12367,2020-07-09 23:57:41,FINISHED
1,Admin-20018755727618776-628308033,driver-20200709235746-0090-2d2eff7b-73f3-4c9e-8f89-f920339820c1,2020-07-10 00:02:07,"[{'name': 'learning_rate', 'dataType': 'double', 'userDefined': False, 'fixedVal': '0.04215576043618912'}]",1,0,0.059649,2020-07-09 23:57:46,FINISHED
2,Admin-20018761570535505-1510995259,driver-20200709235752-0091-5552a582-936c-430f-9cff-a880c77bf997,2020-07-10 00:02:07,"[{'name': 'learning_rate', 'dataType': 'double', 'userDefined': False, 'fixedVal': '0.032785699293517295'}]",2,0,0.06887,2020-07-09 23:57:52,FINISHED
3,Admin-20018767398045674-2139572274,driver-20200709235758-0092-d1f7f277-6e79-4505-99ad-efe228c7a14b,2020-07-10 00:02:08,"[{'name': 'learning_rate', 'dataType': 'double', 'userDefined': False, 'fixedVal': '0.0324267291993964'}]",3,0,0.06981,2020-07-09 23:57:58,FINISHED


{ 'best': { 'appId': 'Admin-20018755727618776-628308033',
            'driverId': 'driver-20200709235746-0090-2d2eff7b-73f3-4c9e-8f89-f920339820c1',
            'endTime': '2020-07-10 00:02:07',
            'hyperParams': [ { 'dataType': 'double',
                               'fixedVal': '0.04215576043618912',
                               'name': 'learning_rate',
                               'userDefined': False}],
            'id': 1,
            'maxiteration': 0,
            'metricVal': 0.05964925270080566,
            'startTime': '2020-07-09 23:57:46',
            'state': 'FINISHED'},
  'complete': 4,
  'createtime': '2020-07-09 23:57:38',
  'creator': 'Admin',
  'duration': '00:04:31',
  'experiments': [ { 'appId': 'Admin-20018749948078432-1163435929',
                     'driverId': 'driver-20200709235741-0089-30bceca0-f9d1-481c-970f-cd94f4a3dd86',
                     'endTime': '2020-07-10 00:02:08',
                     'hyperParams': [ { 'dataType': 'double',
      

<a id = "best"></a>
## Step 6: See best job results

In [10]:
# Lets query our result to see what happened during HPO training!

#res.ok
#res.json()
#print(type(res))
#print(dir(res))
#print(json.dumps(res.json(), indent=4, sort_keys=True))
        
print('Hpo task %s completes with state %s'%(hpo_job_id, res.json()['state']))
print("Best HPO result ...")
res.json()["best"]


Hpo task Admin-hpo-20018747584473295 completes with state FINISHED
Best HPO result ...


{'id': 1,
 'hyperParams': [{'name': 'learning_rate',
   'dataType': 'double',
   'userDefined': False,
   'fixedVal': '0.04215576043618912'}],
 'state': 'FINISHED',
 'metricVal': 0.05964925270080566,
 'maxiteration': 0,
 'appId': 'Admin-20018755727618776-628308033',
 'driverId': 'driver-20200709235746-0090-2d2eff7b-73f3-4c9e-8f89-f920339820c1',
 'startTime': '2020-07-09 23:57:46',
 'endTime': '2020-07-10 00:02:07'}

#### Notebook Complete 
Congratulations, you have completed our demonstration of using WMLA for distributed hyperparameter optimization search

Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>