# GPU Accelerated Elastic Deep Learning Service in Cloud Pak for Data

#### Notebook created by Kelvin Lui, Xue Yin Zhuang, Xue Zhou Yuan (January 2021)

Watson Machine Learning Accelerator in Cloud Pak for Data offers GPU Accelerated Elastic Deep Learning service.   This service enables multiple data scientists to accelerate deep learning model training across multiple GPUs and server, share GPUs in a dynamic fashion,  and drives data scientist productivity and overall GPU utilization.

In this notebook, you will learn how to scale PyTorch model with multiple GPUs with GPU Accelerated Elastic Deep Learning service, monitor the running job, and debug any issues seen.

This notebook uses Watson Machine learning Accelerator 2.2 with Cloud Pak for Data 3.5. 

### Contents

- [The big picture](#The-big-picture)
- [Changes to your code](#Changes-to-your-code)
- [Set up API end point and log on](#Set-up-API-end-point-and-log-on)
- [Submit job via API](#Submit-job-via-API)
- [Monitor running job](#Monitor-running-job)
- [Training metrics and logs](#Training-metrics-and-logs)
- [Download trained model](#Download-trained-model)
- [Further information and useful links](#Further-information-and-useful-links)
- [Appendix](#Appendix)




## The big picture
[Back to top](#Contents)

This notebook details the process of taking your PyTorch model and making the changes required to train the model using [IBM Watson Machine Learning GPU Accelerated Elastic Deep Learning service](https://developer.ibm.com/series/learning-path-get-started-with-watson-machine-learning-accelerator/) (WML Accelerator) 


The image below shows the various elements required to use Elastic Deep Learning Service. In this notebook we will step through each of these elements in more detail. Through this process you will offload your code to a WML Accelerator cluster, monitor the running job, retrieve the output and debug any issues seen. A [static version](https://github.com/IBM/wmla-assets/raw/master/WMLA-learning-journey/shared-images/5_running_job.png) is also available.

![overall](https://github.com/IBM/wmla-assets/raw/master/WMLA-learning-journey/shared-images/5_running_job.gif)

## Changes to your code
[Back to top](#Contents)

In this section we will use the PyTorch Resnet 50 model and make the required changes needed to use this model with the elastic distributed training engine (EDT). An overview of these changes can be seen in the diagram below. A [static version](https://github.com/IBM/wmla-assets/raw/master/WMLA-learning-journey/shared-images/2_code_adaptations.png) is also available.

![code](https://github.com/IBM/wmla-assets/raw/master/WMLA-learning-journey/shared-images/2_code_adaptations.gif)



The key changes to your code in order to use elastic distributed training are the following:
- Importing libraries and setting up environment variables
- Data loading function for elastic distributed training
- Extract parameters for training
- Replace training and testing loops with the loop equivalents for elastic distributed training

For the purpose of this tutorial we train RestNet50 model with Elastic Distributed Training (EDT).

See the blog associated with this notebook with more detailed explanation of the above changes.
https://developer.ibm.com/articles/elastic-distributed-training-edt-in-watson-machine-learning-accelerator/

See more information about the Elastic Distributed Training API in 
 [IBM Documentation](https://www.ibm.com/docs/en/wmla/2.2.0?topic=SSFHA8_2.2.0/wmla_workloads_elastic_distributed_training.html).



Your modified code should be made available in a directory which also contains the EDT helper scripts: `edtcallback.py`, `emetrics.py` and `elog.py`. 



## Define helper methods
Define the required helper methods. 



In [17]:
# import tarfile
import tempfile
import os
import json
import pprint
import pandas as pd
from IPython.display import display, FileLink, clear_output

import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

from matplotlib import pyplot as plt
%pylab inline

import base64
import json
import time
import urllib
import tarfile


def query_job_status(job_id,refresh_rate=3) :

    execURL = dl_rest_url  +'/execs/'+ job_id['id']
    pp = pprint.PrettyPrinter(indent=2)

    keep_running=True
    res=None
    while(keep_running):
        res = req.get(execURL, headers=commonHeaders, verify=False)
        monitoring = pd.DataFrame(res.json(), index=[0])
        pd.set_option('max_colwidth', 120)
        clear_output()
        print("Refreshing every {} seconds".format(refresh_rate))
        display(monitoring)
        pp.pprint(res.json())
        if(res.json()['state'] not in ['PENDING_CRD_SCHEDULER', 'SUBMITTED','RUNNING']) :
            keep_running=False
        time.sleep(refresh_rate)
    return res

def query_executor_stdout_log(job_id) :

    execURL = dl_rest_url  +'/scheduler/applications/'+ job_id['id'] + '/executor/1/logs/stdout?lastlines=1000'
    #'https://{}/platform/rest/deeplearning/v1/scheduler/applications/wmla-267/driver/logs/stderr?lastlines=10'.format(hostname)
    commonHeaders2={'accept': 'text/plain', 'X-Auth-Token': access_token}
    print (execURL)
    res = req.get(execURL, headers=commonHeaders2, verify=False)
    print(res.text)
    
    
def query_train_metric(job_id) :

    #execURL = dl_rest_url  +'/execs/'+ job_id['id'] + '/log'
    execURL = dl_rest_url  +'/execs/'+ job_id['id'] + '/log'
    #'https://{}/platform/rest/deeplearning/v1/scheduler/applications/wmla-267/driver/logs/stderr?lastlines=10'.format(hostname)
    commonHeaders2={'accept': 'text/plain', 'X-Auth-Token': access_token}
    print (execURL)
    res = req.get(execURL, headers=commonHeaders2, verify=False)
    print(res.text)

    # save result file    
def download_trained_model(job_id) :

    from IPython.display import display, FileLink

    # save result file
    commonHeaders3={'accept': 'application/octet-stream', 'X-Auth-Token': access_token}
    execURL = dl_rest_url  +'/execs/'+ r.json()['id'] + '/result'
    res = req.get(execURL, headers=commonHeaders3, verify=False, stream=True)
    print (execURL)

    tmpfile = model_dir + '/' + r.json()['id'] +'.zip'
    print ('Save model: ', tmpfile )
    with open(tmpfile,'wb') as f:
        f.write(res.content)
        f.close()

def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))

Populating the interactive namespace from numpy and matplotlib


In [2]:
import os
model_dir = f'./resnet-wmla' 
model_main = f'elastic-main.py'
model_callback = f'edtcallback.py'
model_elog = f'elog.py'

os.makedirs(model_dir, exist_ok=True)

Resnet50 model: elastic-main.py
This is the main file that is required by the elastic distributed training engine. It acts as the program main entrance. 



In [3]:
%%writefile {model_dir}/{model_main}
#!/usr/bin/env python

from __future__ import print_function
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import transforms, models
from callbacks import Callback
from fabric_model import FabricModel
from edtcallback import EDTLoggerCallback
import torch
import os


## Define model and extract training parameters
def get_max_worker():
    import argparse
    parser = argparse.ArgumentParser(description='EDT Example')
    parser.add_argument('--numWorker', type=int, default='16', help='input the max number ')
    parser.add_argument('--gpuPerWorker', type=int, default='1', help='input the path of initial weight file')
    args, _ = parser.parse_known_args()
    num_worker = args.numWorker * args.gpuPerWorker
    print ('args.numWorker: ', args.numWorker , 'args.gpuPerWorker: ', args.gpuPerWorker)
    return num_worker

BATCH_SIZE_PER_DEVICE = 64
NUM_EPOCHS = 3
MAX_NUM_WORKERS = get_max_worker()
START_LEARNING_RATE = 0.4
LR_STEP_SIZE = 30
LR_GAMMA = 0.1
MOMENTUM = 0.9
WEIGHT_DECAY = 1e-4

## Define dataset location 
DATA_DIR = os.getenv("DATA_DIR")
if DATA_DIR is None:
    DATA_DIR = '/tmp'
print("DATA_DIR: " + DATA_DIR)
TRAIN_DATA = DATA_DIR + "/cifar10"
TEST_DATA = DATA_DIR + "/cifar10"


## <Xue Yin>  Documentation of Callback function
class LRScheduleCallback(Callback):
    def __init__(self, step_size, gamma):
        super(LRScheduleCallback, self).__init__()
        self.step_size = step_size
        self.gamma = gamma

    def on_epoch_begin(self, epoch):
        if (epoch != 0) and (epoch % self.step_size == 0):
            for param_group in self.params['optimizer'].param_groups:
                param_group['lr'] *= self.gamma

        print("LRScheduleCallback epoch={}, learning_rate={}".format(epoch,
              self.params['optimizer'].param_groups[0]['lr']))

## Data loading function for EDT
def getDatasets():
    transform_train = transforms.Compose([
        transforms.Resize(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])

    transform_test = transforms.Compose([
        transforms.Resize(224),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])

    return (torchvision.datasets.CIFAR10(root=TRAIN_DATA, train=True, download=True, transform=transform_train),
            torchvision.datasets.CIFAR10(root=TEST_DATA, train=False, download=True, transform=transform_test))

def custom_train(model, data, eva, train_loader, fn_args):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    inputs, labels = data
    inputs, labels = inputs.to(device), labels.to(device)
    opt = model.get_optimizer()
    opt.zero_grad()
    outputs = model(inputs)
    cri = model.get_loss_function()
    loss = cri(outputs, labels)
    loss.backward()
    acc = eva(outputs, labels)
    return acc, loss

def custom_test(model, test_iter, fn_args):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    cri = model.get_loss_function()
    valid_loss = 0.0
    counter = 0
    for(inputs, labels) in test_iter:
        inputs, labels = inputs.to(device), labels.to(device)
        output = model(inputs)
        loss = cri(output, labels)
        valid_loss += loss.item()
        counter += 1
    valid_loss /= counter
    return valid_loss

def main(model_type):
    print('==> Building model..' + str(model_type))
    model = models.__dict__[model_type]()
    optimizer = optim.SGD(model.parameters(), lr=START_LEARNING_RATE, momentum=MOMENTUM, weight_decay=WEIGHT_DECAY)
    loss_function = F.cross_entropy
    
    edt_m = FabricModel(model, getDatasets, loss_function, optimizer, enable_onnx=True, fn_step_train=custom_train, fn_test=custom_test, user_callback=[LRScheduleCallback(LR_STEP_SIZE, LR_GAMMA)],  driver_logger=EDTLoggerCallback())
    print('==> epochs:' + str(NUM_EPOCHS) + ', batchsize:' + str(BATCH_SIZE_PER_DEVICE) + ', engines_number:' + str(MAX_NUM_WORKERS))
    edt_m.train(NUM_EPOCHS, BATCH_SIZE_PER_DEVICE, MAX_NUM_WORKERS, num_dataloader_threads=4, validation_freq=10, checkpoint_freq=0)

if __name__ == '__main__':
    main("resnet50")


Overwriting ./resnet-wmla/elastic-main.py


### EDT helper scripts: edtcallback.py
The edtcallback.py scripts counts model loss and accuracy and logs them to a the driver log. 


In [4]:
%%writefile {model_dir}/{model_callback}
#! /usr/bin/env python

from __future__ import print_function

import sys
import os

from callbacks import LoggerCallback
from emetrics import EMetrics
from elog import ELog

'''
    EDTLoggerCallback class define LoggerCallback to trigger Elog.
'''

class EDTLoggerCallback(LoggerCallback):
    def __init__(self):
        self.gs =0

    def log_train_metrics(self, loss, acc, completed_batch,  worker=0):
        acc = acc/100.0
        self.gs += 1
        with EMetrics.open() as em:
            em.record(EMetrics.TEST_GROUP,completed_batch,{'loss': loss, 'accuracy': acc})
        with ELog.open() as log:
            log.recordTrain("Train", completed_batch, self.gs, loss, acc, worker)

    def log_test_metrics(self, loss, acc, completed_batch, worker=0):
        acc = acc/100.0
        with ELog.open() as log:
            log.recordTest("Test", loss, acc, worker)

Overwriting ./resnet-wmla/edtcallback.py


### EDT helper scripts: elog.py
The elog.py script defines the path and content of the training and test log. 


In [5]:
%%writefile {model_dir}/{model_elog}
import time
import os

'''
    ELog class define the path and content of train and test log.
'''

class ELog(object):

    def __init__(self,subId,f):
        if "TRAINING_ID" in os.environ:
            self.trainingId = os.environ["TRAINING_ID"]
        elif "DLI_EXECID" in os.environ:
            self.trainingId = os.environ["DLI_EXECID"]
        else:
            self.trainingId = ""
        self.subId = subId
        self.f = f

    def __enter__(self):
        return self

    def __exit__(self, type, value, tb):
        self.close()

    @staticmethod
    def open(subId=None):
        if "LOG_DIR" in os.environ:
            folder = os.environ["LOG_DIR"]
        elif "JOB_STATE_DIR" in os.environ:
            folder = os.path.join(os.environ["JOB_STATE_DIR"],"logs")
        else:
            folder = "/tmp"

        if subId is not None:
            folder = os.path.join(folder, subId)

        if not os.path.exists(folder):
            os.makedirs(folder)

        f = open(os.path.join(folder, "stdout"), "a")
        return ELog(subId,f)

    def recordText(self,text):
        timestr = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
        timestr = "["+ timestr + "]"
        if self.f:
            self.f.write(timestr + " " + text + "\n")
            self.f.flush()

    def recordTrain(self,title,iteration,global_steps,loss,accuracy,worker):
        text = title
        text = text + ",	Timestamp: " + str(int(round(time.time() * 1000)))
        text = text + ",	Global steps: " + str(global_steps)
        text = text + ",	Iteration: " + str(iteration)
        text = text + ",	Loss: " + str(float('%.5f' % loss) )
        text = text + ",	Accuracy: " + str(float('%.5f' % accuracy) )
        self.recordText(text)

    def recordTest(self,title,loss,accuracy,worker):
        text = title
        text = text + ",	Timestamp: " + str(int(round(time.time() * 1000)))
        text = text + ",	Loss: " + str(float('%.5f' % loss) )
        text = text + ",	Accuracy: " + str(float('%.5f' % accuracy) )
        self.recordText(text)

    def close(self):
        if self.f:
            self.f.close()

Overwriting ./resnet-wmla/elog.py


In [6]:
### Package model files for training
import requests, json
import pandas as pd
import datetime
# from IPython.display import display

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
%matplotlib inline
# plt.rcParams['figure.figsize'] = [24, 8.0]
#import seaborn as sns

pd.set_option('display.max_columns', 999)
pd.set_option('max_colwidth', 300)

import tarfile
import tempfile
import os
#Package the updated model files into a tar file ending with `.modelDir.tar`

In [7]:
def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))


MODEL_DIR_SUFFIX = ".modelDir.tar"
tempFile = tempfile.mktemp(MODEL_DIR_SUFFIX)

make_tarfile(tempFile, model_dir)

print(" tempFile: " + tempFile)
files = {'file': open(tempFile, 'rb')}

 tempFile: /var/folders/5n/bsvbwc4x2pv391y0zqg1b22c0000gn/T/tmpgft46na3.modelDir.tar


## Set up API end point and log on
[Back to top](#Contents)

In this section we set up the API endpoint which will be used in this notebook.

The following sections use the Watson ML Accelerator API to complete the various tasks required. 
We've given examples of a number of tasks but you should refer to the documentation at to see more details 
of what is possible and sample output you might expect.

- https://www.ibm.com/support/knowledgecenter/SSFHA8_2.2.0/cm/deeplearning.html

In [8]:
import requests, json
import pandas as pd
import datetime
# from IPython.display import display

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
%matplotlib inline
# plt.rcParams['figure.figsize'] = [24, 8.0]
#import seaborn as sns

pd.set_option('display.max_columns', 999)
pd.set_option('max_colwidth', 300)

import tarfile
import tempfile
import os
import base64
import urllib



In [9]:
#hostname='wmla-console-wmla.apps.cpd35-beta.cpolab.ibm.com'  # please enter Watson Machine Learning Accelerator host name
#hostname = 'wmla-console-liqbj.apps.wml1x210.ma.platformlab.ibm.com'
hostname = 'wmla-console-wmla.apps.cpd35-beta.cpolab.ibm.com'
login='dse_user:cpd4ever' # please enter the login and password
# hostname='wmla-console-xwmla.apps.wml1x180.ma.platformlab.ibm.com'
# login='admin:password'

es = base64.b64encode(login.encode('utf-8')).decode("utf-8")
print(es)
commonHeaders={'Authorization': 'Basic '+es}
req = requests.Session()
auth_url = 'https://{}/auth/v1/logon'.format(hostname)
print(auth_url)
a=requests.get(auth_url,headers=commonHeaders, verify=False)
access_token=a.json()['accessToken']
print(access_token)

ZHNlX3VzZXI6Y3BkNGV2ZXI=
https://wmla-console-wmla.apps.cpd35-beta.cpolab.ibm.com/auth/v1/logon
eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VybmFtZSI6ImRzZV91c2VyIiwicm9sZSI6IlVzZXIiLCJwZXJtaXNzaW9ucyI6WyJhY2Nlc3NfY2F0YWxvZyIsImNhbl9wcm92aXNpb24iXSwiZ3JvdXBzIjpbMTAwMDBdLCJzdWIiOiJkc2VfdXNlciIsImlzcyI6IktOT1hTU08iLCJhdWQiOiJEU1giLCJ1aWQiOiIxMDAwMzMxMDAxIiwiYXV0aGVudGljYXRvciI6ImRlZmF1bHQiLCJpYXQiOjE2MTUxNzM4NzQsImV4cCI6MTYxNTIxNzAzOH0.DF3bdqAcuFVomGZ1lYMg4AqLPGmJRY1T0sXZwcK1urpVCV6gqIIKeqPmhrp3mUABkJ5R8M4h3oJi4-ul5EjPs10IJg4hm3dJDFFtgekK2jVBhesTuVMK7dEzh0SFm755YcRVtUrvHyA2s702pOWpNswddZVjG15BJVASGXFDz0sXZWjNHchRxjOztGvqvv2YkDSGQh6sKraLlQL2NThDI5ZdfBQ0Lub-fvDYon9lFoWEVEUW8cg0EbCXyDdt7xdgKZ8ar7hOPDdjSpe93YEltMlMwH1RiPr4bGLcggn0VrKC223I-Dys3UmJ2xoXpVIFBcnb0KgyOMqchmwMzmSuoA


### Log on


Obtain login session tokens to be used for session authentication within the RESTful API. Tokens are valid for 8 hours.

In [10]:
dl_rest_url = 'https://{}/platform/rest/deeplearning/v1'.format(hostname)
commonHeaders={'accept': 'application/json', 'X-Auth-Token': access_token}
req = requests.Session()

### Check deep learning framework details

Check what framework plugins are available and see example execution commands.  In this demonstration we will use **edtPyTorch**

In [11]:
r = requests.get(dl_rest_url+'/execs/frameworks', headers=commonHeaders, verify=False).json()
# Using the raw json, easier to see the examples given
print(json.dumps(r, indent=4))

[
    {
        "name": "PyTorch",
        "description": "",
        "desc": [
            "PyTorch",
            "Examples:",
            "$ python dlicmd.py --exec-start PyTorch <connection-options> --cs-datastore-meta type=fs,data_path=mnist --model-main mnist.py",
            "",
            "Prebuilt Models:",
            "  $ python dlicmd.py --exec-start PyTorch <connection-options> <prebuilt-model-params>",
            "",
            "  where:",
            "    <prebuilt-model-params>:",
            "      --pbmodel-cmd <command>: <command> is 'train'",
            "        Specify 'train' to train the prebuilt models below with fake data",
            "      --pbmodel-name <name>:",
            "        For 'train', <name> can be either 'AlexNet', 'VGG16', or 'ResNet18'",
            "      --epochs. Optional for 'train'. Default 10",
            "      --batch-size. Optional for 'train'. Default 20",
            "      --lr. Learning rate. Optional for 'train'. Default 0.0

## Submit job via API
[Back to top](#Contents)

Now we need to structure our API job submission. There are various elements to this process as seen in the diagram below. Note that **this** Jupyter notebook is the one referred to below. A [static version](https://github.com/IBM/wmla-assets/raw/master/WMLA-learning-journey/shared-images/4_api_setup.png) is also available.

![code](https://github.com/IBM/wmla-assets/raw/master/WMLA-learning-journey/shared-images/4_api_setup.gif)


framework_name = 'edtPyTorch' # DL Framework to use, from list given above
local_dir_containing_your_code = 'resnet-wmla'
number_of_GPU = '2' # number of GPUs for elastic distribution
name_of_your_code_file = 'elastic-main.py' # Main model file as opened locally above


args = '--exec-start {} \
        --cs-datastore-meta type=fs\
        --model-dir {} \
        --numWorker={} \
        --model-main {} \
        '.format(framework_name, local_dir_containing_your_code, number_of_GPU, name_of_your_code_file)

print ("args: " + args)

In [12]:
framework_name = 'edtPyTorch' # DL Framework to use, from list given above
#dataset_location = 'pytorch-mnist' # relative path of your data set under $DLI_DATA_FS
local_dir_containing_your_code = 'resnet-wmla'
number_of_GPU = '2' # number of GPUs for elastic distribution
name_of_your_code_file = 'elastic-main.py' # Main model file as opened locally above

args = '--exec-start edtPyTorch --cs-datastore-meta type=fs  --numWorker 2 \
                     --model-main elastic-main.py --model-dir resnet-wmla'

    
print ("args: " + args)

args: --exec-start edtPyTorch --cs-datastore-meta type=fs  --numWorker 2                      --model-main elastic-main.py --model-dir resnet-wmla


## Monitor running job
[Back to top](#Contents)

Once the job is submitted successfully we can monitor the running job. 


In [13]:
r = requests.post(dl_rest_url+'/execs?args='+args, files=files, 
                  headers=commonHeaders, verify=False)


if not r.ok:
    print('submit job failed: code=%s, %s'%(r.status_code, r.content))


job_status = query_job_status(r.json(),refresh_rate=5)


Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-398,--exec-start edtPyTorch --cs-datastore-meta type=fs --numWorker 2 --model-main elastic-main.py...,wmla-398,dse_user,FINISHED,wmla-398,https://wmla-mss:9080,wmla,/gpfs/myresultfs/dse_user/batchworkdir/wmla-398/_submitted_code/resnet-wmla,ElasticPyTorchTrain,2021-03-08T03:24:40Z,True,wmla,2,edtPyTorch


{ 'appId': 'wmla-398',
  'appName': 'ElasticPyTorchTrain',
  'args': '--exec-start edtPyTorch --cs-datastore-meta type=fs  --numWorker '
          '2                      --model-main elastic-main.py --model-dir '
          'resnet-wmla ',
  'createTime': '2021-03-08T03:24:40Z',
  'creator': 'dse_user',
  'elastic': True,
  'framework': 'edtPyTorch',
  'id': 'wmla-398',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 2,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-398',
  'workDir': '/gpfs/myresultfs/dse_user/batchworkdir/wmla-398/_submitted_code/resnet-wmla'}


## Training metrics and logs

#### Retrieve and display the model training metrics:
[Back to top](#Contents)

After the job completes then we can retrieve the output, logs and saved models. 



In [14]:
query_executor_stdout_log(r.json())

https://wmla-console-wmla.apps.cpd35-beta.cpolab.ibm.com/platform/rest/deeplearning/v1/scheduler/applications/wmla-398/executor/1/logs/stdout?lastlines=1000
Iteration 1939: tag train_loss, simple_value 2.31841
Timestamp 1615174255910, Iteration 975
batches :975 2.2616612911224365
Iteration 1941: tag train_accuracy, simple_value 0.09954
Iteration 1941: tag train_loss, simple_value 2.26166
Timestamp 1615174256121, Iteration 976
batches :976 2.2919421195983887
Iteration 1943: tag train_accuracy, simple_value 0.09955
Iteration 1943: tag train_loss, simple_value 2.29194
Timestamp 1615174256336, Iteration 977
batches :977 2.340242385864258
Iteration 1945: tag train_accuracy, simple_value 0.09932
Iteration 1945: tag train_loss, simple_value 2.34024
Timestamp 1615174256603, Iteration 978
batches :978 2.310180425643921
Iteration 1947: tag train_accuracy, simple_value 0.09913
Iteration 1947: tag train_loss, simple_value 2.31018
intput worker ptr is: 0x55b07e32e070
eseAllReduce is called.
allredu

## Download trained model from Watson Machine Learning Accelerator 



In [18]:
download_trained_model(r.json())

https://wmla-console-wmla.apps.cpd35-beta.cpolab.ibm.com/platform/rest/deeplearning/v1/execs/wmla-398/result
Save model:  ./resnet-wmla/wmla-398.zip


## Further information and useful links
[Back to top](#Contents)

**WML Accelerator Introduction videos:**
- WML Accelerator overview video (1 minute): http://ibm.biz/wmla-video
- Overview of adapting your code for Elastic Distributed Training via API: [video](https://youtu.be/RnZtYNX6meM) | [PDF](docs/wmla_api_pieces.pdf) (screenshot below)

**Further WML Accelerator information & documentation**
- [Learning path: Get started with Watson Machine Learning Accelerator](http://ibm.biz/wmla-learning-path)
- [IBM Documentation on Watson Machine Learning Accelerator](https://www.ibm.com/docs/en/wmla/2.2.0)
- [Blog: Expert Q&A: Accelerate deep learning on IBM Cloud Pak for Data](https://www.ibm.com/blogs/journey-to-ai/2020/10/expert-qa-accelerate-deep-learning-on-ibm-cloud-pak-for-data)




## Appendix
[Back to top](#Contents)


#### This is version 1.0 and its content is copyright of IBM.   All rights reserved.   


