# Accelerate Deep Learning Model training with Watson Machine Learning Accelerator


### Notebook created by Kelvin Lui,  Xue Yin Zhuang in Jan 2021

### In this notebook, you will learn how to use the Watson Machine Learning Accelerator (WML-A) API and accelerate deep learning model training on GPU with Watson Machine Learning Accelerator.

This notebook uses the PyTorch Resnet18 model, which performs image classification using a basic computer vision image classification example. The model will be trained both on CPU and GPU to demonstrate that training models on GPU hardware deliver faster result times.


This notebook covers the following sections:

1. [Setting up required packages](#setup)<br>

2. [Configuring your environment and project details](#configure)<br>

3. [Training the model on CPU](#cpu)<br>

4. [Training the model on GPU with Watson Machine Learning Accelerator](#gpu)<br>

<a id = "setup"></a>
## Step 1: Setting up required packages


#### First, install torchvision which is required to train the PyTorch Resnet18 model on CPU.
Note: You will need to create a custom environment with 16VCPU and 32GB

In [1]:
! pip install torchvision



In [2]:
import torchvision

#### Next, define helper methods:

In [14]:
# import tarfile
import tempfile
import os
import json
import pprint
import pandas as pd
from IPython.display import display, FileLink, clear_output

import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

from matplotlib import pyplot as plt
%pylab inline

import base64
import json
import time
import urllib
import tarfile


def query_job_status(job_id,refresh_rate=3) :

    execURL = dl_rest_url  +'/execs/'+ job_id['id']
    pp = pprint.PrettyPrinter(indent=2)

    keep_running=True
    res=None
    while(keep_running):
        res = req.get(execURL, headers=commonHeaders, verify=False)
        monitoring = pd.DataFrame(res.json(), index=[0])
        pd.set_option('max_colwidth', 120)
        clear_output()
        print("Refreshing every {} seconds".format(refresh_rate))
        display(monitoring)
        pp.pprint(res.json())
        if(res.json()['state'] not in ['PENDING_CRD_SCHEDULER', 'SUBMITTED','RUNNING']) :
            keep_running=False
        time.sleep(refresh_rate)
    return res

def query_executor_stdout_log(job_id) :

    execURL = dl_rest_url  +'/scheduler/applications/'+ job_id['id'] + '/executor/1/logs/stdout?lastlines=1000'
    #'https://{}/platform/rest/deeplearning/v1/scheduler/applications/wmla-267/driver/logs/stderr?lastlines=10'.format(hostname)
    commonHeaders2={'accept': 'text/plain', 'X-Auth-Token': access_token}
    print (execURL)
    res = req.get(execURL, headers=commonHeaders2, verify=False)
    print(res.text)
    
    
def query_train_metric(job_id) :

    #execURL = dl_rest_url  +'/execs/'+ job_id['id'] + '/log'
    execURL = dl_rest_url  +'/execs/'+ job_id['id'] + '/log'
    #'https://{}/platform/rest/deeplearning/v1/scheduler/applications/wmla-267/driver/logs/stderr?lastlines=10'.format(hostname)
    commonHeaders2={'accept': 'text/plain', 'X-Auth-Token': access_token}
    print (execURL)
    res = req.get(execURL, headers=commonHeaders2, verify=False)
    print(res.text)

    # save result file    
def download_trained_model(job_id) :

    from IPython.display import display, FileLink

    # save result file
    commonHeaders3={'accept': 'application/octet-stream', 'X-Auth-Token': access_token}
    execURL = dl_rest_url  +'/execs/'+ r.json()['id'] + '/result'
    res = req.get(execURL, headers=commonHeaders3, verify=False, stream=True)
    print (execURL)

    tmpfile = '/project_data/data_asset/' +  r.json()['id'] +'.zip'
    print ('Save model: ', tmpfile )
    with open(tmpfile,'wb') as f:
        f.write(res.content)
        f.close()

def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))

Populating the interactive namespace from numpy and matplotlib


<a id = "configure"></a>
## Step 2: Configuring your environment and project details

To set up your project details, provide your credentials in this cell. You must include your cluster URL, username, and password.

In [2]:
hostname='wmla-console-xwmla.apps.wml1x180.ma.platformlab.ibm.com'  # please enter Watson Machine Learning Accelerator host name
login='admin:password' # please enter the login and password
es = base64.b64encode(login.encode('utf-8')).decode("utf-8")
print(es)
commonHeaders={'Authorization': 'Basic '+es}
req = requests.Session()
auth_url = 'https://{}/auth/v1/logon'.format(hostname)
print(auth_url)
a=requests.get(auth_url,headers=commonHeaders, verify=False)
access_token=a.json()['accessToken']
print(access_token)

YWRtaW46cGFzc3dvcmQ=
https://wmla-console-xwmla.apps.wml1x180.ma.platformlab.ibm.com/auth/v1/logon
eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VybmFtZSI6ImFkbWluIiwicm9sZSI6IkFkbWluIiwicGVybWlzc2lvbnMiOlsiYWRtaW5pc3RyYXRvciIsImNhbl9wcm92aXNpb24iLCJtYW5hZ2VfY2F0YWxvZyIsImFjY2Vzc19jYXRhbG9nIl0sImdyb3VwcyI6WzEwMDAwXSwic3ViIjoiYWRtaW4iLCJpc3MiOiJLTk9YU1NPIiwiYXVkIjoiRFNYIiwidWlkIjoiMTAwMDMzMDk5OSIsImF1dGhlbnRpY2F0b3IiOiJkZWZhdWx0IiwiaWF0IjoxNjEyODE4MDgyLCJleHAiOjE2MTI4NjEyNDZ9.XZoM-xN15afxDjLOLt6K2GnKqoXuJexV7rBB3jI9unwO01xmBd4VXX0z-3JuAry_QIX8o2w3kk3tUVVkvCzipOJHVLctooLxN7NN76E0kit9Gf0IsFCqFbigF6cSWUD6aa5fkdMRU_LWyUp40gpXQJ6OA6oteHT93i4Lc03tDRSkk0VU7UjpDSXPCXdEbT0fOB8Wbtoyr6jjdc2XrvJho3t6R8KYAf63VzlNHL_op_5kPWwgNaAKEXtzJNM0IgYMXBcYcIsmWZdROgr9in6wP_stIwpZza-ehhRICSJn5o5Ko72RbS-RELdNo6lZECK24ZRA_maUSL5CURIrHoaM6g


In [3]:
dl_rest_url = 'https://{}/platform/rest/deeplearning/v1'.format(hostname)
commonHeaders={'accept': 'application/json', 'X-Auth-Token': access_token}
req = requests.Session()

<a id = "cpu"></a>
## Step 3: Training the model on CPU

#### Prepare the model files for running on CPU:

In [4]:
import os

model_dir = f'/project_data/data_asset/pytorch-resnet/resnet' 
model_main = f'main.py'
model_resnet = f'resnet.py'

os.makedirs(model_dir, exist_ok=True)

In [5]:
%%writefile {model_dir}/{model_main}

#!/usr/bin/env python
# coding: utf-8

# # Image Classification Using PyTorch Resnet with Watson Machine Learning Accelerator Notebook
# This asset details the process of performing a basic computer vision image classification example using the notebook functionality within Watson Machine Learning Accelerator. In this asset, you will learn how to accelerate your training with pytorch resnet model upon the cifar10 dataset.
#
# Please refer to [Resnet Introduction](https://arxiv.org/abs/1512.03385) for more details.



from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
import torchvision.models as models
#from resnet import resnet18
import time
import numpy

import sys
import os
import glob
import argparse

log_interval = 10

seed = 1
use_cuda = False
completed_batch =0
completed_test_batch =0
criterion = nn.CrossEntropyLoss()


parser = argparse.ArgumentParser(description='Tensorflow MNIST Example')
parser.add_argument('--batch-size', type=int, default=32, metavar='N',
                    help='input batch size for training (default: 128)')
parser.add_argument('--epochs', type=int, default=5, metavar='N',
                    help='number of epochs to train (default: 10)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                    help='learning rate (default: 0.01)')
parser.add_argument('--cuda', action='store_true', default=False,
                    help='disables CUDA training')
args = parser.parse_args()
print(args)


# ## Create the Resnet18 model
print("Use cuda: ", use_cuda)

# ## Download the Cifar10 dataset
# Below code will download the cifar10 dataset automatically to $DATA_DIR/cifar10.
# You could also download the [CIFAR-10 python version](https://www.cs.toronto.edu/~kriz/cifar.html) and upload it manually.

model_dir = f'/project_data/data_asset/pytorch-resnet/resnet' 
#DATA_DIR = os.getenv("DATA_DIR")
print("DATA_DIR: ", DATA_DIR)

def getDatasets():
    train_data_dir = DATA_DIR + '/cifar10'
    test_data_dir = DATA_DIR + '/cifar10'

    transform_train = transforms.Compose([
        transforms.Resize(224),
        #transforms.RandomCrop(self.resolution, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])

    transform_test = transforms.Compose([
        transforms.Resize(224),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])

    return (torchvision.datasets.CIFAR10(root=train_data_dir, train=True, download=True, transform = transform_train),
            torchvision.datasets.CIFAR10(root=test_data_dir, train=False, download=True, transform = transform_test)
            )

torch.manual_seed(seed)
device = torch.device("cuda" if use_cuda else "cpu")
print ('device:', device)

kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}

train_dataset, test_dataset = getDatasets()

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=args.batch_size, shuffle=True, **kwargs)


# ## Implement the customized train and test loop


def train(model, device, train_loader, optimizer, epoch):
    global completed_batch
    train_loss = 0
    correct = 0
    total = 0
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        _, predicted = output.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()

        completed_batch += 1

        print ('Train - batches : {}, average loss: {:.4f}, accuracy: {}/{} ({:.0f}%)'.format(
           completed_batch, train_loss/(batch_idx+1), correct, total, 100.*correct/total))


def test(model, device, test_loader, epoch):
    global completed_test_batch
    global completed_batch
    model.eval()
    test_loss = 0
    correct = 0
    total = 0
    completed_test_batch = completed_batch -  len(test_loader)
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(test_loader):
            data, target = data.to(device), target.to(device)
            output = model(data)

            loss = criterion(output, target)

            test_loss += loss.item() # sum up batch loss
            _, pred = output.max(1) # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()
            total += target.size(0)

            completed_test_batch += 1

    test_loss /= len(test_loader.dataset)
    test_acc = 100. * correct / len(test_loader.dataset)
    # Output test info for per epoch
    print('Test - batches: {}, average loss: {:.4f}, accuracy: {}/{} ({:.0f}%)\n'.format(
        completed_batch, test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


# ## Create the Resnet18 model
#use_cuda = not args.no_cuda
print("Use cuda: ", use_cuda)


model_type = "resnet18"
print("=> using pytorch build-in model '{}'".format(model_type))

model = models.resnet18()
#model = models.resnet50()


# Using pytorch built-in resnet18 model, the model is pre-trained on the ImageNet dataset,
# which has 1000 classifications. To transfer it to cifar10 dataset, we can modify the last fully-connected layer output size to 10

for param in model.parameters():
    param.requires_grad = True  # set False if you only want to train the last layer using pretrained model
    # Replace the last fully-connected layer
    # Parameters of newly constructed modules have requires_grad=True by default
    model.fc = nn.Linear(512, 10)


# (Optional) To use wmla pretrained resnet18 model for cifar10, load the model weight file. The pretrained model weight file can be downloaded [here](https://?).

weightfile = DATA_DIR + "/checkpoint/model_epoch_final.pth"
if os.path.exists(weightfile):
    print ("Initial weight file is " + weightfile)
    model.load_state_dict(torch.load(weightfile, map_location=lambda storage, loc: storage))


# ## Run the model trainings
#print(model)
model.to(device)
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=0, dampening=0, weight_decay=0, nesterov=False)
epochs = args.epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, 30, 0.1, last_epoch=-1)

# Output total iterations info for deep learning insights
print("Total iterations: %s" % (len(train_loader) * epochs))

#print("RESULT_DIR: " + os.getenv("RESULT_DIR"))
#RESULT_DIR = os.getenv("RESULT_DIR")
os.makedirs(RESULT_DIR, exist_ok=True)

for epoch in range(1, epochs+1):
    print("\nRunning epoch %s ... It might take several minutes for each epoch to run." % epoch)
    train(model, device, train_loader, optimizer, epoch)
    test(model, device, test_loader, epoch)
    scheduler.step()

    torch.save(model.state_dict(),  RESULT_DIR + "/model_epoch_%d.pth"%(epoch))

torch.save(model.state_dict(), RESULT_DIR + "/model_epoch_final.pth")


Overwriting /project_data/data_asset/pytorch-resnet/resnet/main.py


## Training results on CPU

#### Training was run from a Cloud Pak for Data Notebook utilizing a CPU kernel. 


In the custom environment that was created with **16vCPU** and **32GB**, it took **1560 seconds** (or approximately **26 minutes**) to complete 1 EPOCH training.


In [8]:
import datetime
starttime = datetime.datetime.now()

! python /project_data/data_asset/pytorch-resnet/resnet/main.py --epochs 1 

endtime = datetime.datetime.now()
print("Training cost: ", (endtime - starttime).seconds, " seconds.")

Namespace(batch_size=32, cuda=False, epochs=1, lr=0.01)
Use cuda:  False
DATA_DIR:  /project_data/data_asset/pytorch-resnet/data
device: cpu
Files already downloaded and verified
Files already downloaded and verified
Use cuda:  False
=> using pytorch build-in model 'resnet18'
Total iterations: 1563

Running epoch 1 ... It might take several minutes for each epoch to run.
  allow_unreachable=True)  # allow_unreachable flag
Train - batches : 1, average loss: 2.4301, accuracy: 2/32 (6%)
Train - batches : 2, average loss: 2.4638, accuracy: 6/64 (9%)
Train - batches : 3, average loss: 2.4309, accuracy: 9/96 (9%)
Train - batches : 4, average loss: 2.4093, accuracy: 13/128 (10%)
Train - batches : 5, average loss: 2.3974, accuracy: 15/160 (9%)
Train - batches : 6, average loss: 2.3684, accuracy: 22/192 (11%)
Train - batches : 7, average loss: 2.3450, accuracy: 30/224 (13%)
Train - batches : 8, average loss: 2.3538, accuracy: 33/256 (13%)
Train - batches : 9, average loss: 2.3480, accuracy: 37/

Train - batches : 110, average loss: 2.0734, accuracy: 792/3520 (22%)
Train - batches : 111, average loss: 2.0729, accuracy: 798/3552 (22%)
Train - batches : 112, average loss: 2.0733, accuracy: 804/3584 (22%)
Train - batches : 113, average loss: 2.0718, accuracy: 817/3616 (23%)
Train - batches : 114, average loss: 2.0725, accuracy: 824/3648 (23%)
Train - batches : 115, average loss: 2.0705, accuracy: 835/3680 (23%)
Train - batches : 116, average loss: 2.0694, accuracy: 845/3712 (23%)
Train - batches : 117, average loss: 2.0700, accuracy: 851/3744 (23%)
Train - batches : 118, average loss: 2.0680, accuracy: 862/3776 (23%)
Train - batches : 119, average loss: 2.0669, accuracy: 873/3808 (23%)
Train - batches : 120, average loss: 2.0678, accuracy: 880/3840 (23%)
Train - batches : 121, average loss: 2.0653, accuracy: 887/3872 (23%)
Train - batches : 122, average loss: 2.0643, accuracy: 896/3904 (23%)
Train - batches : 123, average loss: 2.0622, accuracy: 904/3936 (23%)
Train - batches : 12

Train - batches : 226, average loss: 1.9700, accuracy: 1884/7232 (26%)
Train - batches : 227, average loss: 1.9697, accuracy: 1894/7264 (26%)
Train - batches : 228, average loss: 1.9684, accuracy: 1907/7296 (26%)
Train - batches : 229, average loss: 1.9691, accuracy: 1918/7328 (26%)
Train - batches : 230, average loss: 1.9682, accuracy: 1927/7360 (26%)
Train - batches : 231, average loss: 1.9685, accuracy: 1936/7392 (26%)
Train - batches : 232, average loss: 1.9682, accuracy: 1947/7424 (26%)
Train - batches : 233, average loss: 1.9668, accuracy: 1955/7456 (26%)
Train - batches : 234, average loss: 1.9667, accuracy: 1960/7488 (26%)
Train - batches : 235, average loss: 1.9654, accuracy: 1969/7520 (26%)
Train - batches : 236, average loss: 1.9648, accuracy: 1979/7552 (26%)
Train - batches : 237, average loss: 1.9634, accuracy: 1992/7584 (26%)
Train - batches : 238, average loss: 1.9629, accuracy: 2001/7616 (26%)
Train - batches : 239, average loss: 1.9622, accuracy: 2013/7648 (26%)
Train 

Train - batches : 341, average loss: 1.8997, accuracy: 3142/10912 (29%)
Train - batches : 342, average loss: 1.8992, accuracy: 3150/10944 (29%)
Train - batches : 343, average loss: 1.8986, accuracy: 3159/10976 (29%)
Train - batches : 344, average loss: 1.8984, accuracy: 3172/11008 (29%)
Train - batches : 345, average loss: 1.8981, accuracy: 3189/11040 (29%)
Train - batches : 346, average loss: 1.8978, accuracy: 3199/11072 (29%)
Train - batches : 347, average loss: 1.8973, accuracy: 3214/11104 (29%)
Train - batches : 348, average loss: 1.8963, accuracy: 3223/11136 (29%)
Train - batches : 349, average loss: 1.8957, accuracy: 3231/11168 (29%)
Train - batches : 350, average loss: 1.8960, accuracy: 3243/11200 (29%)
Train - batches : 351, average loss: 1.8956, accuracy: 3251/11232 (29%)
Train - batches : 352, average loss: 1.8952, accuracy: 3260/11264 (29%)
Train - batches : 353, average loss: 1.8951, accuracy: 3266/11296 (29%)
Train - batches : 354, average loss: 1.8942, accuracy: 3276/1132

Train - batches : 455, average loss: 1.8440, accuracy: 4452/14560 (31%)
Train - batches : 456, average loss: 1.8437, accuracy: 4461/14592 (31%)
Train - batches : 457, average loss: 1.8435, accuracy: 4472/14624 (31%)
Train - batches : 458, average loss: 1.8429, accuracy: 4485/14656 (31%)
Train - batches : 459, average loss: 1.8429, accuracy: 4494/14688 (31%)
Train - batches : 460, average loss: 1.8418, accuracy: 4512/14720 (31%)
Train - batches : 461, average loss: 1.8416, accuracy: 4525/14752 (31%)
Train - batches : 462, average loss: 1.8419, accuracy: 4536/14784 (31%)
Train - batches : 463, average loss: 1.8421, accuracy: 4542/14816 (31%)
Train - batches : 464, average loss: 1.8415, accuracy: 4554/14848 (31%)
Train - batches : 465, average loss: 1.8406, accuracy: 4567/14880 (31%)
Train - batches : 466, average loss: 1.8403, accuracy: 4577/14912 (31%)
Train - batches : 467, average loss: 1.8404, accuracy: 4587/14944 (31%)
Train - batches : 468, average loss: 1.8398, accuracy: 4600/1497

Train - batches : 569, average loss: 1.8035, accuracy: 5851/18208 (32%)
Train - batches : 570, average loss: 1.8030, accuracy: 5862/18240 (32%)
Train - batches : 571, average loss: 1.8029, accuracy: 5874/18272 (32%)
Train - batches : 572, average loss: 1.8023, accuracy: 5888/18304 (32%)
Train - batches : 573, average loss: 1.8025, accuracy: 5898/18336 (32%)
Train - batches : 574, average loss: 1.8017, accuracy: 5913/18368 (32%)
Train - batches : 575, average loss: 1.8017, accuracy: 5925/18400 (32%)
Train - batches : 576, average loss: 1.8011, accuracy: 5942/18432 (32%)
Train - batches : 577, average loss: 1.8004, accuracy: 5957/18464 (32%)
Train - batches : 578, average loss: 1.8003, accuracy: 5966/18496 (32%)
Train - batches : 579, average loss: 1.7996, accuracy: 5981/18528 (32%)
Train - batches : 580, average loss: 1.7991, accuracy: 5996/18560 (32%)
Train - batches : 581, average loss: 1.7991, accuracy: 6009/18592 (32%)
Train - batches : 582, average loss: 1.7985, accuracy: 6023/1862

Train - batches : 683, average loss: 1.7609, accuracy: 7390/21856 (34%)
Train - batches : 684, average loss: 1.7605, accuracy: 7401/21888 (34%)
Train - batches : 685, average loss: 1.7607, accuracy: 7412/21920 (34%)
Train - batches : 686, average loss: 1.7602, accuracy: 7429/21952 (34%)
Train - batches : 687, average loss: 1.7599, accuracy: 7441/21984 (34%)
Train - batches : 688, average loss: 1.7595, accuracy: 7457/22016 (34%)
Train - batches : 689, average loss: 1.7592, accuracy: 7468/22048 (34%)
Train - batches : 690, average loss: 1.7591, accuracy: 7480/22080 (34%)
Train - batches : 691, average loss: 1.7593, accuracy: 7490/22112 (34%)
Train - batches : 692, average loss: 1.7584, accuracy: 7506/22144 (34%)
Train - batches : 693, average loss: 1.7579, accuracy: 7522/22176 (34%)
Train - batches : 694, average loss: 1.7574, accuracy: 7537/22208 (34%)
Train - batches : 695, average loss: 1.7569, accuracy: 7554/22240 (34%)
Train - batches : 696, average loss: 1.7571, accuracy: 7565/2227

Train - batches : 797, average loss: 1.7314, accuracy: 8924/25504 (35%)
Train - batches : 798, average loss: 1.7312, accuracy: 8936/25536 (35%)
Train - batches : 799, average loss: 1.7313, accuracy: 8947/25568 (35%)
Train - batches : 800, average loss: 1.7310, accuracy: 8962/25600 (35%)
Train - batches : 801, average loss: 1.7308, accuracy: 8978/25632 (35%)
Train - batches : 802, average loss: 1.7306, accuracy: 8993/25664 (35%)
Train - batches : 803, average loss: 1.7305, accuracy: 9006/25696 (35%)
Train - batches : 804, average loss: 1.7305, accuracy: 9016/25728 (35%)
Train - batches : 805, average loss: 1.7301, accuracy: 9030/25760 (35%)
Train - batches : 806, average loss: 1.7298, accuracy: 9046/25792 (35%)
Train - batches : 807, average loss: 1.7296, accuracy: 9056/25824 (35%)
Train - batches : 808, average loss: 1.7292, accuracy: 9073/25856 (35%)
Train - batches : 809, average loss: 1.7291, accuracy: 9086/25888 (35%)
Train - batches : 810, average loss: 1.7291, accuracy: 9099/2592

Train - batches : 911, average loss: 1.7051, accuracy: 10524/29152 (36%)
Train - batches : 912, average loss: 1.7046, accuracy: 10542/29184 (36%)
Train - batches : 913, average loss: 1.7047, accuracy: 10551/29216 (36%)
Train - batches : 914, average loss: 1.7045, accuracy: 10565/29248 (36%)
Train - batches : 915, average loss: 1.7044, accuracy: 10574/29280 (36%)
Train - batches : 916, average loss: 1.7041, accuracy: 10592/29312 (36%)
Train - batches : 917, average loss: 1.7038, accuracy: 10609/29344 (36%)
Train - batches : 918, average loss: 1.7038, accuracy: 10625/29376 (36%)
Train - batches : 919, average loss: 1.7035, accuracy: 10641/29408 (36%)
Train - batches : 920, average loss: 1.7035, accuracy: 10652/29440 (36%)
Train - batches : 921, average loss: 1.7033, accuracy: 10668/29472 (36%)
Train - batches : 922, average loss: 1.7030, accuracy: 10683/29504 (36%)
Train - batches : 923, average loss: 1.7023, accuracy: 10702/29536 (36%)
Train - batches : 924, average loss: 1.7022, accura

Train - batches : 1023, average loss: 1.6801, accuracy: 12183/32736 (37%)
Train - batches : 1024, average loss: 1.6800, accuracy: 12194/32768 (37%)
Train - batches : 1025, average loss: 1.6796, accuracy: 12214/32800 (37%)
Train - batches : 1026, average loss: 1.6795, accuracy: 12226/32832 (37%)
Train - batches : 1027, average loss: 1.6792, accuracy: 12245/32864 (37%)
Train - batches : 1028, average loss: 1.6790, accuracy: 12262/32896 (37%)
Train - batches : 1029, average loss: 1.6791, accuracy: 12272/32928 (37%)
Train - batches : 1030, average loss: 1.6790, accuracy: 12283/32960 (37%)
Train - batches : 1031, average loss: 1.6787, accuracy: 12298/32992 (37%)
Train - batches : 1032, average loss: 1.6784, accuracy: 12314/33024 (37%)
Train - batches : 1033, average loss: 1.6780, accuracy: 12332/33056 (37%)
Train - batches : 1034, average loss: 1.6777, accuracy: 12349/33088 (37%)
Train - batches : 1035, average loss: 1.6779, accuracy: 12359/33120 (37%)
Train - batches : 1036, average loss: 

Train - batches : 1134, average loss: 1.6577, accuracy: 13832/36288 (38%)
Train - batches : 1135, average loss: 1.6575, accuracy: 13847/36320 (38%)
Train - batches : 1136, average loss: 1.6571, accuracy: 13864/36352 (38%)
Train - batches : 1137, average loss: 1.6571, accuracy: 13881/36384 (38%)
Train - batches : 1138, average loss: 1.6571, accuracy: 13892/36416 (38%)
Train - batches : 1139, average loss: 1.6569, accuracy: 13909/36448 (38%)
Train - batches : 1140, average loss: 1.6565, accuracy: 13928/36480 (38%)
Train - batches : 1141, average loss: 1.6562, accuracy: 13944/36512 (38%)
Train - batches : 1142, average loss: 1.6559, accuracy: 13962/36544 (38%)
Train - batches : 1143, average loss: 1.6555, accuracy: 13979/36576 (38%)
Train - batches : 1144, average loss: 1.6556, accuracy: 13995/36608 (38%)
Train - batches : 1145, average loss: 1.6553, accuracy: 14010/36640 (38%)
Train - batches : 1146, average loss: 1.6551, accuracy: 14025/36672 (38%)
Train - batches : 1147, average loss: 

Train - batches : 1245, average loss: 1.6363, accuracy: 15518/39840 (39%)
Train - batches : 1246, average loss: 1.6362, accuracy: 15535/39872 (39%)
Train - batches : 1247, average loss: 1.6358, accuracy: 15558/39904 (39%)
Train - batches : 1248, average loss: 1.6357, accuracy: 15572/39936 (39%)
Train - batches : 1249, average loss: 1.6357, accuracy: 15585/39968 (39%)
Train - batches : 1250, average loss: 1.6354, accuracy: 15602/40000 (39%)
Train - batches : 1251, average loss: 1.6351, accuracy: 15620/40032 (39%)
Train - batches : 1252, average loss: 1.6351, accuracy: 15631/40064 (39%)
Train - batches : 1253, average loss: 1.6350, accuracy: 15646/40096 (39%)
Train - batches : 1254, average loss: 1.6348, accuracy: 15660/40128 (39%)
Train - batches : 1255, average loss: 1.6347, accuracy: 15677/40160 (39%)
Train - batches : 1256, average loss: 1.6346, accuracy: 15693/40192 (39%)
Train - batches : 1257, average loss: 1.6349, accuracy: 15706/40224 (39%)
Train - batches : 1258, average loss: 

Train - batches : 1356, average loss: 1.6170, accuracy: 17269/43392 (40%)
Train - batches : 1357, average loss: 1.6168, accuracy: 17282/43424 (40%)
Train - batches : 1358, average loss: 1.6167, accuracy: 17298/43456 (40%)
Train - batches : 1359, average loss: 1.6162, accuracy: 17319/43488 (40%)
Train - batches : 1360, average loss: 1.6161, accuracy: 17336/43520 (40%)
Train - batches : 1361, average loss: 1.6158, accuracy: 17355/43552 (40%)
Train - batches : 1362, average loss: 1.6155, accuracy: 17373/43584 (40%)
Train - batches : 1363, average loss: 1.6154, accuracy: 17389/43616 (40%)
Train - batches : 1364, average loss: 1.6152, accuracy: 17407/43648 (40%)
Train - batches : 1365, average loss: 1.6151, accuracy: 17420/43680 (40%)
Train - batches : 1366, average loss: 1.6149, accuracy: 17433/43712 (40%)
Train - batches : 1367, average loss: 1.6147, accuracy: 17452/43744 (40%)
Train - batches : 1368, average loss: 1.6146, accuracy: 17469/43776 (40%)
Train - batches : 1369, average loss: 

Train - batches : 1467, average loss: 1.5961, accuracy: 19081/46944 (41%)
Train - batches : 1468, average loss: 1.5962, accuracy: 19093/46976 (41%)
Train - batches : 1469, average loss: 1.5961, accuracy: 19106/47008 (41%)
Train - batches : 1470, average loss: 1.5959, accuracy: 19121/47040 (41%)
Train - batches : 1471, average loss: 1.5958, accuracy: 19138/47072 (41%)
Train - batches : 1472, average loss: 1.5957, accuracy: 19159/47104 (41%)
Train - batches : 1473, average loss: 1.5953, accuracy: 19180/47136 (41%)
Train - batches : 1474, average loss: 1.5951, accuracy: 19196/47168 (41%)
Train - batches : 1475, average loss: 1.5948, accuracy: 19216/47200 (41%)
Train - batches : 1476, average loss: 1.5946, accuracy: 19234/47232 (41%)
Train - batches : 1477, average loss: 1.5947, accuracy: 19246/47264 (41%)
Train - batches : 1478, average loss: 1.5944, accuracy: 19263/47296 (41%)
Train - batches : 1479, average loss: 1.5941, accuracy: 19281/47328 (41%)
Train - batches : 1480, average loss: 

<a id = "gpu"></a>
## Step 4: Training the model on GPU with Watson Machine Learning Accelerator

#### Prepare the model files for running on GPU:

In [6]:
import os
model_dir = f'/project_data/data_asset/pytorch-resnet/resnet-wmla' 
model_main = f'main.py'

os.makedirs(model_dir, exist_ok=True)

In [16]:
%%writefile {model_dir}/{model_main}
#!/usr/bin/env python
# coding: utf-8

# # Image Classification Using PyTorch Resnet with Watson Machine Learning Accelerator Notebook
# This asset details the process of performing a basic computer vision image classification example using the notebook functionality within Watson Machine Learning Accelerator. In this asset, you will learn how to accelerate your training with pytorch resnet model upon the cifar10 dataset.
#
# Please refer to [Resnet Introduction](https://arxiv.org/abs/1512.03385) for more details.



from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
import torchvision.models as models
import time

import sys
import os
import glob
import argparse

log_interval = 10

seed = 1
use_cuda = False
completed_batch =0
completed_test_batch =0
criterion = nn.CrossEntropyLoss()


parser = argparse.ArgumentParser(description='Tensorflow MNIST Example')
parser.add_argument('--batch-size', type=int, default=128, metavar='N',
                    help='input batch size for training (default: 128)')
parser.add_argument('--epochs', type=int, default=1, metavar='N',
                    help='number of epochs to train (default: 1)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                    help='learning rate (default: 0.01)')
parser.add_argument('cuda', action='store_true', default=True,
                    help='disables CUDA training')
args = parser.parse_args()
print(args)


# ## Create the Resnet18 model
use_cuda = args.cuda
print("Use cuda: ", use_cuda)

# ## Download the Cifar10 dataset
# Below code will download the cifar10 dataset automatically to $DATA_DIR/cifar10.
# You could also download the [CIFAR-10 python version](https://www.cs.toronto.edu/~kriz/cifar.html) and upload it manually.

print("DATA_DIR: " + os.getenv("DATA_DIR"))
DATA_DIR = os.getenv("DATA_DIR")

def getDatasets():
    train_data_dir = DATA_DIR + "/cifar10"
    test_data_dir = DATA_DIR + "/cifar10"

    transform_train = transforms.Compose([
        transforms.Resize(224),
        #transforms.RandomCrop(self.resolution, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])

    transform_test = transforms.Compose([
        transforms.Resize(224),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])

    return (torchvision.datasets.CIFAR10(root=train_data_dir, train=True, download=True, transform = transform_train),
            torchvision.datasets.CIFAR10(root=test_data_dir, train=False, download=True, transform = transform_test)
            )

torch.manual_seed(seed)
device = torch.device("cuda" if use_cuda else "cpu")

kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}

train_dataset, test_dataset = getDatasets()

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=args.batch_size, shuffle=True, **kwargs)


# ## Implement the customized train and test loop


def train(model, device, train_loader, optimizer, epoch):
    global completed_batch
    train_loss = 0
    correct = 0
    total = 0
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        _, predicted = output.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()

        completed_batch += 1

        print ('Train - batches : {}, average loss: {:.4f}, accuracy: {}/{} ({:.0f}%)'.format(
           completed_batch, train_loss/(batch_idx+1), correct, total, 100.*correct/total))


def test(model, device, test_loader, epoch):
    global completed_test_batch
    global completed_batch
    model.eval()
    test_loss = 0
    correct = 0
    total = 0
    completed_test_batch = completed_batch -  len(test_loader)
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(test_loader):
            data, target = data.to(device), target.to(device)
            output = model(data)

            loss = criterion(output, target)

            test_loss += loss.item() # sum up batch loss
            _, pred = output.max(1) # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()
            total += target.size(0)

            completed_test_batch += 1

    test_loss /= len(test_loader.dataset)
    test_acc = 100. * correct / len(test_loader.dataset)
    # Output test info for per epoch
    print('Test - batches: {}, average loss: {:.4f}, accuracy: {}/{} ({:.0f}%)\n'.format(
        completed_batch, test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


# ## Create the Resnet18 model

model_type = "resnet18"
#model_type = "resnet50"
print("=> using pytorch build-in model '{}'".format(model_type))

model = models.resnet18()


# Using pytorch build-in resnet18 model, the model is pre-trained on the ImageNet dataset,
# which has 1000 classifications. To transfer it to cifar10 dataset, we can modify the last fully-connected layer output size to 10

for param in model.parameters():
    param.requires_grad = True  # set False if you only want to train the last layer using pretrained model
    # Replace the last fully-connected layer
    # Parameters of newly constructed modules have requires_grad=True by default
    model.fc = nn.Linear(512, 10)


# (Optional) To use wmla pretrained resnet18 model for cifar10, load the model weight file. The pretrained model weight file can be downloaded [here](https://?).

weightfile = DATA_DIR + "/checkpoint/model_epoch_final.pth"
if os.path.exists(weightfile):
    print ("Initial weight file is " + weightfile)
    model.load_state_dict(torch.load(weightfile, map_location=lambda storage, loc: storage))


# ## Run the model trainings
model.to(device)
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=0, dampening=0, weight_decay=0, nesterov=False)
epochs = args.epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, 30, 0.1, last_epoch=-1)

# Output total iterations info for deep learning insights
print("Total iterations: %s" % (len(train_loader) * epochs))

print("RESULT_DIR: " + os.getenv("RESULT_DIR"))
RESULT_DIR = os.getenv("RESULT_DIR")
os.makedirs(RESULT_DIR, exist_ok=True)

for epoch in range(1, epochs+1):
    print("\nRunning epoch %s ... It might take several minutes for each epoch to run." % epoch)
    train(model, device, train_loader, optimizer, epoch)
    test(model, device, test_loader, epoch)
    scheduler.step()

    torch.save(model.state_dict(),  RESULT_DIR + "/model/model_epoch_%d.pth"%(epoch))

torch.save(model.state_dict(), RESULT_DIR + "/model/model_epoch_final.pth")

Overwriting /project_data/data_asset/pytorch-resnet/resnet-wmla/main.py


## Training results on GPU

#### Training was run from a Cloud Pak for Data Notebook utilizing a GPU kernel. 


In the custom environment that was created with **16vCPU** and **32GB**, it took **147seconds** (or approximately **2.5 minutes**) to complete 1 EPOCH training.


In [17]:
files = {'file': open('/project_data/data_asset/pytorch-resnet/resnet-wmla/main.py', 'rb')}

args = '--exec-start PyTorch --cs-datastore-meta type=fs \
                     --workerDeviceNum 1 \
                     --model-main main.py --epochs 1'


In [18]:
starttime = datetime.datetime.now()

r = requests.post(dl_rest_url+'/execs?args='+args, files=files,
                  headers=commonHeaders, verify=False)
if not r.ok:
    print('submit job failed: code=%s, %s'%(r.status_code, r.content))
    
job_status = query_job_status(r.json(),refresh_rate=5)

endtime = datetime.datetime.now()

print("\nTraining cost: ", (endtime - starttime).seconds, " seconds.")

Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,xwmla-1882,--exec-start PyTorch --cs-datastore-meta type=fs --workerDeviceNum 1 --mod...,xwmla-1882,admin,FINISHED,xwmla-1882,https://wmla-mss:9080,wmla,/gpfs/myresultfs/admin/batchworkdir/xwmla-1882/_submitted_code,SingleNodePytorchTrain,2021-02-08T21:17:27Z,False,xwmla,1,PyTorch


{ 'appId': 'xwmla-1882',
  'appName': 'SingleNodePytorchTrain',
  'args': '--exec-start PyTorch --cs-datastore-meta '
          'type=fs                      --workerDeviceNum '
          '1                      --model-main main.py --epochs 1 ',
  'createTime': '2021-02-08T21:17:27Z',
  'creator': 'admin',
  'elastic': False,
  'framework': 'PyTorch',
  'id': 'xwmla-1882',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'xwmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'xwmla-1882',
  'workDir': '/gpfs/myresultfs/admin/batchworkdir/xwmla-1882/_submitted_code'}

Training cost:  142  seconds.


## Training metrics and logs

#### Retrieve and display the model training metrics:

In [19]:
query_train_metric(r.json())

https://wmla-console-xwmla.apps.wml1x180.ma.platformlab.ibm.com/platform/rest/deeplearning/v1/execs/xwmla-1882/log
Namespace(batch_size=128, cuda=True, epochs=1, lr=0.01)
Use cuda:  True
DATA_DIR: /gpfs/mydatafs
Files already downloaded and verified
Files already downloaded and verified
=> using pytorch build-in model 'resnet18'
Total iterations: 391
RESULT_DIR: /gpfs/myresultfs/admin/batchworkdir/xwmla-1882

Running epoch 1 ... It might take several minutes for each epoch to run.
Train - batches : 1, average loss: 2.4147, accuracy: 15/128 (12%)
Train - batches : 2, average loss: 2.3836, accuracy: 27/256 (11%)
Train - batches : 3, average loss: 2.3745, accuracy: 40/384 (10%)
Train - batches : 4, average loss: 2.3524, accuracy: 58/512 (11%)
Train - batches : 5, average loss: 2.3373, accuracy: 77/640 (12%)
Train - batches : 6, average loss: 2.3258, accuracy: 105/768 (14%)
Train - batches : 7, average loss: 2.3177, accuracy: 121/896 (14%)
Train - batches : 8, average loss: 2.3115, accurac

#### Retrieve and display the model training logs:

In [20]:
query_executor_stdout_log(r.json())

https://wmla-console-xwmla.apps.wml1x180.ma.platformlab.ibm.com/platform/rest/deeplearning/v1/scheduler/applications/xwmla-1882/executor/1/logs/stdout?lastlines=1000
*Task <1> SubProcess*: 2021-02-08 21:17:37.197660 37 INFO Create log direcotry /wmla-logging/dli/xwmla-1882/dli/./app.xwmla-1882-task12n-nxl8r
*Task <1> SubProcess*: 2021-02-08 21:17:37.202192 37 INFO Running on kubernetes.
*Task <1> SubProcess*: 2021-02-08 21:17:37.208179 37 INFO List GPUs
*Task <1> SubProcess*: Mon Feb  8 21:17:37 2021       
*Task <1> SubProcess*: +-----------------------------------------------------------------------------+
*Task <1> SubProcess*: | NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
*Task <1> SubProcess*: |-------------------------------+----------------------+----------------------+
*Task <1> SubProcess*: | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
*Task <1> SubProcess*: | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usa

## Download trained model from Watson Machine Learning Accelerator 

In [21]:
download_trained_model(r.json())

https://wmla-console-xwmla.apps.wml1x180.ma.platformlab.ibm.com/platform/rest/deeplearning/v1/execs/xwmla-1882/result
Save model:  /project_data/data_asset/xwmla-1882.zip
