# Using WMLA Elastic Distributed Training via API - a sample notebook

<div class="alert alert-block alert-info">

### Contents

- [The big picture](#The-big-picture)
- [Changes to your code](#Changes-to-your-code)
- [Making dataset available](#Making-dataset-available)
- [Set up API end point and log on](#Set-up-API-end-point-and-log-on)
- [Submit job via API](#Submit-job-via-API)
- [Monitor running job](#Monitor-running-job)
- [Retrieve output and saved models](#Retrieve-output-and-saved-models)
  - [Output](#Output)
  - [Saved Models](#Saved-models)
- [Debugging any issues](#Debugging-any-issues)
- [Further information and useful links](#Further-information-and-useful-links)

</div>

## The big picture
[Back to top](#Contents)

This notebook details the process of taking your existing model training code and making the changes required to run the code using [IBM Watson Machine Learning](https://developer.ibm.com/linuxonpower/deep-learning-powerai/powerai-enterprise/) (WMLA) using Elastic Distributed Training. 

<span style='color:deeppink'>**TODO:** Link to blog with more details</span>

The image below shows the various elements required to use EDT. In this notebook we will step through each of these elements in more detail. Through this process you will offload your code to a WMLA cluster, monitor the running job, retrieve the output and debug any issues seen. A [static version](https://github.com/mandieq/shared_images/raw/master/wmla_api_pieces/5_running_job.png) is also available.

![overall](https://github.com/mandieq/shared_images/raw/master/wmla_api_pieces/5_running_job.gif)

## Changes to your code
[Back to top](#Contents)

In this section we will take existing sample code and make the relevant changes required for use with EDT. An overview of these changes can be seen in the diagram below. A [static version](https://github.com/mandieq/shared_images/raw/master/wmla_api_pieces/2_code_adaptations.png) is also available.

![code](https://github.com/mandieq/shared_images/raw/master/wmla_api_pieces/2_code_adaptations.gif)

<span style='color:deeppink'>**TODO:** Add Kelvin's existing code here re changes to be made.</span>

<span style='color:deeppink'>**TODO:** Add links to original code and updated code package which can be downloaded for testing / inspection.</span>

## original RestNet18 model: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html



In [4]:
# %load '/Users/Kelvin/Documents/WorkInProgress/WellsFargo/CSSC_EDT/pytorch_edt_cssc2/pytorch_mnist_EDT.py'
from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
from torch.optim import lr_scheduler
from pathlib import Path

import sys
import os
from os import environ
import json


#  Importing libraries and setting up enviroment variables 
path=os.path.join(os.getenv("FABRIC_HOME"), "libs", "fabric.zip")
print(path)
sys.path.insert(0,path)
from fabric_model import FabricModel
from edtcallback import EDTLoggerCallback

dataDir = environ.get("DATA_DIR")
if dataDir is not None:
    print("dataDir is: %s"%dataDir)
else:
    print("Warning: not found DATA_DIR from os env!")


model_path = os.environ["RESULT_DIR"]+"/model/saved_model"
tb_directory = os.environ["LOG_DIR"]+"/tb"
print ("model_path: %s" %model_path)
print ("tb_directory: %s" %tb_directory)

# Data Loading function for EDT

def getDatasets():
    data_transforms = {
    'train': transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    }

    return (datasets.ImageFolder(os.path.join(dataDir, 'train'), data_transforms['train']),
            datasets.ImageFolder(os.path.join(dataDir, 'val'), data_transforms['val']))


def main():

  
    # Extract parameters for training
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batchsize', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--numWorker', type=int, default=100, metavar='N',
                        help='maxWorker')

    parser.add_argument('--epochs', type=int, default=5, metavar='N',
                        help='input epochs for training (default: 64)')

    args, unknow = parser.parse_known_args()
 
    print('args: ', args)

    #Define Model
    model_conv = torchvision.models.resnet18(pretrained=True)
    for param in model_conv.parameters():
        param.requires_grad = False

    num_ftrs = model_conv.fc.in_features
    model_conv.fc = nn.Linear(num_ftrs, 2)
    criterion = nn.CrossEntropyLoss()
    
    #inject learning_rate from HPO API and do the search
    optimizer_conv = optim.SGD(model_conv.fc.parameters(), lr=0.1, momentum=0.9)
    exp_lr_scheduler = lr_scheduler.StepLR(optimizer_conv, step_size=7, gamma=0.1)

  
    # Replace the training and testing loops with EDT equivalents
    edt_m = FabricModel(model_conv, getDatasets, F.nll_loss, optimizer_conv, driver_logger=EDTLoggerCallback())
    edt_m.train(args.epochs, args.batchsize, args.numWorker)

if __name__ == '__main__':
    print('sys.argv: ', sys.argv)
    main()



ModuleNotFoundError: No module named 'torch'

## Making dataset available
[Back to top](#Contents)

Next we will make our dataset available to the WMLA cluster as seen in the diagram below. 

![data](https://github.com/mandieq/shared_images/raw/master/wmla_api_pieces/3_dataset.png)

<span style='color:deeppink'>**TODO:** Add details of where this should go and with what permissions. Details of ssh commands needed here?</span>

1. Log on to DLI_DATA_FS directory 

2. Download dataset

```
wget https://download.pytorch.org/tutorial/hymenoptera_data.zip

Resolving download.pytorch.org... 99.86.230.63, 99.86.230.94, 99.86.230.13, ...
Connecting to download.pytorch.org|99.86.230.63|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47286322 (45M) [application/zip]
Saving to: 'hymenoptera_data.zip'

hymenoptera_data.zip          100%[================================================>]  45.10M  11.0MB/s    in 4.5s    

2020-02-19 17:01:52 (10.1 MB/s) - 'hymenoptera_data.zip' saved [47286322/47286322]

```

3. Unzip the zip file and modify file owner/group, that is equivalent to Instance Group Execution User
```
[root@colonia04 dlidata]# chown -R egoadmin:egoadmin hymenopteradata/
[root@colonia04 hymenopteradata]# pwd
/dlidata/hymenopteradata
[root@colonia04 hymenopteradata]# ls -lt
total 0
drwxr-x--- 4 egoadmin egoadmin 34 Jan  7 23:54 MNIST
drwxr-xr-x 4 egoadmin egoadmin 30 Jan  7 23:08 val
drwxr-xr-x 4 egoadmin egoadmin 30 Jan  7 23:08 train
```



## Set up API end point and log on
[Back to top](#Contents)

In this section we set up the API endpoint which will be used in this notebook.

<span style='color:deeppink'>**TODO:** Add existing details as to how to find this, what it should look it and where to find documentation.</span>

1. Source the environment

```
. $EGO_TOP/profile.platform

```
2. Login

```
egosh user logon -u Admin -x Admin
Logged on successfully

```

3. Retrieve Conductor Rest API Port

```
egosh client view |grep -A 3 ASCD_REST_BASE_URL_1
CLIENT NAME: ASCD_REST_BASE_URL_1
DESCRIPTION: http://colonia04.platform:8280/platform/rest/

```

4.  Retrieve DLI (Deep Learning Impact) Rest API Port

```
egosh client view |grep -A 3 DLPD_REST_BASE_URL_1
CLIENT NAME: DLPD_REST_BASE_URL_1
DESCRIPTION: http://colonia04.platform:9280/platform/rest/

```



## Submit job via API
[Back to top](#Contents)

Now we need to structure our API job submission. There are various elements to this process as seen in the diagram below. Note that **this** jupyter notebook is the one referred to below. A [static version](https://github.com/mandieq/shared_images/raw/master/wmla_api_pieces/4_api_setup.png) is also available.

![code](https://github.com/mandieq/shared_images/raw/master/wmla_api_pieces/4_api_setup.gif)

<span style='color:deeppink'>**TODO:** Add existing code here re changes to be made. Include the first step as packaging the code we created in the last section.</span>

<span style='color:deeppink'>**TODO:** Add existing code here to actually get the workload running! Should we add anything about watching GPU usage via various means? Or leave that to the blog?</span>

In [5]:
import requests, json
import pandas as pd
import datetime
# from IPython.display import display

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
%matplotlib inline
# plt.rcParams['figure.figsize'] = [24, 8.0]
import seaborn as sns

pd.set_option('display.max_columns', 999)
pd.set_option('max_colwidth', 300)

import tarfile
import tempfile
import os


In [6]:
#Construct API call

master_host = 'colonia05.platform'
dli_rest_port = '9280'  #Deep Learning Impact Rest API Port
sc_rest_port = '8280' #Conductor Rest API Port

sc_rest_url = 'http://'+master_host+':'+sc_rest_port+'/platform/rest/conductor/v1'
dl_rest_url = 'http://'+master_host+':'+dli_rest_port+'/platform/rest/deeplearning/v1'

# User login details
#wmla_user = '**** ADD HERE ****'
#wmla_pwd = '**** ADD HERE ****'
wmla_user = 'Admin'
wmla_pwd = 'Admin'

myauth = (wmla_user, wmla_pwd)

# Instance Group to be used
#sig_name =  '***ADD Here***'
sig_name = 'dli-edt'

# REST call variables
headers = {'Accept': 'application/json'}
print (sc_rest_url)
print (dl_rest_url)

# Model Path
model_path = '/Users/Kelvin/Documents/WorkInProgress/WellsFargo/CSSC_EDT/pytorch_edt_cssc2'

http://colonia05.platform:8280/platform/rest/conductor/v1
http://colonia05.platform:9280/platform/rest/deeplearning/v1


## Package model files for training
Package the updated model files into a tar file ending with `.modelDir.tar`

In [49]:
def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))


MODEL_DIR_SUFFIX = ".modelDir.tar"
tempFile = tempfile.mktemp(MODEL_DIR_SUFFIX)
#make_tarfile(tempFile, '/path/to/your/file/')
#make_tarfile(tempFile, '/Users/Kelvin/Github/dli-1.2.3-wmla-1.2.1-samples/dli-1.2.3-wmla-1.2.1-samples/pytorch_edt/')
make_tarfile(tempFile, '/Users/Kelvin/Documents/WorkInProgress/WellsFargo/CSSC_EDT/pytorch_edt_cssc2')
print(" tempFile: " + tempFile)
files = {'file': open(tempFile, 'rb')}

 tempFile: /var/folders/l8/5dhpt4mn5zs6rjblzhlxp5300000gp/T/tmp1xgvo6vx.modelDir.tar


## Log on


Obtain login session tokens to be used for session authentication within the RESTful API. Tokens are valid for 8 hours.

In [8]:
r = requests.get(sc_rest_url+'/auth/logon', verify=False, auth=myauth, headers=headers) 

if r.ok:
    print ('\nLogon succeeded')
    
else: 
    print('\nLogon failed with code={}, {}'. format(r.status_code, r.content))


Logon succeeded


## Check DL Frameworks details

Check what framework plugins are available and see example execution commands.  In this demonstration we will use **edtPyTorch**

In [9]:
r = requests.get(dl_rest_url+'/execs/frameworks', auth=myauth, headers=headers, verify=False).json()
# Using the raw json, easier to see the examples given
print(json.dumps(r, indent=4))

[
    {
        "name": "edtKeras",
        "description": "",
        "desc": [
            "Keras - IBM Elastic Distributed Training (EDT)",
            "Examples:",
            "$ python dlicmd.py --exec-start edtKeras <connection-options> --ig <ig> --cs-datastore-meta type=fs,data_path=mnist --model-main mnist.py"
        ]
    },
    {
        "name": "edtPyTorch",
        "description": "",
        "desc": [
            "PyTorch - IBM Elastic Distributed Training (EDT)",
            "Examples:",
            "$ python dlicmd.py --exec-start edtPyTorch <connection-options> --ig <ig> --cs-datastore-meta type=fs,data_path=mnist --model-main mnist.py"
        ]
    },
    {
        "name": "tensorflow1100",
        "description": "",
        "desc": [
            "Single-node TensorFlow. Tested for Tensorflow 1.10.0.",
            "NOTES:",
            "- Since DLI manages GPU allocation, if you explicitly assign devices using",
            "  calls such as `tf.device`, you should use

## Arguments for API call
### Equivalent of flags used if running command directly on WMLA CLI, including:


In [50]:
framework_name = 'edtPyTorch' # DL Framework to use, from list given above
dataset_location = 'hymenoptera_data' # relative path of your data set 
local_dir_containing_your_code = 'pytorch_edt_cssc2'
number_of_GPU = '2' # number of GPUs for elastic distribution
name_of_your_code_file = 'pytorch_mnist_EDT.py' # Main model file as opened locally above


args = '--exec-start {} \
        --cs-datastore-meta type=fs,data_path={} \
        --model-dir {} \
        --edt-options maxWorkers={} \
        --model-main {} \
        --epochs 50  \
        '.format(framework_name, dataset_location, local_dir_containing_your_code, number_of_GPU, name_of_your_code_file)

print ("args: " + args)

args: --exec-start edtPyTorch         --cs-datastore-meta type=fs,data_path=hymenoptera_data         --model-dir pytorch_edt_cssc2         --edt-options maxWorkers=2         --model-main pytorch_mnist_EDT.py         --epochs 50          


## Submit Job via API call

In [51]:
r = requests.post(dl_rest_url+'/execs?sigName='+sig_name+'&args='+args, files=files,
                  auth=myauth, headers=headers, verify=False)

if r.ok:
    exec_id = r.json()['id']
    sig_id = r.json()['sigId']
    driver_id = r.json()['submissionId']
    print ('\nModel submitted successfully \Driver ID: {}'.format(driver_id))
    print ('Exec ID: {}'.format(exec_id))
    print ('SIG ID: {}'.format(sig_id))
else: 
    print('\nModel submission failed with code={}, {}'. format(r.status_code, r.content))


Model submitted successfully \Driver ID: driver-20200221234711-0084-e43f7b87-017a-4024-a890-8a6a8a7ba055
Exec ID: Admin-2809611603258194-1703562770
SIG ID: 63027c69-543b-451c-9a03-9c850784725f


## Monitor running job
[Back to top](#Contents)

Once the job is submitted successfully we can monitor the running job. 

<span style='color:deeppink'>**TODO:** Add existing code here re how to do this.</span>

In [54]:
# Check status of all RUNNING jobs in SIG (rerun cell to refresh)

monitor = []
monitor_output = []

r = requests.get(sc_rest_url+'/instances/'+sig_id+'/applications?state=RUNNING', 
                auth=myauth, headers=headers, verify=False).json()


       
if (len(r) == 0):
    print ('No jobs running')
    
else:
    
    # Filter out the relevant information
    monitor.append([(
        job['driver']['id'],
        job['driver']['state'],
        job['apprunduration'],
        job['gpuslots'],
        job['gpumemused']['total'],
        job['gpudevutil']['total'],
    ) for job in r])

    monitor_output = pd.DataFrame([item for monitor in monitor for item in monitor])
    monitor_output.columns = [
        'Driver ID', 
        'State', 
        'Run duration (mins)',
        'GPU slots',
        'Total GPU memory used',
        'Total GPU utilsation (%) ',
    ]
    
    for job in r:
        executors = job['executors']
        

monitor_output

Unnamed: 0,Driver ID,State,Run duration (mins),GPU slots,Total GPU memory used,Total GPU utilsation (%)
0,driver-20200221234711-0084-e43f7b87-017a-4024-a890-8a6a8a7ba055,RUNNING,0.143417,2,0,0.0


In [46]:
# Check status of job submitted 

r = requests.get(dl_rest_url+'/execs/'+exec_id, auth=myauth, headers=headers, verify=False).json()
pd.read_json(json.dumps(r), orient='index')

Unnamed: 0,0
appId,app-20200221160530-0043-c1d7e3c6-1fd9-4294-b17c-47a5c3e46323
args,"--exec-start edtPyTorch --cs-datastore-meta type=fs,data_path=hymenoptera_data --model-dir pytorch_edt_cssc2 --edt-options maxWorkers=2 --model-main pytorch_mnist_EDT.py --epochs 50"
executionUserName,egoadmin
id,Admin-2781904014777372-695105367
masterURL,spark://colonia05.platform:6072
sigId,63027c69-543b-451c-9a03-9c850784725f
sigName,dli-edt
state,RUNNING
submissionId,driver-20200221160523-0083-913ec8b0-3b74-4779-8b20-bd6f9ee021f4
userName,Admin


## Retrieve output and saved models
[Back to top](#Contents)

After the job completes then we can retrieve the output, logs and saved models. 

### Output

<span style='color:deeppink'>**TODO:** Add existing code.</span>

### Saved models

<span style='color:deeppink'>**TODO:** Add existing code regarding changes to get a model saved. And then how to get the saved version back.</span>

## Save Model

In [None]:
# Get model from training job - downloads zip file (with progress bar) of saved model to directory local to this notebook
# (note that you need to save model in your code using the environment variable for location)

import requests, zipfile, io
from tqdm.notebook import tqdm

r = requests.get(dl_rest_url+'/execs/'+exec_id+'/result', auth=myauth, stream=True)

total_size = int(r.headers.get('Content-Disposition').split('size=')[1])
block_size = 1024 #1 Kibibyte
t=tqdm(total=total_size, unit='iB', unit_scale=True)

with open('model.zip', 'wb') as f:
    for data in r.iter_content(block_size):
        t.update(len(data))
        f.write(data)
t.close()

## Retrieve Training Metric

<span style='color:deeppink'>**TODO:** Plot the graph of metric of Loss and Accuracy.</span>

In [47]:


# Check status of job submitted 


r = requests.get(dl_rest_url+'/execs/'+exec_id+'/log', auth=myauth, headers=headers, verify=False).json()
print(r)


[2020-02-21 16:05:46] Train,	Timestamp: 1582297546490,	Global steps: 1,	Iteration: 2,	Loss: 0.00124,	Accuracy: 0.00469
[2020-02-21 16:05:46] Train,	Timestamp: 1582297546592,	Global steps: 2,	Iteration: 3,	Loss: 0.00105,	Accuracy: 0.00508
[2020-02-21 16:05:47] Train,	Timestamp: 1582297547028,	Global steps: 3,	Iteration: 4,	Loss: -0.10088,	Accuracy: 0.00542
[2020-02-21 16:05:47] Train,	Timestamp: 1582297547486,	Global steps: 4,	Iteration: 0,	Loss: -0.37705,	Accuracy: 0.00577
[2020-02-21 16:05:49] Train,	Timestamp: 1582297549693,	Global steps: 5,	Iteration: 2,	Loss: -0.65836,	Accuracy: 0.0075
[2020-02-21 16:05:50] Train,	Timestamp: 1582297550209,	Global steps: 6,	Iteration: 3,	Loss: -0.74346,	Accuracy: 0.00811
[2020-02-21 16:05:50] Train,	Timestamp: 1582297550785,	Global steps: 7,	Iteration: 4,	Loss: -1.00607,	Accuracy: 0.00803
[2020-02-21 16:05:51] Train,	Timestamp: 1582297551185,	Global steps: 8,	Iteration: 0,	Loss: -2.13375,	Accuracy: 0.00654
[2020-02-21 16:05:53] Train,	Timestamp: 158

## Debugging any issues
[Back to top](#Contents)

In the case where you have issues during the process detailed above, there are a number of detailed logs that you can view to understand what is happening on the WMLA cluster.

<span style='color:deeppink'>**TODO:** Overview of the various logs available and then API calls.</span>

## Retrieve Training Driver Stdout Log

In [48]:
# Get Spectrum Conductor logs for training run - shows various information including environment variables

r = requests.get(sc_rest_url+'/instances/'+sig_id+'/applications/'+driver_id+'/logs/stdout/download',
                 auth=myauth, headers={'Accept': 'application/octet-stream'}, verify=False)

print(r.text)


load extra config from : /dlishared//conf
Setting up spark environment on node colonia05.platform
DLI_EXTRA_CONF=export DLI_LOGGER_LEVEL=debug;export DLI_SHARED_FS=/dlishared/;export DLI_WORK_DIR=/dliresult/egoadmin/batchworkdir/Admin-2781904014777372-695105367/_submitted_code/pytorch_edt_cssc2;export LD_LIBRARY_PATH=/opt/anaconda3/envs/dlipy2/lib:/opt/anaconda3/envs/dlipy2/lib/python2.7/site-packages/tensorflow:/dlishared//fabric/1.2.3/native:/dlishared//fabric/1.2.3/third-native/libs/ppc64le:$LD_LIBRARY_PATH;export PYTHONPATH=/dlishared//fabric/1.2.3/libs/fabric.zip:/dlishared//tools/dli_utils:/dlishared//fabric/1.2.3/tools:/dliresult/egoadmin/batchworkdir/Admin-2781904014777372-695105367/_submitted_code/pytorch_edt_cssc2:$PYTHONPATH;export DLI_IS_ELASTIC=true;export MODEL_PATH=/dliresult/egoadmin/batchworkdir/Admin-2781904014777372-695105367/_submitted_code/pytorch_edt_cssc2;export FABRIC_HOME=/dlishared//fabric/1.2.3;export RESUME_WEIGHT_PATH=null;export DLI_WORK_DIR=/dliresult/ego

## Retrieve Training Driver Stderr Log

In [61]:
# shows various information including environment variables

r = requests.get(sc_rest_url+'/instances/'+sig_id+'/applications/'+driver_id+'/logs/stderr/download',
                 auth=myauth, headers={'Accept': 'application/octet-stream'}, verify=False)

print(r.text)

+++ export SPARK_EGO_PYTHON_KILL_WORKER_PGROUP=true
+++ SPARK_EGO_PYTHON_KILL_WORKER_PGROUP=true
+++ '[' -z '' ']'
+++ export GLOG_logtostderr=true
+++ GLOG_logtostderr=true
+++ '[' -z '' -a -n '' -a -f '' ']'
+++ '[' -n '' -a -f '' ']'
+++ '[' true = true ']'
+++ checkGPUMemory
+++ echo 'Check GPU memory utilization'
+++ echo 'CUDA_VISIBLE_DEVICES is set to '
+++ utilization_high=1
+++ timeout_secs=30
+++ time_sleep=0
+++ '[' 0 -lt 30 -a 1 -eq 1 ']'
+++ utilization_high=0
++++ echo
++++ tr , ' '
+++ '[' 0 -eq 1 ']'
+++ '[' 0 -lt 30 -a 0 -eq 1 ']'
+++ '[' 0 -eq 1 ']'
+++ export NCCL_P2P_DISABLE=1
+++ NCCL_P2P_DISABLE=1
+++ handle_BYOF_EDT_Job
+++ '[' true = true -a Admin-2809611603258194-1703562770x '!=' x ']'
+++ '[' /dlidata/ '!=' '' ']'
+++ export DATA_DIR=/dlidata//hymenoptera_data
+++ DATA_DIR=/dlidata//hymenoptera_data
+++ local batch_path=/dliresult//egoadmin/batchworkdir/Admin-2809611603258194-1703562770
+++ export RESULT_DIR=/dliresult//egoadmin/batchworkdir/Admin-280961160325

## Retrieve Training Executor Log


## Retrieve Executor ID

#### The deep learning training log per GPU is written in executor log
#### Execute following code to retrieve list of executor ID

In [55]:
for key in executors:    
    print ('executors: ' + key['id'])

executors: 0-ea6cdd44-67f6-4e84-bafd-0b7ae2577542
executors: 1-6ce73046-91f8-4df0-ad30-339969f0e1c9


## Retrieve Executor Stdout log
#### set the parameter executor_id

In [59]:
executor_id = '0-ea6cdd44-67f6-4e84-bafd-0b7ae2577542'

r = requests.get(sc_rest_url+'/instances/'+sig_id+'/applications/'+driver_id +'/'+executor_id+'/logs/stdout/download',
                 auth=myauth, headers={'Accept': 'application/octet-stream'}, verify=False)

print(r.text)



load extra config from : /dlishared//conf
Setting up spark environment on node colonia05.platform
DLI_EXTRA_CONF=export DLI_LOGGER_LEVEL=debug;export DLI_SHARED_FS=/dlishared/;export DLI_WORK_DIR=/dliresult/egoadmin/batchworkdir/Admin-2809611603258194-1703562770/_submitted_code/pytorch_edt_cssc2;export LD_LIBRARY_PATH=/opt/anaconda3/envs/dlipy2/lib:/opt/anaconda3/envs/dlipy2/lib/python2.7/site-packages/tensorflow:/dlishared//fabric/1.2.3/native:/dlishared//fabric/1.2.3/third-native/libs/ppc64le:$LD_LIBRARY_PATH;export PYTHONPATH=/dlishared//fabric/1.2.3/libs/fabric.zip:/dlishared//tools/dli_utils:/dlishared//fabric/1.2.3/tools:/dliresult/egoadmin/batchworkdir/Admin-2809611603258194-1703562770/_submitted_code/pytorch_edt_cssc2:$PYTHONPATH;export DLI_IS_ELASTIC=true;export MODEL_PATH=/dliresult/egoadmin/batchworkdir/Admin-2809611603258194-1703562770/_submitted_code/pytorch_edt_cssc2;export FABRIC_HOME=/dlishared//fabric/1.2.3;export RESUME_WEIGHT_PATH=null;export DLI_WORK_DIR=/dliresult/

## Retrieve Executor Stderr log
#### set the parameter executor_id

In [60]:
executor_id = '0-ea6cdd44-67f6-4e84-bafd-0b7ae2577542'

r = requests.get(sc_rest_url+'/instances/'+sig_id+'/applications/'+driver_id +'/'+executor_id+'/logs/stderr/download',
                 auth=myauth, headers={'Accept': 'application/octet-stream'}, verify=False)

print(r.text)

+++ export SPARK_EGO_PYTHON_KILL_WORKER_PGROUP=true
+++ SPARK_EGO_PYTHON_KILL_WORKER_PGROUP=true
+++ '[' -z '' ']'
+++ export GLOG_logtostderr=true
+++ GLOG_logtostderr=true
+++ '[' -z '' -a -n '' -a -f '' ']'
+++ '[' -n '' -a -f '' ']'
+++ '[' true = true ']'
+++ checkGPUMemory
+++ echo 'Check GPU memory utilization'
+++ echo 'CUDA_VISIBLE_DEVICES is set to 3'
+++ utilization_high=1
+++ timeout_secs=30
+++ time_sleep=0
+++ '[' 0 -lt 30 -a 1 -eq 1 ']'
+++ utilization_high=0
++++ echo 3
++++ tr , ' '
+++ for gpuid in '$(echo $CUDA_VISIBLE_DEVICES | tr '\'','\'' '\'' '\'')'
++++ nvidia-smi -i 3 --query-gpu=utilization.memory --format=csv,nounits,noheader
+++ utilization=0
+++ '[' 0 -gt 90 ']'
+++ '[' 0 -eq 1 ']'
+++ '[' 0 -lt 30 -a 0 -eq 1 ']'
+++ '[' 0 -eq 1 ']'
+++ export NCCL_P2P_DISABLE=1
+++ NCCL_P2P_DISABLE=1
+++ handle_BYOF_EDT_Job
+++ '[' true = true -a Admin-2809611603258194-1703562770x '!=' x ']'
+++ '[' /dlidata/ '!=' '' ']'
+++ export DATA_DIR=/dlidata//hymenoptera_data
+++ D

## Output

## Further information and useful links
[Back to top](#Contents)

