# Create Docker Image for PyTorch
In this notebook we will create the image for our PyTorch script to run in. We will go through the process of creating the image and testing it locally to make sure it runs before submitting it to the cluster. It is often recommended you do this rather than debugging on the cluster since debugging on a cluster can be much more difficult and time consuming.

In [1]:
import sys
sys.path.append("../common") 

from dotenv import dotenv_values, set_key, find_dotenv, get_key
from getpass import getpass
import os
import json
from utils import get_password, write_json_to_file, dotenv_for
import docker

Below are the variables that describe our experiment. By default we are using the NC24rs_v3 (Standard_NC24rs_v3) VMs which have V100 GPUs and Infiniband. By default we are using 2 nodes with each node having 4 GPUs, this equates to 8 GPUs. Feel free to increase the number of nodes but be aware what limitations your subscription may have.

Set the USE_FAKE to True if you want to use fake data rather than the Imagenet dataset. This is often a good way to debug your models as well as checking what IO overhead is.

In [50]:
USE_FAKE               = False
DOCKERHUB              = "masalvar" #"<YOUR DOCKERHUB>"
NUM_PROCESSES = 2

In [29]:
dc = docker.from_env()

In [30]:
image, log_iter = dc.images.build(path='Docker', 
                          tag='{}/caia-horovod-pytorch'.format(DOCKERHUB))

In [42]:
image.tags[0]

'masalvar/caia-horovod-pytorch:latest'

In [31]:
container_labels = {'containerName': 'pytorchgpu'}
environment ={
    "DISTRIBUTED":True,
    "PYTHONPATH":'/workspace/common/',
}

volumes = {
    os.getenv('EXT_PWD'): {
                                'bind': '/workspace', 
                                'mode': 'rw'
                               }
}

if USE_FAKE:
    environment['FAKE'] = True
else:
    environment['FAKE'] = False
    volumes[os.getenv('EXT_DATA')]={'bind': '/mnt/input', 'mode': 'rw'}
    environment['AZ_BATCHAI_INPUT_TRAIN'] = '/mnt/input/train'
    environment['AZ_BATCHAI_INPUT_TEST'] = '/mnt/input/validation'

In [32]:
cmd=f'mpirun -np {NUM_PROCESSES} -H localhost:{NUM_PROCESSES} '\
     'python -u /workspace/HorovodPytorch/src/imagenet_pytorch_horovod.py'
container = dc.containers.run(image.tags[0], 
                              command=cmd,
                              detach=True, 
                              labels=container_labels,
                              runtime='nvidia',
                              volumes=volumes,
                              environment=environment,
                              shm_size='8G',
                              privileged=True)

In [33]:
for line in container.logs(stderr=True, stream=True):
    print(line.decode("utf-8"),end ="")

INFO:__main__:0:  Runnin Distributed
INFO:__main__:1:  Runnin Distributed
INFO:__main__:0:  PyTorch version 0.4.0
INFO:__main__:0:  Setting up loaders
INFO:__main__:1:  PyTorch version 0.4.0
INFO:__main__:1:  Setting up loaders
INFO:__main__:0:  Loading model
INFO:__main__:1:  Loading model
INFO:__main__:0:  Training ...
INFO:__main__:1:  Training ...

4f92c5ed3cd3:13:68 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
4f92c5ed3cd3:13:68 [0] INFO Using internal Network Socket
4f92c5ed3cd3:13:68 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
4f92c5ed3cd3:13:68 [0] INFO NET : Using interface eth0:172.17.0.3<0>
4f92c5ed3cd3:13:68 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.13+cuda9.0

4f92c5ed3cd3:14:65 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
4f92c5ed3cd3:14:65 [1] INFO Using internal Network Socket
4f92c5ed3cd3:14:65 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
Unexpected end of /proc/mounts line `overlay 

KeyboardInterrupt: 

In [44]:
container.reload() # Refresh state
if container.status is 'running':
    container.kill()

In [46]:
for line in dc.images.push(image.tags[0], 
                           stream=True,
                           auth_config={"username":DOCKERHUB,
                                        "password": "d13NHAL!"}):
    print(line)

b'{"status":"The push refers to a repository [docker.io/masalvar/caia-horovod-pytorch]"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"7e6c8b5d5783"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"eeb659df3cc8"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"27aab996f8cd"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"d9038574f55a"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"4936625d6fff"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"9be10ccfe4da"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"12302f8bd2e6"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"879c4ef3d9fb"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"4246124ac3fb"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"a917bc2d0f96"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"c7cfa177d51a"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"9b68e6935e56"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"6e8ce585c22b"}\r