# Create Docker Image for PyTorch
In this notebook we will create the Docker image for our PyTorch script to run in. We will go through the process of creating the image and testing it locally to make sure it runs before submitting it to the cluster. It is often recommended you do this rather than debugging on the cluster since debugging on a cluster can be much more difficult and time consuming.
 
**You will need to be running everything on a GPU enabled VM to run this notebook.** 

In [2]:
import sys
sys.path.append("../common") 

from dotenv import get_key
import os
from utils import dotenv_for
import docker

We will use fake data here since we don't want to have to download the data etc. Using fake data is often a good way to debug your models as well as checking what IO overhead is. Here we are setting the number of processes (NUM_PROCESSES) to 2 since the VM we are testing on has 2 GPUs. If you are running on a machine with 1 GPU set NUM_PROCESSES to 1.

In [22]:
dotenv_path = dotenv_for()
USE_FAKE               = True
DOCKERHUB              = os.getenv('DOCKER_REPOSITORY', "masalvar")
NUM_PROCESSES          = 2
DOCKER_PWD             = get_key(dotenv_path, 'DOCKER_PWD')

In [13]:
dc = docker.from_env()

In [14]:
image, log_iter = dc.images.build(path='Docker', 
                          tag='{}/caia-horovod-pytorch'.format(DOCKERHUB))

In [16]:
container_labels = {'containerName': 'pytorchgpu'}
environment ={
    "DISTRIBUTED":True,
    "PYTHONPATH":'/workspace/common/',
}

volumes = {
    os.getenv('EXT_PWD'): {
                                'bind': '/workspace', 
                                'mode': 'rw'
                               }
}

if USE_FAKE:
    environment['FAKE'] = True
else:
    environment['FAKE'] = False
    volumes[os.getenv('EXT_DATA')]={'bind': '/mnt/input', 'mode': 'rw'}
    environment['AZ_BATCHAI_INPUT_TRAIN'] = '/mnt/input/train'
    environment['AZ_BATCHAI_INPUT_TEST'] = '/mnt/input/validation'

In [17]:
cmd=f'mpirun -np {NUM_PROCESSES} -H localhost:{NUM_PROCESSES} '\
     'python -u /workspace/HorovodPytorch/src/imagenet_pytorch_horovod.py'
container = dc.containers.run(image.tags[0], 
                              command=cmd,
                              detach=True, 
                              labels=container_labels,
                              runtime='nvidia',
                              volumes=volumes,
                              environment=environment,
                              shm_size='8G',
                              privileged=True)

With the code below we are simply monitoring what is happening in the container. Feel free to stop the notebook when you are happy that everything is working.

In [18]:
for line in container.logs(stderr=True, stream=True):
    print(line.decode("utf-8"),end ="")

INFO:__main__:0:  Runnin Distributed
INFO:__main__:1:  Runnin Distributed
INFO:__main__:0:  PyTorch version 0.4.0
INFO:__main__:0:  Setting up fake loaders
INFO:__main__:1:  PyTorch version 0.4.0
INFO:__main__:1:  Setting up fake loaders
INFO:__main__:1:  Creating fake data 1000 labels and 640 images
INFO:__main__:1:  Loading model
INFO:__main__:0:  Creating fake data 1000 labels and 640 images
INFO:__main__:0:  Loading model
INFO:__main__:1:  Training ...
INFO:__main__:0:  Training ...

41afbf31e948:13:65 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
41afbf31e948:13:65 [0] INFO Using internal Network Socket
41afbf31e948:13:65 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
41afbf31e948:13:65 [0] INFO NET : Using interface eth0:172.17.0.3<0>
41afbf31e948:13:65 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.13+cuda9.0

41afbf31e948:14:66 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
41afbf31e948:14:66 [1] INFO Using internal Netwo

KeyboardInterrupt: 

In [19]:
container.reload() # Refresh state
if container.status is 'running':
    container.kill()

In [24]:
for line in dc.images.push(image.tags[0], 
                           stream=True,
                           auth_config={"username": DOCKERHUB,
                                        "password": DOCKER_PWD}):
    print(line)

b'{"status":"The push refers to a repository [docker.io/masalvar/caia-horovod-pytorch]"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"7e6c8b5d5783"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"eeb659df3cc8"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"27aab996f8cd"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"d9038574f55a"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"4936625d6fff"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"9be10ccfe4da"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"12302f8bd2e6"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"879c4ef3d9fb"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"4246124ac3fb"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"a917bc2d0f96"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"c7cfa177d51a"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"9b68e6935e56"}\r\n'
b'{"status":"Preparing","progressDetail":{},"id":"6e8ce585c22b"}\r