## Distributed training with Amazon SageMaker

In this notebook we use the SageMaker Python SDK to setup and run a distributed training job.
SageMaker makes it easy to train models across a cluster containing a large number of machines, without having to explicitly manage those resources. 

**Step 1:** Import essentials packages, start a sagemaker session and specify the bucket name you created in the pre-requsites section of this workshop.

In [1]:
import os
import time
import numpy as np
import sagemaker

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket_name = sagemaker_session.default_bucket()

**Step 2:** Specify hyperparameters, instance type and number of instances to distribute training to. The `hvd_processes_per_host` corresponds to the number of GPUs per instances. 
For example, if you choose:
```
hvd_instance_type = 'ml.p3.2xlarge'
hvd_instance_count = 2
hvd_processes_per_host = 1
```

This is spread across 2 instances (or nodes). SageMaker automatically takes care of spinning up these instances and making sure they can communiate with each other.

In [2]:
hyperparameters = {'epochs': 5, 
                   'learning-rate': 0.001,
                   'momentum': 0.9,
                   'weight-decay': 2e-4,
                   'optimizer': 'adam',
                   'batch-size' : 256}

hvd_instance_type = 'ml.p3.2xlarge'
hvd_instance_count = 2
hvd_processes_per_host = 1

print('Distributed training with a total of {} workers'.format(hvd_processes_per_host*hvd_instance_count))
print('{} x {} instances with {} processes per instance'.format(hvd_instance_count, hvd_instance_type, hvd_processes_per_host))

Distributed training with a total of 2 workers
2 x ml.p3.2xlarge instances with 1 processes per instance


**Step 3:** In this cell we create a SageMaker estimator, by providing it with all the information it needs to launch instances and execute training on those instances.

Since we're using horovod for distributed training, we specify `distributions` to mpi which is used by horovod.

In the TensorFlow estimator call, we specify training script under `entry_point` and dependencies under `code`. SageMaker automatically copies these files into a TensorFlow container behind the scenes, and are executed on the training instances.

In [3]:
from sagemaker.tensorflow import TensorFlow

output_path = 's3://{}/'.format(bucket_name)
job_name = 'sm-dist-{}x{}-workers-'.format(hvd_instance_count, hvd_processes_per_host) + time.strftime('%Y-%m-%d-%H-%M-%S-%j', time.gmtime())
model_dir = output_path + 'tensorboard_logs/' + job_name

distributions = {'mpi': {
                    'enabled': True,
                    'processes_per_host': hvd_processes_per_host,
                    'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none'
                        }
                }

estimator_hvd = TensorFlow(base_job_name='hvd-cifar10-tf',
                       source_dir='code',
                       entry_point='cifar10-multi-gpu-horovod-sagemaker.py', 
                       role=role,
                       framework_version='1.14',
                       py_version='py3',
                       hyperparameters=hyperparameters,
                       train_instance_count=hvd_instance_count, 
                       train_instance_type=hvd_instance_type,
                       output_path=output_path,
                       model_dir=model_dir,
                       tags = [{'Key' : 'Project', 'Value' : 'cifar10'},{'Key' : 'TensorBoard', 'Value' : 'dist'}],
                       metric_definitions=[{'Name': 'val_acc', 'Regex': 'val_acc: ([0-9\\.]+)'}],
                       distributions=distributions)

**Step 4:** Specify dataset locations in Amazon S3 and then call the fit function.

In [4]:
train_path = 's3://{}/cifar10-dataset/train'.format(bucket_name)
val_path = 's3://{}/cifar10-dataset/validation'.format(bucket_name)
eval_path = 's3://{}/cifar10-dataset/eval/'.format(bucket_name)

estimator_hvd.fit({'train': train_path,'validation': val_path,'eval': eval_path}, 
                  job_name=job_name, wait=False)

In [5]:
estimator_hvd.attach(training_job_name=job_name)

2020-02-22 17:28:55 Starting - Launching requested ML instances......
2020-02-22 17:29:56 Starting - Preparing the instances for training......
2020-02-22 17:31:03 Downloading - Downloading input data...
2020-02-22 17:31:28 Training - Downloading the training image...
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])[0m
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])[0m
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])[0m
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])[0m
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])[0m
  np_resource = np.dtype([("resource", np.ubyte, 1)])[0m
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])[0m
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])[0m
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])[0m
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])[0m
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])[0m
  np_resource = np.dtype([("resource", np.ubyte, 1)])[0m
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])[0m
  _np_quint8 

[34m[1,1]<stderr>:2020-02-22 17:32:15.252480: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7[0m
[34m[1,0]<stdout>:algo-1:44:55 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.137.247<0>[0m
[34m[1,0]<stdout>:algo-1:44:55 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).[0m
[34m[1,0]<stdout>:[0m
[34m[1,0]<stdout>:algo-1:44:55 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1][0m
[34m[1,0]<stdout>:NCCL version 2.4.7+cuda10.0[0m
[34m[1,1]<stdout>:algo-2:50:61 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.186.193<0>[0m
[34m[1,1]<stdout>:algo-2:50:61 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).[0m
[34m[1,1]<stdout>:[0m
[34m[1,1]<stdout>:algo-2:50:61 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1][0m
[34m[1,0]<stdout>:algo-1:44:55 [0] NCCL INFO Setting affinity for GPU 0 to ff[0m
[34m[1,1]<stdout>:algo-2:50:61 [0] NCCL INFO Setting affinity for GP



[34m[1,1]<stdout>:Epoch 3/5[0m
[34m[1,0]<stdout>:Completed 256.0 KiB/961.4 KiB (12.9 MiB/s) with 1 file(s) remaining#015[1,0]<stdout>:Completed 512.0 KiB/961.4 KiB (22.6 MiB/s) with 1 file(s) remaining#015[1,0]<stdout>:Completed 768.0 KiB/961.4 KiB (30.7 MiB/s) with 1 file(s) remaining#015[1,0]<stdout>:Completed 961.4 KiB/961.4 KiB (9.4 MiB/s) with 1 file(s) remaining #015[1,0]<stdout>:upload: ../output/data/20200222-173208/events.out.tfevents.1582392730.algo-1 to s3://sagemaker-us-west-2-173153674984/tensorboard_logs/sm-dist-2x1-workers-2020-02-22-17-28-51-053/events.out.tfevents.1582392730.algo-1[0m
[34m[1,0]<stdout>:Epoch 3/5[0m


[34m[1,1]<stdout>:Epoch 4/5[0m
[34m[1,0]<stdout>:Completed 256.0 KiB/961.7 KiB (12.2 MiB/s) with 1 file(s) remaining#015[1,0]<stdout>:Completed 512.0 KiB/961.7 KiB (22.5 MiB/s) with 1 file(s) remaining#015[1,0]<stdout>:Completed 768.0 KiB/961.7 KiB (31.5 MiB/s) with 1 file(s) remaining#015[1,0]<stdout>:Completed 961.7 KiB/961.7 KiB (9.3 MiB/s) with 1 file(s) remaining #015[1,0]<stdout>:upload: ../output/data/20200222-173208/events.out.tfevents.1582392730.algo-1 to s3://sagemaker-us-west-2-173153674984/tensorboard_logs/sm-dist-2x1-workers-2020-02-22-17-28-51-053/events.out.tfevents.1582392730.algo-1[0m
[34m[1,0]<stdout>:Epoch 4/5[0m


[34m[1,1]<stdout>:Epoch 5/5[0m
[34m[1,0]<stdout>:Completed 256.0 KiB/961.9 KiB (10.4 MiB/s) with 1 file(s) remaining#015[1,0]<stdout>:Completed 512.0 KiB/961.9 KiB (17.6 MiB/s) with 1 file(s) remaining#015[1,0]<stdout>:Completed 768.0 KiB/961.9 KiB (23.2 MiB/s) with 1 file(s) remaining#015[1,0]<stdout>:Completed 961.9 KiB/961.9 KiB (7.9 MiB/s) with 1 file(s) remaining #015[1,0]<stdout>:upload: ../output/data/20200222-173208/events.out.tfevents.1582392730.algo-1 to s3://sagemaker-us-west-2-173153674984/tensorboard_logs/sm-dist-2x1-workers-2020-02-22-17-28-51-053/events.out.tfevents.1582392730.algo-1[0m
[34m[1,0]<stdout>:Epoch 5/5[0m


[34m[1,0]<stdout>:Epoch 5: finished gradual learning rate warmup to 0.002.[1,0]<stdout>:[0m
[34m[1,1]<stdout>:[0m
[34m[1,1]<stdout>:Epoch 5: finished gradual learning rate warmup to 0.002.[0m
[34m[1,0]<stdout>:Completed 256.0 KiB/962.1 KiB (14.7 MiB/s) with 1 file(s) remaining#015[1,0]<stdout>:Completed 512.0 KiB/962.1 KiB (26.5 MiB/s) with 1 file(s) remaining#015[1,0]<stdout>:Completed 768.0 KiB/962.1 KiB (36.3 MiB/s) with 1 file(s) remaining#015[1,0]<stdout>:Completed 962.1 KiB/962.1 KiB (10.1 MiB/s) with 1 file(s) remaining#015[1,0]<stdout>:upload: ../output/data/20200222-173208/events.out.tfevents.1582392730.algo-1 to s3://sagemaker-us-west-2-173153674984/tensorboard_logs/sm-dist-2x1-workers-2020-02-22-17-28-51-053/events.out.tfevents.1582392730.algo-1[0m
[34m[1,1]<stdout>:Test loss    : 1.2979332911662567[0m
[34m[1,1]<stdout>:Test accuracy: 0.53215146[0m


[34m[1,0]<stdout>:Test loss    : 1.3000682531258998[0m
[34m[1,0]<stdout>:Test accuracy: 0.5354567[0m
[34mW0222 17:33:25.574396 140656805435136 training.py:181] No model artifact is saved under path /opt/ml/model. Your training job will not save any model files to S3.[0m
[34mFor details of how to construct your training script see:[0m
[34mhttps://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#adapting-your-local-tensorflow-script[0m

2020-02-22 17:34:07 Uploading - Uploading generated training model
2020-02-22 17:34:07 Completed - Training job completed
[35mW0222 17:33:55.614902 140610589783808 training.py:181] No model artifact is saved under path /opt/ml/model. Your training job will not save any model files to S3.[0m
[35mFor details of how to construct your training script see:[0m
[35mhttps://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#adapting-your-local-tensorflow-script[0m
Training seconds: 368
Billable seconds

<sagemaker.tensorflow.estimator.TensorFlow at 0x7f292ddc3908>