# Train and Host a Keras Model with Pipe Mode and Horovod on Amazon SageMaker

Amazon SageMaker is a fully-managed service that provides developers and data scientists with the ability to build, train, and deploy machine learning (ML) models quickly. Amazon SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high-quality models. The SageMaker Python SDK makes it easy to train and deploy models in Amazon SageMaker with several different machine learning and deep learning frameworks, including TensorFlow and Keras.

In this notebook, we train and host a [Keras Sequential model](https://keras.io/getting-started/sequential-model-guide) on SageMaker. The model used for this notebook is a simple deep convolutional neural network (CNN) that was extracted from [the Keras examples](https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py).

For training our model, we also demonstrate distributed training with [Horovod](https://horovod.readthedocs.io) and Pipe Mode. Amazon SageMaker's Pipe Mode streams your dataset directly to your training instances instead of being downloaded first, which translates to training jobs that start sooner, finish quicker, and need less disk space.

## Setup

First, we define a few variables that are be needed later in the example.

In [8]:
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

## The CIFAR-10 dataset

The [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) is one of the most popular machine learning datasets. It consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class). Here are the classes in the dataset, as well as 10 random images from each:

![cifar10](https://maet3608.github.io/nuts-ml/_images/cifar10.png)

### Prepare the dataset for training

To use the CIFAR-10 dataset, we first download it and convert it to TFRecords. This step takes around 5 minutes.

In [9]:
!python generate_cifar10_tfrecords.py --data-dir ./data


Download from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz and extract.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use urllib or similar directly.
Successfully downloaded cifar-10-python.tar.gz 170498071 bytes.
Generating ./data/train/train.tfrecords
Generating ./data/validation/validation.tfrecords
Generating ./data/eval/eval.tfrecords
Done!


Next, we upload the data to Amazon S3:

In [10]:
from sagemaker.s3 import S3Uploader

bucket = sagemaker_session.default_bucket()
dataset_uri = S3Uploader.upload('data', 's3://{}/tf-cifar10-example/data'.format(bucket))

display(dataset_uri)

's3://sagemaker-ap-southeast-2-987959606453/tf-cifar10-example/data'

## Train the model

In this tutorial, we train a deep CNN to learn a classification task with the CIFAR-10 dataset. We compare three different training jobs: a baseline training job, training with Pipe Mode, and distributed training with Horovod.

### Run a baseline training job on SageMaker

The SageMaker Python SDK's `sagemaker.tensorflow.TensorFlow` estimator class makes it easy for us to interact with SageMaker. We create one for each of the different training jobs we run in this example. A couple parameters worth noting:

* `entry_point`: our training script (adapted from [this Keras example](https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py)).
* `train_instance_count`: the number of training instances. Here, we set it to 1 for our baseline training job.

As we run each of our training jobs, we change different parameters to configure our different training jobs.

For more details about the TensorFlow estimator class, see the [API documentation](https://sagemaker.readthedocs.io/en/stable/sagemaker.tensorflow.html).

### Verify the training code

Before running the baseline training job, we first use [the SageMaker Python SDK's Local Mode feature](https://sagemaker.readthedocs.io/en/stable/overview.html#local-mode) to check that our code works with SageMaker's TensorFlow environment. Local Mode downloads the [prebuilt Docker image for TensorFlow](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html) and runs a Docker container locally for a training job. In other words, it simulates the SageMaker environment for a quicker development cycle, so we use it here just to test out our code.

We create a TensorFlow estimator, and specify the `instance_type` to be `'local'` or `'local_gpu'`, depending on our local instance type. This tells the estimator to run our training job locally (as opposed to on SageMaker). We also have our training code run for only one epoch because our intent here is to verify the code, not train an accurate model.

In [11]:
import subprocess

from sagemaker.tensorflow import TensorFlow

instance_type = 'local'

if subprocess.call('nvidia-smi') == 0:
    # Set instance type to GPU if one is present
    instance_type = 'local_gpu'
    
local_hyperparameters = {'epochs': 1, 'batch-size' : 64}

estimator = TensorFlow(entry_point='cifar10_keras_main.py',
                       source_dir='source_dir',
                       role=role,
                       framework_version='1.15.2',
                       py_version='py3',
                       hyperparameters=local_hyperparameters,
                       train_instance_count=1,
                       train_instance_type=instance_type)

train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Once we have our estimator, we call `fit()` to start the training job and pass the inputs that we downloaded earlier. We pass the inputs as a dictionary to define different data channels for training.

In [12]:
import os

data_path = os.path.join(os.getcwd(), 'data')

local_inputs = {
    'train': 'file://{}/train'.format(data_path),
    'validation': 'file://{}/validation'.format(data_path),
    'eval': 'file://{}/eval'.format(data_path),
}
estimator.fit(local_inputs)

Creating 7828rhrjz9-algo-1-9g2mh ... 
Creating 7828rhrjz9-algo-1-9g2mh ... done
Attaching to 7828rhrjz9-algo-1-9g2mh
[36m7828rhrjz9-algo-1-9g2mh |[0m 
[36m7828rhrjz9-algo-1-9g2mh |[0m 2021-05-14 10:19:51,755 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training
[36m7828rhrjz9-algo-1-9g2mh |[0m 2021-05-14 10:19:51,771 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36m7828rhrjz9-algo-1-9g2mh |[0m 2021-05-14 10:19:52,276 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36m7828rhrjz9-algo-1-9g2mh |[0m 2021-05-14 10:19:52,333 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36m7828rhrjz9-algo-1-9g2mh |[0m 2021-05-14 10:19:52,408 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36m7828rhrjz9-algo-1-9g2mh |[0m 2021-05-14 10:19:52,453 sagemaker-containers INFO     Invoking user script
[36m7828rhrjz9-algo-1-9g2mh |[0m 
[36m7828

[36m7828rhrjz9-algo-1-9g2mh |[0m Using TensorFlow backend.
[36m7828rhrjz9-algo-1-9g2mh |[0m 
[36m7828rhrjz9-algo-1-9g2mh |[0m 
[36m7828rhrjz9-algo-1-9g2mh |[0m 
[36m7828rhrjz9-algo-1-9g2mh |[0m INFO:root:Writing TensorBoard logs to s3://sagemaker-ap-southeast-2-987959606453/tensorflow-training-2021-05-14-10-19-37-516/model
[36m7828rhrjz9-algo-1-9g2mh |[0m INFO:root:Running with MPI=False
[36m7828rhrjz9-algo-1-9g2mh |[0m INFO:root:getting data
[36m7828rhrjz9-algo-1-9g2mh |[0m INFO:root:Running train in File mode
[36m7828rhrjz9-algo-1-9g2mh |[0m 
[36m7828rhrjz9-algo-1-9g2mh |[0m 
[36m7828rhrjz9-algo-1-9g2mh |[0m 
[36m7828rhrjz9-algo-1-9g2mh |[0m 
[36m7828rhrjz9-algo-1-9g2mh |[0m 
[36m7828rhrjz9-algo-1-9g2mh |[0m 
[36m7828rhrjz9-algo-1-9g2mh |[0m 
[36m7828rhrjz9-algo-1-9g2mh |[0m 
[36m7828rhrjz9-algo-1-9g2mh |[0m INFO:root:Running eval in File mode
[36m7828rhrjz9-algo-1-9g2mh |[0m INFO:root:Running validation in File mode
[36m7828rhrjz9-algo-1-9g2mh |

[36m7828rhrjz9-algo-1-9g2mh |[0m INFO:tensorflow:SavedModel written to: /opt/ml/model/1/saved_model.pb
[36m7828rhrjz9-algo-1-9g2mh |[0m INFO:tensorflow:SavedModel written to: /opt/ml/model/1/saved_model.pb
[36m7828rhrjz9-algo-1-9g2mh |[0m INFO:root:Model successfully saved at: /opt/ml/model
[36m7828rhrjz9-algo-1-9g2mh |[0m 2021-05-14 10:29:48,019 sagemaker-containers INFO     Reporting training SUCCESS
[36m7828rhrjz9-algo-1-9g2mh exited with code 0
[0mAborting on container exit...
===== Job Complete =====


### Run a baseline training job on SageMaker

Now we run training jobs on SageMaker, starting with our baseline training job.

### Configure metrics

In addition to running the training job, Amazon SageMaker can retrieve training metrics directly from the logs and send them to CloudWatch metrics. Here, we define metrics we would like to observe:

In [13]:
metric_definitions = [
    {'Name': 'train:loss', 'Regex': '.*loss: ([0-9\\.]+) - accuracy: [0-9\\.]+.*'},
    {'Name': 'train:accuracy', 'Regex': '.*loss: [0-9\\.]+ - accuracy: ([0-9\\.]+).*'},
    {'Name': 'validation:accuracy', 'Regex': '.*step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_accuracy: ([0-9\\.]+).*'},
    {'Name': 'validation:loss', 'Regex': '.*step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: ([0-9\\.]+) - val_accuracy: [0-9\\.]+.*'},
    {'Name': 'sec/steps', 'Regex': '.* - \d+s (\d+)[mu]s/step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_accuracy: [0-9\\.]+'}
]

Once again, we create a TensorFlow estimator, with a couple key modfications from last time:

* `train_instance_type`: the instance type for training. We set this to `ml.p2.xlarge` because we are training on SageMaker now. For a list of available instance types, see [the AWS documentation](https://aws.amazon.com/sagemaker/pricing/instance-types).
* `metric_definitions`: the metrics (defined above) that we want sent to CloudWatch.

In [17]:
from sagemaker.tensorflow import TensorFlow

hyperparameters = {'epochs': 10, 'batch-size': 256}
tags = [{'Key': 'Project', 'Value': 'cifar10'}, {'Key': 'TensorBoard', 'Value': 'file'}]

estimator = TensorFlow(entry_point='cifar10_keras_main.py',
                       source_dir='source_dir',
                       metric_definitions=metric_definitions,
                       hyperparameters=hyperparameters,
                       role=role,
                       framework_version='1.15.2',
                       py_version='py3',
                       train_instance_count=1,
                       train_instance_type='ml.m4.xlarge', # changed from ml.p2.xlarge to ml.m4.xlarge due to 
                                                           # Free Tier allowances
                       base_job_name='cifar10-tf',
                       tags=tags)

train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Like before, we call `fit()` to start the SageMaker training job and pass the inputs in a dictionary to define different data channels for training. This time, we use the S3 URI from uploading our data.

In [None]:
inputs = {
    'train': '{}/train'.format(dataset_uri),
    'validation': '{}/validation'.format(dataset_uri),
    'eval': '{}/eval'.format(dataset_uri),
}

estimator.fit(inputs)

2021-05-14 10:37:03 Starting - Starting the training job...
2021-05-14 10:37:27 Starting - Launching requested ML instancesProfilerReport-1620988623: InProgress
......
2021-05-14 10:38:29 Starting - Preparing the instances for training.........
2021-05-14 10:39:55 Downloading - Downloading input data
2021-05-14 10:39:55 Training - Downloading the training image...
[0m
[34m2021-05-14 10:40:21,562 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2021-05-14 10:40:21,570 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-05-14 10:40:22,032 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-05-14 10:40:22,053 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-05-14 10:40:22,074 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-05-14 10:40:22,087 sagemaker-containers INFO     Invokin

[34m  1/156 [..............................] - ETA: 25:45 - loss: 4.5213 - accuracy: 0.0859[0m
[34m  2/156 [..............................] - ETA: 13:49 - loss: 2.9989 - accuracy: 0.2734[0m
[34m  3/156 [..............................] - ETA: 9:50 - loss: 2.2958 - accuracy: 0.4258 [0m
[34m  4/156 [..............................] - ETA: 7:52 - loss: 1.8657 - accuracy: 0.5068[0m
[34m  5/156 [..............................] - ETA: 6:43 - loss: 1.5480 - accuracy: 0.5836
  6/156 [>.............................] - ETA: 5:55 - loss: 1.3169 - accuracy: 0.6432[0m
[34m  7/156 [>.............................] - ETA: 5:22 - loss: 1.1484 - accuracy: 0.6864[0m
[34m  8/156 [>.............................] - ETA: 4:59 - loss: 1.0127 - accuracy: 0.7236[0m
[34m  9/156 [>.............................] - ETA: 4:38 - loss: 0.9032 - accuracy: 0.7543[0m
[34m 10/156 [>.............................] - ETA: 4:22 - loss: 0.8153 - accuracy: 0.7781[0m
[34m 11/156 [=>............................] 

[0m
[0m
[34mEpoch 2/10[0m
[34m  1/156 [..............................] - ETA: 2:25 - loss: 4.9345e-05 - accuracy: 1.0000[0m
[34m  2/156 [..............................] - ETA: 2:20 - loss: 4.9841e-05 - accuracy: 1.0000[0m
[34m  3/156 [..............................] - ETA: 2:15 - loss: 4.8553e-05 - accuracy: 1.0000[0m
[34m  4/156 [..............................] - ETA: 2:12 - loss: 4.4458e-05 - accuracy: 1.0000[0m
[34m  5/156 [..............................] - ETA: 2:16 - loss: 4.0583e-05 - accuracy: 1.0000[0m
[34m  6/156 [>.............................] - ETA: 2:15 - loss: 4.9679e-05 - accuracy: 1.0000[0m
[34m  7/156 [>.............................] - ETA: 2:15 - loss: 5.1140e-05 - accuracy: 1.0000
  8/156 [>.............................] - ETA: 2:13 - loss: 5.1411e-05 - accuracy: 1.0000[0m
[34m  9/156 [>.............................] - ETA: 2:13 - loss: 5.1401e-05 - accuracy: 1.0000[0m
[34m 10/156 [>.............................] - ETA: 2:11 - loss: 4.8669e-05 - 

[34m 20/156 [==>...........................] - ETA: 2:00 - loss: 5.2773e-05 - accuracy: 1.0000[0m
[34m 21/156 [===>..........................] - ETA: 1:58 - loss: 5.1198e-05 - accuracy: 1.0000[0m
[34m 22/156 [===>..........................] - ETA: 1:58 - loss: 5.2401e-05 - accuracy: 1.0000
 23/156 [===>..........................] - ETA: 1:56 - loss: 5.1504e-05 - accuracy: 1.0000[0m
[34m 24/156 [===>..........................] - ETA: 1:56 - loss: 5.0386e-05 - accuracy: 1.0000[0m
[34m 25/156 [===>..........................] - ETA: 1:55 - loss: 4.9855e-05 - accuracy: 1.0000
 26/156 [====>.........................] - ETA: 1:56 - loss: 4.9602e-05 - accuracy: 1.0000[0m
[34m 27/156 [====>.........................] - ETA: 1:54 - loss: 4.9014e-05 - accuracy: 1.0000[0m
[34m 28/156 [====>.........................] - ETA: 1:53 - loss: 4.9985e-05 - accuracy: 1.0000[0m
[34m 29/156 [====>.........................] - ETA: 1:52 - loss: 5.0469e-05 - accuracy: 1.0000[0m
[34m 30/156 [====

[34mEpoch 3/10[0m
[34m  1/156 [..............................] - ETA: 2:21 - loss: 2.3888e-05 - accuracy: 1.0000[0m
[34m  2/156 [..............................] - ETA: 2:11 - loss: 2.5921e-05 - accuracy: 1.0000[0m
[34m  3/156 [..............................] - ETA: 2:08 - loss: 2.2594e-05 - accuracy: 1.0000[0m
[34m  4/156 [..............................] - ETA: 2:11 - loss: 2.1763e-05 - accuracy: 1.0000[0m
[34m  5/156 [..............................] - ETA: 2:11 - loss: 2.3870e-05 - accuracy: 1.0000[0m
[34m  6/156 [>.............................] - ETA: 2:08 - loss: 2.7783e-05 - accuracy: 1.0000[0m
[34m  7/156 [>.............................] - ETA: 2:07 - loss: 2.8058e-05 - accuracy: 1.0000
  8/156 [>.............................] - ETA: 2:06 - loss: 2.8079e-05 - accuracy: 1.0000[0m
[34m  9/156 [>.............................] - ETA: 2:05 - loss: 2.7899e-05 - accuracy: 1.0000[0m
[34m 10/156 [>.............................] - ETA: 2:06 - loss: 2.6538e-05 - accuracy: 



[34mEpoch 4/10[0m
[34m  1/156 [..............................] - ETA: 2:02 - loss: 2.5433e-05 - accuracy: 1.0000[0m
[34m  2/156 [..............................] - ETA: 2:11 - loss: 2.1057e-05 - accuracy: 1.0000
  3/156 [..............................] - ETA: 2:09 - loss: 2.1246e-05 - accuracy: 1.0000[0m
[34m  4/156 [..............................] - ETA: 2:08 - loss: 3.2144e-05 - accuracy: 1.0000[0m
[34m  5/156 [..............................] - ETA: 2:07 - loss: 2.7147e-05 - accuracy: 1.0000[0m
[34m  6/156 [>.............................] - ETA: 2:09 - loss: 2.6644e-05 - accuracy: 1.0000[0m
[34m  7/156 [>.............................] - ETA: 2:11 - loss: 2.7001e-05 - accuracy: 1.0000[0m
[34m  8/156 [>.............................] - ETA: 2:12 - loss: 2.6438e-05 - accuracy: 1.0000[0m
[34m  9/156 [>.............................] - ETA: 2:11 - loss: 2.6732e-05 - accuracy: 1.0000[0m
[34m 10/156 [>.............................] - ETA: 2:10 - loss: 2.5769e-05 - accuracy: 

[34mEpoch 5/10[0m
[34m  1/156 [..............................] - ETA: 2:26 - loss: 1.1096e-05 - accuracy: 1.0000[0m
[34m  2/156 [..............................] - ETA: 2:49 - loss: 1.0562e-05 - accuracy: 1.0000[0m
[34m  3/156 [..............................] - ETA: 2:46 - loss: 9.7605e-06 - accuracy: 1.0000[0m
[34m  4/156 [..............................] - ETA: 2:43 - loss: 1.1195e-05 - accuracy: 1.0000[0m
[34m  5/156 [..............................] - ETA: 2:35 - loss: 9.8050e-06 - accuracy: 1.0000[0m
[34m  6/156 [>.............................] - ETA: 2:30 - loss: 9.4079e-06 - accuracy: 1.0000[0m
[34m  7/156 [>.............................] - ETA: 2:25 - loss: 8.5465e-06 - accuracy: 1.0000[0m
[34m  8/156 [>.............................] - ETA: 2:21 - loss: 9.8969e-06 - accuracy: 1.0000[0m


[34m  9/156 [>.............................] - ETA: 2:20 - loss: 9.3393e-06 - accuracy: 1.0000
 10/156 [>.............................] - ETA: 2:18 - loss: 9.0368e-06 - accuracy: 1.0000[0m
[34m 11/156 [=>............................] - ETA: 2:16 - loss: 8.8927e-06 - accuracy: 1.0000[0m
[34m 12/156 [=>............................] - ETA: 2:14 - loss: 1.3007e-05 - accuracy: 1.0000[0m
[34m 13/156 [=>............................] - ETA: 2:14 - loss: 1.3340e-05 - accuracy: 1.0000[0m
[34m 14/156 [=>............................] - ETA: 2:12 - loss: 1.3074e-05 - accuracy: 1.0000[0m
[34m 15/156 [=>............................] - ETA: 2:11 - loss: 1.3218e-05 - accuracy: 1.0000[0m
[34m 16/156 [==>...........................] - ETA: 2:09 - loss: 1.3608e-05 - accuracy: 1.0000[0m
[34m 17/156 [==>...........................] - ETA: 2:08 - loss: 1.5399e-05 - accuracy: 1.0000[0m
[34m 18/156 [==>...........................] - ETA: 2:08 - loss: 1.5069e-05 - accuracy: 1.0000
 19/156 [==>.

[34mEpoch 6/10[0m
[34m  1/156 [..............................] - ETA: 2:11 - loss: 4.5633e-05 - accuracy: 1.0000[0m
[34m  2/156 [..............................] - ETA: 2:12 - loss: 2.5245e-05 - accuracy: 1.0000
  3/156 [..............................] - ETA: 2:11 - loss: 1.9757e-05 - accuracy: 1.0000[0m
[34m  4/156 [..............................] - ETA: 2:10 - loss: 1.5991e-05 - accuracy: 1.0000[0m
[34m  5/156 [..............................] - ETA: 2:07 - loss: 1.4212e-05 - accuracy: 1.0000[0m
[34m  6/156 [>.............................] - ETA: 2:08 - loss: 1.2687e-05 - accuracy: 1.0000[0m
[34m  7/156 [>.............................] - ETA: 2:09 - loss: 1.1781e-05 - accuracy: 1.0000[0m
[34m  8/156 [>.............................] - ETA: 2:08 - loss: 1.1097e-05 - accuracy: 1.0000[0m
[34m  9/156 [>.............................] - ETA: 2:07 - loss: 1.0393e-05 - accuracy: 1.0000
 10/156 [>.............................] - ETA: 2:06 - loss: 1.1095e-05 - accuracy: 1.0000[0

[34m 36/156 [=====>........................] - ETA: 1:49 - loss: 8.9106e-06 - accuracy: 1.0000


[34mEpoch 7/10[0m
[34m  1/156 [..............................] - ETA: 2:20 - loss: 6.9935e-06 - accuracy: 1.0000[0m
[34m  2/156 [..............................] - ETA: 2:14 - loss: 7.0234e-06 - accuracy: 1.0000
  3/156 [..............................] - ETA: 2:13 - loss: 1.1176e-05 - accuracy: 1.0000[0m
[34m  4/156 [..............................] - ETA: 2:13 - loss: 9.5472e-06 - accuracy: 1.0000[0m
[34m  5/156 [..............................] - ETA: 2:13 - loss: 9.0039e-06 - accuracy: 1.0000[0m
[34m  6/156 [>.............................] - ETA: 2:12 - loss: 9.5153e-06 - accuracy: 1.0000[0m
[34m  7/156 [>.............................] - ETA: 2:10 - loss: 9.4579e-06 - accuracy: 1.0000[0m
[34m  8/156 [>.............................] - ETA: 2:08 - loss: 8.9058e-06 - accuracy: 1.0000[0m
[34m  9/156 [>.............................] - ETA: 2:08 - loss: 9.8761e-06 - accuracy: 1.0000[0m
[34m 10/156 [>.............................] - ETA: 2:07 - loss: 1.0002e-05 - accuracy: 



[34mEpoch 8/10[0m
[34m  1/156 [..............................] - ETA: 2:06 - loss: 2.7943e-05 - accuracy: 1.0000[0m
[34m  2/156 [..............................] - ETA: 2:13 - loss: 1.4666e-05 - accuracy: 1.0000[0m
[34m  3/156 [..............................] - ETA: 2:11 - loss: 1.1164e-05 - accuracy: 1.0000
  4/156 [..............................] - ETA: 2:11 - loss: 9.0284e-06 - accuracy: 1.0000[0m
[34m  5/156 [..............................] - ETA: 2:11 - loss: 8.9779e-06 - accuracy: 1.0000[0m
[34m  6/156 [>.............................] - ETA: 2:09 - loss: 7.9908e-06 - accuracy: 1.0000[0m
[34m  7/156 [>.............................] - ETA: 2:09 - loss: 9.5108e-06 - accuracy: 1.0000[0m
[34m  8/156 [>.............................] - ETA: 2:09 - loss: 1.0184e-05 - accuracy: 1.0000[0m
[34m  9/156 [>.............................] - ETA: 2:08 - loss: 9.4133e-06 - accuracy: 1.0000[0m
[34m 10/156 [>.............................] - ETA: 2:08 - loss: 9.2895e-06 - accuracy: 



[34mEpoch 9/10[0m
[34m  1/156 [..............................] - ETA: 2:18 - loss: 6.4401e-06 - accuracy: 1.0000[0m
[34m  2/156 [..............................] - ETA: 2:09 - loss: 5.2160e-06 - accuracy: 1.0000[0m
[34m  3/156 [..............................] - ETA: 2:11 - loss: 5.1886e-06 - accuracy: 1.0000[0m
[34m  4/156 [..............................] - ETA: 2:11 - loss: 4.7952e-06 - accuracy: 1.0000
  5/156 [..............................] - ETA: 2:10 - loss: 4.1685e-06 - accuracy: 1.0000[0m
[34m  6/156 [>.............................] - ETA: 2:09 - loss: 3.7234e-06 - accuracy: 1.0000[0m
[34m  7/156 [>.............................] - ETA: 2:10 - loss: 3.4059e-06 - accuracy: 1.0000[0m
[34m  8/156 [>.............................] - ETA: 2:11 - loss: 5.3979e-06 - accuracy: 1.0000[0m
[34m  9/156 [>.............................] - ETA: 2:11 - loss: 5.0239e-06 - accuracy: 1.0000[0m
[34m 10/156 [>.............................] - ETA: 2:11 - loss: 5.3181e-06 - accuracy: 



### View the job training metrics

We can now view the metrics from the training job directly in the SageMaker console.  

Log into the [SageMaker console](https://console.aws.amazon.com/sagemaker/home), choose the latest training job, and scroll down to the monitor section. Alternatively, the code below uses the region and training job name to generate a URL to CloudWatch metrics.

Using CloudWatch metrics, you can change the period and configure the statistics.

In [None]:
from urllib import parse

from IPython.core.display import Markdown

region = sagemaker_session.boto_region_name
cw_url = parse.urlunparse((
    'https',
    '{}.console.aws.amazon.com'.format(region),
    '/cloudwatch/home',
    '',
    'region={}'.format(region),
    'metricsV2:namespace=/aws/sagemaker/TrainingJobs;dimensions=TrainingJobName;search={}'.format(estimator.latest_training_job.name),
))

display(Markdown('CloudWatch metrics: [link]({}). After you choose a metric, '
                 'change the period to 1 Minute (Graphed Metrics -> Period).'.format(cw_url)))

### Train on SageMaker with Pipe Mode

Here we train our model using Pipe Mode. With Pipe Mode, SageMaker uses [Linux named pipes](https://www.linuxjournal.com/article/2156) to stream the training data directly from S3 instead of downloading the data first.

In our script, we enable Pipe Mode using the following code:

```python
from sagemaker_tensorflow import PipeModeDataset

dataset = PipeModeDataset(channel=channel_name, record_format='TFRecord')
```

When we create our estimator, the only difference from before is that we also specify `input_mode='Pipe'`:

In [None]:
pipe_mode_estimator = TensorFlow(entry_point='cifar10_keras_main.py',
                                 source_dir='source_dir',
                                 metric_definitions=metric_definitions,
                                 hyperparameters=hyperparameters,
                                 role=role,
                                 framework_version='1.15.2',
                                 py_version='py3',
                                 train_instance_count=1,
                                 train_instance_type='ml.p2.xlarge',
                                 input_mode='Pipe',
                                 base_job_name='cifar10-tf-pipe',
                                 tags=tags)

In this example, we set ```wait=False``` if you want to see the output logs, change this to ```wait=True```

In [None]:
pipe_mode_estimator.fit(inputs, wait=True)

### Distributed training with Horovod

[Horovod](https://horovod.readthedocs.io) is a distributed training framework based on MPI. To use Horovod, we make the following changes to our training script:

1. Enable Horovod:

```python
import horovod.keras as hvd

hvd.init()
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
K.set_session(tf.Session(config=config))
```

2. Add these callbacks:

```python
hvd.callbacks.BroadcastGlobalVariablesCallback(0)
hvd.callbacks.MetricAverageCallback()
```

3. Configure the optimizer:

```python
opt = Adam(lr=learning_rate * size, decay=weight_decay)
opt = hvd.DistributedOptimizer(opt)
```

4. Choose to save checkpoints and send TensorBoard logs only from the master node:

```python
if hvd.rank() == 0:
    save_model(model, args.model_output_dir)
```

To configure the training job, we specify the following for the distribution:

In [None]:
distribution = {
    'mpi': {
        'enabled': True,
        'processes_per_host': 1,  # Number of Horovod processes per host
    }
}

This is then passed to our estimator, in addition to setting `train_instance_count` to 2:

In [None]:
dist_estimator = TensorFlow(entry_point='cifar10_keras_main.py',
                            source_dir='source_dir',
                            metric_definitions=metric_definitions,
                            hyperparameters=hyperparameters,
                            distributions=distribution,
                            role=role,
                            framework_version='1.15.2',
                            py_version='py3',
                            train_instance_count=2,
                            train_instance_type='ml.p3.2xlarge',
                            base_job_name='cifar10-tf-dist',
                            tags=tags)

Like before, we call `fit()` on our estimator. If you want to see the training job logs in the notebook output, set `wait=True`.

In [None]:
dist_estimator.fit(inputs, wait=False)

### Compare the training jobs with TensorBoard

Using the visualization tool [TensorBoard](https://www.tensorflow.org/tensorboard), we can compare our training jobs.

In a local setting, install TensorBoard with `pip install tensorboard`. Then run the command generated by the following code:

In [None]:
!python generate_tensorboard_command.py

After running that command, we can access TensorBoard locally at http://localhost:6006.

Based on the TensorBoard metrics, we can see that:
1. All jobs run for 10 epochs (0 - 9).
1. Both File Mode and Pipe Mode run for ~1 minute - Pipe Mode doesn't affect training performance.
1. Distributed training runs for only 45 seconds.
1. All of the training jobs resulted in similar validation accuracy.

This example uses a relatively small dataset (179 MB). For larger datasets, Pipe Mode can significantly reduce training time because it does not copy the entire dataset into local memory.

## Deploy the trained model

After we train our model, we can deploy it to a SageMaker Endpoint, which serves prediction requests in real-time. To do so, we simply call `deploy()` on our estimator, passing in the desired number of instances and instance type for the endpoint.

Because we're using TensorFlow Serving for deployment, our training script saves the model in TensorFlow's SavedModel format. For more details, see [this blog post on deploying Keras and TF models in SageMaker](https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker).

In [None]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

### Invoke the endpoint

To verify the that the endpoint is in service, we generate some random data in the correct shape and get a prediction.

In [None]:
import numpy as np

data = np.random.randn(1, 32, 32, 3)
print('Predicted class: {}'.format(np.argmax(predictor.predict(data)['predictions'])))

Now let's use the test dataset for predictions.

In [None]:
from keras.datasets import cifar10

(x_train, y_train), (x_test, y_test) = cifar10.load_data()

With the data loaded, we can use it for predictions:

In [None]:
from keras.preprocessing.image import ImageDataGenerator

def predict(data):
    predictions = predictor.predict(data)['predictions']
    return predictions


predicted = []
actual = []
batches = 0
batch_size = 128

datagen = ImageDataGenerator()
for data in datagen.flow(x_test, y_test, batch_size=batch_size):
    for i, prediction in enumerate(predict(data[0])):
        predicted.append(np.argmax(prediction))
        actual.append(data[1][i][0])

    batches += 1
    if batches >= len(x_test) / batch_size:
        break

With the predictions, we calculate our model accuracy and create a confusion matrix.

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_pred=predicted, y_true=actual)
display('Average accuracy: {}%'.format(round(accuracy * 100, 2)))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sn
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_pred=predicted, y_true=actual)
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sn.set(rc={'figure.figsize': (11.7,8.27)})
sn.set(font_scale=1.4)  # for label size
sn.heatmap(cm, annot=True, annot_kws={"size": 10})  # font size

Aided by the colors of the heatmap, we can use this confusion matrix to understand how well the model performed for each label.

## Cleanup

To avoid incurring extra charges to your AWS account, let's delete the endpoint we created:

In [None]:
predictor.delete_endpoint()