# Distributed training  

### Import packages 
* os -  provides a portable way of using operating system dependent functionality.

In [1]:
import os

### Importing standard python packages
* utils - a collection of small Python functions and classes which make common patterns shorter and easier.
* numpy - package for scientific computing with Python.

In [2]:
import utils
import numpy as np

### Importing tensorflow packages
* tensorflow - library for dataflow programming across a range of tasks.
* tensorflow.contrib.learn.python.learn.datasets - contains dataset utilities and synthetic/reference datasets, for getting the mnist dataset

In [3]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
from tensorflow.contrib.learn.python.learn.datasets import mnist

### Importing amazon packages
* sagemaker - Python SDK for training and deploying machine learning models on Amazon SageMaker.
* get_execution_role - Return the role ARN whose credentials are used to call the API.
* sagemaker.tensorflow - The Amazon SageMaker custom TensorFlow code. 

In [4]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow

## Getting and preprocessing the dataset
* Read the mnist dataset
* split it into three : train, validation and test
* instantiate a sagemaker session
* upload our datasets to an S3 location.

In [5]:
data_sets = mnist.read_data_sets('data', dtype=tf.uint8, reshape=False, validation_size=5000)

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting data/t10k-labels-idx1-ubyte.gz


In [6]:
utils.convert_to(data_sets.train, 'train', 'data')
utils.convert_to(data_sets.validation, 'validation', 'data')
utils.convert_to(data_sets.test, 'test', 'data')

('Writing', 'data/train.tfrecords')
('Writing', 'data/validation.tfrecords')
('Writing', 'data/test.tfrecords')


In [7]:
sagemaker_session = sagemaker.Session()

In [8]:
inputs = sagemaker_session.upload_data(path='data', key_prefix='data/mnist')

INFO:sagemaker:Created S3 bucket: sagemaker-us-east-2-324118574079


## Training the model
* You will need the training script predefined.
* Define a tensorflow estimator object and pass in the python script as the entry point parameter.
* Your TensorFlow training script must be a Python 2.7 source file. It should define the following methods.
* **model_fn**: Defines the model that will be trained.
* **train_input_fn**: Preprocess and load training data.
* **eval_input_fn**: Preprocess and load evaluation data.
* **serving_input_fn**: Defines the features to be passed to the model during prediction.
* Get the role ARN whose credentials are used to call the API.
* To perform distributed training,the instance count is set to 2. 
* Invoke the fit method to train the model. The fit method will create a training job in two ml.c4.xlarge instances. The logs will show the instances doing training, evaluation, and incrementing the number of training steps.
* Invoke the deploy method to create an endpoint.

In [9]:
!cat 'mnist.py'

import os
import tensorflow as tf
from tensorflow.python.estimator.model_fn import ModeKeys as Modes

INPUT_TENSOR_NAME = 'inputs'
SIGNATURE_NAME = 'predictions'

LEARNING_RATE = 0.001


def model_fn(features, labels, mode, params):
    # Input Layer
    input_layer = tf.reshape(features[INPUT_TENSOR_NAME], [-1, 28, 28, 1])

    # Convolutional Layer #1
    conv1 = tf.layers.conv2d(
        inputs=input_layer,
        filters=32,
        kernel_size=[5, 5],
        padding='same',
        activation=tf.nn.relu)

    # Pooling Layer #1
    pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)

    # Convolutional Layer #2 and Pooling Layer #2
    conv2 = tf.layers.conv2d(
        inputs=pool1,
        filters=64,
        kernel_size=[5, 5],
        padding='same',
        activation=tf.nn.relu)
    pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)

    # Dense Layer
    pool2_flat = tf.reshape(pool2, [-1, 7

In [10]:
role = get_execution_role()
role

'arn:aws:iam::324118574079:role/service-role/AmazonSageMaker-ExecutionRole-20180209T192191'

In [11]:
mnist_estimator = TensorFlow(entry_point='mnist.py',
                             role=role,
                             training_steps=1000, 
                             evaluation_steps=100,
                             train_instance_count=2,
                             train_instance_type='ml.c4.xlarge')

In [12]:
mnist_estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: sagemaker-tensorflow-py2-cpu-2018-03-10-12-19-50-341


...................................................................
[32mexecuting startup script (first run)[0m
[32m2018-03-10 12:25:18,122 INFO - root - running container entrypoint[0m
[32m2018-03-10 12:25:18,122 INFO - root - starting train task[0m
[31mexecuting startup script (first run)[0m
[31m2018-03-10 12:25:16,712 INFO - root - running container entrypoint[0m
[31m2018-03-10 12:25:16,713 INFO - root - starting train task[0m
[31m2018-03-10 12:25:18,318 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.170.2[0m
[31m2018-03-10 12:25:19,219 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.amazonaws.com[0m
[31m2018-03-10 12:25:19,303 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.us-east-2.amazonaws.com[0m
[31m2018-03-10 12:25:19,391 INFO - botocore.vendored.requests.packages.urllib3.connectio

[31mINFO:tensorflow:loss = 2.3031068, step = 0[0m
[32m2018-03-10 12:25:51.531754: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 9f44844d8f8b80a0 with config: gpu_options { per_process_gpu_memory_fraction: 1 } allow_soft_placement: true[0m
[32mINFO:tensorflow:loss = 0.14459099, step = 88[0m
[31mINFO:tensorflow:global_step/sec: 3.62032[0m
[31mINFO:tensorflow:loss = 0.06488793, step = 110 (29.435 sec)[0m
[31mINFO:tensorflow:global_step/sec: 6.96771[0m
[32mINFO:tensorflow:loss = 0.040450353, step = 281 (28.099 sec)[0m
[31mINFO:tensorflow:global_step/sec: 6.71306[0m
[31mINFO:tensorflow:loss = 0.04918912, step = 318 (30.389 sec)[0m
[31mINFO:tensorflow:global_step/sec: 6.77391[0m
[32mINFO:tensorflow:loss = 0.05305976, step = 477 (29.033 sec)[0m
[31mINFO:tensorflow:global_step/sec: 6.73164[0m
[31mINFO:tensorflow:loss = 0.06597349, step = 522 (30.229 sec)[0m
[31mINFO:tensorflow:global_step/sec: 6.86542[0m
[32mINFO:tensorflow:loss

In [20]:
mnist_predictor = mnist_estimator.deploy(initial_instance_count=1,
                                             instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: sagemaker-tensorflow-py2-cpu-2018-03-08-11-07-26-995
INFO:sagemaker:Creating endpoint with name sagemaker-tensorflow-py2-cpu-2018-03-08-11-07-26-995


-------------------------------------------------------------------------------------------------------------!

## Validating the model
* get some data for testing
* call the predictor to compare the labels from test data and the predicted labels

In [23]:
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


Datasets(train=<tensorflow.contrib.learn.python.learn.datasets.mnist.DataSet object at 0x7fd9faa09e50>, validation=<tensorflow.contrib.learn.python.learn.datasets.mnist.DataSet object at 0x7fd9e3fa5c10>, test=<tensorflow.contrib.learn.python.learn.datasets.mnist.DataSet object at 0x7fd9e3fa5510>)

In [24]:
for i in range(10):
    data = mnist.test.images[i].tolist()
    tensor_proto = tf.make_tensor_proto(values=np.asarray(data), shape=[1, len(data)], dtype=tf.float32)
    predict_response = mnist_predictor.predict(tensor_proto)
    
    print("========================================")
    label = np.argmax(mnist.test.labels[i])
    print("label is {}".format(label))
    prediction = predict_response['outputs']['classes']['int64Val'][0]
    print("prediction is {}".format(prediction))

label is 7
prediction is 7
label is 2
prediction is 2
label is 1
prediction is 1
label is 0
prediction is 0
label is 4
prediction is 4
label is 1
prediction is 1
label is 4
prediction is 4
label is 9
prediction is 9
label is 5
prediction is 5
label is 9
prediction is 9


## Delete the endpoint

In [25]:
sagemaker_session.delete_endpoint(mnist_predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: sagemaker-tensorflow-py2-cpu-2018-03-08-11-07-26-995
