# TensorFlow BYOM: Train with Custom Training Script, Compile with Neo, and Deploy on SageMaker

This notebook can be compared to [TensorFlow MNIST distributed training notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_distributed_mnist.ipynb) in terms of its functionality. We will do the same classification task, but this time we will compile the trained model using the Neo API backend, to optimize for our choice of hardware. Finally, we setup a real-time hosted endpoint in SageMaker for our compiled model using the Neo Deep Learning Runtime.

### Set up the environment

In [16]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

### Download the MNIST dataset

In [54]:
import utils
from tensorflow.contrib.learn.python.learn.datasets import mnist
import tensorflow as tf

data_sets = mnist.read_data_sets('data', dtype=tf.uint8, reshape=False, validation_size=5000)
print(data_sets.test)

x = data_sets.train.images[0].shape[0]
y= tf.train.Feature(int64_list=tf.train.Int64List(value=[x]))
print(y)

z = data_sets.train.images[0]
a_raw = z.tostring()

b= tf.train.Feature(bytes_list=tf.train.BytesList(value=[a_raw]))
print(b)
print('p',p)
p = data_sets.train.labels[0]

newdir = os.path.join('data', 'trial' + '.tfrecords')
writernew = tf.python_io.TFRecordWriter(newdir)

newexample = tf.train.Example(features=tf.train.Features(feature={
            'somefeature': y,
            'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[p])),
            'a_raw': b}))
writernew.write(newexample.SerializeToString())
writernew.close()


#print(data_sets.train.images[0][1])

utils.convert_to(data_sets.train, 'train', 'data')
utils.convert_to(data_sets.validation, 'validation', 'data')
utils.convert_to(data_sets.test, 'test', 'data')


Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
<tensorflow.contrib.learn.python.learn.datasets.mnist.DataSet object at 0x7f3159960278>
int64_list {
  value: 28
}

bytes_list {
  value: "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\00

### Upload the data
We use the ```sagemaker.Session.upload_data``` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use this later when we start the training job.

In [18]:
inputs = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-mnist')

# Construct a script for distributed training 
Here is the full code for the network model:

In [55]:
!cat 'mnist.py'

import os
import tensorflow as tf
from tensorflow.python.estimator.model_fn import ModeKeys as Modes

INPUT_TENSOR_NAME = 'inputs'
SIGNATURE_NAME = 'predictions'

LEARNING_RATE = 0.001


def model_fn(features, labels, mode, params):
    # Input Layer
    print("===>",features, labels)
    input_layer = tf.reshape(features[INPUT_TENSOR_NAME], [-1, 28, 28, 1])

    # Convolutional Layer #1
    conv1 = tf.layers.conv2d(
        inputs=input_layer,
        filters=32,
        kernel_size=[5, 5],
        padding='same',
        activation=tf.nn.relu)

    # Pooling Layer #1
    pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)

    # Convolutional Layer #2 and Pooling Layer #2
    conv2 = tf.layers.conv2d(
        inputs=pool1,
        filters=64,
        kernel_size=[5, 5],
        padding='same',
        activation=tf.nn.relu)
    pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)

    # Dense Layer
    p

The script here is and adaptation of the [TensorFlow MNIST example](https://github.com/tensorflow/models/tree/master/official/mnist). It provides a ```model_fn(features, labels, mode)```, which is used for training, evaluation and inference. See [TensorFlow MNIST distributed training notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_distributed_mnist.ipynb) for more details about the training script.

At the end of the training script, there are two additional functions, to be used with Neo Deep Learning Runtime:
* `neo_preprocess(payload, content_type)`: Function that takes in the payload and Content-Type of each incoming request and returns a NumPy array
* `neo_postprocess(result)`: Function that takes the prediction results produced by Deep Learining Runtime and returns the response body

## Create a training job using the sagemaker.TensorFlow estimator

In [56]:
from sagemaker.tensorflow import TensorFlow

mnist_estimator = TensorFlow(entry_point='mnist.py',
                             role=role,
                             framework_version='1.11.0',
                             training_steps=1000, 
                             evaluation_steps=100,
                             train_instance_count=2,
                             train_instance_type='ml.c4.xlarge')

mnist_estimator.fit(inputs)

2.1.0 is the latest version of tensorflow that supports Python 2. Newer versions of tensorflow will only be available for Python 3.Please set the argument "py_version='py3'" to use the Python 3 tensorflow image.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-10-05 14:03:35 Starting - Starting the training job...
2020-10-05 14:03:37 Starting - Launching requested ML instances......
2020-10-05 14:04:55 Starting - Preparing the instances for training......
2020-10-05 14:05:51 Downloading - Downloading input data...
2020-10-05 14:06:26 Training - Training image download completed. Training in progress.[34m2020-10-05 14:06:26,061 INFO - root - running container entrypoint[0m
[34m2020-10-05 14:06:26,061 INFO - root - starting train task[0m
[34m2020-10-05 14:06:26,075 INFO - container_support.training - Training starting[0m
[34mDownloading s3://sagemaker-ap-southeast-1-018166606076/sagemaker-tensorflow-2020-10-05-14-03-33-954/source/sourcedir.tar.gz to /tmp/script.tar.gz[0m
[34m2020-10-05 14:06:28,798 INFO - tf_container - ----------------------TF_CONFIG--------------------------[0m
[34m2020-10-05 14:06:28,798 INFO - tf_container - {"environment": "cloud", "cluster": {"worker": ["algo-2:2222"], "ps": ["algo-1:2223", "algo-2:2223"]

[34m2020-10-05 14:06:50,089 INFO - tensorflow - global_step/sec: 7.01743[0m
[35m2020-10-05 14:07:01,862 INFO - tensorflow - loss = 0.0615984, step = 190 (24.190 sec)[0m
[34m2020-10-05 14:07:03,809 INFO - tensorflow - global_step/sec: 7.43396[0m
[34m2020-10-05 14:07:04,869 INFO - tensorflow - loss = 0.05768078, step = 212 (29.457 sec)[0m
[34m2020-10-05 14:07:17,044 INFO - tensorflow - global_step/sec: 7.63152[0m
[35m2020-10-05 14:07:26,395 INFO - tensorflow - loss = 0.04406964, step = 374 (24.533 sec)[0m
[34m2020-10-05 14:07:30,720 INFO - tensorflow - global_step/sec: 7.45856[0m
[34m2020-10-05 14:07:33,856 INFO - tensorflow - loss = 0.05710558, step = 429 (28.987 sec)[0m
[34m2020-10-05 14:07:44,442 INFO - tensorflow - global_step/sec: 7.5062[0m
[35m2020-10-05 14:07:51,493 INFO - tensorflow - loss = 0.05259962, step = 561 (25.098 sec)[0m
[34m2020-10-05 14:07:57,993 INFO - tensorflow - global_step/sec: 7.45303[0m
[34m2020-10-05 14:08:02,526 INFO - tensorflow - loss 

[35m2020-10-05 14:10:17,065 INFO - tf_container - master algo-1 is down, stopping parameter server[0m

2020-10-05 14:10:22 Uploading - Uploading generated training model
2020-10-05 14:10:56 Completed - Training job completed
Training seconds: 610
Billable seconds: 610


The **```fit```** method will create a training job in two **ml.c4.xlarge** instances. The logs above will show the instances doing training, evaluation, and incrementing the number of **training steps**. 

In the end of the training, the training job will generate a saved model for TF serving.

# Deploy the trained model to prepare for predictions (the old way)

The deploy() method creates an endpoint which serves prediction requests in real-time.

In [21]:
mnist_predictor = mnist_estimator.deploy(initial_instance_count=1,
                                         instance_type='ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://sagemaker-ap-southeast-1-018166606076/tensorflow-training-2020-09-26-11-19-40-794/output/model.tar.gz.

## Invoking the endpoint

In [None]:
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

for i in range(10):
    data = mnist.test.images[i].tolist()
    tensor_proto = tf.make_tensor_proto(values=np.asarray(data), shape=[1, len(data)], dtype=tf.float32)
    predict_response = mnist_predictor.predict(tensor_proto)
    
    print("========================================")
    label = np.argmax(mnist.test.labels[i])
    print("label is {}".format(label))
    prediction = np.argmax(predict_response['outputs']['probabilities']['float_val'])
    print("prediction is {}".format(prediction))

## Deleting the endpoint

In [None]:
sagemaker.Session().delete_endpoint(mnist_predictor.endpoint)

# Deploy the trained model using Neo

Now the model is ready to be compiled by Neo to be optimized for our hardware of choice. We are using the  ``TensorFlowEstimator.compile_model`` method to do this. For this example, our target hardware is ``'ml_c5'``. You can changed these to other supported target hardware if you prefer.

## Compiling the model
The ``input_shape`` is the definition for the model's input tensor and ``output_path`` is where the compiled model will be stored in S3. **Important. If the following command result in a permission error, scroll up and locate the value of execution role returned by `get_execution_role()`. The role must have access to the S3 bucket specified in ``output_path``.**

In [21]:
output_path = '/'.join(mnist_estimator.output_path.split('/')[:-1])

print(output_path)


optimized_estimator = mnist_estimator.compile_model(target_instance_family='rasp3b',
                              output_path=output_path,
                              input_shape= {'inputs':[1, 784]},  # Batch size 1, 3 channels, 224x224 Images.
                              framework='tensorflow', framework_version='1.11.0')


Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
2.1.0 is the latest version of tensorflow that supports Python 2. Newer versions of tensorflow will only be available for Python 3.Please set the argument "py_version='py3'" to use the Python 3 tensorflow image.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


s3://sagemaker-ap-southeast-1-018166606076
??.....!

The instance type rasp3b is not supported for deployment via SageMaker.Please deploy the model manually.


## Deploying the compiled model

In [None]:
optimized_predictor = optimized_estimator.deploy(initial_instance_count = 1,
                                                 instance_type = 'ml.c5.4xlarge')

In [None]:
def numpy_bytes_serializer(data):
    f = io.BytesIO()
    np.save(f, data)
    f.seek(0)
    return f.read()

optimized_predictor.content_type = 'application/vnd+python.numpy+binary'
optimized_predictor.serializer = numpy_bytes_serializer

## Invoking the endpoint

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
from IPython import display
import PIL.Image
import io

mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

for i in range(10):
    data = mnist.test.images[i]
    # Display image
    im = PIL.Image.fromarray(data.reshape((28,28))*255).convert('L')
    display.display(im)
    # Invoke endpoint with image
    predict_response = optimized_predictor.predict(data)
    
    print("========================================")
    label = np.argmax(mnist.test.labels[i])
    print("label is {}".format(label))
    prediction = predict_response
    print("prediction is {}".format(np.argmax(prediction)))

## Deleting endpoint

In [None]:
sagemaker.Session().delete_endpoint(optimized_predictor.endpoint)