# Train and Host a Keras Model on Amazon SageMaker

Amazon SageMaker is a fully-managed service that provides developers and data scientists with the ability to build, train, and deploy machine learning (ML) models quickly. Amazon SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high-quality models. The SageMaker Python SDK makes it easy to train and deploy models in Amazon SageMaker with several different machine learning and deep learning frameworks, including TensorFlow and Keras.

In this notebook, we train and host a [Keras Sequential model](https://keras.io/getting-started/sequential-model-guide) on SageMaker. The model used for this notebook is a simple multi-layer perceptron neural network (VNN).

## Setup

First check the directory structure and modify permissions if a lost+found folder is present with root group and/or owner.

In [None]:
%%sh
ls -l

%%sh
sudo chown ec2-user lost+found

%%sh
ls -l

%%sh
sudo chgrp ec2-user lost+found

%%sh
ls -l 

Next define a few variables that will be needed later. Don't forget to change the kernel to **conda_tensorflow_p36**.

In [None]:
import sagemaker
from sagemaker import get_execution_role

sess = sagemaker.Session()

role = get_execution_role()

## The MNIST dataset

The [MNIST dataset](https://deepai.org/dataset/mnist) is a low-complexity data collection of hand-written digits used to train and test various supervised machine learning algorithmsm. It is also considered to be the "Hello, World!" of machine learning. The database contains 70,000 28x28 black and white images representing the digits zero through nine. It is split into two subsets, with 60,000 images belonging to the training set and 10,000 images belonging to the testing set. The separation of images ensures that given what an adequately trained model has learned previously, it can accurately classify relevant images not previously examined.

### Prepare the dataset for training

In [None]:
# Import os, keras, numpy, pyplot and the MNIST data 
import os
import keras
import numpy as np
from keras.datasets import mnist
from matplotlib import pyplot

# mnist = tf.keras.datasets.mnist # get mnist from keras

(x_train, y_train), (x_val, y_val) = mnist.load_data()

In [None]:
# Take a quick look at data 

#Each image is represented as a 28x28 pixel grayscale images
## View shape and type of data
xtr = x_train.shape, x_train.dtype
ytr = y_train.shape, y_train.dtype

print("x_train_shape & data type:", xtr)
print("y_train_shape & data type:", ytr)

# plot some raw pixel data
for i in range(9):  
    pyplot.subplot(330 + 1 + i)
    pyplot.imshow(x_train[i], cmap=pyplot.get_cmap('gray'))

In [None]:
# Create local directory for the data and save the training and test data there
os.makedirs("./data", exist_ok=True)
np.savez('./data/training', image = x_train, label=y_train)
np.savez('./data/test', image=x_val, label=y_val)

In [None]:
%%sh 
ls -l data ## Check that the directories have been created and the files have been saved successfully

### Verify the training code

Next train the model on the local instance - this is an optional step and is to check if the code will run on AWS. The model is trained using TensorFlow() to create a tf_estimator object.

In more detail, before running the baseline training job, [the SageMaker Python SDK's Local Mode feature](https://sagemaker.readthedocs.io/en/stable/overview.html#local-mode) is first used to check that the code works with SageMaker's TensorFlow environment. Local Mode downloads the [prebuilt Docker image for TensorFlow](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html) and runs a Docker container locally for a training job. A TensorFlow estimator is created, and the `instance_type` is specified as to be `'local'` or `'local_gpu'`, depending on the local instance type. This tells the estimator to run the training job locally (as opposed to on SageMaker). The  training code is also only run for only one epoch because the intent is to verify the code, not train an accurate model.

**Don't forget to upload the python script into the same notebook instance.**

In [None]:
# Import tensorflow from sagemaker
from sagemaker.tensorflow import TensorFlow

# Set environment variables - file paths to data and for output
local_training_input_path = 'file://data/training.npz'
local_test_input_path = 'file://data/test.npz'
output = 'file:///output'

instance_type = "local"

if subprocess.call("nvidia-smi") == 0:
    instance_type = "local_gpu" # set instance_type to GPU if one is present
    
local_hyperparameters = {"epochs": 1, "batch-size": 64}

tf_estimator = TensorFlow(entry_point='mnist_vnn_tf2.py', # path to local python source file to be executed
                          role = role, # the IAM ROLE ARN for the model - unique user ID
                          source_dir ='.', # path to the directory where any other dependancies are apart from entry point
                          instance_count = 1, #the number of EC2 intances to use
                          instance_type = instance_type, # Type of EC2 instance to use local = this one! 
                          framework_version = '2.1.0', # Tensorflow version for executing the tf code
                          py_version ='py3',
                          script_mode =True,
                          hyperparameters=local_hyperparameters,
                          output_path = output) # location for saving the results. Default = saved in the default S3 bucket.

In [None]:
# fit is used to train the model saved in the estimator object. The local files paths to the traiing and test data also 
# need to be passed in
tf_estimator.fit({'training': local_training_input_path, 'validation': local_test_input_path})

### Train the model in AWS

Now that it has been determined the code is working on SageMaker (note, this is only possible because it's a small dataset and a shallow neural network - it won't work with large datasets or deep neural networks), the model can be trained on a larger instance. 

1. Upload the dataset to S3. S3 is a default bucket for storing data and model output in AWS
2. Select the [EC2 instance type](https://aws.amazon.com/ec2/instance-types/) for the model. MA5852 will mainly use *ml.m4.xlarge*. EC stands for Elastic Compute Cloud, and its a web service where AWS subscribers can request and provision compute services in the AWS cloud. The user is charged per hour with different rates, depending on the instance chosen. Don't forget to terminate the instance when done to stop being over-charged. 

In [None]:
from sagemaker.s3 import S3Uploader

prefix = 'keras-mnist'

training_input_path = sess.upload_data('data/training.npz', key_prefix = prefix+'/training')
test_input_path = sess.upload_data('data/test.npz', key_prefix = prefix+'/validation')

print(training_input_path)
print(test_input_path)

In [None]:
from sagemaker.tensorflow import TensorFlow

hyperparameters = {"epochs": 10, "batch-size": 256}

estimator = TensorFlow(
    entry_point="mnist_vnn_tf2.py",
    role = role,
    source_dir='.',
    hyperparameters=hyperparameters,
    role=role,
    framework_version="2.1.0",
    py_version="py3",
    instance_count=1,
    instance_type="ml.m4.xlarge",
    script_mode = True,
)

In [None]:
tf_estimator.fit({'training': training_input_path, 'validation': test_input_path})

## Deploy the trained model

After the model is trained, it can be deployed to a SageMaker Endpoint, which serves prediction requests in real-time. To do so, simply call `deploy()` on the estimator, passing in the desired number of instances and instance type for the endpoint.

In [None]:
import time

tf_endpoint_name = 'keras-tf-mnist-'+time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()) #give the endpoint a name.
# used the time and date from the time library

# deploy() deploys the model to an endpoint and optionally return a predictor.
tf_predictor = tf_estimator.deploy(initial_instance_count=1, # The initial number of instances to run in the endpoint created from this Model.
                                   instance_type='ml.m4.xlarge', # The EC2 instance type to deploy this model to.
                                   endpoint_name=tf_endpoint_name) # The name of the endpoint to create   

Now use the test dataset for predictions

In [None]:
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [None]:
from keras.preprocessing.image import ImageDataGenerator


def predict(data):
    predictions = predictor.predict(data)["predictions"]
    return predictions


predicted = []
actual = []
batches = 0
batch_size = 128

datagen = ImageDataGenerator()
for data in datagen.flow(x_test, y_test, batch_size=batch_size):
    for i, prediction in enumerate(predict(data[0])):
        predicted.append(np.argmax(prediction))
        actual.append(data[1][i][0])

    batches += 1
    if batches >= len(x_test) / batch_size:
        break

Use the predictions to calculate model accuracy and create a confusion matrix.

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_pred=predicted, y_true=actual)
display("Average accuracy: {}%".format(round(accuracy * 100, 2)))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sn
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_pred=predicted, y_true=actual)
cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]
sn.set(rc={"figure.figsize": (11.7, 8.27)})
sn.set(font_scale=1.4)  # for label size
sn.heatmap(cm, annot=True, annot_kws={"size": 10})  # font size

### Clean-up - Delete the endpoint

Remember to delete the endpoint to avoid unnecessary surcharge from AWS.

In [None]:
tf_predictor.delete_endpoint()