# Train and Host a Keras Model on Amazon SageMaker

Amazon SageMaker is a fully-managed service that provides developers and data scientists with the ability to build, train, and deploy machine learning (ML) models quickly. Amazon SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high-quality models. The SageMaker Python SDK makes it easy to train and deploy models in Amazon SageMaker with several different machine learning and deep learning frameworks, including TensorFlow and Keras.

In this notebook, we train and host a [Keras Sequential model](https://keras.io/getting-started/sequential-model-guide) on SageMaker. The model used for this notebook is a simple multi-layer perceptron neural network (VNN) that was based on [the Keras examples](https://www.tensorflow.org/tutorials/images/cnn).

## Setup

We first need to check the directory structure and modify permissions if a lost+found folder is present with root group and/or owner.

In [None]:
%%sh
ls -l

%%sh
sudo chown ec2-user lost+found

%%sh
ls -l

%%sh
sudo chgrp ec2-user lost+found

%%sh
ls -l 

Next, we define a few variables that will be needed later. Don't forget to change the kernel to **conda_tensorflow_p36**.

In [None]:
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

## The MNIST dataset

The [MNIST dataset](https://deepai.org/dataset/mnist) is a low-complexity data collection of hand-written digits used to train and test various supervised machine learning algorithmsm. It is also considered to be the "Hello, World!" of machine learning. The database contains 70,000 28x28 black and white images representing the digits zero through nine. It is split into two subsets, with 60,000 images belonging to the training set and 10,000 images belonging to the testing set. The separation of images ensures that given what an adequately trained model has learned previously, it can accurately classify relevant images not previously examined.

#### Instructions

1. In the Jupyter Notebook, download the MNIST data (or use the tensorflow.dataset).
2. Modify the upload to S3 for your new data then upload to S3.
3. In the Python file containing the Keras model:
- Change the Keras model to the simple mutli-layer perceptron model used in Week 1
- Modify the train_input_fn, eval_input_fn to use the MNIST data
- Modify the _input to remove the padding, trimming and flipping effects (for now).

### Prepare the dataset for training

In [None]:
import os
import tensorflow as tf
import keras
import numpy as np
from matplotlib import pyplot

mnist = tf.keras.datasets.mnist # get mnist from keras

(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [None]:
# create local directories for the data
os.makedirs("./data", exist_ok=True) #exist_ok = True means no error if directory already exists
os.makedirs('./data/training', exist_ok=True)
os.makedirs('./data/test', , exist_ok=True)
os.makedirs("./output", exist_ok=True)

In [None]:
#  save the training and test data
np.savez('./data/training', image = x_train, label=y_train)
np.savez('./data/test', image=x_test, label=y_test)

In [None]:
%%sh
ls -l data # check they have been created

Now upload the data to Amazon S3

In [None]:
from sagemaker.s3 import S3Uploader

bucket = sagemaker_session.default_bucket()
dataset_uri = S3Uploader.upload("data", "s3://{}/Wk3Activity6-mnist/data".format(bucket))

display(dataset_uri)

## Train the model

In this activity, we train a VNN to learn a classification task with the mnist dataset.

### Run a baseline training job on SageMaker

The SageMaker Python SDK's `sagemaker.tensorflow.TensorFlow` estimator class makes it easy for us to interact with SageMaker. We create one for each of the different training jobs we run in this example. A couple parameters worth noting:

* `entry_point`: our training script (adapted from [this Keras example](https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py)).
* `train_instance_count`: the number of training instances. Here, we set it to 1 for our baseline training job.

As we run each of our training jobs, we change different parameters to configure our different training jobs.

For more details about the TensorFlow estimator class, see the [API documentation](https://sagemaker.readthedocs.io/en/stable/sagemaker.tensorflow.html).

### Verify the training code

Before running the baseline training job, we first use [the SageMaker Python SDK's Local Mode feature](https://sagemaker.readthedocs.io/en/stable/overview.html#local-mode) to check that our code works with SageMaker's TensorFlow environment. Local Mode downloads the [prebuilt Docker image for TensorFlow](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html) and runs a Docker container locally for a training job. In other words, it simulates the SageMaker environment for a quicker development cycle, so we use it here just to test out our code.

We create a TensorFlow estimator, and specify the `instance_type` to be `'local'` or `'local_gpu'`, depending on our local instance type. This tells the estimator to run our training job locally (as opposed to on SageMaker). We also have our training code run for only one epoch because our intent here is to verify the code, not train an accurate model.

In [None]:
import subprocess

from sagemaker.tensorflow import TensorFlow

instance_type = "local"

if subprocess.call("nvidia-smi") == 0:
    # Set instance type to GPU if one is present
    instance_type = "local_gpu"

local_hyperparameters = {"epochs": 1, "batch-size": 64}

estimator = TensorFlow(
    entry_point="cifar10_keras_main.py",
    source_dir="source_dir", # other option is "."
    role=role,
    framework_version="2.1.0", # updated from 1.15.2
    py_version="py3",
    script_mode=True,
    hyperparameters=local_hyperparameters,
    train_instance_count=1,
    train_instance_type=instance_type,
)

Once we have our estimator, we call `fit()` to start the training job and pass the inputs that we downloaded earlier. We pass the inputs as a dictionary to define different data channels for training.

In [None]:
import os

data_path = os.path.join(os.getcwd(), "data")

local_inputs = {
    "train": "file://{}/train".format(data_path),
    "validation": "file://{}/test".format(data_path),
    "eval": "file://{}/eval".format(data_path),
}
estimator.fit(local_inputs)

### Run a baseline training job on SageMaker

Now we run training jobs on SageMaker, starting with our baseline training job.

### Configure metrics

In addition to running the training job, Amazon SageMaker can retrieve training metrics directly from the logs and send them to CloudWatch metrics. Here, we define metrics we would like to observe:

In [None]:
metric_definitions = [
    {"Name": "train:loss", "Regex": ".*loss: ([0-9\\.]+) - accuracy: [0-9\\.]+.*"},
    {"Name": "train:accuracy", "Regex": ".*loss: [0-9\\.]+ - accuracy: ([0-9\\.]+).*"},
    {
        "Name": "validation:accuracy",
        "Regex": ".*step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_accuracy: ([0-9\\.]+).*",
    },
    {
        "Name": "validation:loss",
        "Regex": ".*step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: ([0-9\\.]+) - val_accuracy: [0-9\\.]+.*",
    },
    {
        "Name": "sec/steps",
        "Regex": ".* - \d+s (\d+)[mu]s/step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_accuracy: [0-9\\.]+",
    },
]

Once again, we create a TensorFlow estimator, with a couple key modfications from last time:

* `train_instance_type`: the instance type for training. We set this to `ml.p2.xlarge` because we are training on SageMaker now. For a list of available instance types, see [the AWS documentation](https://aws.amazon.com/sagemaker/pricing/instance-types).
* `metric_definitions`: the metrics (defined above) that we want sent to CloudWatch.

In [None]:
from sagemaker.tensorflow import TensorFlow

hyperparameters = {"epochs": 10, "batch-size": 256}
tags = [{"Key": "Project", "Value": "cifar10"}, {"Key": "TensorBoard", "Value": "file"}]

estimator = TensorFlow(
    entry_point="cifar10_keras_main.py",
    source_dir="source_dir",
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    role=role,
    framework_version="1.15.2",
    py_version="py3",
    train_instance_count=1,
    train_instance_type="ml.p2.xlarge",
    base_job_name="cifar10-tf",
    tags=tags,
)

Like before, we call `fit()` to start the SageMaker training job and pass the inputs in a dictionary to define different data channels for training. This time, we use the S3 URI from uploading our data.

In [None]:
inputs = {
    "train": "{}/train".format(dataset_uri),
    "validation": "{}/validation".format(dataset_uri),
    "eval": "{}/eval".format(dataset_uri),
}

estimator.fit(inputs)

### View the job training metrics

We can now view the metrics from the training job directly in the SageMaker console.  

Log into the [SageMaker console](https://console.aws.amazon.com/sagemaker/home), choose the latest training job, and scroll down to the monitor section. Alternatively, the code below uses the region and training job name to generate a URL to CloudWatch metrics.

Using CloudWatch metrics, you can change the period and configure the statistics.

In [None]:
from urllib import parse

from IPython.core.display import Markdown

region = sagemaker_session.boto_region_name
cw_url = parse.urlunparse(
    (
        "https",
        "{}.console.aws.amazon.com".format(region),
        "/cloudwatch/home",
        "",
        "region={}".format(region),
        "metricsV2:namespace=/aws/sagemaker/TrainingJobs;dimensions=TrainingJobName;search={}".format(
            estimator.latest_training_job.name
        ),
    )
)

display(
    Markdown(
        "CloudWatch metrics: [link]({}). After you choose a metric, "
        "change the period to 1 Minute (Graphed Metrics -> Period).".format(cw_url)
    )
)