# Starter Code for Submitting Jobs with AWS SageMaker

**This code is adapted from** [AWS lab code](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_script_mode_training_and_serving/tensorflow_script_mode_training_and_serving.ipynb)


Script mode is a training script format for TensorFlow that lets you execute any TensorFlow training script in SageMaker with minimal modification. The SageMaker Python SDK handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native TensorFlow support sets up training-related environment variables and executes your training script. In this tutorial, we use the SageMaker Python SDK to launch a training job and obtain the trained model.

Script mode supports training with a Python script, a Python module, or a shell script. In this example, we use a Python script to train a classification model on the MNIST dataset. In this example, we will show how easily you can train a SageMaker using TensorFlow 2.0 scripts with SageMaker Python SDK. 

First lets import the necessary packages and setup the environment.

In [1]:
import os
import sagemaker
from sagemaker import get_execution_role
import boto3
from sagemaker.tensorflow import TensorFlow

sagemaker_session = sagemaker.Session()

role = get_execution_role()
region = sagemaker_session.boto_session.region_name
print(role, region)

arn:aws:iam::986428434207:role/analysis-1594681887916-sagemaker-notebook-role ap-southeast-2


## Training Data
The MNIST dataset has been loaded to the public S3 buckets sagemaker-sample-data-<REGION> under the prefix tensorflow/mnist. There are four .npy file under this prefix:
    
- train_data.npy
- eval_data.npy
- train_labels.npy
- eval_labels.npy
    
You can download this data and upload to folder `../Data/` or run the following code to download data automatically.

**This is an optional step**

The following code will download the MNIST data from the public s3 bucket to local sagemaker instance. 
(You could have directly run the model from this bucket but to illustrate how to work with your own data we will first get it to the local instance and upload to our own s3 bucket)

In [2]:
!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/train_data.npy ../Data/train_data.npy
!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/train_labels.npy ../Data/train_labels.npy
!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/eval_data.npy ../Data/eval_data.npy
!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/eval_labels.npy ../Data/eval_labels.npy

download: s3://sagemaker-sample-data-ap-southeast-2/tensorflow/mnist/train_data.npy to ../Data/train_data.npy
download: s3://sagemaker-sample-data-ap-southeast-2/tensorflow/mnist/train_labels.npy to ../Data/train_labels.npy
download: s3://sagemaker-sample-data-ap-southeast-2/tensorflow/mnist/eval_data.npy to ../Data/eval_data.npy
download: s3://sagemaker-sample-data-ap-southeast-2/tensorflow/mnist/eval_labels.npy to ../Data/eval_labels.npy


Script mode requires your data to be in a s3 bucket. so lets copt the downloaded data to s3.

In the RaaS environment you only can write to a folder with your student ID in a specific bucket. The bucket you can use is: `'sagemaker-ap-southeast-2-986428434207'`

In [3]:
bucket = 'sagemaker-ap-southeast-2-986428434207'
prefix = 's9999999'     #THIS SHOULD BE YOUR STUDENT NUMBER

def write_to_s3(file, bucket, prefix):
    return boto3.Session(region_name=region).resource('s3').Bucket(bucket).Object(prefix).upload_file(file)

write_to_s3('../Data/train_data.npy', bucket, f'{prefix}/mnist_data/train_data.npy')
write_to_s3('../Data/train_labels.npy', bucket, f'{prefix}/mnist_data/train_labels.npy')
write_to_s3('../Data/eval_data.npy', bucket, f'{prefix}/mnist_data/eval_data.npy')
write_to_s3('../Data/eval_labels.npy', bucket, f'{prefix}/mnist_data/eval_labels.npy')

training_input_path = os.path.join("s3://", bucket, prefix, "mnist_data")
print(training_input_path)

s3://sagemaker-ap-southeast-2-986428434207/s9999999/mnist_data


## Tensorflow training script
This tutorial's training script was adapted from TensorFlow's official CNN MNIST example. We have modified it to handle the model_dir parameter passed in by SageMaker. This is an S3 path which can be used for data sharing during distributed training and checkpointing and/or model persistence. We have also added an argument-parsing function to handle processing training-related variables.

At the end of the training job we have added a step to export the trained model to the path stored in the environment variable SM_MODEL_DIR, which always points to /opt/ml/model. This is critical because SageMaker uploads all the model artefacts in this folder to S3 at end of training.

the script should be in `mnist_2.py`

## Create a training job using the TensorFlow estimator
The sagemaker.tensorflow.TensorFlow estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. Let's call out a couple important parameters here:

py_version is set to 'py3' to indicate that we are using script mode since legacy mode supports only Python 2. Though Python 2 will be deprecated soon, you can use script mode with Python 2 by setting py_version to 'py2' and script_mode to True.

<!-- distributions is used to configure the distributed training setup. It's required only if you are doing distributed training either across a cluster of instances or across multiple GPUs. Here we are using parameter servers as the distributed training schema. SageMaker training jobs run on homogeneous clusters. To make parameter server more performant in the SageMaker setup, we run a parameter server on every instance in the cluster, so there is no need to specify the number of parameter servers to launch. Script mode also supports distributed training with Horovod. You can find the full documentation on how to configure distributions here. -->

In [4]:
mnist_estimator2 = TensorFlow(entry_point='mnist_2.py',
                             role=role,
                             train_instance_count=2,
                             train_instance_type='ml.p2.xlarge',
                             framework_version='2.1.0',
                             py_version='py3',
                             # distributions={'parameter_server': {'enabled': True}},
                             output_path='s3://sagemaker-ap-southeast-2-986428434207',
                             base_job_name='s9999999')

## Calling fit
To start a training job, we call estimator.fit(training_data_uri).

An S3 location is used here as the input. fit creates a default channel named 'training', which points to this S3 location. In the training script we can then access the training data from the location stored in SM_CHANNEL_TRAINING. fit accepts a couple other types of input as well. See the API doc here for details.

When training starts, the TensorFlow container executes mnist_2.py, passing hyperparameters and model_dir from the estimator as script arguments. 

When training is complete, the training job will upload the saved model for TensorFlow serving.

In [5]:
mnist_estimator2.fit(training_input_path)

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-07-14 02:58:07 Starting - Starting the training job...
2020-07-14 02:58:10 Starting - Launching requested ML instances......
2020-07-14 02:59:18 Starting - Preparing the instances for training.........
2020-07-14 03:00:45 Downloading - Downloading input data...
2020-07-14 03:01:38 Training - Downloading the training image...
2020-07-14 03:02:02 Training - Training image download completed. Training in progress..[35m2020-07-14 03:02:04,061 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[35m2020-07-14 03:02:05,115 sagemaker-containers INFO     Invoking user script
[0m
[35mTraining Env:
[0m
[35m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-2",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1",
        "algo-2"
    ],
    "hyperparameters": {
        "model_dir": "s3://sag

## Retrieving the model

The fit function trains the model and saves the output to a s3 location. lets retrieve the trained model to he local instance.

In [6]:
import os.path
trained_model_output_path = os.path.dirname(mnist_estimator2.model_dir)
trained_model_output_path = os.path.join(trained_model_output_path, "output/")
print(trained_model_output_path)

s3://sagemaker-ap-southeast-2-986428434207/s9999999-2020-07-14-02-58-07-503/output/


In [7]:
!aws s3 ls $trained_model_output_path
output_file_path = trained_model_output_path + 'model.tar.gz'
!aws s3 cp $output_file_path '../Outputs/model.tar.gz'

2020-07-14 03:02:28   16255858 model.tar.gz
download: s3://sagemaker-ap-southeast-2-986428434207/s9999999-2020-07-14-02-58-07-503/output/model.tar.gz to ../Outputs/model.tar.gz


Need to unzip

In [8]:
!tar -zxvf ../Outputs/model.tar.gz -C ../Outputs/

my_model.h5
my_model.h5


## Evaluating

Now lets evalaute the model using the test data we have

In [10]:
from mnist_2 import _load_testing_data

x_test, y_test = _load_testing_data('../Data/')

In [11]:
from mnist_2 import evalaute_model

evalaute_model(x_test, y_test, '../Outputs/my_model.h5')

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 1024)              803840    
_________________________________________________________________
dropout (Dropout)            (None, 1024)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                10250     
Total params: 814,090
Trainable params: 814,090
Non-trainable params: 0
_________________________________________________________________


<tensorflow.python.keras.engine.sequential.Sequential at 0x7ffa5d7f45c0>

## Deploying the model

**Not Tested**

In [None]:
predictor2 = mnist_estimator2.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

In [None]:
import numpy as np

train_data = np.load('../Data/train_data.npy')
train_labels = np.load('../Data/train_labels.npy')

In [None]:
predictions2 = predictor2.predict(train_data[:50])
for i in range(0, 50):
    prediction = predictions['predictions'][i]
    label = train_labels[i]
    print('prediction is {}, label is {}, matched: {}'.format(prediction, label, prediction == label))

In [None]:
sagemaker.Session().delete_endpoint(predictor2.endpoint)