# Sweeping TensorFlow Script Mode Containers 
### Featuring Weights and Biases & Amazon ECR
---
#### Script Mode: Training and Deployment
Script mode is a training script format for TensorFlow that lets you execute any TensorFlow training script in SageMaker with minimal modification. The [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native TensorFlow support sets up training-related environment variables and executes your training script. In this example, we use a Python script to train a classification model on the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) and deploy it as an HTTPS Endpoint. 

#### Amazon ECR: Registering a Container
By Dockerizing our training, we can run train our model on any AWS instance in either a single or distributed setting. 



#### Weights and Biases: Sweeping and Monitoring
In addition, this notebook demonstrates how to perform real time inference with [Weights and Biases](https://wandb.com).


In [1]:
!pip install wandb -q

[33mYou are using pip version 10.0.1, however version 20.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


# Set up the environment

Let's start by setting up the environment:

In [2]:
import os
import wandb
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator

sagemaker_session = sagemaker.Session()
role = get_execution_role()
region = sagemaker_session.boto_session.region_name

### Weights and Biases
Copy and paste your API key from https://app.wandb.ai/authorize

In [3]:
wandb_api_key = '1d2a8e338a0ecba6e71df00638afa00b6296a83e'
wandb_entity  = 'rosenblatt'
wandb_project = 'satellite-model-and-orientation'

!wandb login $wandb_api_key

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/ec2-user/.netrc
[32mSuccessfully logged in to Weights & Biases![0m


### Training Data

The MNIST dataset has been loaded to the public S3 buckets ``sagemaker-sample-data-<REGION>`` under the prefix ``tensorflow/mnist``. There are four ``.npy`` file under this prefix:
* ``train_data.npy``
* ``eval_data.npy``
* ``train_labels.npy``
* ``eval_labels.npy``

In [19]:
dataset_uri = 's3://ssa-data/dataset'
dataset_channel = sagemaker.session.s3_input(dataset_uri, content_type='image/png')

In [66]:
import boto3
s3 = boto3.client("s3")
all_objects = s3.Bucket('ssa-data/dataset/') 

# Construct a script for distributed training

This tutorial's training script was adapted from TensorFlow's official [CNN MNIST example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/layers/cnn_mnist.py). We have modified it to handle the ``model_dir`` parameter passed in by SageMaker. This is an S3 path which **can be used for data sharing during distributed training and checkpointing and/or model persistence**. We have also added an argument-parsing function to handle processing training-related variables.

At the end of the training job we have added a step to export the trained model to the path stored in the environment variable ``SM_MODEL_DIR``, which always points to ``/opt/ml/model``. This is critical because SageMaker uploads all the model artifacts in this folder to S3 at end of training.

Here is the entire script:

In [5]:
cd dockerfile

/home/ec2-user/SageMaker/satellite-model-and-orientation/ResNet Model/dockerfile


In [6]:
# TensorFlow 2.0 script
!pygmentize 'model.py'

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mwandb[39;49;00m
[34mimport[39;49;00m [04m[36mpathlib[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mcallbacks[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtensorflow[39;49;00m [34mas[39;49;00m [04m[36mtf[39;49;00m
[34mimport[39;49;00m [04m[36mmatplotlib.pyplot[39;49;00m [34mas[39;49;00m [04m[36mplt[39;49;00m


[34mdef[39;49;00m [32mmodel[39;49;00m():
    [33m"""Generate a simple model using the Keras API for Tensorflow"""[39;49;00m
    args, unknown = _parse_args()
    
    config = wandb.config [37m# When Sweeping, wandb.config will be updated every session[39;49;00m
    
    train_ds, train_steps, val_ds, val_steps, labels = _load_dataset(args.train, config[[33m'[39;49;00m[33mbatch_size

# Building a Customized Docker Image
To install specific libraries or run commands before executing your script, you need to write a custom docker file.

### Dockerfile

In [7]:
!pygmentize 'Dockerfile'

[37m# Downloads the TensorFlow library used to run the Python 3 script[39;49;00m
[34mFROM[39;49;00m[33m tensorflow/tensorflow:2.0.0-gpu-py3[39;49;00m

[37m# Contains the common functionality necessary to create a container compatible with SageMaker[39;49;00m
[34mRUN[39;49;00m pip install sagemaker-containers -q 

[37m# Wandb allows us to customize and centralize logging[39;49;00m
[34mRUN[39;49;00m pip install wandb -q --upgrade

[37m# Copies the training code inside the container according to the SageMaker Estimator design pattern [39;49;00m
COPY model.py /opt/ml/code/model.py
COPY callbacks.py /opt/ml/code/callbacks.py
COPY wandb_setup.sh /opt/ml/code/wandb_setup.sh

[37m# Set the entry point as the setup script[39;49;00m
[34mENV[39;49;00m[33m SAGEMAKER_PROGRAM wandb_setup.sh[39;49;00m


### wandb_setup.sh

In [8]:
!pygmentize 'wandb_setup.sh'

[37m#!/bin/bash[39;49;00m
 
[37m# Argument options[39;49;00m
[31mLONG[39;49;00m=api_key:,sweep_id:,config:,entity:,project:

[37m# Parse through arguments[39;49;00m
[34mfunction[39;49;00m args()
{
    [31moptions[39;49;00m=[34m$([39;49;00mgetopt --long [31m$LONG[39;49;00m --long name: -- [33m"[39;49;00m[31m$@[39;49;00m[33m"[39;49;00m[34m)[39;49;00m
    [ [31m$?[39;49;00m -eq [34m0[39;49;00m ] || {
        [36mecho[39;49;00m [33m"wandb_setup: Incorrect option provided"[39;49;00m
        [36mexit[39;49;00m [34m1[39;49;00m
    }
    [36meval[39;49;00m [36mset[39;49;00m -- [33m"[39;49;00m[31m$options[39;49;00m[33m"[39;49;00m
    [34mwhile[39;49;00m true; [34mdo[39;49;00m
        [34mcase[39;49;00m [33m"[39;49;00m[31m$1[39;49;00m[33m"[39;49;00m in
        --api_key)
            [36mshift[39;49;00m
            [36mexport[39;49;00m [31mWANDB_API_KEY[39;49;00m=[33m"[39;49;00m[31m$1[39;49;00m[33m"[39;49;00m
            ;;
   

Verifying our dockerfile works as expected is a two step process: first we need to build it locally. 

In [9]:
image_name = 'ssa-model'
!docker build -t $image_name .

Sending build context to Docker daemon  28.16kB
Step 1/7 : FROM tensorflow/tensorflow:2.0.0-gpu-py3
2.0.0-gpu-py3: Pulling from tensorflow/tensorflow

[1B02085707: Pulling fs layer 
[1B5509d51d: Pulling fs layer 
[1B9fe70a46: Pulling fs layer 
[1Be1789921: Pulling fs layer 
[1B21d58e5d: Pulling fs layer 
[1Bfcda1e6e: Pulling fs layer 
[1Ba76e3193: Pulling fs layer 
[1B9f7e28e6: Pulling fs layer 
[1Be7aaea7e: Pulling fs layer 
[1Ba82d62e6: Pulling fs layer 
[1B420b0759: Pulling fs layer 
[1B0f532378: Pulling fs layer 
[1B8a6dc949: Pulling fs layer 
[1B1bda3d6d: Pulling fs layer 
[1B0e4900cb: Pulling fs layer 
[1BDigest: sha256:2089bcbf3a7b9e41d7c4be3971874598d04fd0b9190aca924d634053adca41c7A[1K[K[12A[1K[K[16A[1K[K[11A[1K[K[12A[1K[K[11A[1K[K[11A[1K[K[10A[1K[K[9A[1K[K[16A[1K[K[9A[1K[K[8A[1K[K[9A[1K[K[8A[1K[K[9A[1K[K[8A[1K[K[9A[1K[K[8A[1K[K[6A[1K[K[8A[1K[K[6A[1K[K[8A[1K[K[6A[1K[K[8A[1K[K[6A[1K[K[8A[

# Create a local training job using the SageMaker estimator

After the docker image is built, it is automatically accessible to the local instance. To verify the job will execute as expected we create a local training job.

In [20]:
test_estimator = Estimator(image_name,
                           role,
                           train_instance_count=1,
                           train_instance_type='local',
                           hyperparameters={'api_key' : wandb_api_key,
                                            'entity'  : wandb_entity,
                                            'project' : wandb_project}
                          )

test_estimator.fit(dataset_channel)

KeyboardInterrupt: 

# Registering the container to Amazon Container Services (ECR)

After a successful training job is run on the local instance, we are now ready to push the image to Amazon Elastic Container Registry and run a training job remotely. **You may need to your role's permissions. Run the cell below and click the link.**

In [11]:
from IPython.display import Markdown as md
url = 'https://console.aws.amazon.com/iam/home?#/roles/' + role.split('/')[-1]
md('Click this link: {}'.format(url))

Click this link: https://console.aws.amazon.com/iam/home?#/roles/AmazonSageMaker-ExecutionRole-20200214T172599

### Adjusting Permissions
Click on the policy with the corresponding role name, and view the JSON. If the JSON doesn't already contains the fields below, add the following to the "Statement" value:
```
{
            "Sid": "ECRPermissions",
            "Effect": "Allow",
            "Action": [
                "ecr:PutLifecyclePolicy",
                "ecr:GetLifecyclePolicyPreview",
                "ecr:GetDownloadUrlForLayer",
                "ecr:ListTagsForResource",
                "ecr:BatchDeleteImage",
                "ecr:UploadLayerPart",
                "ecr:ListImages",
                "ecr:DeleteLifecyclePolicy",
                "ecr:DeleteRepository",
                "ecr:PutImage",
                "ecr:UntagResource",
                "ecr:SetRepositoryPolicy",
                "ecr:BatchGetImage",
                "ecr:CompleteLayerUpload",
                "ecr:DescribeImages",
                "ecr:TagResource",
                "ecr:DescribeRepositories",
                "ecr:StartLifecyclePolicyPreview",
                "ecr:DeleteRepositoryPolicy",
                "ecr:InitiateLayerUpload",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetLifecyclePolicy",
                "ecr:GetRepositoryPolicy",
                "ecr:GetAuthorizationToken"
            ],
            "Resource": "*"
        },
        {
            "Sid": "ECRFullAccess",
            "Effect": "Allow",
            "Action": "ecr:*",
            "Resource": "*"
        }
```

In [12]:
%%sh -s "$image_name"

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

export fullname="${account}.dkr.ecr.${region}.amazonaws.com/$1:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "$1" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "$1" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.
BLUE="\033[0;34m"
NC="\033[0m"
echo ""
echo -e "COPY THIS -> ${BLUE}" $fullname " ${NC}"
echo ""

docker tag $1 ${fullname}

docker push ${fullname}

Login Succeeded

COPY THIS -> [0;34m 751398683966.dkr.ecr.us-east-2.amazonaws.com/mnist-2:latest  [0m

The push refers to repository [751398683966.dkr.ecr.us-east-2.amazonaws.com/mnist-2]
f07f612567d7: Preparing
823e33761ca8: Preparing
6d9a87209903: Preparing
6ffb4d81d748: Preparing
d42640b1c1a1: Preparing
dedee7f64028: Preparing
529a3f87b032: Preparing
0c2b1f7aa7ff: Preparing
881a26fc96a8: Preparing
b623593089c7: Preparing
e8bc6712038a: Preparing
9c564e8f33e8: Preparing
75287790905f: Preparing
0bb97e92ee41: Preparing
718bbdc0b45f: Preparing
4a78de7ea906: Preparing
0bfa7a55184c: Preparing
122be11ab4a2: Preparing
7beb13bce073: Preparing
f7eae43028b3: Preparing
6cebf3abed5f: Preparing
dedee7f64028: Waiting
529a3f87b032: Waiting
0c2b1f7aa7ff: Waiting
881a26fc96a8: Waiting
b623593089c7: Waiting
e8bc6712038a: Waiting
9c564e8f33e8: Waiting
75287790905f: Waiting
0bb97e92ee41: Waiting
718bbdc0b45f: Waiting
4a78de7ea906: Waiting
0bfa7a55184c: Waiting
122be11ab4a2: Waiting
7beb13bce073: Waitin

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



In [13]:
image_url = '751398683966.dkr.ecr.us-east-2.amazonaws.com/mnist-2:latest' # <- PASTE HERE

# Create a remote training job using the SageMaker Estimator 

The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. Let's call out a couple important parameters here:

* `py_version` is set to `'py3'` to indicate that we are using script mode since legacy mode supports only Python 2. Though Python 2 will be deprecated soon, you can use script mode with Python 2 by setting `py_version` to `'py2'` and `script_mode` to `True`.

* `distributions` is used to configure the distributed training setup. It's required only if you are doing distributed training either across a cluster of instances or across multiple GPUs. Here we are using parameter servers as the distributed training schema. SageMaker training jobs run on homogeneous clusters. To make parameter server more performant in the SageMaker setup, we run a parameter server on every instance in the cluster, so there is no need to specify the number of parameter servers to launch. Script mode also supports distributed training with [Horovod](https://github.com/horovod/horovod). You can find the full documentation on how to configure `distributions` [here](https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#distributed-training). 

You can also initiate an estimator to train with TensorFlow 2.0 script. The only things that you will need to change are the script name and ``framework_version``. Now we verify the image by running it on the local instance.



In [14]:
bucket = 'rosenblatt-tutorial-data' # Replace with your s3 bucket name
prefix = 'sagemaker/tf-image-mnist' # Used as part of the path in the bucket where you store data
s3_output_location = 's3://{}/{}/{}'.format(bucket, prefix, 'mnist-2')

In [15]:
mnist_estimator = Estimator(image_url,
                            role,
                            train_instance_count=1,
                            train_instance_type='ml.m4.xlarge',
                            output_path=s3_output_location,
                            hyperparameters={'api_key' : wandb_api_key}
                           )

## Calling ``fit``

To start a training job, we call `mnist_estimator.fit(training_data_uri)`.

An S3 location is used here as the input. `fit` creates a default channel named `'training'`, which points to this S3 location. In the training script we can then access the training data from the location stored in `SM_CHANNEL_TRAINING`. `fit` accepts a couple other types of input as well. See the API doc [here](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.EstimatorBase.fit) for details.

When training starts, the TensorFlow container executes mnist.py, passing `hyperparameters` and `model_dir` from the estimator as script arguments. Because we didn't define either in this example, no hyperparameters are passed, and `model_dir` defaults to `s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>`, so the script execution is as follows:
```bash
python mnist.py --model_dir s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>
```
When training is complete, the training job will upload the saved model for TensorFlow serving.

In [17]:
mnist_estimator.fit(training_data_uri, job_name ='RemoteTest-2')

2020-03-21 20:41:23 Starting - Starting the training job...
2020-03-21 20:41:24 Starting - Launching requested ML instances...
2020-03-21 20:42:23 Starting - Preparing the instances for training......
2020-03-21 20:43:22 Downloading - Downloading input data...
2020-03-21 20:43:33 Training - Downloading the training image.........
2020-03-21 20:45:18 Training - Training image download completed. Training in progress..[34m2020-03-21 20:45:19,275 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-03-21 20:45:19,275 sagemaker-containers INFO     Failed to parse hyperparameter api_key value d04fa6489b323f4a650aac3e80f17a194fe363be to Json.[0m
[34mReturning the value itself[0m
[34m2020-03-21 20:45:25,533 sagemaker-containers INFO     Failed to parse hyperparameter api_key value d04fa6489b323f4a650aac3e80f17a194fe363be to Json.[0m
[34mReturning the value itself[0m
[34m2020-03-21 20:45:25,536 sagemaker-containers INFO     No GPUs detected (norma

# Creating a Sweep with Weights and Biases
Here we will perform a hyperparameter Sweep but use a SageMaker distributed training session to speed things up. 

### Setting up the Sweep
Don't maximize accuracy because then you can get a network with low confidence.

In [18]:
sweep_config = {
  'name' : 'ChimChimCheree',
  'descritipion' : 'A sweep is as lucky, as lucky can be',
  'method' : 'grid',
  'metric' : {
      'name' : 'val_loss',
      'goal' : 'minimize'
  },
  'parameters': {
        'dropout': {
            'values': [0.1, 0.2, 0.3, 0.4]
        },
        'activation': {
            'values': ['relu', 'tanh']
        },
  }
}

sweep_id = wandb.sweep(sweep_config, entity=wandb_entity, project=wandb_project)

Create sweep with ID: vmsf08wu
Sweep URL: https://app.wandb.ai/intermx/sagemaker-integration/sweeps/vmsf08wu


### Performining the Sweep Locally

In [19]:
mnist_estimator_sweep = Estimator(image_name,
                                  role,
                                  train_instance_count=1,
                                  train_instance_type='local',
                                  hyperparameters={'api_key' : wandb_api_key,
                                                   'sweep_id': sweep_id,
                                                   'entity'  : wandb_entity,
                                                   'project' : wandb_project}
                                  )

mnist_estimator_sweep.fit(training_data_uri)

Creating tmpxrswv92a_algo-1-ou7pc_1 ... 
[1BAttaching to tmpxrswv92a_algo-1-ou7pc_12mdone[0m
[36malgo-1-ou7pc_1  |[0m 2020-03-21 20:51:28,377 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-ou7pc_1  |[0m 2020-03-21 20:51:28,377 sagemaker-containers INFO     Failed to parse hyperparameter api_key value d04fa6489b323f4a650aac3e80f17a194fe363be to Json.
[36malgo-1-ou7pc_1  |[0m Returning the value itself
[36malgo-1-ou7pc_1  |[0m 2020-03-21 20:51:28,377 sagemaker-containers INFO     Failed to parse hyperparameter sweep_id value vmsf08wu to Json.
[36malgo-1-ou7pc_1  |[0m Returning the value itself
[36malgo-1-ou7pc_1  |[0m 2020-03-21 20:51:28,378 sagemaker-containers INFO     Failed to parse hyperparameter entity value intermx to Json.
[36malgo-1-ou7pc_1  |[0m Returning the value itself
[36malgo-1-ou7pc_1  |[0m 2020-03-21 20:51:28,378 sagemaker-containers INFO     Failed to parse hyperparameter project value sagemaker-integration to J

KeyboardInterrupt: 

# Deploy the trained model to an endpoint

The `deploy()` method creates a SageMaker model, which is then deployed to an endpoint to serve prediction requests in real time. We will use the TensorFlow Serving container for the endpoint, because we trained with script mode. This serving container runs an implementation of a web server that is compatible with SageMaker hosting protocol. The [Using your own inference code]() document explains how SageMaker runs inference containers.

Deployed the trained TensorFlow 2.0 model to an endpoint.

In [18]:
predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

-----------!

# Invoke the endpoint

Let's download the training data and use that as input for inference.

In [20]:
import numpy as np

!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/train_data.npy train_data.npy
!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/train_labels.npy train_labels.npy

train_data = np.load('train_data.npy')
train_labels = np.load('train_labels.npy')

download: s3://sagemaker-sample-data-us-east-2/tensorflow/mnist/train_data.npy to ./train_data.npy
download: s3://sagemaker-sample-data-us-east-2/tensorflow/mnist/train_labels.npy to ./train_labels.npy


The formats of the input and the output data correspond directly to the request and response formats of the `Predict` method in the [TensorFlow Serving REST API](https://www.tensorflow.org/serving/api_rest). SageMaker's TensforFlow Serving endpoints can also accept additional input formats that are not part of the TensorFlow REST API, including the simplified JSON format, line-delimited JSON objects ("jsons" or "jsonlines"), and CSV data.

In this example we are using a `numpy` array as input, which will be serialized into the simplified JSON format. In addtion, TensorFlow serving can also process multiple items at once as you can see in the following code. You can find the complete documentation on how to make predictions against a TensorFlow serving SageMaker endpoint [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint).

Examine the prediction result from the TensorFlow 2.0 model.

In [25]:
predictions = predictor.predict(train_data[:50])
for i in range(0, 50):
    prediction = np.argmax(predictions['predictions'][i])
    label = train_labels[i]
    print('prediction is {}, label is {}, matched: {}'.format(prediction, label, prediction == label))

prediction is 7, label is 7, matched: True
prediction is 3, label is 3, matched: True
prediction is 9, label is 4, matched: False
prediction is 6, label is 6, matched: True
prediction is 1, label is 1, matched: True
prediction is 8, label is 8, matched: True
prediction is 1, label is 1, matched: True
prediction is 0, label is 0, matched: True
prediction is 9, label is 9, matched: True
prediction is 8, label is 8, matched: True
prediction is 0, label is 0, matched: True
prediction is 3, label is 3, matched: True
prediction is 1, label is 1, matched: True
prediction is 3, label is 2, matched: False
prediction is 7, label is 7, matched: True
prediction is 0, label is 0, matched: True
prediction is 2, label is 2, matched: True
prediction is 9, label is 9, matched: True
prediction is 6, label is 6, matched: True
prediction is 0, label is 0, matched: True
prediction is 1, label is 1, matched: True
prediction is 6, label is 6, matched: True
prediction is 7, label is 7, matched: True
predictio

# Delete the endpoint

Let's delete the endpoint we just created to prevent incurring any extra costs.

Delete the TensorFlow 2.0 endpoint as well.

In [26]:
sagemaker.Session().delete_endpoint(predictor.endpoint)