# Let's Train Our Model

1. [Introduction](#Introduction)
2. [Setup](#Setup)
3. [Specifying input Dataset](#Specifying-input-Dataset)
4. [Training](#Training)

## Introduction

Object detection is the process of identifying and localizing objects in an image. A typical object detection solution takes in an image as input and provides a bounding box on the image where an object of interest is, along with identifying what object the box encapsulates. But before we have this solution, we need to process a training dataset, create and setup a training job for the algorithm so that the aglorithm can learn about the dataset and then host the algorithm as an endpoint, to which we can supply the query image.

This notebook focuses on using the built-in SageMaker Single Shot multibox Detector ([SSD](https://arxiv.org/abs/1512.02325)) object detection algorithm to train model on your custom dataset. For dataset prepration or using the model for inference, please see other scripts in [this folder](./)

## Setup

To train the Object Detection algorithm on Amazon SageMaker, we need to setup and authenticate the use of AWS services. To begin with we need an AWS account role with SageMaker access. This role is used to give SageMaker access to your data in S3. In this example, we will use the same role that was used to start this SageMaker notebook.

In [1]:
%%time
import sagemaker
import boto3
from sagemaker import get_execution_role
from time import gmtime, strftime

role = get_execution_role()
sess = sagemaker.Session()

account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name

CPU times: user 743 ms, sys: 63.8 ms, total: 807 ms
Wall time: 880 ms


We also need the S3 bucket that hold the pre-training artifacts and store the final model. 

In [2]:
bucket = 'cvml-sagemaker-repo'
prefix = 'wakeboarder-detection'
TMP_FOLDER_NAME = 'tmp' # - Reference temp directory for model artifacts

## Load the base training model

Because we are using transfer learning, we are relying ...

In [3]:
#set a path to hold the checkpoint model artifacts
s3_checkpoint_path = 's3://{}/checkpoint/'.format(bucket)

#Download the base model and extract locally
!mkdir $TMP_FOLDER_NAME/checkpoint
!wget -O $TMP_FOLDER_NAME/ssd_resnet50.tar.gz http://download.tensorflow.org/models/object_detection/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz
!tar -vxzf $TMP_FOLDER_NAME/ssd_resnet50.tar.gz --strip-components 1 --directory $TMP_FOLDER_NAME/checkpoint

#Upload checkpoint files to S3
!aws s3 cp $TMP_FOLDER_NAME/checkpoint $s3_checkpoint_path --recursive

mkdir: cannot create directory ‘tmp/checkpoint’: File exists
--2020-03-18 18:31:08--  http://download.tensorflow.org/models/object_detection/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.13.240, 2607:f8b0:4004:807::2010
Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.13.240|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 366947246 (350M) [application/x-tar]
Saving to: ‘tmp/ssd_resnet50.tar.gz’


2020-03-18 18:31:15 (53.6 MB/s) - ‘tmp/ssd_resnet50.tar.gz’ saved [366947246/366947246]

ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/model.ckpt.meta
ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/checkpoint
ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/frozen_inference_graph.pb
ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/saved_model/
ssd_r

## Create/Load our custom (Tensorflow) Container

Commentary here about the purpose of a custom container over inherent Sagemaker algorithm.

__Note: Cell execution may take a few minutes__

In [9]:
%%sh

# The name of our algorithm
algorithm_name=tensorflow-detection
account=$(aws sts get-caller-identity --query Account --output text)
region=$(aws configure get region)

path_to_container="/home/ec2-user/SageMaker/amazon-sagemaker-aws-greengrass-custom-object-detection-model/container"
echo ${path_to_container}
chmod +x ${path_to_container}/container/resources/train
chmod +x ${path_to_container}/container/resources/test


fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

echo ${fullname}
echo LOGGING INTO AMAZON ECR...

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)



echo "BUILDING IMAGE WITH NAME ${algorithm_name}"
cd ${path_to_container}/container/
docker build  --no-cache -t ${algorithm_name} -f Dockerfile .
docker tag ${algorithm_name} ${fullname}

echo BUILD COMPLETE
echo "PUSHING IMAGE TO ${fullname}"
docker push ${fullname}

/home/ec2-user/SageMaker/amazon-sagemaker-aws-greengrass-custom-object-detection-model/container
745043861688.dkr.ecr.us-east-1.amazonaws.com/tensorflow-detection:latest
LOGGING INTO AMAZON ECR...
Login Succeeded
BUILDING IMAGE WITH NAME tensorflow-detection
Sending build context to Docker daemon  22.85MB
Step 1/26 : FROM tensorflow/tensorflow:1.14.0-gpu-py3
1.14.0-gpu-py3: Pulling from tensorflow/tensorflow
6abc03819f3e: Pulling fs layer
05731e63f211: Pulling fs layer
0bd67c50d6be: Pulling fs layer
d5c73556cc1e: Pulling fs layer
e059dd98ac7c: Pulling fs layer
e4732fdd9b39: Pulling fs layer
cbeb255d6ab1: Pulling fs layer
0809e577f6d6: Pulling fs layer
421e23cecfe8: Pulling fs layer
a5abf0996067: Pulling fs layer
d718e4299c08: Pulling fs layer
f401bdaa92ad: Pulling fs layer
6669e38ab1ba: Pulling fs layer
5b6ac7f35d3d: Pulling fs layer
d5c73556cc1e: Waiting
e059dd98ac7c: Waiting
e4732fdd9b39: Waiting
cbeb255d6ab1: Waiting
0809e577f6d6: Waiting
421e23cecfe8: Waiting
a5abf0996067: Waitin

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



## Specify Asset locations to prepare for training

This notebook assumes you already have prepared two [Augmented Manifest Files](https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html) as training and validation input data for the object detection model.  

There are many advantages to using **augmented manifest files** for your training input

* No format conversion is required if you are using SageMaker Ground Truth to generate the data labels
* Unlike the traditional approach of providing paths to the input images separately from its labels, augmented manifest file already combines both into one entry for each input image, reducing complexity in algorithm code for matching each image with labels. (Read this [blog post](https://aws.amazon.com/blogs/machine-learning/easily-train-models-using-datasets-labeled-by-amazon-sagemaker-ground-truth/) for more explanation.) 
* When splitting your dataset for train/validation/test, you don't need to rearrange and re-upload image files to different s3 prefixes for train vs validation. Once you upload your image files to S3, you never need to move it again. You can just place pointers to these images in your augmented manifest file for training and validation. More on the train/validation data split in this post later. 



In [3]:
#Here we specify locations from all the prior activites as we ready the assets to servce the model training.
container = '{}.dkr.ecr.{}.amazonaws.com/tensorflow-detection:latest'.format(account, region)
s3_train_data = f's3://{bucket}/tfrecords/train.records'
s3_validation_data = f's3://{bucket}/tfrecords/validation.records'
s3_checkpoint_data = f's3://{bucket}/checkpoint/'
s3_label_data = f's3://{bucket}/tfrecords/label_map.pbtxt'
s3_output_location = f's3://{bucket}/output-model'


## Create Data Channels

In [4]:
train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', content_type='application/x-image', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', content_type='application/x-image', s3_data_type='S3Prefix')
label_data = sagemaker.session.s3_input(s3_label_data, distribution='FullyReplicated', content_type='application/json', s3_data_type='S3Prefix')
checkpoint_data = sagemaker.session.s3_input(s3_checkpoint_data, distribution='FullyReplicated', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data, 'label': label_data, 'checkpoint': checkpoint_data}

## Set Hyperparameters and Train Model

The object detection algorithm at its core is the [Single-Shot Multi-Box detection algorithm (SSD)](https://arxiv.org/abs/1512.02325). This algorithm uses a `base_network`, which is typically a [VGG](https://arxiv.org/abs/1409.1556) or a [ResNet](https://arxiv.org/abs/1512.03385). (resnet is typically faster so for edge inferences, I'd recommend using this base network). The Amazon SageMaker object detection algorithm supports VGG-16 and ResNet-50 now. It also has a lot of options for hyperparameters that help configure the training job. The next step in our training, is to setup these hyperparameters and data channels for training the model. See the SageMaker Object Detection [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/object-detection.html) for more details on the hyperparameters.

To figure out which works best for your data, run a hyperparameter tuning job. There's some example notebooks at [https://github.com/awslabs/amazon-sagemaker-examples](https://github.com/awslabs/amazon-sagemaker-examples) that you can use for reference. 

In [5]:
# This is where transfer learning happens. We use the pre-trained model and nuke the output layer by specifying
# the num_classes value. You can also run a hyperparameter tuning job to figure out which values work the best. 
hyperparameters = {
            "model.ssd.num_classes": 1,
            "base_config_name": "ssd_resnet_50_fpn_coco.config",
            "train_input_path": "train/train.records",
            "eval_input_path": "validation/validation.records",
            "train_input_config.label_map_path": "label/label_map.pbtxt",
            "label_map_path": "label/label_map.pbtxt",
            "train_config.fine_tune_checkpoint": "checkpoint/model.ckpt",
            "eval_config.num_examples": 100,
            "train_config.optimizer.momentum_optimizer.learning_rate.cosine_decay_learning_rate.learning_rate_base": 0.02,
            "train_config.optimizer.momentum_optimizer.learning_rate.cosine_decay_learning_rate.total_steps": 1000,
            "train_config.optimizer.momentum_optimizer.learning_rate.cosine_decay_learning_rate.warmup_learning_rate": 0.003,
            "train_config.optimizer.momentum_optimizer.learning_rate.cosine_decay_learning_rate.warmup_steps": 100,
            "momentum_optimizer_value": 0.9,
            "train_config.batch_size": 8,
            "train_config.batch_queue_capacity": 64,
            "train_config.num_batch_queue_threads": 4,
            "train_config.prefetch_queue_capacity": 32,
            "train_config.num_steps": 1000
        }

In [7]:
# DEFINE METRICS
metric_definitions=[{'Name': 'validation-map', 'Regex': 'mAP@.50IOU = ([0-9\\.]+)'},
                   {'Name': 'loss', 'Regex': ' loss = ([0-9\\.]+)'},
                   {'Name': 'training-steps', 'Regex': ' step = ([0-9\\.]+)'}]

# CREATE MODEL
od_model = sagemaker.estimator.Estimator(container,
                                    role,
                                    train_instance_count=1,
                                    train_instance_type='ml.p3.2xlarge',
                                    train_volume_size = 50,
                                    train_max_run = 3600,
                                    input_mode= 'File',
                                    output_path=s3_output_location,
                                    hyperparameters=hyperparameters,
                                    metric_definitions=metric_definitions,
                                    sagemaker_session=sess)
# TRAIN MODEL
od_model.fit(inputs=data_channels)

2020-03-18 19:04:18 Starting - Starting the training job...
2020-03-18 19:04:19 Starting - Launching requested ML instances......
2020-03-18 19:05:22 Starting - Preparing the instances for training......
2020-03-18 19:06:43 Downloading - Downloading input data
2020-03-18 19:06:43 Training - Downloading the training image.........
2020-03-18 19:08:18 Training - Training image download completed. Training in progress...[34mEXTRACTED CHECKPOINT FILES: ['saved_model', 'frozen_inference_graph.pb', 'pipeline.config', 'model.ckpt.meta', 'checkpoint', 'model.ckpt.data-00000-of-00001', 'model.ckpt.index'][0m
[34mUSING BASE CONFIG: /opt/ml/code/tensorflow-models/research/object_detection/samples/configs/ssd_resnet_50_fpn_coco.config[0m
[34mLOADED TRAINING PARAMETERS: {'model.ssd.num_classes': 1, 'train_config.optimizer.momentum_optimizer.learning_rate.cosine_decay_learning_rate.total_steps': 1000, 'train_config.optimizer.momentum_optimizer.learning_rate.cosine_decay_learning_rate.learning_r

[34mW0318 19:08:40.872127 140011957266240 deprecation_wrapper.py:119] From /opt/ml/code/tensorflow-models/research/object_detection/model_lib.py:515: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.
[0m
[34mI0318 19:08:41.566937 140011957266240 estimator.py:1147] Done calling model_fn.[0m
[34mI0318 19:08:41.568591 140011957266240 basic_session_run_hooks.py:541] Create CheckpointSaverHook.[0m
[34mI0318 19:08:44.669642 140011957266240 monitored_session.py:240] Graph was finalized.[0m
[34m2020-03-18 19:08:44.669988: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA[0m
[34m2020-03-18 19:08:44.678221: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1[0m
[34m2020-03-18 19:08:45.096637: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had ne

[34mI0318 19:14:36.163409 140011957266240 basic_session_run_hooks.py:692] global_step/sec: 2.86824[0m
[34mI0318 19:14:36.164339 140011957266240 basic_session_run_hooks.py:260] loss = 0.72547203, step = 900 (34.865 sec)[0m
[34mI0318 19:15:10.772154 140011957266240 basic_session_run_hooks.py:606] Saving checkpoints for 1000 into /opt/ml/model/model.ckpt.[0m
[34mI0318 19:15:12.801973 140011957266240 estimator.py:1145] Calling model_fn.[0m
[34mW0318 19:15:17.843768 140011957266240 deprecation.py:323] From /opt/ml/code/tensorflow-models/research/object_detection/eval_util.py:796: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mW0318 19:15:18.049504 140011957266240 deprecation.py:323] From /opt/ml/code/tensorflow-models/research/object_detection/utils/visualization_utils.py:498: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be re

[34mEXPORTING FROZEN GRAPH[0m
[34mW0318 19:15:39.218902 140119322634048 deprecation_wrapper.py:119] From /opt/ml/code/tensorflow-models/research/slim/nets/inception_resnet_v2.py:373: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.
[0m
[34mW0318 19:15:39.230486 140119322634048 deprecation_wrapper.py:119] From /opt/ml/code/tensorflow-models/research/slim/nets/mobilenet/mobilenet.py:397: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.
[0m
[34mW0318 19:15:39.243411 140119322634048 deprecation_wrapper.py:119] From /opt/ml/code/tensorflow-models/research/object_detection/export_inference_graph.py:162: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.
[0m
[34mW0318 19:15:39.244006 140119322634048 deprecation_wrapper.py:119] From /opt/ml/code/tensorflow-models/research/object_detection/export_inference_graph.py:145: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
[0m
[3

[34m2020-03-18 19:15:51.816437: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero[0m
[34m2020-03-18 19:15:51.816870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:[0m
[34mname: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53[0m
[34mpciBusID: 0000:00:1e.0[0m
[34m2020-03-18 19:15:51.816927: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0[0m
[34m2020-03-18 19:15:51.816946: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0[0m
[34m2020-03-18 19:15:51.816979: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0[0m
[34m2020-03-18 19:15:51.817014: I tensorflow/stream_executor/platf


2020-03-18 19:15:57 Uploading - Uploading generated training model
2020-03-18 19:17:59 Completed - Training job completed
Training seconds: 692
Billable seconds: 692


## Hyperparameter Optimization. 

Text here on the intent of optimizing w/ metrics between ranges

In [8]:
hyperparameters = {
            "model.ssd.num_classes": 1,
            "base_config_name": "ssd_resnet_50_fpn_coco.config",
            "train_input_path": "train/train.records",
            "eval_input_path": "validation/validation.records",
            "train_input_config.label_map_path": "label/label_map.pbtxt",
            "label_map_path": "label/label_map.pbtxt",
            "train_config.fine_tune_checkpoint": "checkpoint/model.ckpt",
            "eval_config.num_examples": 100,
            "train_config.batch_size": 8,
            "train_config.batch_queue_capacity": 64,
            "train_config.num_batch_queue_threads": 4,
            "train_config.prefetch_queue_capacity": 32,
            "train_config.num_steps": 1000,
            "train_config.optimizer.momentum_optimizer.learning_rate.cosine_decay_learning_rate.total_steps": 1000,
        }

In [9]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

# DEFINE METRICS
metric_definitions=[{'Name': 'validation-map', 'Regex': 'mAP@.50IOU = ([0-9\\.]+)'},
                   {'Name': 'loss', 'Regex': ' loss = ([0-9\\.]+)'},
                   {'Name': 'training-steps', 'Regex': ' step = ([0-9\\.]+)'}]

# CREATE MODEL
od_model = sagemaker.estimator.Estimator(container,
                                    role,
                                    train_instance_count=1,
                                    train_instance_type='ml.p3.2xlarge',
                                    train_volume_size = 40,
                                    train_max_run = 3600,
                                    input_mode= 'File',
                                    output_path=s3_output_location,
                                    hyperparameters=hyperparameters,
                                    metric_definitions=metric_definitions,
                                    sagemaker_session=sess)

# SET HYPERPARAMETERS RANGES 
hyperparameter_ranges = {'train_config.optimizer.momentum_optimizer.learning_rate.cosine_decay_learning_rate.learning_rate_base': ContinuousParameter(0.01, 0.1),
                         'train_config.optimizer.momentum_optimizer.learning_rate.cosine_decay_learning_rate.warmup_learning_rate': ContinuousParameter(0.001, 0.009),
                         'train_config.optimizer.momentum_optimizer.learning_rate.cosine_decay_learning_rate.warmup_steps': IntegerParameter(50, 300),
                         'momentum_optimizer_value': ContinuousParameter(0.3, 0.99)
                        }
# SET OBJECTIVE METRIC
objective_metric_name = 'validation-map'
objective_type = 'Maximize'

# CREATE TUNER
tuner = HyperparameterTuner(od_model,
                            objective_metric_name=objective_metric_name,
                            objective_type=objective_type,
                            metric_definitions=metric_definitions,
                            hyperparameter_ranges=hyperparameter_ranges,
                            max_jobs=30,
                            max_parallel_jobs=2)

## Launch HPO Jobs

In [10]:
tuner.fit(inputs=data_channels)

__Note:__ I'm going to add some additional knowledge and direction around the need for job tuning along with more directed instruction. For now, (after executing the cell above) head back to the Sagemaker console to view *Hyperparameter Tuning Jobs*. This will show the composite collection of tuning jobs that are running to optimize model accuracy within the given hyperparameter ranges above. Individual jobs are fed into *training jobs* where they can be viewed individually. The best training job is already identified for you as each new job reports its results. We'll take the model assets from this job (the s3 location) and use that to test our inference against in the next notebook. 

#  Next step

Once the training job completes, move on to the [next notebook](../validation/4_Test_Tensorflow_Model.ipynb) to run inferences against your model with new images.