# Tensorflow Object Detection API and AWS Sagemaker

In this notebook, you will train and evaluate different models using the [Tensorflow Object Detection API](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/) and [AWS Sagemaker](https://aws.amazon.com/sagemaker/). 

If you ever feel stuck, you can refer to this [tutorial](https://aws.amazon.com/blogs/machine-learning/training-and-deploying-models-using-tensorflow-2-with-the-object-detection-api-on-amazon-sagemaker/).

## Dataset

We are using the [Waymo Open Dataset](https://waymo.com/open/) for this project. The dataset has already been exported using the tfrecords format. The files have been created following the format described [here](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#create-tensorflow-records). You can find data stored on [AWS S3](https://aws.amazon.com/s3/), AWS Object Storage. The images are saved with a resolution of 640x640.

In [1]:
%%capture
%pip install tensorflow_io sagemaker -U

In [4]:
import os
import sagemaker
from sagemaker.estimator import Estimator
from framework import CustomFramework

Save the IAM role in a variable called `role`. This would be useful when training the model.

In [5]:
role = sagemaker.get_execution_role()
print(role)

arn:aws:iam::789453636658:role/service-role/AmazonSageMaker-ExecutionRole-20230817T141326


In [6]:
# The train and val paths below are public S3 buckets created by Udacity for this project
inputs = {'train': 's3://cd2688-object-detection-tf2/train/', 
        'val': 's3://cd2688-object-detection-tf2/val/'} 

# Insert path of a folder in your personal S3 bucket to store tensorboard logs.
tensorboard_s3_prefix = 's3://object-detection-project-1/logs/'

## Container

To train the model, you will first need to build a [docker](https://www.docker.com/) container with all the dependencies required by the TF Object Detection API. The code below does the following:
* clone the Tensorflow models repository
* get the exporter and training scripts from the the repository
* build the docker image and push it 
* print the container name

In [15]:
%%bash

# clone the repo and get the scripts
git clone https://github.com/tensorflow/models.git docker/models

# get model_main and exporter_main files from TF2 Object Detection GitHub repository
cp docker/models/research/object_detection/exporter_main_v2.py source_dir 
cp docker/models/research/object_detection/model_main_tf2.py source_dir

fatal: destination path 'docker/models' already exists and is not an empty directory.


In [7]:
# build and push the docker image. This code can be commented after being ran once.
# This will take around 10 mins.
image_name = 'tf2-object-detection'
#!sh ./docker/build_and_push.sh $image_name

To verify that the image was correctly pushed to the [Elastic Container Registry](https://aws.amazon.com/ecr/), you can look at it in the AWS webapp. For example, below you can see that three different images have been pushed to ECR. You should only see one, called `tf2-object-detection`.
![ECR Example](../data/example_ecr.png)


In [8]:
# display the container name
with open (os.path.join('docker', 'ecr_image_fullname.txt'), 'r') as f:
    container = f.readlines()[0][:-1]

print(container)

789453636658.dkr.ecr.us-east-1.amazonaws.com/tf2-object-detection:20230818012136


## Pre-trained model from model zoo

As often, we are not training from scratch and we will be using a pretrained model from the TF Object Detection model zoo. You can find pretrained checkpoints [here](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md). Because your time is limited for this project, we recommend to only experiment with the following models:
* SSD MobileNet V2 FPNLite 640x640	
* SSD ResNet50 V1 FPN 640x640 (RetinaNet50)	
* Faster R-CNN ResNet50 V1 640x640	
* EfficientDet D1 640x640	
* Faster R-CNN ResNet152 V1 640x640	

In the code below, the EfficientDet D1 model is downloaded and extracted. This code should be ajusted if you were to experiment with other architectures.

In [9]:
%%bash
mkdir /tmp/checkpoint
mkdir source_dir/checkpoint
wget -O /tmp/mobilenet.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz
tar -zxvf /tmp/mobilenet.tar.gz --strip-components 2 --directory source_dir/checkpoint ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint

mkdir: cannot create directory ‘/tmp/checkpoint’: File exists
mkdir: cannot create directory ‘source_dir/checkpoint’: File exists
--2023-08-18 19:27:34--  http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 142.251.167.128, 2607:f8b0:4004:c08::80
Connecting to download.tensorflow.org (download.tensorflow.org)|142.251.167.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20518283 (20M) [application/x-tar]
Saving to: ‘/tmp/mobilenet.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 1.38M 14s
    50K .......... .......... .......... .......... ..........  0% 2.96M 10s
   100K .......... .......... .......... .......... ..........  0% 3.05M 9s
   150K .......... .......... .......... .......... ..........  0% 9.23M 7s
   200K .......... .......... .......... .......... ..........  1% 8.93M 6s
   250K .......

  5000K .......... .......... .......... .......... .......... 25%  198M 0s
  5050K .......... .......... .......... .......... .......... 25%  193M 0s
  5100K .......... .......... .......... .......... .......... 25%  182M 0s
  5150K .......... .......... .......... .......... .......... 25%  158M 0s
  5200K .......... .......... .......... .......... .......... 26%  151M 0s
  5250K .......... .......... .......... .......... .......... 26%  351M 0s
  5300K .......... .......... .......... .......... .......... 26%  202M 0s
  5350K .......... .......... .......... .......... .......... 26%  337M 0s
  5400K .......... .......... .......... .......... .......... 27%  135M 0s
  5450K .......... .......... .......... .......... .......... 27%  161M 0s
  5500K .......... .......... .......... .......... .......... 27%  174M 0s
  5550K .......... .......... .......... .......... .......... 27%  151M 0s
  5600K .......... .......... .......... .......... .......... 28%  279M 0s
  5650K ....

 10400K .......... .......... .......... .......... .......... 52%  361M 0s
 10450K .......... .......... .......... .......... .......... 52%  303M 0s
 10500K .......... .......... .......... .......... .......... 52%  354M 0s
 10550K .......... .......... .......... .......... .......... 52%  330M 0s
 10600K .......... .......... .......... .......... .......... 53%  349M 0s
 10650K .......... .......... .......... .......... .......... 53%  193M 0s
 10700K .......... .......... .......... .......... .......... 53%  347M 0s
 10750K .......... .......... .......... .......... .......... 53%  349M 0s
 10800K .......... .......... .......... .......... .......... 54%  355M 0s
 10850K .......... .......... .......... .......... .......... 54%  292M 0s
 10900K .......... .......... .......... .......... .......... 54%  312M 0s
 10950K .......... .......... .......... .......... .......... 54%  353M 0s
 11000K .......... .......... .......... .......... .......... 55%  342M 0s
 11050K ....

 15800K .......... .......... .......... .......... .......... 79%  359M 0s
 15850K .......... .......... .......... .......... .......... 79%  313M 0s
 15900K .......... .......... .......... .......... .......... 79%  137M 0s
 15950K .......... .......... .......... .......... .......... 79% 27.4M 0s
 16000K .......... .......... .......... .......... .......... 80%  317M 0s
 16050K .......... .......... .......... .......... .......... 80%  112M 0s
 16100K .......... .......... .......... .......... .......... 80%  220M 0s
 16150K .......... .......... .......... .......... .......... 80%  107M 0s
 16200K .......... .......... .......... .......... .......... 81%  257M 0s
 16250K .......... .......... .......... .......... .......... 81%  303M 0s
 16300K .......... .......... .......... .......... .......... 81%  198M 0s
 16350K .......... .......... .......... .......... .......... 81%  232M 0s
 16400K .......... .......... .......... .......... .......... 82%  198M 0s
 16450K ....

ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/ckpt-0.data-00000-of-00001
ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/checkpoint
ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/ckpt-0.index


## Edit pipeline.config file

The [`pipeline.config`](source_dir/pipeline.config) in the `source_dir` folder should be updated when you experiment with different models. The different config files are available [here](https://github.com/tensorflow/models/tree/master/research/object_detection/configs/tf2).

>Note: The provided `pipeline.config` file works well with the `EfficientDet` model. You would need to modify it when working with other models.

## Launch Training Job

Now that we have a dataset, a docker image and some pretrained model weights, we can launch the training job. To do so, we create a [Sagemaker Framework](https://sagemaker.readthedocs.io/en/stable/frameworks/index.html), where we indicate the container name, name of the config file, number of training steps etc.

The `run_training.sh` script does the following:
* train the model for `num_train_steps` 
* evaluate over the val dataset
* export the model

Different metrics will be displayed during the evaluation phase, including the mean average precision. These metrics can be used to quantify your model performances and compare over the different iterations.

You can also monitor the training progress by navigating to **Training -> Training Jobs** from the Amazon Sagemaker dashboard in the Web UI.

In [10]:
tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)

estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir":"/opt/training",        
        "pipeline_config_path": "pipeline.config",
        "num_train_steps": "2000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='ml.m5.2xlarge',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection'
)

estimator.fit(inputs)

Using provided s3_resource


INFO:sagemaker:Creating training-job with name: tf2-object-detection-2023-08-18-19-27-50-557


2023-08-18 19:27:52 Starting - Starting the training job...
2023-08-18 19:28:07 Starting - Preparing the instances for training......
2023-08-18 19:29:09 Downloading - Downloading input data...
2023-08-18 19:29:34 Training - Downloading the training image.........
2023-08-18 19:31:10 Training - Training image download completed. Training in progress...[34m2023-08-18 19:31:36,782 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-08-18 19:31:36,784 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-08-18 19:31:36,798 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-08-18 19:31:36,800 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-08-18 19:31:36,814 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-08-18 19:31:36,816 sagemaker-training-toolki

[34mINFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)[0m
[34mI0818 19:31:42.876619 140349932992320 mirrored_strategy.py:419] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)[0m
[34mINFO:tensorflow:Maybe overwriting train_steps: 2000[0m
[34mI0818 19:31:42.899416 140349932992320 config_util.py:552] Maybe overwriting train_steps: 2000[0m
[34mINFO:tensorflow:Maybe overwriting use_bfloat16: False[0m
[34mI0818 19:31:42.899588 140349932992320 config_util.py:552] Maybe overwriting use_bfloat16: False[0m
[34mInstructions for updating:[0m
[34mrename to distribute_datasets_from_function[0m
[34mW0818 19:31:42.924484 140349932992320 deprecation.py:364] From /usr/local/lib/python3.8/dist-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version

[34mINFO:tensorflow:Step 200 per-step time 3.320s[0m
[34mI0818 19:43:51.921965 140349932992320 model_lib_v2.py:705] Step 200 per-step time 3.320s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.43705904,
 'Loss/localization_loss': 0.51569265,
 'Loss/regularization_loss': 0.12530816,
 'Loss/total_loss': 1.0780599,
 'learning_rate': 0.004}[0m
[34mI0818 19:43:51.922238 140349932992320 model_lib_v2.py:708] {'Loss/classification_loss': 0.43705904,
 'Loss/localization_loss': 0.51569265,
 'Loss/regularization_loss': 0.12530816,
 'Loss/total_loss': 1.0780599,
 'learning_rate': 0.004}[0m
[34mINFO:tensorflow:Step 300 per-step time 3.303s[0m
[34mI0818 19:49:22.174231 140349932992320 model_lib_v2.py:705] Step 300 per-step time 3.303s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.41568053,
 'Loss/localization_loss': 0.41608602,
 'Loss/regularization_loss': 0.11611957,
 'Loss/total_loss': 0.9478861,
 'learning_rate': 0.004}[0m
[34mI0818 19:49:22.174499 140349932992320 mo

[34mINFO:tensorflow:Step 1600 per-step time 3.324s[0m
[34mI0818 21:01:13.689249 140349932992320 model_lib_v2.py:705] Step 1600 per-step time 3.324s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.3043723,
 'Loss/localization_loss': 0.33449095,
 'Loss/regularization_loss': 0.10215101,
 'Loss/total_loss': 0.74101424,
 'learning_rate': 0.004}[0m
[34mI0818 21:01:13.689541 140349932992320 model_lib_v2.py:708] {'Loss/classification_loss': 0.3043723,
 'Loss/localization_loss': 0.33449095,
 'Loss/regularization_loss': 0.10215101,
 'Loss/total_loss': 0.74101424,
 'learning_rate': 0.004}[0m
[34mINFO:tensorflow:Step 1700 per-step time 3.323s[0m
[34mI0818 21:06:46.032000 140349932992320 model_lib_v2.py:705] Step 1700 per-step time 3.323s[0m
[34mINFO:tensorflow:{'Loss/classification_loss': 0.31164846,
 'Loss/localization_loss': 0.32135108,
 'Loss/regularization_loss': 0.10078069,
 'Loss/total_loss': 0.7337802,
 'learning_rate': 0.004}[0m
[34mI0818 21:06:46.032265 14034993299232

[34mI0818 21:24:25.868683 140450831300416 api.py:460] feature_map_spatial_dims: [(80, 80), (40, 40), (20, 20), (10, 10), (5, 5)][0m
[34mI0818 21:24:37.421303 140450831300416 api.py:460] feature_map_spatial_dims: [(80, 80), (40, 40), (20, 20), (10, 10), (5, 5)][0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mW0818 21:24:41.672977 140450831300416 deprecation.py:364] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1176: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mUse `tf.cast` instead.[0m
[34mINFO:tensorflow:Finished eval step 0[0m
[34mI0818 21:24:41.691991 140450831300416 model_lib_v2.py:966] Finished eval step 0[0m
[34mInstructions for updating:[0m
[34mtf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    ten

[34mI0818 21:24:50.273458 140272193849152 api.py:460] feature_map_spatial_dims: [(80, 80), (40, 40), (20, 20), (10, 10), (5, 5)][0m
[34mI0818 21:24:58.757875 140272193849152 api.py:460] feature_map_spatial_dims: [(80, 80), (40, 40), (20, 20), (10, 10), (5, 5)][0m
[34mI0818 21:25:01.368822 140272193849152 signature_serialization.py:148] Function `call_func` contains input name(s) resource with unsupported characters which will be renamed to weightsharedconvolutionalboxpredictor_predictiontower_conv2d_3_batchnorm_feature_4_fusedbatchnormv3_readvariableop_1_resource in the SavedModel.[0m
[34mI0818 21:25:02.526404 140272193849152 api.py:460] feature_map_spatial_dims: [(80, 80), (40, 40), (20, 20), (10, 10), (5, 5)][0m
[34mW0818 21:25:04.736757 140272193849152 save_impl.py:66] Skipping full serialization of Keras layer <object_detection.meta_architectures.ssd_meta_arch.SSDMetaArch object at 0x7f932827b040>, because it is not built.[0m
[34mW0818 21:25:05.001085 140272193849152 sav

[34mI0818 21:25:16.103998 140272193849152 save.py:274] Found untraced functions such as WeightSharedConvolutionalBoxPredictor_layer_call_fn, WeightSharedConvolutionalBoxPredictor_layer_call_and_return_conditional_losses, WeightSharedConvolutionalBoxHead_layer_call_fn, WeightSharedConvolutionalBoxHead_layer_call_and_return_conditional_losses, WeightSharedConvolutionalClassHead_layer_call_fn while saving (showing 5 of 173). These functions will not be directly callable after loading.[0m

2023-08-18 21:25:27 Uploading - Uploading generated training model[34mINFO:tensorflow:Assets written to: /tmp/exported/saved_model/assets[0m
[34mI0818 21:25:21.288776 140272193849152 builder_impl.py:804] Assets written to: /tmp/exported/saved_model/assets[0m
[34mI0818 21:25:21.601941 140272193849152 fingerprinting_utils.py:48] Writing fingerprint to /tmp/exported/saved_model/fingerprint.pb[0m
[34mINFO:tensorflow:Writing pipeline config file to /tmp/exported/pipeline.config[0m
[34mI0818 21:25:2

You should be able to see your model training in the AWS webapp as shown below:
![ECR Example](../data/example_trainings.png)
