# Tensorflow Object Detection API and AWS Sagemaker

In this notebook, you will train and evaluate different models using the [Tensorflow Object Detection API](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/) and [AWS Sagemaker](https://aws.amazon.com/sagemaker/). 

If you ever feel stuck, you can refer to this [tutorial](https://aws.amazon.com/blogs/machine-learning/training-and-deploying-models-using-tensorflow-2-with-the-object-detection-api-on-amazon-sagemaker/).

## Dataset

We are using the [Waymo Open Dataset](https://waymo.com/open/) for this project. The dataset has already been exported using the tfrecords format. The files have been created following the format described [here](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#create-tensorflow-records). You can find data stored on [AWS S3](https://aws.amazon.com/s3/), AWS Object Storage. The images are saved with a resolution of 640x640.

In [11]:
%%capture
%pip install tensorflow_io "sagemaker<3" "numpy<2.0" python-dotenv awscli -U

In [2]:
import os
import sys

print(f"Using Python: {sys.executable}")

try:
    import sagemaker
    print(f"SageMaker imported from: {sagemaker.__file__}")
    from sagemaker.estimator import Estimator
    from framework import CustomFramework
    print("Imports successful!")
except ImportError as e:
    print(f"Import failed: {e}")
    print("Attempting to force reinstall sagemaker...")
    !{sys.executable} -m pip install "sagemaker<3" "numpy<2.0" --force-reinstall
    print("\nDONE. Please RESTART the kernel (circular arrow icon) and run this cell again.")

Using Python: /home/wcho/miniconda3/envs/tf2_od/bin/python
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/xdg-ubuntu/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/wcho/.config/sagemaker/config.yaml
SageMaker imported from: /home/wcho/miniconda3/envs/tf2_od/lib/python3.10/site-packages/sagemaker/__init__.py
Imports successful!


Save the IAM role in a variable called `role`. This would be useful when training the model.

In [3]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Set the AWS region if it is not configured locally
if os.environ.get('AWS_DEFAULT_REGION') is None:
    os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'

# Get the SageMaker Execution Role from .env (Best for local execution)
role = os.environ.get('SAGEMAKER_ROLE_ARN')

if role is None:
    try:
        role = sagemaker.get_execution_role()
    except ValueError:
        print("Could not retrieve execution role via SDK. Please set SAGEMAKER_ROLE_ARN in .env")

print(role)

arn:aws:iam::305502288700:role/service-role/AmazonSageMaker-ExecutionRole-20200629T181730


In [4]:
# The train and val paths below are public S3 buckets created by Udacity for this project
inputs = {'train': 's3://cd2688-object-detection-tf2/train/', 
          'val': 's3://cd2688-object-detection-tf2/val/'} 

# Insert path of a folder in your personal S3 bucket to store tensorboard logs.
tensorboard_s3_prefix = 's3://object-detection-project/logs/'

## Container

To train the model, you will first need to build a [docker](https://www.docker.com/) container with all the dependencies required by the TF Object Detection API. The code below does the following:
* clone the Tensorflow models repository
* get the exporter and training scripts from the repository
* build the docker image and push it 
* print the container name

In [5]:
%%bash

# clone the repo and get the scripts
git clone https://github.com/tensorflow/models.git docker/models

# get model_main and exporter_main files from TF2 Object Detection GitHub repository
cp docker/models/research/object_detection/exporter_main_v2.py source_dir 
cp docker/models/research/object_detection/model_main_tf2.py source_dir

fatal: destination path 'docker/models' already exists and is not an empty directory.


In [55]:
# build and push the docker image. This code can be commented out after being run once.
# This will take around 10 mins.
image_name = 'tf2-object-detection'
!docker build -t $image_name -f docker/Dockerfile.rocm docker

DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  788.2MB
Step 1/24 : FROM rocm/tensorflow:latest
 ---> dfbd655d23fc
Step 2/24 : ARG DEBIAN_FRONTEND=noninteractive
 ---> Using cache
 ---> f12672ff6d05
Step 3/24 : RUN apt-get update && apt-get install -y     git     gpg-agent     python3-cairocffi     protobuf-compiler     python3-pil     python3-lxml     python3-tk     wget
 ---> Using cache
 ---> 722c50dbabd3
Step 4/24 : RUN pip install --upgrade pip &&     pip install     cython     contextlib2     pillow     lxml     jupyter     matplotlib
 ---> Using cache
 ---> c3d63e859345
Step 5/24 : COPY models /opt/models
 ---> Using cache
 ---> 6232483a25e1
Step 6/24 : WORKDIR /opt/models/research
 ---> Using cache
 ---> 74fc99039678
Step 7/24 : RUN protoc object_detection/protos/*.proto --python_out

To verify that the image was correctly pushed to the [Elastic Container Registry](https://aws.amazon.com/ecr/), you can look at it in the AWS webapp. For example, below you can see that three different images have been pushed to ECR. You should only see one, called `tf2-object-detection`.
![ECR Example](../data/example_ecr.png)


In [31]:
# display the container name
if os.path.exists(os.path.join('docker', 'ecr_image_fullname.txt')):
    with open(os.path.join('docker', 'ecr_image_fullname.txt'), 'r') as f:
        container = f.readlines()[0][:-1]
container = image_name

print(container)

tf2-object-detection


In [54]:
!docker run --rm --device=/dev/kfd --device=/dev/dri --group-add=video -e HSA_OVERRIDE_GFX_VERSION=11.0.0 $container python -c "import tensorflow as tf; print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU'))); import object_detection; print('Object Detection API imported successfully')"

2026-01-18 18:02:13.332289: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Num GPUs Available:  1
Object Detection API imported successfully


## Pre-trained model from model zoo

As often, we are not training from scratch and we will be using a pretrained model from the TF Object Detection model zoo. You can find pretrained checkpoints [here](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md). Because your time is limited for this project, we recommend to only experiment with the following models:
* SSD MobileNet V2 FPNLite 640x640	
* SSD ResNet50 V1 FPN 640x640 (RetinaNet50)	
* Faster R-CNN ResNet50 V1 640x640	
* EfficientDet D1 640x640	
* Faster R-CNN ResNet152 V1 640x640	

In the code below, the EfficientDet D1 model is downloaded and extracted. This code should be adjusted if you were to experiment with other architectures.

In [9]:
%%bash
mkdir -p /tmp/checkpoint
mkdir -p source_dir/checkpoint
wget -O /tmp/efficientdet.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz
tar -zxvf /tmp/efficientdet.tar.gz --strip-components 2 --directory source_dir/checkpoint efficientdet_d1_coco17_tpu-32/checkpoint

--2026-01-18 15:40:53--  http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 142.250.129.207, 172.217.76.207, 142.250.140.207, ...
Connecting to download.tensorflow.org (download.tensorflow.org)|142.250.129.207|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51839363 (49M) [application/x-tar]
Saving to: ‘/tmp/efficientdet.tar.gz’

     0K .......... .......... .......... .......... ..........  0%  519K 97s
    50K .......... .......... .......... .......... ..........  0%  525K 97s
   100K .......... .......... .......... .......... ..........  0% 1.65M 74s
   150K .......... .......... .......... .......... ..........  0% 1.43M 64s
   200K .......... .......... .......... .......... ..........  0% 1.64M 57s
   250K .......... .......... .......... .......... ..........  0% 1.79M 52s
   300K .......... .......... .......... .......... ..........  

efficientdet_d1_coco17_tpu-32/checkpoint/ckpt-0.data-00000-of-00001
efficientdet_d1_coco17_tpu-32/checkpoint/checkpoint
efficientdet_d1_coco17_tpu-32/checkpoint/ckpt-0.index


## Manual GPU Training Test

Before running the full SageMaker job, we download the data locally and run the training script manually using `docker run` to verify GPU usage and performance.

In [12]:
%%bash
# Download data for local training test
mkdir -p data/train data/val
aws s3 cp s3://cd2688-object-detection-tf2/train/ data/train/ --recursive --quiet
aws s3 cp s3://cd2688-object-detection-tf2/val/ data/val/ --recursive --quiet

In [56]:
%%bash
# Run training locally on GPU using docker run
mkdir -p local_training_output

docker run --rm \
    --device=/dev/kfd --device=/dev/dri --group-add=video \
    -e HSA_OVERRIDE_GFX_VERSION=11.0.0 \
    -v $(pwd)/source_dir:/opt/ml/code \
    -v $(pwd)/data/train:/opt/ml/input/data/train \
    -v $(pwd)/data/val:/opt/ml/input/data/val \
    -v $(pwd)/local_training_output:/opt/training \
    -w /opt/ml/code \
    tf2-object-detection \
    /bin/bash run_training.sh \
    --model_dir /opt/training \
    --pipeline_config_path pipeline.config \
    --num_train_steps 100 \
    --sample_1_of_n_eval_examples 1

===TRAINING THE MODEL==


2026-01-18 18:03:17.440777: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/opt/ml/code/model_main_tf2.py", line 31, in <module>
    from object_detection import model_lib_v2
  File "/opt/models/research/object_detection/model_lib_v2.py", line 30, in <module>
    from object_detection import inputs
  File "/opt/models/research/object_detection/inputs.py", line 24, in <module>
    import tensorflow_estimator as tf_estimator
  File "/usr/local/lib/python3.12/site-packages/tensorflow_estimator/__init__.py", line 8, in <module>
    from tensorflow_estimator._api.v1 import estimator
  File "/usr/local/lib/python3.12/site-packages/tensorflow_estimator/_a

==EVALUATING THE MODEL==


2026-01-18 18:03:19.943634: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/opt/ml/code/model_main_tf2.py", line 31, in <module>
    from object_detection import model_lib_v2
  File "/opt/models/research/object_detection/model_lib_v2.py", line 30, in <module>
    from object_detection import inputs
  File "/opt/models/research/object_detection/inputs.py", line 24, in <module>
    import tensorflow_estimator as tf_estimator
  File "/usr/local/lib/python3.12/site-packages/tensorflow_estimator/__init__.py", line 8, in <module>
    from tensorflow_estimator._api.v1 import estimator
  File "/usr/local/lib/python3.12/site-packages/tensorflow_estimator/_a

==EXPORTING THE MODEL==


2026-01-18 18:03:22.205519: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  import pkg_resources
caused by: ['/usr/local/lib/python3.12/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl5mutex6unlockEv']
caused by: ['/usr/local/lib/python3.12/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZNK10tensorflow4data11DatasetBase8FinalizeEPNS_15OpKernelContextESt8functionIFN4absl12lts_202308028StatusOrIN3tsl4core11RefCountPtrIS1_EEEEvEE']
FATAL Flags parsing error: flag --pipeline_config_path=None: Flag --pipeline_config_path must have a value other than None.
Pass --helpshort or --helpfull to see help on flags.
mv: cannot 

## Edit pipeline.config file

The [`pipeline.config`](source_dir/pipeline.config) in the `source_dir` folder should be updated when you experiment with different models. The different config files are available [here](https://github.com/tensorflow/models/tree/master/research/object_detection/configs/tf2).

>Note: The provided `pipeline.config` file works well with the `EfficientDet` model. You would need to modify it when working with other models.

## Launch Training Job

Now that we have a dataset, a docker image and some pretrained model weights, we can launch the training job. To do so, we create a [Sagemaker Framework](https://sagemaker.readthedocs.io/en/stable/frameworks/index.html), where we indicate the container name, name of the config file, number of training steps etc.

The `run_training.sh` script does the following:
* train the model for `num_train_steps` 
* evaluate over the val dataset
* export the model

Different metrics will be displayed during the evaluation phase, including the mean average precision. These metrics can be used to quantify your model performances and compare over the different iterations.

You can also monitor the training progress by navigating to **Training -> Training Jobs** from the Amazon Sagemaker dashboard in the Web UI.

In [10]:
tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)

estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir": "/opt/training",        
        "pipeline_config_path": "pipeline.config",
        "num_train_steps": "2000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='local',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection'
)

estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: tf2-object-detection-2026-01-18-15-41-08-589
INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker.local.image:'Docker Compose' is not installed. Proceeding to check for 'docker-compose' CLI.
INFO:sagemaker.local.image:'Docker Compose' found using Docker Compose CLI.
INFO:sagemaker.local.local_session:Starting training job
INFO:sagemaker.local.image:Using the long-lived AWS credentials found in session
INFO:sagemaker.local.image:docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-cm77g:
    command: train
    cont

 Network sagemaker-local Creating 
 Network sagemaker-local Created 
 Container 2qy2g7fkqv-algo-1-cm77g Creating 
 Container 2qy2g7fkqv-algo-1-cm77g Created 
Attaching to 2qy2g7fkqv-algo-1-cm77g
 Container 2qy2g7fkqv-algo-1-cm77g Starting 
Error response from daemon: could not select device driver "" with capabilities: [[gpu]]


ERROR:sagemaker:Please check the troubleshooting guide for common errors: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html#sagemaker-python-sdk-troubleshooting-create-training-job


You should be able to see your model training in the AWS webapp as shown below:
![ECR Example](../data/example_trainings.png)


## Improve on the initial model

Most likely, this initial experiment did not yield optimal results. However, you can make multiple changes to the `pipeline.config` file to improve this model. One obvious change consists in improving the data augmentation strategy. The [`preprocessor.proto`](https://github.com/tensorflow/models/blob/master/research/object_detection/protos/preprocessor.proto) file contains the different data augmentation method available in the Tf Object Detection API. Justify your choices of augmentations in the write-up.

Keep in mind that the following are also available:
* experiment with the optimizer: type of optimizer, learning rate, scheduler etc
* experiment with the architecture. The Tf Object Detection API model zoo offers many architectures. Keep in mind that the pipeline.config file is unique for each architecture and you will have to edit it.
* visualize results on the test frames using the `2_deploy_model` notebook available in this repository.

In the cell below, write down all the different approaches you have experimented with, why you have chosen them and what you would have done if you had more time and resources. Justify your choices using the tensorboard visualizations (take screenshots and insert them in your write-up), the metrics on the evaluation set and the generated animation you have created with [this tool](../2_run_inference/2_deploy_model.ipynb).

In [None]:
# your write-up goes here.