# Transfer Learning of Faster R-CNN on KITTI dataset

This notebook is a step-by-step tutorial on transfer learning of [Faster R-CNN](https://arxiv.org/abs/1506.01497) model on [Kitti](http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=2d) 2-D object dataset starting with a pre-trained Mask R-CNN model trained on [COCO 2017 dataset](http://cocodataset.org/#download). 

The outline of steps is as follows:

1. Stage Kitti dataset in COCO format on S3
2. Copy Kitti dataset from Amazon S3 to Amazon EFS file-system mounted on this notebook instance
3. Build Docker training image and push it to [Amazon ECR](https://aws.amazon.com/ecr/)
4. Configure data input channels
5. Configure hyper-prarameters
6. Define training metrics
7. Define training job and start training

Before we get started, let us initialize two python variables ```aws_region``` and ```s3_bucket``` that we will use throughout the notebook:

In [None]:
aws_region = # <aws-region>
s3_bucket  = # <your-s3_bucket>

## Stage Kitti dataset in COCO format in Amazon S3

Below are steps to convert Kitti dataset to COCO format and stage it in S3:

1. Download [Kitti object dataset left camera color images](http://www.cvlibs.net/download.php?file=data_object_image_2.zip). 
2. Download [Kitti object dataset training labels](http://www.cvlibs.net/download.php?file=data_object_label_2.zip)
3. `unzip` the downloaded zip files, which should create a folder structure as shown below:

```
   |-- training
   |   |-- images_2
   |   |__ label2
   |
   |-- testing
   |__|__images_2
      
   ```
3. Split the Kitti `training` dataset `image_2` and `label_2` files randomly into `train` and `val` datasets using a 90-10 split. We do not provide any tool for doing this split.
4. Use [convert-dataset](https://github.com/eweill/convert-datasets/blob/master/convert-dataset.py) or any other tool to convert `train` and `val` datasets from `Kitti` to `Voc` format. 
5. Use [voc2coco](https://github.com/yukkyo/voc2coco) or any other tool to convert data from `Voc` to `Coco` format.
    * Name the converted COCO annotations files for `train` and `val`  as `instances_trainKitti.json` and `instances_valKitti.json` respectively
6. Copy the converted Kitti dataset and [pretrained models](http://models.tensorpack.com/#FasterRCNN) to your S3 bucket. The required prefix structure in your S3 bucket is shown below:
   
```
|-- faster-rcnn-kitti
|   |-- testing
|   |__ ...
|   |
|   |-- sagemaker
|   |   |-- input
|   |   |   |-- train
|   |   |   |   |-- trainKitti
|   |   |   |   |   |__ ...
|   |   |   |   |
|   |   |   |   |-- valKitti
|   |   |   |   |   |__ ...
|   |   |   |   |
|   |   |   |   |-- annotations
|   |   |   |   |   |-- instances_trainKitti.json
|   |   |   |   |   |__ instances_valKitti.json
|   |   |   |   |
|   |   |   |   |-- pretrained-models
|   |   |   |   |   |-- COCO-MaskRCNN-R101FPN1x.npz
|   |   |   |   |   |-- COCO-MaskRCNN-R50FPN2x.npz
|   |   |   |   |   |-- ImageNet-R101-AlignPadding.npz
|___|___|___|___|___|__ ImageNet-R50-AlignPadding.npz
       
   ```
The split `train` and `val` dataset images are copied under `faster-rcnn-kitti/sagemaker/input/train/trainKitti` and `faster-rcnn-kitti/sagemaker/input/train/valKitti` prefixes in S3, respectively. The `testing/image_2` images are copied under `faster-rcnn-kitti/testing` prefix in S3.

## Copy Kitti dataset from S3 to Amazon EFS

Next, we copy Kitti dataset from S3 to EFS file-system.  The ```prepare-kitti-efs.sh``` script executes this step.

In [None]:
!cat ./prepare-kitti-efs.sh

If you have already copied COCO 2017 dataset from S3 to your EFS file-system, skip this step.

In [None]:
%%time
!./prepare-kitti-efs.sh {s3_bucket}

## Build and push SageMaker training images

For this step, the [IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) attached to this notebook instance needs full access to Amazon ECR service. If you created this notebook instance using the ```./stack-sm.sh``` script in this repository, the IAM Role attached to this notebook instance is already setup with full access to ECR service. 

Below, we will build an image for [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) implementation and push it to Amazon ECR.

### TensorPack Faster-RCNN/Mask-RCNN

Use ```./container-kitti/build_tools/build_and_push.sh``` script to build and push the TensorPack Faster-RCNN/Mask-RCNN  training image to Amazon ECR.

In [None]:
!cat ./container-kitti/build_tools/build_and_push.sh

Using your *AWS region* as argument, run the cell below. 

In [None]:
%%time
! ./container-kitti/build_tools/build_and_push.sh {aws_region}

Set ```tensorpack_image``` below to Amazon ECR URI of the image you pushed above.

In [None]:
tensorpack_image = #<amazon-ecr-uri>

## SageMaker Initialization 
We have staged the data and we have built and pushed the training docker image to Amazon ECR. Now we are ready to start using Amazon SageMaker.

In [None]:
%%time
import os
import time
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator

role = get_execution_role() # provide a pre-existing role ARN as an alternative to creating a new role
print(f'SageMaker Execution Role:{role}')

client = boto3.client('sts')
account = client.get_caller_identity()['Account']
print(f'AWS account:{account}')

session = boto3.session.Session()
region = session.region_name
print(f'AWS region:{region}')

Next, we set the Amazon ECR image URI used for training. You saved this URI in a previous step.

In [None]:
training_image = tensorpack_image
print(f'Training image: {training_image}')

## Define SageMaker Data Channels

Next, we define the *train* and *log* data channels using EFS file-system. To do so, we need to specify the EFS file-system id, which is shown in the output of the command below.

In [None]:
!df -kh | grep 'fs-' | sed 's/\(fs-[0-9a-z]*\).*/\1/'

Set the EFS ```file_system_id``` below to the ouput of the command shown above. In the cell below, we define the `train` data input channel.

In [None]:
from sagemaker.inputs import FileSystemInput

# Specify EFS ile system id.
file_system_id = # 'fs-xxxxxxxx'
print(f"EFS file-system-id: {file_system_id}")

# Specify directory path for input data on the file system. 
# You need to provide normalized and absolute path below.
file_system_directory_path = '/faster-rcnn-kitti/sagemaker/input/train'
print(f'EFS file-system data input path: {file_system_directory_path}')

# Specify the access mode of the mount of the directory associated with the file system. 
# Directory must be mounted  'ro'(read-only).
file_system_access_mode = 'ro'

# Specify your file system type
file_system_type = 'EFS'

train = FileSystemInput(file_system_id=file_system_id,
                                    file_system_type=file_system_type,
                                    directory_path=file_system_directory_path,
                                    file_system_access_mode=file_system_access_mode)

Below we create the log output directory and define the `log` data output channel.

In [None]:
# Specify directory path for log output on the EFS file system.
# You need to provide normalized and absolute path below.
# For example, '/mask-rcnn/sagemaker/output/log'
# Log output directory must not exist
file_system_directory_path = f'/faster-rcnn-kitti/sagemaker/output/log-{int(time.time())}'

# Create the log output directory. 
# EFS file-system is mounted on '$HOME/efs' mount point for this notebook.
home_dir=os.environ['HOME']
local_efs_path = os.path.join(home_dir,'efs', file_system_directory_path[1:])
print(f"Creating log directory on EFS: {local_efs_path}")

assert not os.path.isdir(local_efs_path)
! sudo mkdir -p -m a=rw {local_efs_path}
assert os.path.isdir(local_efs_path)

# Specify the access mode of the mount of the directory associated with the file system. 
# Directory must be mounted 'rw'(read-write).
file_system_access_mode = 'rw'


log = FileSystemInput(file_system_id=file_system_id,
                                    file_system_type=file_system_type,
                                    directory_path=file_system_directory_path,
                                    file_system_access_mode=file_system_access_mode)

data_channels = {'train': train, 'log': log}

Next, we define the model output location in S3. Set ```s3_bucket``` to your S3 bucket name prior to running the cell below. 

The model checkpoints, logs and Tensorboard events will be written to the log output directory on the EFS file system you created above. At the end of the model training, they will be copied from the log output directory to the `s3_output_location` defined below.

In [None]:
prefix = "faster-rcnn-kitti/sagemaker" #prefix in your bucket
s3_output_location = f's3://{s3_bucket}/{prefix}/output'
print(f'S3 model output location: {s3_output_location}')

## Configure Hyper-parameters
Next we define the hyper-parameters. 

Note, some hyper-parameters are different between the two implementations. The batch size per GPU in TensorPack Faster-RCNN/Mask-RCNN is fixed at 1, but is configurable in AWS Samples Mask-RCNN. The learning rate schedule is specified in units of steps in TensorPack Faster-RCNN/Mask-RCNN, but in epochs in AWS Samples Mask-RCNN.

The detault learning rate schedule values shown below correspond to training for a total of 24 epochs, at 120,000 images per epoch.

<table align='left'>
    <caption>TensorPack Faster-RCNN/Mask-RCNN  Hyper-parameters</caption>
    <tr>
    <th style="text-align:center">Hyper-parameter</th>
    <th style="text-align:center">Description</th>
    <th style="text-align:center">Default</th>
    </tr>
    <tr>
        <td style="text-align:center">mode_fpn</td>
        <td style="text-align:left">Flag to indicate use of Feature Pyramid Network (FPN) in the Mask R-CNN model backbone</td>
        <td style="text-align:center">"True"</td>
    </tr>
     <tr>
        <td style="text-align:center">mode_mask</td>
        <td style="text-align:left">A value of "False" means Faster-RCNN model, "True" means Mask R-CNN moodel</td>
        <td style="text-align:center">"True"</td>
    </tr>
     <tr>
        <td style="text-align:center">eval_period</td>
        <td style="text-align:left">Number of epochs period for evaluation during training</td>
        <td style="text-align:center">1</td>
    </tr>
    <tr>
        <td style="text-align:center">lr_schedule</td>
        <td style="text-align:left">Learning rate schedule in training steps</td>
        <td style="text-align:center">'[240000, 320000, 360000]'</td>
    </tr>
    <tr>
        <td style="text-align:center">batch_norm</td>
        <td style="text-align:left">Batch normalization option ('FreezeBN', 'SyncBN', 'GN', 'None') </td>
        <td style="text-align:center">'FreezeBN'</td>
    </tr>
    <tr>
        <td style="text-align:center">images_per_epoch</td>
        <td style="text-align:left">Images per epoch </td>
        <td style="text-align:center">120000</td>
    </tr>
    <tr>
        <td style="text-align:center">data_train</td>
        <td style="text-align:left">Training data under data directory</td>
        <td style="text-align:center">'coco_train2017'</td>
    </tr>
    <tr>
        <td style="text-align:center">data_val</td>
        <td style="text-align:left">Validation data under data directory</td>
        <td style="text-align:center">'coco_val2017'</td>
    </tr>
     <tr>
        <td style="text-align:center">resnet_arch</td>
        <td style="text-align:left">Must be 'resnet50' or 'resnet101'</td>
        <td style="text-align:center">'resnet50'</td>
    </tr>
    <tr>
        <td style="text-align:center">backbone_weights</td>
        <td style="text-align:left">Pretrained RestNet Backbone weights</td>
        <td style="text-align:center">'ImageNet-R50-AlignPadding.npz'</td>
    </tr>
     <tr>
        <td style="text-align:center">load_model</td>
        <td style="text-align:left">Load pretrained model weights</td>
        <td style="text-align:center"></td>
    </tr>
</table>


In [None]:
hyperparameters = {
                    "mode_fpn": "True",
                    "mode_mask": "False",
                    "eval_period": 1,
                    "batch_norm": "FreezeBN",
                    "backbone_weights": 'ImageNet-R50-AlignPadding.npz',
                    "load_model": "COCO-MaskRCNN-R50FPN2x.npz",
                    "data_train": "coco_trainKitti",
                    "data_val": "coco_valKitti",
                    "lr_schedule": '[14400, 18000, 21600]',
                    "images_per_epoch": 7200,
                    "resnet_arch": 'resnet50'
                  }

## Define Training Metrics
Next, we define the regular expressions that SageMaker uses to extract algorithm metrics from training logs and send them to [AWS CloudWatch metrics](https://docs.aws.amazon.com/en_pv/AmazonCloudWatch/latest/monitoring/working_with_metrics.html). These algorithm metrics are visualized in SageMaker console.

In [None]:
metric_definitions=[
             {
                "Name": "fastrcnn_losses/box_loss",
                "Regex": ".*fastrcnn_losses/box_loss:\\s*(\\S+).*"
            },
            {
                "Name": "fastrcnn_losses/label_loss",
                "Regex": ".*fastrcnn_losses/label_loss:\\s*(\\S+).*"
            },
            {
                "Name": "fastrcnn_losses/label_metrics/accuracy",
                "Regex": ".*fastrcnn_losses/label_metrics/accuracy:\\s*(\\S+).*"
            },
            {
                "Name": "fastrcnn_losses/label_metrics/false_negative",
                "Regex": ".*fastrcnn_losses/label_metrics/false_negative:\\s*(\\S+).*"
            },
            {
                "Name": "fastrcnn_losses/label_metrics/fg_accuracy",
                "Regex": ".*fastrcnn_losses/label_metrics/fg_accuracy:\\s*(\\S+).*"
            },
            {
                "Name": "fastrcnn_losses/num_fg_label",
                "Regex": ".*fastrcnn_losses/num_fg_label:\\s*(\\S+).*"
            },
            {
                "Name": "mAP(bbox)/IoU=0.5",
                "Regex": ".*mAP\\(bbox\\)/IoU=0\\.5:\\s*(\\S+).*"
            },
            {
                "Name": "mAP(bbox)/IoU=0.5:0.95",
                "Regex": ".*mAP\\(bbox\\)/IoU=0\\.5:0\\.95:\\s*(\\S+).*"
            },
            {
                "Name": "mAP(bbox)/IoU=0.75",
                "Regex": ".*mAP\\(bbox\\)/IoU=0\\.75:\\s*(\\S+).*"
            },
            {
                "Name": "mAP(bbox)/large",
                "Regex": ".*mAP\\(bbox\\)/large:\\s*(\\S+).*"
            },
            {
                "Name": "mAP(bbox)/medium",
                "Regex": ".*mAP\\(bbox\\)/medium:\\s*(\\S+).*"
            },
            {
                "Name": "mAP(bbox)/small",
                "Regex": ".*mAP\\(bbox\\)/small:\\s*(\\S+).*"
            }
            
    ]

## Define SageMaker Training Job

Next, we use SageMaker [Estimator](https://sagemaker.readthedocs.io/en/stable/estimators.html) API to define a SageMaker Training Job. 

We recommned using 32 GPUs, so we set ```train_instance_count=4``` and ```train_instance_type='ml.p3.16xlarge'```, because there are 8 Tesla V100 GPUs per ```ml.p3.16xlarge``` instance. We recommend using 100 GB [Amazon EBS](https://aws.amazon.com/ebs/) storage volume with each training instance, so we set ```train_volume_size = 100```. 

We run the training job in your private VPC, so we need to set the ```subnets``` and ```security_group_ids``` prior to running the cell below. You may specify multiple subnet ids in the ```subnets``` list. The subnets included in the ```sunbets``` list must be part of the output of  ```./stack-sm.sh``` CloudFormation stack script used to create this notebook instance. Specify only one security group id in ```security_group_ids``` list. The security group id must be part of the output of  ```./stack-sm.sh``` script.

For ```train_instance_type``` below, you have the option to use ```ml.p3.16xlarge``` with 16 GB per-GPU memory and 25 Gbs network interconnectivity, or ```ml.p3dn.24xlarge``` with 32 GB per-GPU memory and 100 Gbs network interconnectivity. The ```ml.p3dn.24xlarge``` instance type offers significantly better performance than ```ml.p3.16xlarge``` for Mask R-CNN distributed TensorFlow training.

In [None]:
# Give Amazon SageMaker Training Jobs Access to FileSystem Resources in Your Amazon VPC.
security_group_ids = # ['sg-xxxxxxxx'] 
subnets =  # [ 'subnet-xxxxxxx', ]
sagemaker_session = sagemaker.session.Session(boto_session=session)

faster_rcnn_kitti_estimator = Estimator(training_image,
                                         role, 
                                         instance_count=1, 
                                         instance_type='ml.p3.16xlarge',
                                         volume_size = 100,
                                         max_run = 400000,
                                         output_path=s3_output_location,
                                         sagemaker_session=sagemaker_session, 
                                         hyperparameters = hyperparameters,
                                         metric_definitions = metric_definitions,
                                         base_job_name="faster-rcnn-kitti-efs",
                                         subnets=subnets,
                                         security_group_ids=security_group_ids)



Finally, we launch the SageMaker training job. 

The time to complete the training depends on type and number of training instances, and the training image used for training. 

In [None]:
faster_rcnn_kitti_estimator.fit(inputs=data_channels, logs="All", wait=False)