# Object Recognition using TAO Metric Learning Recognition

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. 

Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

<img align="center" src="https://developer.nvidia.com/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png" width="1080">


## Learning Objectives

In this notebook, you will learn how to leverage the simplicity and convenience of TAO to:

* Train a model for object recogtion on an [ImageNet](https://www.image-net.org/) format classification dataset.
* Evaluate the trained model & export results.
* Run Inference on the trained model.
* Export the trained model to an .onnx file for deployment to DeepStream or TensorRT.

At the end of this notebook, you will have generated a trained `MLRecog` model which you may deploy via [DeepStream](https://developer.nvidia.com/deepstream-sdk).

## Table of Contents

This notebook shows an example usecase of MLRecogNet using Train Adapt Optimize (TAO) Toolkit.

0. [Set up env variables and map drives](#head-0)
1. [Installing the TAO launcher](#head-1)
2. [Prepare dataset and pre-trained model](#head-2)
3. [Provide training specification](#head-3)
4. [Run TAO training](#head-4)
5. [Evaluate trained models](#head-5)
6. [Inferences](#head-6)
7. [Deploy](#head-7)


## 0. Set up env variables and map drives <a class="anchor" id="head-0"></a>

The TAO launcher uses docker containers under the hood, and **for our data and results directory to be visible to the docker, they need to be mapped**. The launcher can be configured using the config file `~/.tao_mounts.json`. Apart from the mounts, you can also configure additional options like the Environment Variables and amount of Shared Memory available to the TAO launcher. <br>

`IMPORTANT NOTE:` The code below creates a sample `~/.tao_mounts.json`  file. Here, we can map directories in which we save the data, specs, results and cache. You should configure it for your specific case so these directories are correctly visible to the docker container.


In [None]:
import os

# Please define this local project directory that needs to be mapped to the TAO docker session.
%env LOCAL_PROJECT_DIR=/path/to/local/tao-experiments

os.environ["HOST_DATA_DIR"] = os.path.join(os.getenv("LOCAL_PROJECT_DIR", os.getcwd()), "data")
os.environ["HOST_RESULTS_DIR"] = os.path.join(os.getenv("LOCAL_PROJECT_DIR", os.getcwd()), "ml_recognition", "results")
os.environ["HOST_MODEL_DIR"] = os.path.join(os.getenv("LOCAL_PROJECT_DIR", os.getcwd()), "ml_recognition", "models")

# Set this path if you don't run the notebook from the samples directory.
# %env NOTEBOOK_ROOT/path/to/local/tao-experiments/metric_learning_recognition

# The sample spec files are present in the same path as the downloaded samples.
os.environ["HOST_SPECS_DIR"] = os.path.join(
    os.getenv("NOTEBOOK_ROOT", os.getcwd()),
    "specs"
)


In [None]:
! mkdir -p $HOST_DATA_DIR
! mkdir -p $HOST_SPECS_DIR
! mkdir -p $HOST_RESULTS_DIR
! mkdir -p $HOST_MODEL_DIR

In [None]:
# Mapping up the local directories to the TAO docker.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")
tao_configs = {
   "Mounts":[
       # Mapping the data directory
       {
           "source": os.environ["LOCAL_PROJECT_DIR"],
           "destination": "/workspace/tao-experiments"
       },
       {
           "source": os.environ["HOST_DATA_DIR"],
           "destination": "/data"
       },
       {
           "source": os.environ["HOST_MODEL_DIR"],
           "destination": "/model"
       },
       {
           "source": os.environ["HOST_SPECS_DIR"],
           "destination": "/specs"
       },
       {
           "source": os.environ["HOST_RESULTS_DIR"],
           "destination": "/results"
       }
   ],
   "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
         }
   }
}
# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(tao_configs, mfile, indent=4)

In [None]:
!cat ~/.tao_mounts.json

## 1. Installing the TAO launcher <a class="anchor" id="head-1"></a>
The TAO launcher is a python package distributed as a python wheel listed in PyPI. You may install the launcher by executing the following cell.

Please note that TAO Toolkit recommends users to run the TAO launcher in a virtual env with python 3.6.9. You may follow the instruction in this [page](https://virtualenvwrapper.readthedocs.io/en/latest/install.html) to set up a python virtual env using the `virtualenv` and `virtualenvwrapper` packages. Once you have setup virtualenvwrapper, please set the version of python to be used in the virtual env by using the `VIRTUALENVWRAPPER_PYTHON` variable. You may do so by running

```sh
export VIRTUALENVWRAPPER_PYTHON=/path/to/bin/python3.x
```
where x >= 6 and <= 8

We recommend performing this step first and then launching the notebook from the virtual environment. In addition to installing TAO python package, please make sure of the following software requirements:
* python >=3.6.9 < 3.8.x
* docker-ce > 19.03.5
* docker-API 1.40
* nvidia-container-toolkit > 1.3.0-1
* nvidia-container-runtime > 3.4.0-1
* nvidia-docker2 > 2.5.0-1
* nvidia-driver > 455+

Once you have installed the pre-requisites, please log in to the docker registry nvcr.io by following the command below

```sh
docker login nvcr.io
```

You will be triggered to enter a username and password. The username is `$oauthtoken` and the password is the API key generated from `ngc.nvidia.com`. Please follow the instructions in the [NGC setup guide](https://docs.nvidia.com/ngc/ngc-overview/index.html#generating-api-key) to generate your own API key.

Please note that TAO Toolkit recommends users to run the TAO launcher in a virtual env with python >=3.6.9. You may follow the instruction in this [page](https://virtualenvwrapper.readthedocs.io/en/latest/install.html) to set up a python virtual env using the virtualenv and virtualenvwrapper packages.

In [None]:
# SKIP this step IF you have already installed the TAO launcher.
!pip3 install nvidia-tao

In [None]:
# View the versions of the TAO launcher
!tao info --verbose

## 2. Prepare dataset <a class="anchor" id="head-2"></a>

Here we use [Retail Product Checkout dataset](https://www.kaggle.com/datasets/diyer22/retail-product-checkout-dataset) to illustrate the method of training the metric recognition model for retail item recognition.

In [None]:
# [Action required] Download the dataset manually.
# [Action required] Put your downloaded .zip dataset file at $HOST_DATA_DIR/retail-product-checkout-dataset.zip

In [None]:
# Extract the files
# apt-get install unzip
!mkdir $HOST_DATA_DIR/metric_learning_recognition
!unzip $HOST_DATA_DIR/retail-product-checkout-dataset.zip -d $HOST_DATA_DIR/metric_learning_recognition

# set dataset root folder path
%env DATA_FOLDER=metric_learning_recognition/retail-product-checkout-dataset_classification_demo

# Run data processing script: 
# 1. crop the images and save as a classification dataset
# 2. split the dataset as train/val/test/reference sets
# 3. separate the classes to be known and unknown classes

# install the pkgs needed for process script if needed
!pip install opencv-python
!pip install pycocotools
!pip install tqdm
# now run the process script
!python $NOTEBOOK_ROOT/process_retail_product_checkout_dataset.py

In [None]:
# Verify
!ls -l $HOST_DATA_DIR/$DATA_FOLDER/known_classes

In [None]:
!ls -l $HOST_DATA_DIR/$DATA_FOLDER/unknown_classes

## 3. Provide training specification <a class="anchor" id="head-3"></a>

We provide specification files to configure the training parameters including:
 
* results_dir: a global setup for output directories, would create train/evaluation/inference/export subdirectories based on subtasks. Can be overwritten by subtask ``results_dir`` fields.

* model: configure the model setting
  * backbone: type of backbone architecture, only resnet_50 or resnet_101 supported now
  * pretrain_choice: type of pretrain choice, imagenet
  * pretrained_model_path: path for the pretrained model weights
  * input_width: width of an input image
  * input_height: height of an input image
  * input_channels: number of color channels for input images, always in channel first format
  * feat_dim: size of the output embedding

* train: configure the training hyperparameters
  * optim: configure optimizer
  * num_epochs: number of epochs
  * checkpoint_interval: enabling how often to store models
  * grad_clip: enabling gradient clipping
  * smooth_loss: enabling label smoothing feature, True/False
  * batch_size: number of images in 1 batch for training
  * val_batch_size: number of images in 1 batch for validation
  * resume_training_checkpoint_path: resume .pth model training from a saved checkpoint
  * report_accuracy_per_class: enabling accuracy per class report instead of average class accuracies, True/False
  
* dataset: configure the dataset and augmentation methods
  * train_dataset: path for the train dataset directory
  * val_dataset: map of the validation or test dataset directory. It contains reference and query set.
  * workers: number of workers to do data loading
  * pixel_mean: pixel mean in 3 channels for normalization
  * pixel_std: pixel standard deviation in 3 channels for normalization
  * prob: probability of randomly flipping images horizontally
  * re_prob: constant for random erasing
  * gaussian_blur: configurations for gaussian blur
  * color_augmentation: configurations for color augmentation
  * num_instance: number of types 1 image is repeated in a batch
  * class_map: path to the yaml file mapping dataset class name to the new class names

* evaluate: configure evaluate subtask parameters
  * checkpoint: the .pth model for evaluation
  * trt_engine: path of the tensorrt engine for evaluate
  * report_accuracy_per_class: enabling accuracy per class report instead of average class accuracies, True/False
  * gpu_id: the index of a single GPU only for evaluate. default 0.
  * topk: get predictions by the k nearest neighbor
  * batch_size: the batch size for evaluate
  * results_dir: the evaluation output directory. Have priority over global `results_dir`

* inference: configure inference subtask parameters
  * inference_input_type: the format of query dataset, image/image_folder/classification_folder
  * checkpoint: the .pth model for inference
  * trt_engine: path of the tensorrt engine for inference
  * input_path: the inference image/image folder/classification dataset folder
  * topk: get predictions by the k nearest neighbors
  * gpu_id: the index of a single GPU only for inference. default 0.
  * batch_size: the batch size for inference
  * results_dir: the inference output directory. Have priority over global `results_dir`
 
* export: configure export subtask parameters
  * checkpoint: the .pth model for export (to onnx file)
  * onnx_file: the exported onnx model path. Have priority over the default onnx name created from ``export.results_dir``. 
  * gpu_id: the index of a single GPU only for export. default 0.
 
 
* gen_trt_engine: configure tensorrt generation subtask parameters
  * gpu_id: the index of a single GPU only for tensorrt engine generation. default 0.
  * onnx_file: path of the onnx file that tensorrt engine converted from
  * trt_engine: path of the tensorrt engine to generate
  * batch_size: the batch size of the tensorrt engine. When `batch_size=-1`, a dynamic batch size trt engine would be generated.
  * verbose: If True, verbose information of tensorrt generation would be printed out
  * tensorrt: trt engine generation setup
  * results_dir: the trt engine generation output folder. Have priority over global `results_dir`

Please refer to the TAO documentation about MLRecogNet to get all the parameters that are configurable.

In [None]:
!cat $HOST_SPECS_DIR/train.yaml

## 4. Run TAO training <a class="anchor" id="head-4"></a>
* Provide the sample spec file and the output directory location for models.
* WARNING: Training will take several hours or one day to complete.

In [None]:
# NOTE: The following paths are set from the perspective of the TAO Docker.

# The data is saved here
%env DATA_DIR = /data
%env MODEL_DIR = /model
%env SPECS_DIR = /specs
%env RESULTS_DIR = /results


### 4.1 Train Metric Learning Recognition model

We will train a MLRecog model with ResNet101 backbone and 2048 embedding size output. The backbone would be loaded with weights trained by NVImageNetV2 (same classes as [ImageNet](https://www.image-net.org/) but using licensed datasets).

In [None]:
%env EPOCH=149

In [None]:
print("Train model")
! tao model ml_recog train \
              -e $SPECS_DIR/train.yaml \
              -r $RESULTS_DIR \
              dataset.train_dataset=$DATA_DIR/$DATA_FOLDER/known_classes/train \
              dataset.val_dataset.reference=$DATA_DIR/$DATA_FOLDER/known_classes/reference \
              dataset.val_dataset.query=$DATA_DIR/$DATA_FOLDER/known_classes/val

In [None]:

## Training command for multi-gpu training. We can define the number of gpus and specify which GPU's are to be used by setting the `train.gpu_ids` parameter.
## The following command will trigger multi-gpu training on gpu 0 and gpu 1.
# ! tao model ml_recog train \
#               -e $SPECS_DIR/train.yaml \
#               -r $RESULTS_DIR \
#               dataset.train_dataset=$DATA_DIR/$DATA_FOLDER/known_classes/train \
#               dataset.val_dataset.reference=$DATA_DIR/$DATA_FOLDER/known_classes/reference \
#               dataset.val_dataset.query=$DATA_DIR/$DATA_FOLDER/known_classes/val \
#               train.gpu_ids=[0,1]

In [None]:
print('checkpoints:')
print('---------------------')
!ls -ltrh $HOST_RESULTS_DIR/train

In [None]:
print('Rename a model:')
print('---------------------')
# NOTE: The following command may require `sudo`. You can run the command outside the notebook.
!find $HOST_RESULTS_DIR/train -name *epoch=$EPOCH* | xargs realpath | xargs -I {} mv {} $HOST_RESULTS_DIR/train/resnet101_model.pth
!ls -ltrh $HOST_RESULTS_DIR/train/resnet101_model.pth

## 5. Evaluate trained models <a class="anchor" id="head-5"></a>
Evaluate trained model.

In [None]:
# evaluate on known classes
! tao model ml_recog evaluate \
            -e $SPECS_DIR/evaluate.yaml \
            -r $RESULTS_DIR \
            evaluate.checkpoint=$RESULTS_DIR/train/resnet101_model.pth \
            dataset.val_dataset.reference=$DATA_DIR/$DATA_FOLDER/known_classes/reference \
            dataset.val_dataset.query=$DATA_DIR/$DATA_FOLDER/known_classes/test 


In [None]:
# evaluate on unknown classes
! tao model ml_recog evaluate \
            -e $SPECS_DIR/evaluate.yaml \
            evaluate.results_dir=$RESULTS_DIR/eval_unknown \
            evaluate.checkpoint=$RESULTS_DIR/train/resnet101_model.pth \
            dataset.val_dataset.reference=$DATA_DIR/$DATA_FOLDER/unknown_classes/reference \
            dataset.val_dataset.query=$DATA_DIR/$DATA_FOLDER/unknown_classes/test 


## 6. Inferences <a class="anchor" id="head-6"></a>
In this section, we run the metric_learning inference tool to generate inferences with the trained models and save the results under `$RESULTS_DIR`. 

In [None]:
# run inference on known classes
! tao model ml_recog inference \
                    -e $SPECS_DIR/infer.yaml \
                    -r $RESULTS_DIR \
                    inference.checkpoint=$RESULTS_DIR/train/resnet101_model.pth \
                    dataset.val_dataset.reference=$DATA_DIR/$DATA_FOLDER/known_classes/reference \
                    inference.input_path=$DATA_DIR/$DATA_FOLDER/known_classes/test 

In [None]:
# run inference on unknown classes
! tao model ml_recog inference \
                    -e $SPECS_DIR/infer.yaml \
                    inference.results_dir=$RESULTS_DIR/inference_unknown \
                    inference.checkpoint=$RESULTS_DIR/train/resnet101_model.pth \
                    dataset.val_dataset.reference=$DATA_DIR/$DATA_FOLDER/unknown_classes/reference \
                    inference.input_path=$DATA_DIR/$DATA_FOLDER/unknown_classes/test 

In [None]:
print('Inference results:')
print('------------')
!ls -lth $HOST_RESULTS_DIR/inference
!ls -lth $HOST_RESULTS_DIR/inference_unknown

## 7. Deploy <a class="anchor" id="head-7"></a>

In [None]:
# Export the model to ONNX model.
! tao model ml_recog export \
                   -e $SPECS_DIR/export.yaml \
                   -r $RESULTS_DIR \
                   export.checkpoint=$RESULTS_DIR/train/resnet101_model.pth \
                   export.onnx_file=$RESULTS_DIR/export/resnet101_model.onnx
               

In [None]:
print('Exported model:')
print('------------')
!ls -lth $HOST_RESULTS_DIR/export

In [None]:
# Generate TensorRT engine using tao deploy
!tao deploy ml_recog gen_trt_engine -e $SPECS_DIR/gen_trt_engine.yaml \
                                       gen_trt_engine.onnx_file=$RESULTS_DIR/export/resnet101_model.onnx \
                                       gen_trt_engine.trt_engine=$RESULTS_DIR/gen_trt_engine/resnet101_model.engine \
                                       results_dir=$RESULTS_DIR


In [None]:
# Generate int8 TensorRT engine using tao deploy
!tao deploy ml_recog gen_trt_engine -e $SPECS_DIR/gen_trt_engine.yaml \
                                       gen_trt_engine.onnx_file=$RESULTS_DIR/export/resnet101_model.onnx \
                                       gen_trt_engine.trt_engine=$RESULTS_DIR/gen_trt_engine/resnet101_model.int8.engine \
                                       results_dir=$RESULTS_DIR \
                                       gen_trt_engine.tensorrt.data_type=int8 \
                                       gen_trt_engine.tensorrt.calibration.cal_image_dir=[$DATA_DIR/$DATA_FOLDER/known_classes/test] \
                                       gen_trt_engine.tensorrt.calibration.cal_cache_file=$RESULTS_DIR/gen_trt_engine/cal_resnet101_model.int8.bin


In [None]:
print('Generated tensorrt engines and calibration files:')
print('------------')
!ls -lth $HOST_RESULTS_DIR/gen_trt_engine

In [None]:
# Evaluate with generated TensorRT engine
!tao deploy ml_recog evaluate -e $SPECS_DIR/evaluate.yaml \
                                 evaluate.trt_engine=$RESULTS_DIR/gen_trt_engine/resnet101_model.engine \
                                 results_dir=$RESULTS_DIR \
                                 dataset.val_dataset.reference=$DATA_DIR/$DATA_FOLDER/known_classes/reference \
                                 dataset.val_dataset.query=$DATA_DIR/$DATA_FOLDER/known_classes/test


In [None]:
# Inference with generated TensorRT engine
!tao deploy ml_recog inference -e $SPECS_DIR/infer.yaml \
                                  inference.trt_engine=$RESULTS_DIR/gen_trt_engine/resnet101_model.engine \
                                  results_dir=$RESULTS_DIR \
                                  dataset.val_dataset.reference=$DATA_DIR/$DATA_FOLDER/known_classes/reference \
                                  inference.input_path=$DATA_DIR/$DATA_FOLDER/known_classes/test


In [None]:
print('TensorRT Inference results:')
print('------------')
!ls -lth $HOST_RESULTS_DIR/trt_inference