<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# Optimizing a Video AI Application #
The effectiveness of a video AI application will largely depend on the inference performance of the video AI model(s). Thus far we have been able to train a video AI model with the TAO Toolkit, but we have not considered the inference performance. This is an important consideration to ensure the DeepStream pipeline runs smoothly and without delays. Furthermore, this will allow the video AI application to be deployed on edge devices that have less computational capabilities. A complete model training workflow includes optimization after the model has been trained to use powerful features such as pruning and quantization before deployment. 

<img src='images/optimized_pre-trained_model_workflow.png' width=1080>

## Learning Objectives ##
In this notebook, you will learn how to use the TAO Toolkit to optimize a model for inference performance, including: 
* Building a Multi-source DeepStream Pipeline
* Fine-Tuning a Video AI Model for Deployment to DeepStream
* Pruning a Trained Detectnet_v2 Model
* Using Quantization-Aware Training

**Table of Contents** 
<br>
This notebook covers the below sections: 
1. [Multi-source DeepStream Pipeline](#s1)
    * [Exercise #1 - Build a DeepStream Pipeline with Multiple Sources](#e1)
2. [Optimizing Video AI Model for Inference](#s2)
    * [Model Pruning](#s2.1)
    * [Evaluate Pruned Model](#s2.2)
    * [Exercise #2 - Model Comparison](#e2)
    * [Retrain Pruned Model with Quantization-Aware Training](#s2.3)
    * [Exercise #3 - Convert Pruned Model to QAT and Retrain](#e3)
3. [Evaluate Retrained Model](#s3)
4. [Export with Calibration Cache](#s4)
5. [Deployment to DeepStream](#s5)

Execute the below cell to set directories for the TAO Toolkit. 

In [4]:
# DO NOT CHANGE THIS CELL
# Set and create directories for the TAO Toolkit experiment
import os

#!mkdir logs
os.environ['PROJECT_DIR']='/dli/task/tao_project'
os.environ['SOURCE_DATA_DIR']='/dli/task/data'
os.environ['DATA_DIR']='/dli/task/tao_project/data'
os.environ['MODELS_DIR']='/dli/task/tao_project/models'
os.environ['SPEC_FILES_DIR']='/dli/task/spec_files'

<a name='s1'></a>
## Multi-source DeepStream Pipeline ##
The DeepStream SDK enables building a pipeline with multiple input video streams. When there are multiple input sources, each source must have its own decoder and be linked to the `Gst-nvstreammux`. The `Gst-nvstreammux` plugin, referred to as the **muxer**, forms a batch of frames from multiple input sources. When connecting a source to the muxer, a new pad must be requested from the muxer using `get_request_pad()` with the pad template `sink_%u`. The muxer will form a batched buffer with `<batch-size>` frames, which is specified using `set_property()`. If the muxer’s output format and input format are the same, the muxer forwards the frames from that source as a part of the muxer’s output batched buffer. If the resolutions are not the same, the mux scales frames from the input into the batched buffer. The muxer maintains that all frames in the batch have the same resolution when it pushes it downstream. 

<a name='e1'></a>
#### Exercise #1 - Build a DeepStream Pipeline with Multiple Sources ####
To demonstrate a DeepStream pipeline with multiple inputs, we created a sample application [app_04.py](sample_apps/app_04.py) with the below architecture. This pipeline is very similar to the pipelines we've built so far with a few modifications: 
1. It takes _one_ video file and uses it for an arbitrary number of file sources (`filesrc`). 
2. It uses a tiler (`Gst-nvmultistreamtiler`) to composite a 2D tile from batched buffers, which needs to have the `rows`, `columns`, `width`, and `height` properties set. 
3. It uses the Object Detection model we had built in the previous notebook. 
4. The probe callback function is attached to the source pad of the tiler. 

We can run the pipeline by executing the script and passing 4 arguments as: <br> `python sample_apps/app_04.py <path to input h264 video> <path to nvinfer config file> <number of file sources> <name of output file>`. 

<p><img src='images/multi_input_pipeline.png' width=1080></p>

**Instructions**:<br>
* Review the code for [app_04.py](sample_apps/app_04.py). 
* Modify the `<FIXME>`s only to create the necessary elements that will connect to the `Gst-nvstreammux`, iteratively based on the arguments passed. Please **save changes** to the file. 
* Execute the below cells to review the nvinfer config file, run the DeepStream pipeline, and view the `nvdia-smi` log. 

In [5]:
# DO NOT CHANGE THIS CELL
# Read the nvinfer config file
!cat $SPEC_FILES_DIR/pgie_config_trafficcamnet_retrained.txt

cat: /dli/task/spec_files/pgie_config_trafficcamnet_retrained.txt: No such file or directory


In [6]:
# DO NOT CHANGE THIS CELL
# Run the app_04.py DeepStream pipeline w/ the custom ResNet18 model
!nvidia-smi dmon -i 0 \
                 -s ucmt \
                 -c 20 > '/dli/task/logs/smi.log' & \
python sample_apps/app_04.py /dli/task/data/sample_30.h264 \
                            /dli/task/spec_files/pgie_config_resnet18_detector_unpruned.txt \
                            8 \
                            output_tiled.mp4

Creating Pipeline
Adding elements to Pipeline
Linking elements in the Pipeline
Now playing...
1 :  /dli/task/data/sample_30.h264
2 :  /dli/task/data/sample_30.h264
3 :  /dli/task/data/sample_30.h264
4 :  /dli/task/data/sample_30.h264
5 :  /dli/task/data/sample_30.h264
6 :  /dli/task/data/sample_30.h264
7 :  /dli/task/data/sample_30.h264
8 :  /dli/task/data/sample_30.h264
Starting pipeline
0:00:00.279386693 [335m 1918[00m      0x350ed30 [36mINFO   [00m [00m             nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger:<primary-inference>[00m NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1914> [UID = 1]: Trying to create engine from model files
0:00:18.241011005 [335m 1918[00m      0x350ed30 [36mINFO   [00m [00m             nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger:<primary-inference>[00m NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1947> [UID = 1]: serialize cuda engine t

In [7]:
# DO NOT CHANGE THIS CELL
# Convert the output video to a format that is compatible with Jupyter Lab
!ffmpeg -i output_tiled.mp4 output_tiled_conv.mp4 \
        -y \
        -loglevel quiet

In [8]:
# DO NOT CHANGE THIS CELL
# Show video
from IPython.display import Video
Video('output_tiled_conv.mp4', width=720)

In [9]:
# DO NOT CHANGE THIS CELL
# Read the smi.log
!cat logs/smi.log

# gpu    sm   mem   enc   dec  mclk  pclk    fb  bar1 rxpci txpci
# Idx     %     %     %     %   MHz   MHz    MB    MB  MB/s  MB/s
    0     0     0     0     0   405   300     0     2     0     0
    0     1     0     0     0  5000   585   256     5     0     0
    0     5     0     0     0  5000   585   406     5   326    43
    0     6     0     0     0  5000   585  1068     5   334    54
    0    96    49     0     0  5000  1470  1374     5    36    12
    0    96    90     0     0  5000  1545  1374     5    25     7
    0    93    37     0     0  5000  1275  1374     5    22     2
    0    76    55     0     0  5000  1440  1374     5    41     8
    0    88    34     0     0  5000  1365  1374     5   144    25
    0    90    48     0     0  5000  1410  1374     5   264    16
    0    90    12     0     0  5000  1440  1374     5     3    59
    0    90    19     0     0  5000  1365  1374     5     5    59
    0    87    38     0     0  5000  1320  1374     5     3    58
    0    8

In [None]:
for i in range(number_sources): 
    print('Creating source_bin ', i, end='\r')
    source=Gst.ElementFactory.make('filesrc', 'file-source_%u'%i)
    source.set_property('location', args[1])
    h264parser=Gst.ElementFactory.make('h264parse', 'h264-parser_%u'%i)
    decoder = Gst.ElementFactory.make("nvv4l2decoder", "nvv4l2-decoder_%u"%i)
    pipeline.add(source)
    pipeline.add(h264parser)
    pipeline.add(decoder)
    padname="sink_%u"%i
    source.link(h264parser)
    h264parser.link(decoder)
    decodersrcpad=decoder.get_static_pad("src").link(streammux.get_request_pad(padname))

Click ... to show **solution**. 

**Observations**:<br>
When we process multiple input streams using our current unpruned model, the DeepStream pipeline begins to suffer in performance. 
1. At the bottom of the output from the pipeline run, we see that it took a while to run the 24 seconds clip, which is significantly longer than it took for a single input. The pipeline processed less than 30 frames per second, which is what the input streams are taken at. This would result in a significant delay if they were live. See [GStreamer's Design Document on Blocking Probe](https://gstreamer.freedesktop.org/documentation/additional/design/probes.html?gi-language=c#blocking-probes) to find out more about why a delay will occur. 
2. We also saw in the `nvidia-smi` log that the Streaming Multiprocessor is at very high utilization for the duration of the pipeline. 

<a name='s2'></a>
## Optimizing Video Model for Inference ##
The TAO Toolkit offers several features to optimize a model for inference performance, including **pruning** and **quantization**. 

<a name='s2.1'></a>
### Model Pruning ###
Pruning is one way to fine-tune a model for better inference performance. It is one of the key differentiators for the TAO Toolkit, which involves algorithmically removing neurons from the neural network that do not contribute significantly to the overall accuracy. Pruning reduces the overall size of the model significantly, resulting in a much lower memory footprint and higher inference throughput, which are very important for edge deployment. The model pruning step will inadvertently reduce the accuracy of the model. So after pruning, the next step is to retrain the model on the same data set to recover the lost accuracy. 

<p><img src='images/pruning.svg' width=540></p>

More information about pruning can be found in this [NVIDIA Developer Blog](https://developer.nvidia.com/blog/transfer-learning-toolkit-pruning-intelligent-video-analytics/). 

When using the `prune` subtask, the `-m` argument indictates the path to the pre-trained model, the `-o` argument indictates the path to the output file, and the `-k` argument indictates the key to _load_ the model. Some optional arguments include: 
* `-eq, --equalization_criterion`: Criteria _(arithmetic_mean, geometric_mean, union (default), and intersection)_ to equalize the states of inputs to an element-wise op layer or depth-wise convolutional layer. This parameter is useful for _ResNets_ and _MobileNets_. 
* `-pg, --pruning_granularity`: Number of filters to remove at a time _(default=8)_. 
* `-pth`: Threshold to compare the normalized norm against _(default=0.1)_.
* `-nf, --min_num_filters`: Minimum number of filters to keep per layer _(default=16)_. 
* `-el, --excluded_layers`: List of excluded_layers _(default=[])_. 

Usually, we just need to adjust `-pth` (threshold) for accuracy and model size trade off. Higher `pth` gives smaller model (and thus higher inference speed) but worse accuracy. The threshold to use depends on the data set. A `pth` value of _0.1_ is just a starting point. If the retrain accuracy is good, we can increase this value to get smaller models. Otherwise, lower this value to get better accuracy.

In [10]:
# DO NOT CHANGE THIS CELL
# View prune usage
!detectnet_v2 prune --help

Using TensorFlow backend.
usage: detectnet_v2 prune [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS]
                          [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp]
                          [--log_file LOG_FILE] -m MODEL -o OUTPUT_FILE -k KEY
                          [-n NORMALIZER] [-eq EQUALIZATION_CRITERION]
                          [-pg PRUNING_GRANULARITY] [-pth PRUNING_THRESHOLD]
                          [-nf MIN_NUM_FILTERS]
                          [-el [EXCLUDED_LAYERS [EXCLUDED_LAYERS ...]]] [-v]
                          {calibration_tensorfile,dataset_convert,evaluate,export,inference,prune,train}
                          ...

optional arguments:
  -h, --help            show this help message and exit
  --num_processes NUM_PROCESSES, -np NUM_PROCESSES
                        The number of horovod child processes to be spawned.
                        Default is -1(equal to --gpus).
  --gpus GPUS           The number of GPUs to be used for the job.
  --

In [11]:
# DO NOT CHANGE THIS CELL
# Create a new ResNet model folder and prune the resnet18_detector model
!rm -rf $MODELS_DIR/resnet18_detector_pruned
!mkdir -p $MODELS_DIR/resnet18_detector_pruned

!detectnet_v2 prune -m $MODELS_DIR/resnet18_detector/weights/resnet18_detector.tlt \
                    -o $MODELS_DIR/resnet18_detector_pruned/resnet18_detector_pruned.tlt \
                    -k tlt_encode

Using TensorFlow backend.
Using TensorFlow backend.
2022-12-23 02:55:30,250 [INFO] modulus.pruning.pruning: Exploring graph for retainable indices
2022-12-23 02:55:30,837 [INFO] modulus.pruning.pruning: Pruning model and appending pruned nodes to new graph
2022-12-23 02:55:46,729 [INFO] iva.common.magnet_prune: Pruning ratio (pruned model / original model): 0.15538467817115237


In [None]:
# DO NOT CHANGE THIS CELL
# List the model and sizes
!ls -rlt $MODELS_DIR/resnet18_detector/weights

!ls -rlt $MODELS_DIR/resnet18_detector_pruned

<a name='s2.22'></a>
### Evaluate Pruned Model ###
Once the model has been pruned, there can be a decrease in accuracy because some previously useful weights may have been removed. 

Execute the below cells to compare unpruned model evaluation with that of the pruned model. 

In [12]:
# DO NOT CHANGE THIS CELL
# Evaluate the unpruned model
!detectnet_v2 evaluate -e $SPEC_FILES_DIR/combined_training_config.txt \
                       -m $MODELS_DIR/resnet18_detector/weights/resnet18_detector.tlt \
                       -k tlt_encode

Using TensorFlow backend.
Using TensorFlow backend.

2022-12-23 02:56:05,279 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /dli/task/spec_files/combined_training_config.txt




















2022-12-23 02:56:07,748 [INFO] iva.detectnet_v2.objectives.bbox_objective: Default L1 loss function will be used.
2022-12-23 02:56:08,023 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2022-12-23 02:56:08,023 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2022-12-23 02:56:08,023 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2022-12-23 02:56:08,023 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2022-12-23 02:56:08,023 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset siz

In [13]:
# DO NOT CHANGE THIS CELL
# Evaluate the pruned model
!detectnet_v2 evaluate -e $SPEC_FILES_DIR/combined_training_config.txt \
                       -m $MODELS_DIR/resnet18_detector_pruned/resnet18_detector_pruned.tlt \
                       -k tlt_encode

Using TensorFlow backend.
Using TensorFlow backend.

2022-12-23 02:57:12,621 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /dli/task/spec_files/combined_training_config.txt




















2022-12-23 02:57:14,761 [INFO] iva.detectnet_v2.objectives.bbox_objective: Default L1 loss function will be used.
2022-12-23 02:57:15,054 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2022-12-23 02:57:15,054 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2022-12-23 02:57:15,054 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2022-12-23 02:57:15,054 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2022-12-23 02:57:15,054 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset siz

<a name='e2'></a>
#### Exercise #2 - Model Comparison ####
**Instructions**: 
* Study the outputs regarding the size and mean average precision (mAP) of the unpruned and pruned model. 
* Note down how the two models compare. 

In [None]:
##### WRITE ANSWERS HERE #####
#
# 
#
#
##############################

In [None]:
##### WRITE ANSWERS HERE #####
#
# The pruned model is significantly smaller in size but has a lower mean average precision. 
#
#
##############################

Click ... to show **solution**. 

<a name='s2.3'></a>
### Retrain Pruned Model with Quantization-Aware Training ###
To regain the accuracy, we recommend to retrain this pruned model over the same data set using the `train` subtask with an updated spec file that points to the newly pruned model as the pre-trained model file. There are several things to consider when retraining: 
* The `regularizer` option should be turned off in the `training_config` for DetectNet_v2 to recover the accuracy when retraining a pruned model. It can be done by setting the regularizer type to `NO_REG`. All other parameters may be retained in the spec file from the previous training.
* The `load_graph` option should be set to `true` in the `model_config` to load the pruned model graph. 
* If after retraining, the model shows some decrease in mAP, it could be that the originally trained model was pruned a little too much. Please try reducing the pruning threshold (thereby reducing the pruning ratio) and use the new model to retrain.
* _Optionally_, DetectNet_v2 supports **Quantization-Aware Training** to help with optmizing the model. 

Deep neural network (DNN) models, such as those routinely used video AI applications, are typically trained on servers with high-end GPUs available in data centers or private/public clouds. Such systems often use **floating-point 32-bit** arithmetic to take advantage of the wider dynamic range for the weights. After a model is trained, however, it often must be deployed at the edge on hardware that has less computational resources and power budget. Running a DNN inference using the full 32-bit representation is not practical for real-time analysis given the compute, memory, and power constraints of the edge. To help reduce the compute budget, while not compromising on the structure and number of parameters in the model, we can run inference at a lower precision. It is advantageous in many cases to use **8-bit integer numbers** for weights. The challenge is that simply rounding the weights after training may result in a lower accuracy model, especially if the weights have a wide dynamic range. While 8-bit **quantization** is appealing to save compute and memory budgets, it is a lossy process. During quantization, a small range of floating-point numbers are squeezed to a fixed number of information buckets. This results in loss of information. In another words, the minute differences which could originally be resolved using 32-bit representations are now lost because they get quantized to the same bucket in 8-bit representations. This is like the rounding errors that one encounters when representing fractional numbers as integers. To maintain accuracy during inferences at lower precision, it is important to try and mitigate errors arising due to this loss of information with Quantization-Aware Training. QAT is used to train DNNs for lower precision INT8 deployment without compromising on accuracy. It emulates the inference time quantization when training a model that may then be used by downstream inference platforms to generate actual quantized models. The error from quantization weights and tensors to INT8 is modeled during training, allowing the model to adapt and mitigate the error. Technically, during QAT the model constructed in the training graph is modified to: 
1. Replace existing notes with nodes that support fake quantization of its weights. 
2. Convert existing activation to ReLU-6 (except the output nodes). 
3. Add Quantize and De-Quantize (QDQ) nodes to compute the dynamic ranges of the intermediate tensors.  

The dynamic ranges computed during training are serialized to a **cache file** that is used at inference. 

<p><img src='images/qat_training.png' width=720></p>

More information about Quantization-Aware Training can be found [here](https://developer.nvidia.com/blog/improving-int8-accuracy-using-quantization-aware-training-and-tao-toolkit/). 

<a name='e3'></a>
#### Exercise #3 - Convert Pruned Model to QAT and Retrain ####
Supported models can be converted to QAT models by setting the `enable_qat` parameter in the `training_config` component of the spec file to `true`. When creating a training configuration file for retraining, only the `enable_qat` and `regularizer` from the `training_config` component, and `pretrained_model_file` and `load_graph` from the `model_config` component are updated. 

**Instructions**:<br>
* Modify the `model_config`[(separate qat version here)](spec_files/model_config_qat.txt) section and the `training_config`[(separate qat version here)](spec_files/training_config_qat.txt) of the training configuration file by changing the `<FIXME>`s into acceptable values. Please **save changes** to the files.
* Execute the below cells to retrain the pruned model with QAT. 

In [None]:
# DO NOT CHANGE THIS CELL
# Read the config file
!cat $SPEC_FILES_DIR/model_config_qat.txt

In [None]:
# DO NOT CHANGE THIS CELL
# Read the config file
!cat $SPEC_FILES_DIR/training_config_qat.txt

In [None]:
 model_config {
   arch: "resnet"
   pretrained_model_file: "/dli/task/tao_project/models/resnet18_detector_pruned/resnet18_detector_pruned.tlt"
   load_graph: true
   freeze_blocks: 0
   freeze_blocks: 1
   num_layers: 18
   use_pooling: false
   use_batch_norm: true
   dropout_rate: 0.0
   objective_set: {
     cov: {}
     bbox: {
       scale: 35.0
       offset: 0.5
     }
   }
 }

 training_config: {
   batch_size_per_gpu: 16
   num_epochs: 10
   enable_qat: true
   learning_rate: {
     soft_start_annealing_schedule: {
       min_learning_rate: 5e-6
       max_learning_rate: 5e-4
       soft_start: 0.1
       annealing: 0.7
     }
   }
   regularizer: {
     type: L1
     weight: 3e-9
   }
   optimizer: {
     adam: {
       epsilon: 1e-08
       beta1: 0.9
       beta2: 0.999
     }
   }
   cost_scaling: {
     enabled: false
     initial_exponent: 20.0
     increment: 0.005
     decrement: 1.0
   }
   checkpoint_interval: 5
 }

Click ... to show **solution**. 

In [14]:
# DO NOT CHANGE THIS CELL
# UPDATED enable_qat and regularizer from training_config
# UPDATED pretrained_model_file and load_graph from model_config
# Combining configuration components in separate files and writing into one
!cat $SPEC_FILES_DIR/dataset_config.txt \
     $SPEC_FILES_DIR/augmentation_config.txt \
     $SPEC_FILES_DIR/model_config_qat.txt \
     $SPEC_FILES_DIR/bbox_rasterizer_config.txt \
     $SPEC_FILES_DIR/postprocessing_config.txt \
     $SPEC_FILES_DIR/training_config_qat.txt \
     $SPEC_FILES_DIR/cost_function_config.txt \
     $SPEC_FILES_DIR/evaluation_config.txt \
     > $SPEC_FILES_DIR/combined_training_config_qat.txt
!cat $SPEC_FILES_DIR/combined_training_config_qat.txt

dataset_config: {
  data_sources: {
    tfrecords_path: "/dli/task/tao_project/data/tfrecords/kitti_trainval/*"
    image_directory_path: "/dli/task/tao_project/data/training"
  }
  image_extension: "png"
  target_class_mapping: {
       key: "car"
       value: "car"
   }
   validation_fold: 0
 }
########## LEAVE NEW LINE BELOW
augmentation_config: {
   preprocessing: {
     output_image_width: 960
     output_image_height: 544
     output_image_channel: 3
     min_bbox_width: 1.0
     min_bbox_height: 1.0
   }
   spatial_augmentation: {
     hflip_probability: 0.5
     vflip_probability: 0.5
     zoom_min: 1.0
     zoom_max: 1.0
     translate_max_x: 8.0
     translate_max_y: 8.0
   }
   color_augmentation: {
     color_shift_stddev: 0.0
     hue_rotation_max: 25.0
    saturation_shift_max: 0.2
    contrast_scale_max: 0.1
     contrast_center: 0.5
   }
 }
########## LEAVE NEW LINE BELOW
 model_config {
   arch: "resnet"
   pretrained_model_file: "/dli/task/tao_project/models/resnet18

In [15]:
# DO NOT CHANGE THIS CELL
# Initiate the training process
!detectnet_v2 train -e $SPEC_FILES_DIR/combined_training_config_qat.txt \
                    -r $MODELS_DIR/resnet18_detector_pruned_retrained_qat \
                    -k tlt_encode \
                    -n resnet18_detector_pruned_retrained_qat

Using TensorFlow backend.
Using TensorFlow backend.








2022-12-23 03:05:58,827 [INFO] __main__: Loading experiment spec at /dli/task/spec_files/combined_training_config_qat.txt.
2022-12-23 03:05:58,829 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /dli/task/spec_files/combined_training_config_qat.txt
2022-12-23 03:05:59,138 [INFO] __main__: Cannot iterate over exactly 2315 samples with a batch size of 16; each epoch will therefore take one extra step.


















2022-12-23 03:06:21,070 [INFO] iva.detectnet_v2.objectives.bbox_objective: Default L1 loss function will be used.
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 3, 544, 960)  0                                            
_______________________________________________________________________________

In [16]:
# DO NOT CHANGE THIS CELL
# List the newly retrained model
!ls -rlt $MODELS_DIR/resnet18_detector_pruned_retrained_qat/weights

total 7112
-rw-r--r-- 1 root root 7280552 Dec 23 03:26 resnet18_detector_pruned_retrained_qat.tlt


<a name='s3'></a>
### Evaluate Retrained Model ###
Once the retraining is complete, we can evaluate the QAT enabled pruned retrained model. The mAP (mean average precision) of this model should be comparable to that of the unpruned model (without QAT). However, due to quantization, it is possible sometimes to see a drop in the mAP value. Pruning and retraining can be an iterative process, but the TAO Toolkit makes it easy to rapidly prototype different versions of the video AI model. 

In [17]:
# DO NOT CHANGE THIS CELL
# Evaluate the model using the same validation set as training
!detectnet_v2 evaluate -e $SPEC_FILES_DIR/combined_training_config_qat.txt \
                       -m $MODELS_DIR/resnet18_detector_pruned_retrained_qat/weights/resnet18_detector_pruned_retrained_qat.tlt \
                       -k tlt_encode

Using TensorFlow backend.
Using TensorFlow backend.

2022-12-23 03:31:16,093 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /dli/task/spec_files/combined_training_config_qat.txt




















2022-12-23 03:31:18,544 [INFO] iva.detectnet_v2.objectives.bbox_objective: Default L1 loss function will be used.
2022-12-23 03:31:18,846 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2022-12-23 03:31:18,847 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2022-12-23 03:31:18,847 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2022-12-23 03:31:18,847 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2022-12-23 03:31:18,847 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset

<a name='s4'></a>
## Export Model with Calibration Cache ##
When we feel confident in the model's accuracy as well as inference performance, it can be exported to integrate into DeepStream. To enable inference at lower precision for better performance, the **TensorRT engine** needs to be generated in INT8 mode. This process requires an additional **cache file** that contains scale factors to help combat quantization errors, which may arise due to low-precision arithmetic. The calibration cache can optionally be created with the `export` subtask. This is referred to as exporting in **INT8 mode**. When using the `export` subtask, we can include the `--cal_cache_file` argument to indicate the path to save the calibration cache file to and the `--data_type int8` argument to indicate the desired data type. The options for the `--data_type` argument are `fp32`, `fp16`, and `int8`. The default value is `fp32` if inference in INT8 mode is not required. 

Execute the below cell to export the QAT trained model. This command generates an `.etlt` file from the trained model and serializes the corresponding INT8 scales as a TensorRT readable calibration cache file.

In [18]:
# DO NOT CHANGE THIS CELL
# Delete duplicate copies
!rm -rf $MODELS_DIR/resnet18_detector_final/resnet18_detector_pruned_retrained_qat.etlt
!rm -rf $MODELS_DIR/resnet18_detector_final/cal.bin

# Export the QAT trained model
!detectnet_v2 export -m $MODELS_DIR/resnet18_detector_pruned_retrained_qat/weights/resnet18_detector_pruned_retrained_qat.tlt \
                     -e $SPEC_FILES_DIR/combined_training_config_qat.txt \
                     -o $MODELS_DIR/resnet18_detector_final/resnet18_detector_pruned_retrained_qat.etlt \
                     -k tlt_encode \
                     --cal_cache_file $MODELS_DIR/resnet18_detector_final/cal.bin \
                     --data_type int8 \
                     --gen_ds_config

Using TensorFlow backend.
Using TensorFlow backend.
2022-12-23 03:35:07,683 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /dli/task/spec_files/combined_training_config_qat.txt
2022-12-23 03:35:09,713 [INFO] iva.common.export.keras_exporter: Using input nodes: ['input_1']
2022-12-23 03:35:09,713 [INFO] iva.common.export.keras_exporter: Using output nodes: ['output_cov/Sigmoid', 'output_bbox/BiasAdd']
NOTE: UFF has been tested with TensorFlow 1.14.0.
DEBUG [/usr/local/lib/python3.6/dist-packages/uff/converters/tensorflow/converter.py:96] Marking ['output_cov/Sigmoid', 'output_bbox/BiasAdd'] as outputs


<a name='s5'></a>
### Deployment to DeepStream ###
The pruned, QAT retrained model is ready to be deployed to DeepStream. We are now able to use `network-mode=1` for INT8 mode in the configuration file for `Gst-nvinfer`. 

Execute the below cells to read the modified `Gst-nvinfer config file` and pass it to `app_04.py` to run the DeepStream pipeline. 

In [19]:
# DO NOT CHANGE THIS CELL
# Run the app_04.py DeepStream pipeline w/ the pruned ResNet18 model
!nvidia-smi dmon -i 0 \
                 -s ucmt \
                 -c 20 > '/dli/task/logs/smi.log' & \
python sample_apps/app_04.py /dli/task/data/sample_30.h264 \
                            spec_files/pgie_config_resnet18_detector_optimized.txt \
                            16 \
                            output_tiled_optimized.mp4

Creating Pipeline
Adding elements to Pipeline
Linking elements in the Pipeline
Now playing...
1 :  /dli/task/data/sample_30.h264
2 :  /dli/task/data/sample_30.h264
3 :  /dli/task/data/sample_30.h264
4 :  /dli/task/data/sample_30.h264
5 :  /dli/task/data/sample_30.h264
6 :  /dli/task/data/sample_30.h264
7 :  /dli/task/data/sample_30.h264
8 :  /dli/task/data/sample_30.h264
9 :  /dli/task/data/sample_30.h264
10 :  /dli/task/data/sample_30.h264
11 :  /dli/task/data/sample_30.h264
12 :  /dli/task/data/sample_30.h264
13 :  /dli/task/data/sample_30.h264
14 :  /dli/task/data/sample_30.h264
15 :  /dli/task/data/sample_30.h264
16 :  /dli/task/data/sample_30.h264
Starting pipeline
0:00:00.274806158 [334m 2349[00m      0x34458d0 [36mINFO   [00m [00m             nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger:<primary-inference>[00m NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1914> [UID = 1]: Trying to create engine from model files
0:00:49.

In [20]:
# DO NOT CHANGE THIS CELL
# Read the smi.log
!cat logs/smi.log

# gpu    sm   mem   enc   dec  mclk  pclk    fb  bar1 rxpci txpci
# Idx     %     %     %     %   MHz   MHz    MB    MB  MB/s  MB/s
    0     0     0     0     0   405   300     0     2     0     0
    0     0     0     0     0  5000   585   280     5   522    31
    0     5     0     0     0  5000   585   850     5     1     6
    0    92    39     0     0  5000  1440  1114     5  2405   269
    0    96    17     0     0  5000  1395  1114     5     3     3
    0   100     2     0     0  5000  1500  1114     5     0     0
    0    99     5     0     0  5000  1500  1114     5     7     4
    0    99     6     0     0  5000  1425  1114     5    10     2
    0    99     4     0     0  5000  1485  1114     5    12     5
    0    94   100     0     0  5000  1410  1224     5    23     7
    0    93    49     0     0  5000  1305  1224     5    16     5
    0    90    37     0     0  5000  1365  1252     5    11     5
    0    80    76     0     0  5000  1530  1252     5    27     9
    0    9

In [21]:
# DO NOT CHANGE THIS CELL
# Convert the output video to a format that is compatible with Jupyter Lab
!ffmpeg -i output_tiled_optimized.mp4 output_tiled_optimized_conv.mp4 \
        -y \
        -loglevel quiet

In [22]:
# DO NOT CHANGE THIS CELL
# Show video
Video('output_tiled_optimized_conv.mp4', width=720)

**Observations**:<br>
The pipeline runs smoothly with the pruned model. It is memory and hardware efficient, allowing it to perform accurate, real-time video AI inference from multiple sources without noticeable latency. 

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>