# Model Frameworks and Inference Latency

Not all frameworks for ML models are cut from the same cloth. While frameworks like PyTorch or TensorFlow are designed to train and infer models, frameworks like ONNX or TensorRT are designed with inference in mind.

It can be more difficult (or impossible without custom C++ kernels) to get a model into an optimized format. In this notebook some common formats will be explored and the conversion process and benefits will be demonstrated.

*(Note this section will take a fair amount of device memory. If you run out of memory remember to close/stop other Jupyter Notebook kernels and if necessary only do a few models at a time!)* There are portions of this notebook that will stop the kernal to regain some device memory to try and be proactive in this regard!

In [None]:
import os
import cv2
import numpy as np

# create data input
image_file_path = "sample_images/group-photo.jpg"

target_input_height = 480
target_input_width = 640

os.environ['TARGET_HEIGHT'] = str(target_input_height)
os.environ['TARGET_WIDTH'] = str(target_input_width)

original_image = cv2.imread(image_file_path)

# pre processing
def pre_process_image(original_image):
    resized_image = cv2.resize(original_image, (target_input_width,
                                 target_input_height))
    image_rgb = cv2.cvtColor(resized_image, cv2.COLOR_BGR2RGB)
    image = np.float32(image_rgb)

    image = image/255
    image = np.moveaxis(image, -1, 0)  # HWC to CHW

    image = image[np.newaxis, :] # add batch dimension
    image = np.float32(image)
    
    return image

image = pre_process_image(original_image)

## PyTorch

Pytorch is a great framework for training models and can be used for inference as well. 

In [None]:
import torch
import gdown
import sys
import os

sys.path.insert(0, os.path.join(os.getcwd(), 'DM-Count/'))

import models

model_path = "model.pth"
url = "https://drive.google.com/uc?id=1nnIHPaV9RGqK8JHL645zmRvkNrahD9ru"
gdown.download(url, model_path, quiet=False)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# load model
model = models.vgg19() # DM-Count is VGG19 based
model.load_state_dict(torch.load(model_path, device))
model.eval() 

## Torchscript

[TorchScript](https://pytorch.org/docs/stable/jit.html) is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency.

Typically, Pytorch models are converted to Torchscript for deployment. 

In [None]:
dummy_input = torch.rand(1, 3, target_input_height, target_input_width)

traced_model = torch.jit.trace(model, dummy_input).eval()
scripted_model = torch.jit.script(traced_model).eval()

## ONNX

ONNX is an open format built to represent machine learning models. Read more about it [here](https://onnx.ai/).

In [None]:
# save model in ONNX format
dummy_input = torch.rand(1, 3, target_input_height, target_input_width)

torch.onnx.export(model,  # model being run
                  dummy_input,  # model test input
                  "model.onnx",  # where to save the model (can be a file or file-like object)
                  opset_version=16,  # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names=['input'],  # the model's input names
                  output_names=['output_0', 'output_1'],  # the model's output names
                  dynamic_axes={'input': {0: 'batch_size'},  # variable length axes
                                'output_0': {0: 'batch_size'}, 
                                'output_1': {0: 'batch_size'}
                               }
                  )

Polygrapy is an excellent toolkit designed to assist in running and debugging deep learning models in various frameworks. It is preinstalled with all containers from Nvidia's [NGC](https://catalog.ngc.nvidia.com/).

Here we are going to use it to "sanitize" the model, by folding constants in the model graph into other nodes. This can simplyfiy the network and potentially give some speedup. Another tool that can do this is [ONNX Simplifier](https://github.com/daquexian/onnx-simplifier).

In [None]:
!POLYGRAPHY_AUTOINSTALL_DEPS=1 polygraphy surgeon sanitize model.onnx --fold-constants -o model_simplified_folded.onnx 

## TensorRT

NVIDIA® TensorRT™, is an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications on NVIDIA hardware.

In [None]:
!POLYGRAPHY_AUTOINSTALL_DEPS=1 polygraphy convert model_simplified_folded.onnx --convert-to trt --output model.engine --trt-min-shapes input:[1,3,$TARGET_HEIGHT,$TARGET_WIDTH] --trt-opt-shapes input:[1,3,$TARGET_HEIGHT,$TARGET_WIDTH] --trt-max-shapes input:[8,3,$TARGET_HEIGHT,$TARGET_WIDTH]

## Measuring the latency difference
Now that all the models have been converted to the frameworks, lets do some simple benchmarking.

Note: the `min()` is taken to avoid the warmup values and any spikes that could be caused from other processes.

In [None]:
import timeit

raw_torch_latency = []
torchscript_latency = []
onnx_latency = []
trt_latency = []

number_of_iterations = 25
batch_size_range = [1, 2, 4, 8]

In [None]:
# send Pytorch model to compute device
model = model.to(device)

def raw_torch_infer(image):
    if not torch.is_tensor(image):
        image = torch.tensor(image)
    image_tensor = image.to(device) # move image data to compute device    
    output = model(image_tensor)[0].cpu() # inference    
    image.cpu()
    del image
    return output   

In [None]:
for n in batch_size_range:
    image_batch = np.concatenate([image for _ in range(n)])
    raw_torch_latency.append(min(timeit.repeat(lambda: raw_torch_infer(image_batch), number=number_of_iterations)) / number_of_iterations / n)
    torch.cuda.empty_cache()

In [None]:
# unload model from GPU to save some memory
model.cpu()
torch.cuda.empty_cache()

In [None]:
scripted_model = scripted_model.to(device)

def torchscript_infer(image):
    if not torch.is_tensor(image):
        image = torch.tensor(image)
    image_tensor = image.to(device) # move image data to compute device    
    output = scripted_model(image_tensor)[0].cpu() # inference    
    image.cpu()   
    return output

In [None]:
for n in batch_size_range:
    image_batch = np.concatenate([image for _ in range(n)])
    torchscript_latency.append(min(timeit.repeat(lambda: torchscript_infer(image_batch), number=number_of_iterations)) / number_of_iterations / n)
    torch.cuda.empty_cache()

In [None]:
# unload model from GPU to save some memory
scripted_model.cpu()
torch.cuda.empty_cache()

In [None]:
import onnxruntime as ort

# instatiate model
ort_sess = ort.InferenceSession(f'model_simplified_folded.onnx', providers=['CUDAExecutionProvider'])

def onnx_infer(image):
    output = ort_sess.run(None, {"input": image})
    return output

In [None]:
for n in batch_size_range:
    image_batch = np.concatenate([image for _ in range(n)])
    onnx_latency.append(min(timeit.repeat(lambda: onnx_infer(image_batch), number=number_of_iterations)) / number_of_iterations / n)

In [None]:
# unload model from GPU to save some memory
del ort_sess

In [None]:
from polygraphy.backend.common import BytesFromPath
from polygraphy.backend.trt import EngineFromBytes, TrtRunner

# instatiate model
load_engine = EngineFromBytes(BytesFromPath(f"model.engine"))
trt_runner = TrtRunner(load_engine)
trt_runner.activate()

def trt_infer(image):
    output = trt_runner.infer(feed_dict={"input": image})
    return output

In [None]:
for n in batch_size_range:
    image_batch = np.concatenate([image for _ in range(n)])
    trt_latency.append(min(timeit.repeat(lambda: trt_infer(image_batch), number=number_of_iterations)) / number_of_iterations / n)

In [None]:
# unload model from GPU to save some memory
trt_runner.deactivate()

In [None]:
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt

objects = ('Raw PyTorch', 'TorchScript', 'ONNX', 'TensorRT')
y_pos = np.arange(len(objects))
performance = [raw_torch_latency,torchscript_latency,onnx_latency,trt_latency]

for p, o in zip(performance, objects):  
    plt.plot(batch_size_range, p, label=o)
plt.xlabel('Batch Size')
plt.ylabel('Latency (s)')
plt.title('Latency of DM-Count')
plt.legend(loc="upper right")
plt.grid(True)

plt.show()

## Results

Look! Using a framework like ONNX or TensorRT can lower your inference speed quite significantly. Also if you look at the memory consumed as models are loaded you can see differences between the frameworks as well.

Batching can be very useful to increase throughput as show in this graph.

## Quantization 

"Quantization refers to techniques for performing computations and storing tensors at lower bitwidths than floating point precision. A quantized model executes some or all of the operations on tensors with reduced precision rather than full precision (floating point) values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms." Here is a good [site](https://pytorch.org/docs/stable/quantization.html) to start learning more.

Lets start by quantizing models in TensorRT and Torchscript!

In [None]:
!POLYGRAPHY_AUTOINSTALL_DEPS=1 polygraphy convert model_simplified_folded.onnx --convert-to trt --output model_fp16.engine --fp16 --trt-min-shapes input:[1,3,$TARGET_HEIGHT,$TARGET_WIDTH] --trt-opt-shapes input:[1,3,$TARGET_HEIGHT,$TARGET_WIDTH] --trt-max-shapes input:[8,3,$TARGET_HEIGHT,$TARGET_WIDTH]

In [None]:
# instatiate model
load_engine = EngineFromBytes(BytesFromPath(f"model_fp16.engine"))
trt_runner_fp16 = TrtRunner(load_engine)
trt_runner_fp16.activate()

def trt_infer_fp16(image):
    output = trt_runner_fp16.infer(feed_dict={"input": image})
    return output

In [None]:
# benchmark

trt_fp16_latency = []

for n in batch_size_range:
    image_batch = np.concatenate([image for _ in range(n)])
    trt_fp16_latency.append(min(timeit.repeat(lambda: trt_infer_fp16(image_batch), number=number_of_iterations)) / number_of_iterations / n)

In [None]:
# unload model from GPU to save some memory
trt_runner_fp16.deactivate()

In [None]:
scripted_model_fp16 = scripted_model.half().to(device)

def torchscript_infer_fp16(image):
    if not torch.is_tensor(image):
        image = torch.tensor(image)
    image_tensor = image.to(device) # move image data to compute device    
    output = scripted_model_fp16(image_tensor.half())[0].cpu() # inference
    image.cpu()
    del image
    torch.cuda.empty_cache()
    return output

In [None]:
# benchmark

torchscript_fp16_latency = []

for n in batch_size_range:
    image_batch = np.concatenate([image for _ in range(n)])
    torchscript_fp16_latency.append(min(timeit.repeat(lambda: torchscript_infer_fp16(image_batch), number=number_of_iterations)) / number_of_iterations / n)
    torch.cuda.empty_cache()

In [None]:
# unload model from GPU to save some memory
scripted_model_fp16.cpu()
torch.cuda.empty_cache()

In [None]:
objects = ('Torchscript FP32', 'Torchscript FP16', 'TensorRT FP32', 'TensorRT FP16')
y_pos = np.arange(len(objects))
performance = [torchscript_latency, torchscript_fp16_latency, trt_latency, trt_fp16_latency]

for p, o in zip(performance, objects):  
    plt.plot(batch_size_range, p, label=o)

plt.legend(loc="upper right")
plt.grid(True)
plt.ylabel('Latency (s)')
plt.title('Quantization Lowers Latency')

plt.show()

Quantization gives a significant performance improvement to inference latency as well!

Just ensure that when quantizing that the accuracy remains within your target. It would be expected to see ~0.1% accuracy drop for a FP16 bit quantization in a computer vision task for example.

INT8, FP8 and other numerical representations can be used as well, but quantization can be more challenging. Here is a good [site](https://pytorch.org/docs/stable/quantization.html) to start learning more.