## ONNX + TenorRT

There are several pathways one can choose from, in order to take a model from development to production. A clear winner amongst all existing methods is TensorRT, NVIDIA's flagship model optimization framework and inference engine generator. In simple words, TensorRT takes a Torch/TensorFlow model and converts it into an "engine" such that it makes most of the existing hardware resource. We will first look at a simple example of converting a classification model into TensorRT engine. 

In [1]:
from torchvision import models
import cv2
import torch
from torchvision.transforms import Resize, Compose, ToTensor, Normalize
import onnx

In [2]:
def preprocess_image(img_path):
    # transformations for the input data
    transforms = Compose([
        ToTensor(),
        Resize(224),
        Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])

    # read input image
    input_img = cv2.imread(img_path)
    # do transformations
    input_data = transforms(input_img)
    batch_data = torch.unsqueeze(input_data, 0)
    return batch_data

def postprocess(output_data):
    # get class names
    with open("../data/imagenet_classes.txt") as f:
        classes = [line.strip() for line in f.readlines()]
    # calculate human-readable value by softmax
    confidences = torch.nn.functional.softmax(output_data, dim=1)[0] * 100
    # find top predicted classes
    _, indices = torch.sort(output_data, descending=True)
    i = 0
    # print the top classes predicted by the model
    while confidences[indices[0][i]] > 0.5:
        class_idx = indices[0][i]
        print(
            "class:",
            classes[class_idx],
            ", confidence:",
            confidences[class_idx].item(),
            "%, index:",
            class_idx.item(),
        )
        i += 1


###  Step 1 : Just Torch based inference
Here, we use a pretrained Resnet50 model to classify an input image. The inference in done purely in Pytorch. The model's prediction is passed through a post processing engine to get final predction.

In [3]:
input = preprocess_image("../data/turkish_coffee.jpg").cuda()
model = models.resnet50(pretrained=True)
model.eval()
model.cuda()
output = model(input)

postprocess(output)




class: cup , confidence: 94.97859191894531 %, index: 968
class: espresso , confidence: 3.951244831085205 %, index: 967
class: coffee mug , confidence: 0.6196929216384888 %, index: 504


### Step 2 : Convert Model to ONNX
Here, we first convert the given model to ONNX representation which will be later converted to TensorRT engine. There are several ways tto convert a model to TensorRT, but the most common method is using ONNX representation. We pass dummy input of the same shape as expected inputs, the model instance to the export function to convert a given torch network to ONNX.

In [4]:
ONNX_FILE_PATH = '../onnx_files/resnet50.onnx'
torch.onnx.export(model, input, ONNX_FILE_PATH, input_names=['input'],
                  output_names=['output'], export_params=True)

### Step 3: Convert ONNX model to TensorRT engine
Now, we finally convert this generated ONNX model to TensorRT engine. This process involves several steps and we will look at them below :

The generated ONNX file is saved in the [onnx_files](../onnx_files) folder.

In [5]:
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import tensorrt as trt


#### 1. Create Builder
To create a builder, you must first create a logger. Then use the logger to create the builder.  Builder allows the creation of an optimized engine from a network definition. It allows the application to specify the maximum batch and workspace size, the minimum acceptable level of precision, timing iteration counts for autotuning, and an interface for quantizing networks to run in 8-bit precision. 

In [10]:
TRT_LOGGER = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(TRT_LOGGER)

[04/14/2023-17:00:50] [TRT] [I] The logger passed into createInferBuilder differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.

[04/14/2023-17:00:50] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2563, GPU 2148 (MiB)


#### 2. Create Network
After the builder has been created, the first step in optimizing a model is to create a network definition. The EXPLICIT_BATCH flag is required in order to import models using the ONNX parser.  Network Definition provides methods for the application to specify the definition of a network. Input and output tensors can be specified, layers can be added, and there is an interface for configuring each supported layer type.
Layers like convolutional and recurrent layers, and a Plugin layer type allows the application to implement functionality not natively supported by TensorRT.

In [11]:
EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))

In [14]:
network

<tensorrt.tensorrt.INetworkDefinition at 0x7fee02fb0fb0>

#### 3. Import model using ONNX Parser
Now, the network definition must be populated from the ONNX representation. You can create an ONNX parser to populate the network as follows:


In [12]:
parser = trt.OnnxParser(network, TRT_LOGGER)
success = parser.parse_from_file(ONNX_FILE_PATH)
for idx in range(parser.num_errors):
    print(parser.get_error(idx))

[04/14/2023-17:00:53] [TRT] [I] ----------------------------------------------------------------
[04/14/2023-17:00:53] [TRT] [I] Input filename:   ../onnx_files/resnet50.onnx
[04/14/2023-17:00:53] [TRT] [I] ONNX IR version:  0.0.7
[04/14/2023-17:00:53] [TRT] [I] Opset version:    13
[04/14/2023-17:00:53] [TRT] [I] Producer name:    pytorch
[04/14/2023-17:00:53] [TRT] [I] Producer version: 1.12.1
[04/14/2023-17:00:53] [TRT] [I] Domain:           
[04/14/2023-17:00:53] [TRT] [I] Model version:    0
[04/14/2023-17:00:53] [TRT] [I] Doc string:       
[04/14/2023-17:00:53] [TRT] [I] ----------------------------------------------------------------


#### 4. Building an engine
The next step is to create a build configuration specifying how TensorRT should optimize the model. This interface has many properties that you can set in order to control how TensorRT optimizes the network. 
 Allows the application to execute inference. 
   - It supports synchronous and asynchronous execution, profiling, and enumeration and querying of the bindings for the engine inputs and outputs. 
   - A single-engine can have multiple execution contexts, allowing a single set of trained parameters to be used for the simultaneous execution of multiple batches.

One important property is the maximum workspace size. Layer implementations often require a temporary workspace, and this parameter limits the maximum size that any layer in the network can use. If insufficient workspace is provided, it is possible that TensorRT will not be able to find an implementation for a layer:

In [15]:
config = builder.create_builder_config()


In [17]:
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20) # 1 MiB

AttributeError: 'tensorrt.tensorrt.IBuilderConfig' object has no attribute 'set_memory_pool_limit'

After the configuration has been specified, the engine can be built and serialized with:

In [None]:
serialized_engine = builder.build_serialized_network(network, config)

It may be useful to save the engine to a file for future use. You can do that like so:

In [None]:
with open("../trt_engines/sample.engine", "wb") as f:
    f.write(serialized_engine)

#### 5. Deserialize an Engine
To perform inference, deserialize the engine using the Runtime interface. Like the builder, the runtime requires an instance of the logger.

In [None]:
runtime = trt.Runtime(TRT_LOGGER)

First load the engine from a file. Then deserialize the engine from a memory buffer:

In [None]:
with open("../trt_engines/sample.engine", "rb") as f:
    serialized_engine = f.read()

In [None]:
engine = runtime.deserialize_cuda_engine(serialized_engine)

#### 6. Performing Inference

The engine holds the optimized model, but to perform inference requires additional state for intermediate activations. An engine can have multiple execution contexts, allowing one set of weights to be used for multiple overlapping inference tasks. 

In [None]:
context = engine.create_execution_context()

Allocate some host and device buffers for inputs and outputs:

In [None]:
# Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=np.float32)
h_output = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)), dtype=np.float32)
# Allocate device memory for inputs and outputs.
d_input = cuda.mem_alloc(h_input.nbytes)
d_output = cuda.mem_alloc(h_output.nbytes)
# Create a stream in which to copy inputs/outputs and run inference.
stream = cuda.Stream()

In [None]:
host_input = np.array(preprocess_image("../data/turkish_coffee.jpg").numpy(), dtype=np.float32, order='C')
# Transfer input data to the GPU.
cuda.memcpy_htod_async(d_input, host_input, stream)
# Run inference.
context. execute_async_v2(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
# Transfer predictions back from the GPU.
cuda.memcpy_dtoh_async(h_output, d_output, stream)
# Synchronize the stream
stream.synchronize()

Create some space to store intermediate activation values. Since the engine holds the network definition and trained parameters, additional space is necessary. 

In [None]:
output_data = torch.Tensor(h_output).unsqueeze(0)

In [None]:
postprocess(output_data)

Finally we are able to recreate the same results that we obtained using pure pytorch model. In conclusion, we first converted a given model to it's ONNX representation, then used this ONNX representation to generate a TensorRT engine. Then, use this saved engine for inference.