# Using PyTorch with TensorRT through ONNX:

TensorRT is a great way to take a trained PyTorch model and optimize it to run more efficiently during inference on an NVIDIA GPU.

One approach to convert a PyTorch model to TensorRT is to export a PyTorch model to ONNX (an open format exchange for deep learning models) and then convert into a TensorRT engine. Essentially, we will follow this path to convert and deploy our model:

![PyTorch+ONNX](./images/pytorch_onnx.png)

Both TensorFlow and PyTorch models can be exported to ONNX, as well as many other frameworks. This allows models created using either framework to flow into common downstream pipelines.

To get started, let's take a well-known computer vision model and follow five key steps to deploy it to the TensorRT Python runtime:

1. __What format should I save my model in?__
2. __What batch size(s) am I running inference at?__
3. __What precision am I running inference at?__
4. __What TensorRT path am I using to convert my model?__
5. __What runtime am I targeting?__

## 1. What format should I save my model in?

We are going to use ResNet50, a widely used CNN architecture first described in <a href=https://arxiv.org/abs/1512.03385>this paper</a>.

Let's start by loading dependencies and downloading the model:

In [1]:
import torchvision.models as models
import torch
import torch.onnx

# load the pretrained model
resnet50 = models.resnet50(pretrained=True, progress=False)

Next, we will select our batch size and export the model:

In [2]:
# set up a dummy input tensor and export the model to ONNX
BATCH_SIZE = 32
dummy_input=torch.randn(BATCH_SIZE, 3, 224, 224)
torch.onnx.export(resnet50, dummy_input, "resnet50_pytorch.onnx", verbose=False)

Note that we are picking a BATCH_SIZE of 4 in this example.

Let's use a benchmarking function included in this guide to time this model:

In [3]:
from benchmark import benchmark

resnet50.to("cuda").eval()
benchmark(resnet50)

Warm up ...
Start timing ...
Iteration 1000/1000, ave batch time 10.19 ms
Input shape: torch.Size([1, 3, 224, 224])
Output features size: torch.Size([1, 1000])
Average batch time: 10.19 ms


Now, let's restart our Jupyter Kernel so PyTorch doesn't collide with TensorRT: 

In [None]:
import os

os._exit(0) # Shut down all kernels so TRT doesn't fight with PyTorch for GPU memory

## 2. What batch size(s) am I running inference at?

We are going to run with a fixed batch size of 4 for this example. Note that above we set BATCH_SIZE to 4 when saving our model to ONNX. We need to create another dummy batch of the same size (this time it will need to be in our target precision) to test out our engine.

First, as before, we will set our BATCH_SIZE to 4. Note that our trtexec command above includes the '--explicitBatch' flag to signal to TensorRT that we will be using a fixed batch size at runtime.

In [1]:
BATCH_SIZE = 32

Importantly, by default TensorRT will use the input precision you give the runtime as the default precision for the rest of the network. So before we create our new dummy batch, we also need to choose a precision as in the next section:

## 3. What precision am I running inference at?

Remember that lower precisions than FP32 tend to run faster. There are two common reduced precision modes - FP16 and INT8. Graphics cards that are designed to do inference well often have an affinity for one of these two types. This guide was developed on an NVIDIA V100, which favors FP16, so we will use that here by default. INT8 is a more complicated process that requires a calibration step.

In [2]:
import numpy as np

USE_FP16 = True

target_dtype = np.float16 if USE_FP16 else np.float32
dummy_input_batch = np.zeros((BATCH_SIZE, 224, 224, 3), dtype = np.float32) 

## 4. What TensorRT path am I using to convert my model?

We can use trtexec, a command line tool for working with TensorRT, in order to convert an ONNX model originally from PyTorch to an engine file.

Let's make sure we have TensorRT installed (this comes with trtexec):

In [3]:
import tensorrt

To convert the model we saved in the previous step, we need to point to the ONNX file, give trtexec a name to save the engine as, and last specify that we want to use a fixed batch size instead of a dynamic one.

In [4]:
# step out of Python for a moment to convert the ONNX model to a TRT engine using trtexec
if USE_FP16:
    !trtexec --onnx=resnet50_pytorch.onnx --saveEngine=resnet_engine_pytorch.trt  --explicitBatch --fp16
else:
    !trtexec --onnx=resnet50_pytorch.onnx --saveEngine=resnet_engine_pytorch.trt  --explicitBatch

&&&& RUNNING TensorRT.trtexec # trtexec --onnx=resnet50_pytorch.onnx --saveEngine=resnet_engine_pytorch.trt --explicitBatch --fp16
[01/30/2021-02:11:40] [I] === Model Options ===
[01/30/2021-02:11:40] [I] Format: ONNX
[01/30/2021-02:11:40] [I] Model: resnet50_pytorch.onnx
[01/30/2021-02:11:40] [I] Output:
[01/30/2021-02:11:40] [I] === Build Options ===
[01/30/2021-02:11:40] [I] Max batch: explicit
[01/30/2021-02:11:40] [I] Workspace: 16 MiB
[01/30/2021-02:11:40] [I] minTiming: 1
[01/30/2021-02:11:40] [I] avgTiming: 8
[01/30/2021-02:11:40] [I] Precision: FP32+FP16
[01/30/2021-02:11:40] [I] Calibration: 
[01/30/2021-02:11:40] [I] Refit: Disabled
[01/30/2021-02:11:40] [I] Safe mode: Disabled
[01/30/2021-02:11:40] [I] Save engine: resnet_engine_pytorch.trt
[01/30/2021-02:11:40] [I] Load engine: 
[01/30/2021-02:11:40] [I] Builder Cache: Enabled
[01/30/2021-02:11:40] [I] NVTX verbosity: 0
[01/30/2021-02:11:40] [I] Tactic sources: Using default tactic sources
[01/30/2021-02:11:40] [I] Input(s

This will save our model as 'resnet_engine.trt'.

## 5. What TensorRT runtime am I targeting?

Now, we have a converted our model to a TensorRT engine. Great! That means we are ready to load it into the native Python TensorRT runtime. This runtime strikes a balance between the ease of use of the high level Python APIs used in frameworks and the fast, low level C++ runtimes available in TensorRT.

In [5]:
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit

f = open("resnet_engine_pytorch.trt", "rb")
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) 

engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()

Now allocate input and output memory, give TRT pointers (bindings) to it:

In [6]:
# need to set input and output precisions to FP16 to fully enable it
output = np.empty([BATCH_SIZE, 1000], dtype = target_dtype) 

# allocate device memory
d_input = cuda.mem_alloc(1 * dummy_input_batch.nbytes)
d_output = cuda.mem_alloc(1 * output.nbytes)

bindings = [int(d_input), int(d_output)]

stream = cuda.Stream()

Next, set up the prediction function.

This involves a copy from CPU RAM to GPU VRAM, executing the model, then copying the results back from GPU VRAM to CPU RAM:

In [7]:
def predict(batch): # result gets copied into output
    # transfer input data to device
    cuda.memcpy_htod_async(d_input, batch, stream)
    # execute model
    context.execute_async_v2(bindings, stream.handle, None)
    # transfer predictions back
    cuda.memcpy_dtoh_async(output, d_output, stream)
    # syncronize threads
    stream.synchronize()
    
    return output

Finally, let's time the function!

Note that we're going to include the extra CPU-GPU copy time in this evaluation, so it won't be directly comparable with our TRTorch model performance as it also includes additional overhead.

In [8]:
print("Warming up...")

predict(dummy_input_batch)

print("Done warming up!")

Warming up...
Done warming up!


In [9]:
%%timeit

pred = predict(dummy_input_batch)

7.15 ms ± 4.73 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


However, even with the CPU-GPU copy, this is still faster than our raw PyTorch model!

## Next Steps:

<h4> Profiling </h4>

This is a great next step for further optimizing and debugging models you are working on productionizing

You can find it here: https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html

<h4>  TRT Dev Docs </h4>

Main documentation page for the ONNX, layer builder, C++, and legacy APIs

You can find it here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html

<h4>  TRT OSS GitHub </h4>

Contains OSS TRT components, sample applications, and plugin examples

You can find it here: https://github.com/NVIDIA/TensorRT


#### TRT Supported Layers:

https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/samplePlugin

#### TRT ONNX Plugin Example:

https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#layers-precision-matrix