# Using Tensorflow through ONNX:

The ONNX path to getting a TensorRT engine is a high-performance approach to TensorRT conversion that works with a variety of frameworks - including Tensorflow and Tensorflow 2.

TensorRT's ONNX parser is an all-or-nothing parser for ONNX models that ensures an optimal, single TensorRT engine and is great for exporting to the TensorRT API runtimes. ONNX models can be easily generated from Tensorflow models using the ONNX project's keras2onnx and tf2onnx tools.

In this notebook we will take a look at how ONNX models can be generated from a Keras/TF2 ResNet50 model, how we can convert those ONNX models to TensorRT engines using trtexec, and finally how we can use the native Python TensorRT runtime to feed a batch of data into the TRT engine at inference time.

Essentially, we will follow this path to convert and deploy our model:

![Tensorflow+ONNX](./images/tf_onnx.png)

__Use this when:__
- You want the most efficient runtime performance possible out of an automatic parser
- You have a network consisting of mostly supported operations -  including operations and layers that the ONNX parser uniquely supports (Such as RNNs/LSTMs/GRUs)
- You are willing to write custom C++ plugins for any unsupported operations (if your network has any)
- You do not want to use the manual layer builder API

__Checking your GPU status:__

Lets see what GPU hardware we are working with. Our hardware can matter a lot because different cards have different performance profiles and precisions they tend to operate best in. For example, a V100 is relatively strong as FP16 processing vs a T4, which tends to operate best in the INT8 mode.

In [1]:
!nvidia-smi

Sat Jan 30 01:21:04 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-DGXS...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   38C    P0    36W / 300W |    126MiB / 16155MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-DGXS...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   39C    P0    37W / 300W |      6MiB / 16158MiB |      0%      Default |
|       

Remember to sucessfully deploy a TensorRT model, you have to make __five key decisions__:

1. __What format should I save my model in?__
2. __What batch size(s) am I running inference at?__
3. __What precision am I running inference at?__
4. __What TensorRT path am I using to convert my model?__
5. __What runtime am I targeting?__

## 1. What format should I save my model in?

Our first step is to load up a pretrained ResNet50 model. This can be done easily using keras.applications - a collection of pretrained image model classifiers that can additionally be used as backbones for detection and other deep learning problems.

We can load up a pretrained classifier with batch size 32 as follows:

In [2]:
from tensorflow.keras.applications import ResNet50

BATCH_SIZE = 32

In [3]:
model = ResNet50(weights='imagenet')

For the purposes of checking our non-optimized model, we can use a dummy batch of data to verify our performance and the consistency of our results across precisions. 224x224 RGB images are a common  format, so lets generate a batch of them.

Once we generate a batch of them, we will feed it through the model using .predict() to "warm up" the model. The first batch you feed through a deep learning model often takes a lot longer as just-in-time compilation and other runtime optimizations are performed. Once you get that first batch through, further performance tends to be more consistent.

In [4]:
import numpy as np

dummy_input_batch = np.zeros((BATCH_SIZE, 224, 224, 3))

model.predict(dummy_input_batch) # warm up

array([[1.6964252e-04, 3.3007501e-04, 6.1350627e-05, ..., 1.4622418e-05,
        1.4449919e-04, 6.6087063e-04],
       [1.6964252e-04, 3.3007501e-04, 6.1350627e-05, ..., 1.4622418e-05,
        1.4449919e-04, 6.6087063e-04],
       [1.6964252e-04, 3.3007501e-04, 6.1350627e-05, ..., 1.4622418e-05,
        1.4449919e-04, 6.6087063e-04],
       ...,
       [1.6964252e-04, 3.3007501e-04, 6.1350627e-05, ..., 1.4622418e-05,
        1.4449919e-04, 6.6087063e-04],
       [1.6964252e-04, 3.3007501e-04, 6.1350627e-05, ..., 1.4622418e-05,
        1.4449919e-04, 6.6087063e-04],
       [1.6964252e-04, 3.3007501e-04, 6.1350627e-05, ..., 1.4622418e-05,
        1.4449919e-04, 6.6087063e-04]], dtype=float32)

__Baseline Timing:__

Once we have warmed up our non-optimized model, we can get a rough timing estimate of our model using %%timeit, which runs the cell several times and reports timing information.

Lets take a look at how long our model takes to run at baseline before doing any TensorRT optimization:

In [5]:
%%timeit

result = model.predict_on_batch(dummy_input_batch) # Check default performance

51.2 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


We can now take a look at the resulting batch:

In [6]:
result = model.predict_on_batch(dummy_input_batch)
result[:10] # The probabilities for the first ten Imagenet classes in the first sample of the batch

array([[1.6964252e-04, 3.3007501e-04, 6.1350627e-05, ..., 1.4622418e-05,
        1.4449919e-04, 6.6087063e-04],
       [1.6964252e-04, 3.3007501e-04, 6.1350627e-05, ..., 1.4622418e-05,
        1.4449919e-04, 6.6087063e-04],
       [1.6964252e-04, 3.3007501e-04, 6.1350627e-05, ..., 1.4622418e-05,
        1.4449919e-04, 6.6087063e-04],
       ...,
       [1.6964252e-04, 3.3007501e-04, 6.1350627e-05, ..., 1.4622418e-05,
        1.4449919e-04, 6.6087063e-04],
       [1.6964252e-04, 3.3007501e-04, 6.1350627e-05, ..., 1.4622418e-05,
        1.4449919e-04, 6.6087063e-04],
       [1.6964252e-04, 3.3007501e-04, 6.1350627e-05, ..., 1.4622418e-05,
        1.4449919e-04, 6.6087063e-04]], dtype=float32)

Okay - now that we have a baseline model, lets convert it to the format TensorRT understands best: ONNX. 

__Convert Keras model to ONNX intermediate model and save:__

The ONNX format is a framework-agnostic way of describing and saving the structure and state of deep learning models. We can convert Tensorflow 2 Keras models to ONNX using the keras2onnx tool provided by the ONNX project. (You can find the ONNX project here: https://onnx.ai or on GitHub here: https://github.com/onnx/onnx)

In [7]:
import onnx, keras2onnx

Converting a model with default parameters to an ONNX model is fairly straightforward:

In [8]:
onnx_model = keras2onnx.convert_keras(model, model.name)

tf executing eager_mode: True
tf.keras model eager_mode: False
The ONNX operator number change on the optimization: 458 -> 127


That said, we do need to make one change for our model to work with TensorRT. Keras by default uses a dynamic input shape in its networks - where it can handle arbitrary batch sizes at every update. While TensorRT can do this, it requires extra configuration. 

Instead, we will just set the input size to be fixed to our batch size. This will work with TensorRT out of the box!

__Configure ONNX File Batch Size:__

__Note:__ We need to do two things to set our batch size with ONNX. The first is to modify our ONNX file to change its default batch size to our target batch size. The second is setting our converter to use the __explicit batch__ mode, which will use this default batch size as our final batch size.

In [9]:
inputs = onnx_model.graph.input
for input in inputs:
    dim1 = input.type.tensor_type.shape.dim[0]
    dim1.dim_value = BATCH_SIZE

__Save Model:__

In [10]:
model_name = "resnet50_onnx_model.onnx"
onnx.save_model(onnx_model, model_name)
print("Done saving!")

Done saving!


Once we get our model into ONNX format, we can convert it efficiently using TensorRT. For this, TensorRT needs exclusive access to your GPU. If you so much as import Tensorflow, it will generally consume all of your GPU memory. To get around this, before moving on go ahead and shut down this notebook and restart it. (You can do this in the menu: Kernel -> Restart Kernel)

Make sure not to import Tensorflow at any point after restarting the runtime! 

(The following cell is a quick shortcut to make your notebook restart:)

In [None]:
import os, time
print("Restarting kernel  in three seconds...")
time.sleep(3)
print("Restarting kernel now")
os._exit(0) # Shut down all kernels so TRT doesn't fight with Tensorflow for GPU memory - TF monopolizes all GPU memory by default

## 2. What batch size(s) am I running inference at?

We have actually already set our inference batch size - see the note above in section 1!

We are going to set our target batch size to a fixed size of 32.

In [1]:
BATCH_SIZE = 32

We need to do two things to set our batch size to a fixed batch size with ONNX: 

1. Modify our ONNX file to change its default batch size to our target batch size, which we did above.
2. Use the trtexec --explicitBatch flag, which we also did above.

## 3. What precision am I running inference at?

Now, we have a converted TensorRT engine. Great! That means we are ready to load it into the native Python TensorRT runtime. This runtime strikes a balance between the ease of use of the high level Python runtimes and the low level C++ runtimes.

First, as before, lets create a dummy batch. Importantly, by default TensorRT will use the input precision you give it as the default precision for the rest of the network. 

Remember that lower precisions than FP32 tend to run faster. There are two common reduced precision modes - FP16 and INT8. Graphics cards that are designed to do inference well often have an affinity for one of these two types. This guide was developed on an NVIDIA V100, which favors FP16, so we will use that here by default. INT8 is a more complicated process that requires a calibration step.

In [2]:
import numpy as np

USE_FP16 = True

target_dtype = np.float16 if USE_FP16 else np.float32
dummy_input_batch = np.zeros((BATCH_SIZE, 224, 224, 3), dtype = np.float32) 

## 4. What TensorRT path am I using to convert my model?

TensorRT is able to take ONNX models and convert them entirely into a single, efficient TensorRT engine. Restart your Jupyter kernel, and then start here!

We can use trtexec, a command line tool for working with TensorRT, in order to convert an ONNX model to an engine file.

To convert the model we saved in the previous steps, we need to point to the ONNX file, give trtexec a name to save the engine as, and last specify that we want to use a fixed batch size instead of a dynamic one.

__Remember to shut down all Jupyter notebooks and restart your Jupyter kernel after "1. What format should I save my model in?" - otherwise this cell will crash as TensorRT competes with Tensorflow for GPU memory:__

In [3]:
# May need to shut down all kernels and restart before this - otherwise you might get cuDNN initialization errors:
if USE_FP16:
    !trtexec --onnx=resnet50_onnx_model.onnx --saveEngine=resnet_engine.trt  --explicitBatch --fp16
else:
    !trtexec --onnx=resnet50_onnx_model.onnx --saveEngine=resnet_engine.trt  --explicitBatch

&&&& RUNNING TensorRT.trtexec # trtexec --onnx=resnet50_onnx_model.onnx --saveEngine=resnet_engine.trt --explicitBatch --fp16
[01/30/2021-01:47:03] [I] === Model Options ===
[01/30/2021-01:47:03] [I] Format: ONNX
[01/30/2021-01:47:03] [I] Model: resnet50_onnx_model.onnx
[01/30/2021-01:47:03] [I] Output:
[01/30/2021-01:47:03] [I] === Build Options ===
[01/30/2021-01:47:03] [I] Max batch: explicit
[01/30/2021-01:47:03] [I] Workspace: 16 MiB
[01/30/2021-01:47:03] [I] minTiming: 1
[01/30/2021-01:47:03] [I] avgTiming: 8
[01/30/2021-01:47:03] [I] Precision: FP32+FP16
[01/30/2021-01:47:03] [I] Calibration: 
[01/30/2021-01:47:03] [I] Refit: Disabled
[01/30/2021-01:47:03] [I] Safe mode: Disabled
[01/30/2021-01:47:03] [I] Save engine: resnet_engine.trt
[01/30/2021-01:47:03] [I] Load engine: 
[01/30/2021-01:47:03] [I] Builder Cache: Enabled
[01/30/2021-01:47:03] [I] NVTX verbosity: 0
[01/30/2021-01:47:03] [I] Tactic sources: Using default tactic sources
[01/30/2021-01:47:03] [I] Input(s)s format:


-

__The trtexec Logs:__

Above, trtexec does a lot of things! Some important things to note:

__First__, _"PASSED"_ is what you want to see in the last line of the log above. We can see our conversion was successful!

__Second__, can see the resnet_engine.trt engine file has indeed been successfully created: 

In [4]:
!ls -la

total 2547292
drwxrwxr-x  8   1000  1000       4096 Jan 30 01:46  .
drwxrwxr-x  5   1000  1000       4096 Jan 14 22:29  ..
drwxr-xr-x  2   1000  1000       4096 Jan 29 23:39  .ipynb_checkpoints
-rw-rw-r--  1   1000  1000       6570 Jan 30 01:10 '0. Running This Guide.ipynb'
-rw-r--r--  1 root   root      502649 Jan 30 01:06 '1. Introduction.ipynb'
-rw-rw-r--  1   1000  1000      23645 Jan 29 23:47 '2. Using the Tensorflow TensorRT Integration.ipynb'
-rw-rw-r--  1   1000  1000      38440 Jan 30 01:46 '3. Using Tensorflow 2 through ONNX.ipynb'
-rw-rw-r--  1   1000  1000      11961 Jan 30 01:46 '4. Using PyTorch through ONNX.ipynb'
-rw-rw-r--  1   1000  1000       7052 Jan 29 23:41 '5. Understanding TensorRT Runtimes.ipynb'
drwxrwxr-x  5   1000  1000       4096 Jan 29 23:41 'Additional Examples'
drwxr-xr-x  2 root   root        4096 Jan 30 00:58  __pycache__
-rw-rw-r--  1   1000  1000       1091 Jan 14 22:29  benchmark.py
-rw-------  1 root   root  2147479552 Jan 27 08:24  core
-rw-rw-r--

__Third__, you can see timing details above using trtexec - these are in the ideal case with no overhead. Depending on how you run your model, a considerable amount of overhead can be added to this. We can do timing in our Python runtime below - but keep in mind performing C++ inference would likely be faster.

## 5. What TensorRT runtime am I targeting?

We want to run our TensorRT inference in Python - so the TensorRT Python API is a great way of testing our model out in Jupyter, and is still quite performant.

To use it, we need to do a few steps:

__Load our engine into a tensorrt.Runtime:__

In [5]:
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit

f = open("resnet_engine.trt", "rb")
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) 

engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()

Note: if this cell is having issues, restarting all Jupyter kernels and rerunning only the batch size and precision cells above before trying again often helps

__Allocate input and output memory, give TRT pointers (bindings) to it:__

d_input and d_output refer to the memory regions on our 'device' (aka GPU) - as opposed to memory on our normal RAM, where Python holds its variables (such as 'output' below).

In [6]:
output = np.empty([BATCH_SIZE, 1000], dtype = target_dtype) # Need to set output dtype to FP16 to enable FP16

# Allocate device memory
d_input = cuda.mem_alloc(1 * dummy_input_batch.nbytes)
d_output = cuda.mem_alloc(1 * output.nbytes)

bindings = [int(d_input), int(d_output)]

stream = cuda.Stream()

__Set up prediction function:__

This involves a copy from CPU RAM to GPU VRAM, executing the model, then copying the results back from GPU VRAM to CPU RAM:

In [7]:
def predict(batch): # result gets copied into output
    # Transfer input data to device
    cuda.memcpy_htod_async(d_input, batch, stream)
    # Execute model
    context.execute_async_v2(bindings, stream.handle, None)
    # Transfer predictions back
    cuda.memcpy_dtoh_async(output, d_output, stream)
    # Syncronize threads
    stream.synchronize()
    
    return output

This is all we need to run predictions using our TensorRT engine in a Python runtime!

## Performance Comparison:

Last, we can see how quickly we can feed a singular batch to TensorRT, which we can compare to our original Tensorflow experiment from earlier.

In [8]:
print("Warming up...")

predict(dummy_input_batch)

print("Done warming up!")

Warming up...
Done warming up!


We use the %%timeit Jupyter magic again. Note that %%timeit is fairly rough, and for any actual benchmarking better controlled testing is required - preferably outside of Jupyter.

In [9]:
%%timeit

pred = predict(dummy_input_batch) # Check TRT performance

7.23 ms ± 2.98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [10]:
print ("Prediction: " + str(np.argmax(output)))

Prediction: 74


In [11]:
pred = predict(dummy_input_batch)

pred.shape

(32, 1000)

## Next Steps:

<h4> Profiling </h4>

This is a great next step for further optimizing and debugging models you are working on productionizing

You can find it here: https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html

<h4>  TRT Dev Docs </h4>

Main documentation page for the ONNX, layer builder, C++, and legacy APIs

You can find it here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html

<h4>  TRT OSS GitHub </h4>

Contains OSS TRT components, sample applications, and plugin examples

You can find it here: https://github.com/NVIDIA/TensorRT


#### TRT Supported Layers:

https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/samplePlugin

#### TRT ONNX Plugin Example:

https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#layers-precision-matrix
