# Exporting to ONNX and converting to TRT

## Export to ONNX

We will use the same model as in part 1, a Vision Transformer pretrained with the DINOv2 
framework.

If we try to naively export the model to ONNX this will result in an error, because 
one of the operations used in the model is not yet supported by ONNX 
(upsample_bicubic2d_aa) as shown in this [PR](https://github.com/microsoft/onnxscript/pull/1208).

We need to fix the function that is causing trouble first.

In [None]:
import torch
from onnx_utils import fix_dinov2_for_onnx_export

model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg_lc')
model = fix_dinov2_for_onnx_export(model)
model.eval()

In [None]:
from onnx_utils import export_to_onnx

onnx_program = export_to_onnx(model)

We can do a quick check to verify that the output of the torch model and ONNX model are
close.

In [None]:
import onnxruntime
import numpy as np


def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()


x_torch = torch.randn(1, 3, 518, 518)
y_hat_torch = model(x_torch)

ort_session = onnxruntime.InferenceSession("model.onnx", providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

# compute ONNX Runtime output prediction
x_ort = {ort_session.get_inputs()[0].name: to_numpy(x_torch)}
y_hat_ort = ort_session.run(None, x_ort)

# compare ONNX Runtime and PyTorch results
np.testing.assert_allclose(to_numpy(y_hat_torch), y_hat_ort[0], rtol=1e-02, atol=1e-05)

print("Exported model has been tested with ONNXRuntime, and the result looks good!")

Let's run a quick inference time benchmark on the Torch and ONNX models

In [None]:
x_torch = torch.randn(1, 3, 518, 518)

In [None]:
%%timeit

model(x_torch)

In [None]:
import onnxruntime

def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

In [None]:
ort_session = onnxruntime.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

x_ort = {ort_session.get_inputs()[0].name: to_numpy(x_torch)}

In [None]:
%%timeit

y_hat_ort = ort_session.run(None, x_ort)

The ONNX model seems to be slightly faster than the Torch model.

Since ONNX models can be used with a GPU we can also run a little benchmark of the 
models on the GPU.

In [None]:
%%capture

x_torch = x_torch.to("cuda")

model.cuda()

In [None]:
%%timeit

model(x_torch)

In [None]:
ort_session = onnxruntime.InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])

x_ort = {ort_session.get_inputs()[0].name: to_numpy(x_torch)}

In [None]:
%%timeit

y_hat_ort = ort_session.run(None, x_ort)

Again we notice a small improvement in timing when using the ONNX model.

## To TRT and beyond

Now that we have a working ONNX program, it is time to try to export it to TensorRT.

Of course we can't do it straight away (this would be too easy), because if we do
we would get the following error : 
```
[11/14/2024-10:20:05] [E] [TRT] ModelImporter.cpp:948: --- End node ---
[11/14/2024-10:20:05] [E] [TRT] ModelImporter.cpp:951: ERROR: onnxOpCheckers.cpp:151 In function emptyOutputChecker:
[8] This version of TensorRT doesn't support mode than 1 outputs for LayerNormalization nodes!
[11/14/2024-10:20:05] [E] [TRT] ModelImporter.cpp:946: While parsing node number XXX [LayerNormalization -> "getitem_xx"]:
[11/14/2024-10:20:05] [E] [TRT] ModelImporter.cpp:947: --- Begin node ---
```

Which basically tells us that our `LayerNormalization` layers output too much things.

A quick look at the ONNX graph with `netron` will give us a clue of what's happening.

Our `LayerNormalization` output 3 things, the 2 supplementary outputs are the mean and 
deviation of the layer normalization, this is used to speed up training.
This is probably an artifact fro the pretrained model. 

In the case of inference we do not use these values and can safely remove them.

We can do that using the `onnx_graphsurgeon` module.

In [None]:
from trt_utils import fix_graph_for_trt_export

model = fix_graph_for_trt_export("model.onnx")

Now we can convert it to a trt engine and serialize it.

In [None]:
from trt_utils import build_engine

build_engine("model.onnx")

Altough the builder correctly parses the file, unfortunately the hardware I use is too 
old and I can't convert the model to trt with it, I get the following error :

`IBuilder::buildSerializedNetwork: Error Code 9: API Usage Error (Target GPU SM 61 is not supported by this TensorRT release.`