Survey TensorRT for inference #8492

sidgoyal78 · 2018-02-22T07:34:25Z

We want to basically see how we can integrate TensorRT to support models trained with fluid.

sidgoyal78 · 2018-02-24T01:06:19Z

Problem description

The main goal is to come up with an approach for integrating TensorRT with PaddlePaddle's inference library (which is in C++). We want to do this in order to use TensorRT for performing inference on a model saved using Fluid.

To address this, we will first briefly discuss TensorRT, and the functionalities offered by TensorRT, before finally proposing our first attempt.

TensorRT

Introduction

TensorRT is deep learning inference optimizer and runtime from Nvidia, aimed at deploying trained deep networks for inference in a variety of production platforms. Using TensorRT involves two phases:

Build phase: In the build phase, TensorRT performs optimizations on the configuration of the neural network and generates an optimized plan for the forward pass computation.
Deployment phase: In this phase, TensorRT performs inference by executing the plan on the input data (which comes from a service or a user application).

Build phase

This step is performed only once, prior to deployment. A "trained model" trained using any popular deep learning framework has to be first parsed using TensorRT, and imported to the TensorRT Optimizer module. The TensorRT Optimizer performs several optimizations (briefly discussed below) and outputs an optimized inference execution engine. This execution engine when serialized to a file on disk is known as plan file.

The crucial part here is importing a trained model. For Caffe and Tensorflow, TensorRT provides simple Python and C++ APIs to import the models directly. However, for other frameworks, we need to use TensorRT's Network Definition API to specify the network description (either in C++ or Python), before loading it into TensorRT.

An image summarizing this phase is: https://devblogs.nvidia.com/wp-content/uploads/2017/12/pasted-image-0-4-768x656.png

The various optimizations performed by the TensorRT Optimizer are:

Graph optimizations are performed to restructure the graph by doing layer and tensor fusion.
FP16 and INT8 precision calibration is supported to convert FP32 to lower precision.
Kernel auto tuning is performed to choose the best implementation of kernels from a library of kernels, for the given input data size, layout, etc.
Reduction of memory footprint, by reusing memory for each tensor.

Deploy phase

In this phase, the saved plan file is loaded and deserialized to create a TensorRT Runtime engine object and used to perform inference on new data.

Our approach

As discussed in the "build phase" subsection, the most important point for our use case is: to import a model into TensorRT that is trained using PaddlePaddle fluid.

From the documentation, we find that networks from other frameworks (except Caffe and Tensorflow) can be imported directly via the UFF format. The UFF: Universal Framework Format is a data format that describes an execution graph for a deep network.
The format consists of syntax for serialization format and definition of each operators (as protobuf and python descriptors respectively).

The documentation contains an example of using TensorRT's Python API to convert a model from PyTorch into a TensorRT engine. However, there isn't any example demonstrating TensorRT's C++ API to convert a model from any other framework. So the first task is to come up with an example where we can use TensorRT's C++ API to convert a model to the required format.

Regarding current support of ONNX with TensorRT:

The developer's guide doesn't discuss about ONNX, hence it seems that only Caffe and TensorFlow converters are provided.
Looking at the blog post: https://devblogs.nvidia.com/tensorrt-container/, and actually inspecting the code inside the docker container, we don't see any source files which demonstrate how the ONNX model is imported. We just find a binary file. Related issue in ONNX (Tutorial on Importing ONNX model in TensorRT (nvidia runtime) onnx/tutorials#9)

Thus, we think it is reasonable to come up with a custom converter that imports fluid's model into TensorRT (using the C++ API).

References:

TensorRT 3.0 developer guide: http://docs.nvidia.com/deeplearning/sdk/pdf/TensorRT-Developer-Guide.pdf
TensorRT blog discussing the optimizations in build phase: https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/
TensorRT blogpost: https://devblogs.nvidia.com/deploying-deep-learning-nvidia-tensorrt/

Xreki · 2018-03-09T06:52:34Z

You can refer to the implementation in tensorflow: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/tensorrt

HoJianBo · 2018-03-13T01:43:55Z

TensorRT as a 3rd party inference platform, integrate into paddlepaddle will be a good serving option.

sidgoyal78 self-assigned this Feb 22, 2018

sidgoyal78 added the 预测原名Inference，包含Capi预测问题等 label Feb 22, 2018

kexinzhao added this to Basic Usage (DOING) in Inference Framework Feb 27, 2018

luotao1 mentioned this issue Apr 2, 2018

TensorRT库集成计划 #9572

Closed

Xreki moved this from Basic Usage (DOING) to Performance Tuning (DOING) in Inference Framework Apr 3, 2018

sidgoyal78 closed this as completed Jun 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Survey TensorRT for inference #8492

Survey TensorRT for inference #8492

sidgoyal78 commented Feb 22, 2018

sidgoyal78 commented Feb 24, 2018 •

edited

Loading

Xreki commented Mar 9, 2018

HoJianBo commented Mar 13, 2018

Survey TensorRT for inference #8492

Survey TensorRT for inference #8492

Comments

sidgoyal78 commented Feb 22, 2018

sidgoyal78 commented Feb 24, 2018 • edited Loading

Problem description

TensorRT

Introduction

Build phase

Deploy phase

Our approach

References:

Xreki commented Mar 9, 2018

HoJianBo commented Mar 13, 2018

sidgoyal78 commented Feb 24, 2018 •

edited

Loading