# TensorRT Inference

After training the deep learning network, the next step is to usually deploy the model to production. The most straight-forward way is to put the PyTorch model in inference mode. The model below loads the trained weights from the PyTorch check point file and sets the weights of the deep learning model. The inference is to do a forward pass from input to the output. We can see it runs fairly quickly to get accurate results in less than 1ms. Here is an example from the last notebook:- 

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
class Net(nn.Module):

    def __init__(self, hidden=512):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(6, hidden)
        self.fc2 = nn.Linear(hidden, hidden)
        self.fc3 = nn.Linear(hidden, hidden)
        self.fc4 = nn.Linear(hidden, hidden)
        self.fc5 = nn.Linear(hidden, hidden)
        self.fc6 = nn.Linear(hidden, 1)
        self.register_buffer('norm',
                             torch.tensor([200.0,
                                           198.0,
                                           200.0,
                                           0.4,
                                           0.2,
                                           0.2]))

    def forward(self, x):
        x = x / self.norm
        x = F.elu(self.fc1(x))
        x = F.elu(self.fc2(x))
        x = F.elu(self.fc3(x))
        x = F.elu(self.fc4(x))
        x = F.elu(self.fc5(x))
        return self.fc6(x)

In [2]:
! ((test ! -f './check_points/model_best.pth.tar' ||  test ! -f './check_points/512/model_best.pth.tar') && \
  bash ./download_data.sh) || echo "Dataset is already present. No need to re-download it."

Dataset is already present. No need to re-download it.


In [3]:
checkpoint = torch.load('check_points/512/model_best.pth.tar')
model = Net().cuda()
model.load_state_dict(checkpoint['state_dict'])
inputs = torch.tensor([[110.0, 100.0, 120.0, 0.35, 0.1, 0.05]])
start = time.time()
inputs = inputs.cuda()
result = model(inputs)
end = time.time()
print('result %.4f inference time %.6f' % (result,end- start))

result 18.6810 inference time 0.184153


However, we can do much better. NVIDIA provides a powerful inference model optimization tool [TensorRT](https://developer.nvidia.com/tensorrt) which includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. It made NVIDIA win the [MLPerf Inference benchmark](https://devblogs.nvidia.com/nvidia-mlperf-v05-ai-inference/).  In this [blog](https://devblogs.nvidia.com/nlu-with-tensorrt-bert/#disqus_thread), TensorRT helps to accelerate the BERT natural language understanding inference to 2.2ms on the T4 GPU. 

In this notebook inspired by the [BERT inference blog](https://devblogs.nvidia.com/nlu-with-tensorrt-bert/#disqus_thread) we will demonstrate step-by-step, how we can convert the trained Asian Barrier Option model to TensorRT inference engine to get significant acceleration. 

Our network is a simple feed-forward fully connected network with `Elu` activation function. `Elu` is not directly supported by TensorRT yet. We will show how to customize the activation function in CUDA.

From PyTorch document, we can find the math formulae of `ELU` activation function.
```
ELU(x)=max(0,x)+min(0,α∗(exp(x)−1))
```

This can be translated into CUDA code as:-
```c++
template <typename T, unsigned TPB>
__global__ void eluKernel(const T a, const T b, int n, const T* input, T* output)
{

    const int idx = blockIdx.x * TPB + threadIdx.x;

    if (idx < n)
    {
        const T in = input[idx];
        const T tmp = exp(in) - b;
        const T result = (a > in ? a : in) + (a < tmp ? a : tmp);
        output[idx] = result;
    }
}

```

where `a` is a constant 0 and `b` is a constant 1. We set them into variables of type `T` so that we can handle single precision or half precision inferences by TensorRT. We follow the examples described in [BERT inference blog](https://devblogs.nvidia.com/nlu-with-tensorrt-bert/#disqus_thread), and wrap the CUDA kernel in `EluPluginDynamic` which is a subclass of `nvinfer1::IPluginV2DynamicExt`.

Run the following command to build the plugins into dynamic libraries:-

In [4]:
!mkdir -p elu_activation/build

In [8]:
cd elu_activation/build

[Errno 2] No such file or directory: 'elu_activation/build'
/Projects/gQuant/notebooks/asian_barrier_option/elu_activation/build


In [9]:
!cmake ../

-- The CXX compiler identification is GNU 7.4.0
-- The CUDA compiler identification is NVIDIA 10.1.243
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Configuring done
-- Generating done
-- Build files have been written to: /Projects/gQuant/notebooks/asian_barrier_option/elu_activation/build


In [10]:
!make -j

[35m[1mScanning dependencies of target common[0m
[ 20%] [32mBuilding CXX object CMakeFiles/common.dir/log/logger.cpp.o[0m
[ 40%] [32m[1mLinking CXX shared library libcommon.so[0m
[ 40%] Built target common
[35m[1mScanning dependencies of target my_plugins[0m
[ 60%] [32mBuilding CUDA object CMakeFiles/my_plugins.dir/plugins/eluPlugin.cu.o[0m












[ 80%] [32m[1mLinking CUDA device code CMakeFiles/my_plugins.dir/cmake_device_link.o[0m
[100%] [32m[1mLinking CUDA shared library libmy_plugins.so[0m
[100%] Built target my_plugins


In [11]:
cd ../../

/Projects/gQuant/notebooks/asian_barrier_option


Now we can use ctypes to load those dynamic libraries and register them in tensorRT:-

In [12]:
import tensorrt as trt
import ctypes
import numpy as np
TRT_LOGGER = trt.Logger(trt.Logger.INFO)
ctypes.CDLL("libnvinfer_plugin.so", mode=ctypes.RTLD_GLOBAL)
ctypes.CDLL("elu_activation/build/libcommon.so", mode=ctypes.RTLD_GLOBAL)
ctypes.CDLL("elu_activation/build/libmy_plugins.so", mode=ctypes.RTLD_GLOBAL)
trt.init_libnvinfer_plugins(TRT_LOGGER, "")
plg_registry = trt.get_plugin_registry()
elu_plg_creator = plg_registry.get_plugin_creator("CustomEluPluginDynamic", "1", "")

The next step is to convert the PyTorch check point weights into TensorRT weights:-

In [13]:
def get_trt_weights(model_dict):
    weight_dict = dict()
    for k in model_dict.keys():
        if k.find('weight') >= 0:
            weight_dict[k] = trt.Weights(model_dict[k].cpu().numpy())
        else:
            weight_dict[k] = trt.Weights(model_dict[k].cpu().numpy())
    return weight_dict
weights = get_trt_weights(checkpoint['state_dict'])

We can check that the weights have the following weight keys corresponding to each of the layers in the model.

In [14]:
print(weights.keys())

dict_keys(['norm', 'fc1.weight', 'fc1.bias', 'fc2.weight', 'fc2.bias', 'fc3.weight', 'fc3.bias', 'fc4.weight', 'fc4.bias', 'fc5.weight', 'fc5.bias', 'fc6.weight', 'fc6.bias'])


To build the TensorRT engine, we need the network to be defined. There are two ways of doing this. We can either use the network parser which can convert the TensorFlow static graph or Onnx graph into the TensorRT network directly, or we can use the Network API to define the network. In this example, we will show the latter approach.

From the Pytorch model, we see the first step is to normalize the input to the range [0-1]. In TensorRT, it can be done by:-

In [15]:
def normalize_layer(network, weights, inputs):
    # the constant layer to load the normalization factor
    const = network.add_constant((1, 6, 1, 1), weights['norm'])
    output = network.add_elementwise(inputs, const.get_output(0), trt.ElementWiseOperation.DIV)  
    out_tensor = output.get_output(0)
    return out_tensor

After the normalization, the input will be projected to a `hidden` dimension and applied to `Elu` activation, this can be done by:

In [16]:
def projection_activation(network, weights, inputs, lid):
    layer = network.add_fully_connected(inputs, hidden, weights['fc'+str(lid)+'.weight'], weights['fc'+str(lid)+'.bias'])    
    pfc = trt.PluginFieldCollection()
    plug = elu_plg_creator.create_plugin("elu", pfc)
    elu_layer = network.add_plugin_v2([layer.get_output(0)], plug)
    out_tensor = elu_layer.get_output(0)
    out_tensor.name = 'l'+str(lid)+'elu'
    return out_tensor

Following is the code to build the full network, run optimization to get the TensorRT engine and serialize it to the file `opt.engine`:

In [17]:
hidden=512
with trt.Builder(TRT_LOGGER) as builder:
    explicit_batch_flag = 1
    with builder.create_network(explicit_batch_flag) as network, builder.create_builder_config() as builder_config:
        builder_config.max_workspace_size = 5000 * (1024 * 1024)
        builder_config.set_flag(trt.BuilderFlag.FP16)
        # inputs has to be of shape (B, C, H, W) so we can use fully connected layer
        inputs = network.add_input(name="option_para", dtype=trt.float32, shape=(-1, 6, 1, 1))
        # create one profile that handles batch size 1
        bs1_profile = builder.create_optimization_profile()
        shape = (1, 6, 1, 1)
        bs1_profile.set_shape("option_para", min=shape, opt=shape, max=shape)
        # create another profile that handles batch size 8
        bs8_profile = builder.create_optimization_profile()
        shape = (8, 6, 1, 1)
        bs8_profile.set_shape("option_para", min=shape, opt=shape, max=shape)        
        builder_config.add_optimization_profile(bs1_profile)
        builder_config.add_optimization_profile(bs8_profile)
        
        # normalize the input to range 0-1
        out_tensor = normalize_layer(network, weights, inputs) 
        
        # project it to hidden dimension 512 and apply Elu activation 5 times
        out_tensor = projection_activation(network, weights, out_tensor, 1)
        out_tensor = projection_activation(network, weights, out_tensor, 2)
        out_tensor = projection_activation(network, weights, out_tensor, 3)
        out_tensor = projection_activation(network, weights, out_tensor, 4)
        out_tensor = projection_activation(network, weights, out_tensor, 5)
        
        # project it to dimension 1 to get the price
        layer = network.add_fully_connected(out_tensor, 1, weights['fc6.weight'], weights['fc6.bias'])
        out_tensor = layer.get_output(0)
        out_tensor.name = 'output'
        # mark the output tensor
        network.mark_output(out_tensor)
        
        # run optimization to find the best plan
        engine = builder.build_engine(network, builder_config)
        # serialize the model into file
        serialized_engine = engine.serialize()
        with open('opt.engine', 'wb') as fout:
            fout.write(serialized_engine)
        TRT_LOGGER.log(TRT_LOGGER.INFO, "Done.")
            
            

Once we have the TensorRT engine file ready, it is easy to use it for inference work. We need to:-
1. Load the serialized engine file
2. Allocate the CUDA device array
3. Async copy input from host to device
4. Launch the TensorRT engine to compute the result
5. Async copy the output from device to host

In [18]:
import tensorrt as trt
import time
import numpy as np
import pycuda
import pycuda.driver as cuda
import pycuda.autoinit

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

with open("opt.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())

h_input = cuda.pagelocked_empty((1,6,1,1), dtype=np.float32)
h_input[0, 0, 0, 0] = 110.0
h_input[0, 1, 0, 0] = 100.0
h_input[0, 2, 0, 0] = 120.0
h_input[0, 3, 0, 0] = 0.35
h_input[0, 4, 0, 0] = 0.1
h_input[0, 5, 0, 0] = 0.05
h_output = cuda.pagelocked_empty((1,1,1,1), dtype=np.float32)
d_input = cuda.mem_alloc(h_input.nbytes)
d_output = cuda.mem_alloc(h_output.nbytes)
stream = cuda.Stream()
with engine.create_execution_context() as context:
    start = time.time()
    cuda.memcpy_htod_async(d_input, h_input, stream)
    input_shape = (1, 6, 1, 1)
    context.set_binding_shape(0, input_shape)
    context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
    cuda.memcpy_dtoh_async(h_output, d_output, stream)
    stream.synchronize()
    end = time.time()
print('result %.4f inference time %.6f' % (h_output,end- start))

result 18.6810 inference time 0.000201


It produces accurate result in half of the inference time compared to the non TensorRT approach