<img src=http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png style="width: 90px; float: right;">

# Jasper inference using TensorRT Inference Server
This Jupyter notebook provides scripts to deploy high-performance inference in NVIDIA TensorRT Inference Server offering different options for the model backend, among others NVIDIA TensorRT. 
Jasper is a neural acoustic model for speech recognition. Its network architecture is designed to facilitate fast GPU inference. 
NVIDIA TensorRT Inference Server provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server
NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.
## 1. Overview

The Jasper model is an end-to-end neural acoustic model for automatic speech recognition (ASR) that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting strict real-time requirements of ASR systems in deployment.The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences corresponding to a given audio segment. This post-processing step is called decoding.

The original paper is Jasper: An End-to-End Convolutional Neural Acoustic Model https://arxiv.org/pdf/1904.03288.pdf.

### 1.1 Model architecture
By default the model configuration is Jasper 10x5 with dense residuals. A Jasper BxR model has B blocks, each consisting of R repeating sub-blocks.
Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout. 
In the original paper Jasper is trained with masked convolutions, which masks out the padded part of an input sequence in a batch before the 1D-Convolution.
For inference masking is not used. The reason for this is that in inference, the original mask operation does not achieve better accuracy than without the mask operation on the test and development dataset. However, no masking achieves better inference performance especially after TensorRT optimization.
More information on the model architecture can be found in the [Jasper Pytorch README](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper)

### 1.2 TensorRT Inference Server Overview

A typical TensorRT Inference Server pipeline can be broken down into the following 8 steps:
1. Client serializes the inference request into a message and sends it to the server (Client Send)
2. Message travels over the network from the client to the server (Network)
3. Message arrives at server, and is deserialized (Server Receive)
4. Request is placed on the queue (Server Queue)
5. Request is removed from the queue and computed (Server Compute)
6. Completed request is serialized in a message and sent back to the client (Server Send)
7. Completed message travels over network from the server to the client (Network)
8. Completed message is deserialized by the client and processed as a completed inference request (Client Receive)

Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to steps 5-6. As backend deep learning systems like Jasper are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of Jasper, we can consider that all clients are local.
In this section, we will go over how to launch TensorRT Inference Server and client and get the best performant solution that fits your specific application needs.

Note: The following instructions are run from outside the container and call `docker run` commands as required.

### 1.3 Inference Pipeline in TensorRT Inference Server
The Jasper model pipeline consists of 3 components, where each part can be customized to be a different backend: 

**Data preprocessor**

The data processor transforms an input raw audio file into a spectrogram. By default the pipeline uses mel filter banks as spectrogram features. This part does not have any learnable weights.

**Acoustic model**

The acoustic model takes in the spectrogram and outputs a probability over a list of characters. This part is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and what differentiates Jasper from other end-to-end neural speech recognition models. In the original paper, the acoustic model contains a masking operation for training (More details in [Jasper PyTorch README](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechRecognition/Jasper/README.md)). We do not use masking for inference . 

**Greedy decoder**

The decoder takes the probabilities over the list of characters and outputs the final transcription. Greedy decoding is a fast and simple way of doing this by always choosing the character with the maximum probability. 

To run a model with TensorRT, we first construct the model in PyTorch, which is then exported into a ONNX static graph. Finally, a TensorRT engine is constructed from the ONNX file and can be launched to do inference. The following table shows which backends are supported for each part along the model pipeline.

|Backend\Pipeline component|Data preprocessor|Acoustic Model|Decoder|
|---|---|---|---|
|PyTorch JIT|x|x|x|
|ONNX|-|x|-|
|TensorRT|-|x|-|

In order to run inference with TensorRT outside of the inference server, refer to the [Jasper TensorRT README](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechRecognition/Jasper/trt/README.md).

### 1.3 Learning objectives

This notebook demonstrates:
- Speed up Jasper Inference with TensorRT in TensorRT Inference Server
- Use of Mixed Precision for Inference

## 2. Requirements

Please refer to Jasper TensorRT Inference Server README.md

## 3. Jasper Inference



### 3.1  Prepare Working Directory

In [None]:
import os
if not 'workbookDir' in globals():
    workbookDir = os.getcwd() + "/../"
print('workbookDir: ' + workbookDir)
os.chdir(workbookDir)

### 3.2  Generate TRTIS Model Checkpoints
Use the PyTorch model checkpoint to generate all 3 model backends. You can find a pretrained checkpoint at https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16.

Set the following parameters:

* `ARCH`: hardware architecture. use 70 for Volta, 75 for Turing.
* `CHECKPOINT_DIR`: absolute path to model checkpoint directory.
* `CHECKPOINT`: model checkpoint name. (default: jasper10x5dr.pt)
* `PRECISION`: model precision. Default is using mixed precision.


In [None]:
%env ARCH=70
# replace with absolute path to checkpoint directory, which should include CHECKPOINT file
%env CHECKPOINT_DIR=<CHECKPOINT_DIR> 
# CHECKPOINT file name
%env CHECKPOINT=jasper_fp16.pt 
%env PRECISION=fp16
!echo "ARCH=${ARCH} CHECKPOINT_DIR=${CHECKPOINT_DIR} CHECKPOINT=${CHECKPOINT} PRECISION=${PRECISION} trtis/scripts/export_model.sh"
!ARCH=${ARCH} CHECKPOINT_DIR=${CHECKPOINT_DIR} CHECKPOINT=${CHECKPOINT} PRECISION=${PRECISION} trtis/scripts/export_model.sh

In [None]:
!bash trtis/scripts/prepare_model_repository.sh

### 3.3  Start the TensorRT Inference Server using Docker

In [None]:
!bash trtis/scripts/run_server.sh

### 3.4. Start inference prediction in TRTIS

Use the following script to run inference with TensorRT Inference Server.
You will need to set the parameters such as: 


* `MODEL_TYPE`: Model pipeline type. Choose from [pyt, onnx, trt] for Pytorch JIT, ONNX, or TensorRT model pipeline.
* `DATA_DIR`: absolute path to directory with audio files
* `FILE`: relative path of audio file to `DATA_DIR`


In [None]:
MODEL_TYPE="trt"
DATA_DIR=os.path.join(workbookDir, "notebooks/")
FILE="example1.wav"

In [None]:
!bash trtis/scripts/run_client.sh $MODEL_TYPE $DATA_DIR $FILE

You can play with other examples from the 'notebooks' directory. You can also add your own audio files and generate the output text files in this way.

### 3.5. Stop your container in the end

In [None]:
!docker stop jasper-trtis