In [0]:
# Copyright 2019 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Jasper Inference Demo with NVIDIA TensorRT on Google Colab

## Overview


In this notebook, we will demo the process of carrying out inference on new audio segment using a pre-trained Pytorch Jasper model downloaded from the NVIDIA NGC Model registry with TensorRT (TRT). NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. After optimizing the compute-intensive acoustic model with NVIDIA TensorRT, inference throughput increased by up to 1.8x over native PyTorch.

The Jasper model is an end-to-end neural acoustic model for automatic speech recognition (ASR) that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting strict real-time requirements of ASR systems in deployment.The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences corresponding to a given audio segment. This post-processing step is called decoding.

The original paper is Jasper: An End-to-End Convolutional Neural Acoustic Model https://arxiv.org/pdf/1904.03288.pdf.

### Model architecture
By default the model configuration is Jasper 10x5 with dense residuals. A Jasper BxR model has B blocks, each consisting of R repeating sub-blocks.
Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout. 
In the original paper Jasper is trained with masked convolutions, which masks out the padded part of an input sequence in a batch before the 1D-Convolution.
For inference masking is not used. The reason for this is that in inference, the original mask operation does not achieve better accuracy than without the mask operation on the test and development dataset. However, no masking achieves better inference performance especially after TensorRT optimization.
More information on the model architecture can be found in the [root folder](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper)

### TensorRT Inference pipeline
The Jasper inference pipeline consists of 3 components: data preprocessor, acoustic model and greedy decoder. The acoustic model is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and also what differentiates Jasper from the competition. So, we focus on the acoustic model for the most part.
For the non-TRT Jasper inference pipeline, all 3 components are implemented and run with native PyTorch. For the TensorRT inference pipeline, we show the speedup of running the acoustic model with TensorRT, while preprocessing and decoding are reused from the native PyTorch pipeline.
To run a model with TensorRT, we first construct the model in PyTorch, which is then exported into an ONNX file. Finally, a TensorRT engine is constructed from the ONNX file, serialized to TRT plan file, and also launched to do inference.
Note that TensorRT engine is being runtime optimized before serialization. TRT tries a vast set of options to find the strategy that performs best on user’s GPU - so it takes a few minutes. After the TRT plan file is created, it can be reused.


### Requirement
1. Before running this notebook, please set the Colab runtime environment to GPU via the menu *Runtime => Change runtime type => GPU*.

For TRT FP16 and INT8 inference, an NVIDIA Volta, Turing or newer GPU generations is required. On Google Colab, this normally means a T4 GPU.

In [3]:
!nvidia-smi

Wed Oct  2 02:42:12 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P0    70W / 149W |     69MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
+-------

The below code check whether a Tensor core GPU is present.

In [4]:
from tensorflow.python.client import device_lib

def check_tensor_core_gpu_present():
    local_device_protos = device_lib.list_local_devices()
    for line in local_device_protos:
        if "compute capability" in str(line):
            compute_capability = float(line.physical_device_desc.split("compute capability: ")[-1])
            if compute_capability>=7.0:
                return True
    
print("Tensor Core GPU Present:", check_tensor_core_gpu_present())
tensor_core_gpu = check_tensor_core_gpu_present()

Tensor Core GPU Present: None


2. Next, we clone the NVIDIA Github Deep Learning Example repository and set up the workspace.

In [5]:
!git clone https://github.com/NVIDIA/DeepLearningExamples

Cloning into 'DeepLearningExamples'...
remote: Enumerating objects: 110, done.[K
remote: Counting objects: 100% (110/110), done.[K
remote: Compressing objects: 100% (90/90), done.[K
remote: Total 4049 (delta 65), reused 35 (delta 17), pack-reused 3939[K
Receiving objects: 100% (4049/4049), 32.29 MiB | 26.48 MiB/s, done.
Resolving deltas: 100% (1875/1875), done.


In [7]:
import os

WORKSPACE_DIR='/content/DeepLearningExamples/PyTorch/SpeechRecognition/Jasper/notebooks'
os.chdir(WORKSPACE_DIR)
print (os.getcwd())

/content/DeepLearningExamples/PyTorch/SpeechRecognition/Jasper/notebooks


## Install NVIDIA TensorRT

We will need to install NVIDIA TensorRT 6.0 runtime environment on Colab. First, check the Colab CUDA installed version. As of 2nd Oct 2019, `cuda-10.0` is the CUDA version on Google Colab.

In [8]:
!ls /usr/local/

bin   cuda-10.0  games	  lib	       man   setup.cfg	src
cuda  etc	 include  LICENSE.txt  sbin  share	xgboost


Next, we will need to install the NVIDIA TensorRT version that match the current Colab CUDA version, following the instruction at https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html#maclearn-net-repo-install.

In [None]:
%%bash
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

dpkg -i nvidia-machine-learning-repo-*.deb
apt-get update

When using the NVIDIA Machine Learning network repository, Ubuntu will be default install TensorRT for the latest CUDA version. The following commands will install libnvinfer6 for an older CUDA version and hold the libnvinfer6 package at this version. Replace 6.0.1 with your version of TensorRT and cuda10.0 with your CUDA version for your Colab environment.

In [None]:
%%bash
version="6.0.1-1+cuda10.0"
sudo apt-get install libnvinfer6=${version} libnvonnxparsers6=${version} libnvparsers6=${version} libnvinfer-plugin6=${version} libnvinfer-dev=${version} libnvonnxparsers-dev=${version} libnvparsers-dev=${version} libnvinfer-plugin-dev=${version} python-libnvinfer=${version} python3-libnvinfer=${version}



In [11]:
!sudo apt-mark hold libnvinfer6 libnvonnxparsers6 libnvparsers6 libnvinfer-plugin6 libnvinfer-dev libnvonnxparsers-dev libnvparsers-dev libnvinfer-plugin-dev python-libnvinfer python3-libnvinfer

libnvinfer6 set on hold.
libnvonnxparsers6 set on hold.
libnvparsers6 set on hold.
libnvinfer-plugin6 set on hold.
libnvinfer-dev set on hold.
libnvonnxparsers-dev set on hold.
libnvparsers-dev set on hold.
libnvinfer-plugin-dev set on hold.
python-libnvinfer set on hold.
python3-libnvinfer set on hold.
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list.d/nvidia-machine-learning.list:1 and /etc/apt/sources.list.d/nvidia-ml.list:1


In [12]:
!dpkg -l | grep TensorRT

hi  libnvinfer-dev                          6.0.1-1+cuda10.0                                  amd64        TensorRT development libraries and headers
hi  libnvinfer-plugin-dev                   6.0.1-1+cuda10.0                                  amd64        TensorRT plugin libraries
hi  libnvinfer-plugin6                      6.0.1-1+cuda10.0                                  amd64        TensorRT plugin libraries
hi  libnvinfer6                             6.0.1-1+cuda10.0                                  amd64        TensorRT runtime libraries
hi  libnvonnxparsers-dev                    6.0.1-1+cuda10.0                                  amd64        TensorRT ONNX libraries
hi  libnvonnxparsers6                       6.0.1-1+cuda10.0                                  amd64        TensorRT ONNX libraries
hi  libnvparsers-dev                        6.0.1-1+cuda10.0                                  amd64        TensorRT parsers libraries
hi  libnvparsers6                           6.0.1-1+cu

A successful TensorRT installation should look like:

```
hi  libnvinfer-dev                          6.0.1-1+cuda10.0                                  amd64        TensorRT development libraries and headers
hi  libnvinfer-plugin-dev                   6.0.1-1+cuda10.0                                  amd64        TensorRT plugin libraries
hi  libnvinfer-plugin6                      6.0.1-1+cuda10.0                                  amd64        TensorRT plugin libraries
hi  libnvinfer6                             6.0.1-1+cuda10.0                                  amd64        TensorRT runtime libraries
hi  libnvonnxparsers-dev                    6.0.1-1+cuda10.0                                  amd64        TensorRT ONNX libraries
hi  libnvonnxparsers6                       6.0.1-1+cuda10.0                                  amd64        TensorRT ONNX libraries
hi  libnvparsers-dev                        6.0.1-1+cuda10.0                                  amd64        TensorRT parsers libraries
hi  libnvparsers6                           6.0.1-1+cuda10.0                                  amd64        TensorRT parsers libraries
hi  python-libnvinfer                       6.0.1-1+cuda10.0                                  amd64        Python bindings for TensorRT
hi  python3-libnvinfer                      6.0.1-1+cuda10.0                                  amd64        Python 3 bindings for TensorRT
```

## Download pretrained Jasper model from NVIDIA GPU Cloud model repository

NVIDIA provides pretrained Jasper models along with many other deep learning models such as ResNet, BERT, Transformer, SSD... at https://ngc.nvidia.com/catalog/models. Here, we will download and unzip pretrained Jasper Pytorch models.

In [13]:
%%bash 
wget -nc -q --show-progress -O jasper_model.zip \
https://api.ngc.nvidia.com/v2/models/nvidia/jasperpyt_fp16/versions/1/zip

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [14]:
!unzip -o ./jasper_model.zip

Archive:  ./jasper_model.zip
  inflating: jasper_fp16.pt          


After a successful download, a Pytorch checkpoint named ` jasper_fp16.pt` should exist in the current notebooks directory.

In [16]:
!ls -l jasper_fp16.pt  

-rw-r--r-- 1 root root 2661855989 Sep 10 00:33 jasper_fp16.pt


## Install extra dependencies

Before proceeding to creating the TensorRT execution engine from the Pytorch checkpoint, we shall install some extra dependency to load and convert the Pytorch model and process input audio files.

- [Apex](https://nvidia.github.io/apex/): this is NVIDIA libraries for automatic mixed precision training in Pytorch
- [Onnx](https://github.com/onnx/onnx): for processing ONNX model.
- unidecode, soundfile, toml, pycuda: miscellaneous helper libraries



In [None]:
%%bash 
pip uninstall -y apex
git clone https://www.github.com/nvidia/apex
cd apex
python setup.py install


In [None]:
!pip install unidecode soundfile toml pycuda

In [22]:
!pip install onnx

Collecting onnx
[?25l  Downloading https://files.pythonhosted.org/packages/f5/f4/e126b60d109ad1e80020071484b935980b7cce1e4796073aab086a2d6902/onnx-1.6.0-cp36-cp36m-manylinux1_x86_64.whl (4.8MB)
[K     |████████████████████████████████| 4.8MB 27kB/s 
Collecting typing-extensions>=3.6.2.1 (from onnx)
  Downloading https://files.pythonhosted.org/packages/27/aa/bd1442cfb0224da1b671ab334d3b0a4302e4161ea916e28904ff9618d471/typing_extensions-3.7.4-py3-none-any.whl
Installing collected packages: typing-extensions, onnx
Successfully installed onnx-1.6.0 typing-extensions-3.7.4


## Play with audio examples

You can perform inference using pre-trained checkpoints which takes audio file (in .wav format) as input, and produces the corresponding text file. You can customize the content of the input .wav file. For example, there are several examples of input files at "notebooks" dirctory and we can listen to example1.wav:

In [19]:
import IPython.display as ipd
ipd.Audio('./example1.wav', rate=22050)

You can also download your own audio sample to Colab with

```!wget <link-to-.wav-file>```

## FP32 Inference with TensorRT


### Creating TensorRT FP32 execution plan

You can run inference using the trt/perf.py script:
* the checkpoint is passed as `--ckpt` argument 
* `--model_toml` specifies the path to network configuration file (see examples in "config" directory)
* `--make_onnx` exports to ONNX file at the path if set
* `--engine_path` saves the engine file (*.plan) 

To create a new engine file (jasper.plan) for TensorRT and run it using fp32 (building the engine for the first time can take several minutes):

In [23]:
%%bash
PYTHONPATH=/content/DeepLearningExamples/PyTorch/SpeechRecognition/Jasper 
python ../trt/perf.py \
--ckpt_path ./jasper_fp16.pt --wav=example1.wav \
--model_toml=../configs/jasper10x5dr_nomask.toml \
--make_onnx --onnx_path jasper.onnx \
--engine_path jasper.plan

tcmalloc: large alloc 1331142656 bytes == 0x15c680000 @  0x7f5e9070c887 0x7f5e8f002bf9 0x7f5e8f003acb 0x7f5e8f003b84 0x7f5e8f003f6c 0x7f5e4a95216f 0x7f5e4a9523f4 0x7f5e4053a411 0x7f5e8684837d 0x7f5e8657def4 0x56204c 0x4f88ba 0x4f98c7 0x4f6128 0x4f7d60 0x4f876d 0x4fa6c0 0x4f6128 0x4f7d60 0x4f876d 0x4f98c7 0x4f6128 0x4f7d60 0x4f876d 0x4fa6c0 0x4f6128 0x4f7d60 0x4f876d 0x4fa6c0 0x4f7a28 0x4f876d
tcmalloc: large alloc 1331142656 bytes == 0x1abbfa000 @  0x7f5e9070a1e7 0x5a1c5c 0x7f5e868486da 0x7f5e8657def4 0x56204c 0x4f88ba 0x4f98c7 0x4f6128 0x4f7d60 0x4f876d 0x4fa6c0 0x4f6128 0x4f7d60 0x4f876d 0x4f98c7 0x4f6128 0x4f7d60 0x4f876d 0x4fa6c0 0x4f6128 0x4f7d60 0x4f876d 0x4fa6c0 0x4f7a28 0x4f876d 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f6128 0x4f9023
tcmalloc: large alloc 1800912896 bytes == 0x7f5d74a84000 @  0x7f5e9070c887 0x7f5e11f173ea 0x7f5e11f0a632 0x7f5e120df6d4 0x7f5e11ef638f 0x7f5e1ebca86a 0x7f5e1ec2194a 0x56204c 0x4f88ba 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f

### Inference from existing TensorRT FP32 plan
Inference with an existing plan can be launch with the `--use_existing_engine` flag.

In [26]:
%%bash
PYTHONPATH=/content/DeepLearningExamples/PyTorch/SpeechRecognition/Jasper 
python ../trt/perf.py \
--wav=./example1.wav \
--model_toml=../configs/jasper10x5dr_nomask.toml \
--use_existing_engine --engine_path jasper.plan

INTERENCE TIME: 289.92610499994953 ms
TRANSCRIPT:  when these two souls perceived each other they recognized each other as necessary to each other and embraced each other closely


tcmalloc: large alloc 1331036160 bytes == 0x62440000 @  0x7fd170b6f1e7 0x5a1c5c 0x578954 0x561fca 0x57c961 0x57e6ae 0x4bb666 0x4f858d 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f6128 0x4f9023 0x6415b2 0x64166a 0x643730 0x62b26e 0x4b4cb0 0x7fd17076cb97 0x5bdf6a
tcmalloc: large alloc 1330552832 bytes == 0x1010f0000 @  0x7fd170b71887 0x7fd0f255dce7 0x7fd0f254d05f 0x7fd0f2364ee3 0x7fd0f236efd8 0x7fd0ff01f82e 0x7fd0ff08694a 0x56204c 0x4f88ba 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f6128 0x4f9023 0x6415b2 0x64166a 0x643730 0x62b26e 0x4b4cb0 0x7fd17076cb97 0x5bdf6a


## FP16 Inference with TensorRT
### Creating TensorRT FP16 execution plan

We will next create an FP16 TRT inference plan. 

To run inference of the input audio file using automatic mixed precision, add the argument `--trt_fp16`. Using automatic mixed precision, the inference time can be reduced efficiently compared to that of using fp32 (building the engine for the first time can take several minutes).

**Important Note:** Efficient FP16 inference requires a Volta, Turing or newer generation GPUs. On Google Colab, this normally means a T4 GPU. On the older K80 GPUs, FP16 performance might actually degrade from an FP32 TRT model.

In [27]:
%%bash
PYTHONPATH=/content/DeepLearningExamples/PyTorch/SpeechRecognition/Jasper 
python ../trt/perf.py \
--ckpt_path ./jasper_fp16.pt --wav=example1.wav \
--model_toml=../configs/jasper10x5dr_nomask.toml \
--make_onnx --onnx_path jasper.onnx \
--engine_path jasper_fp16.plan \
--trt_fp16

INTERENCE TIME: 334.61581900019155 ms
TRANSCRIPT:  when these two souls perceived each other they recognized each other as necessary to each other and embraced each other closely


tcmalloc: large alloc 1331142656 bytes == 0x15bbec000 @  0x7f1bf09e6887 0x7f1bef2dcbf9 0x7f1bef2ddacb 0x7f1bef2ddb84 0x7f1bef2ddf6c 0x7f1baac2c16f 0x7f1baac2c3f4 0x7f1ba0814411 0x7f1be6b2237d 0x7f1be6857ef4 0x56204c 0x4f88ba 0x4f98c7 0x4f6128 0x4f7d60 0x4f876d 0x4fa6c0 0x4f6128 0x4f7d60 0x4f876d 0x4f98c7 0x4f6128 0x4f7d60 0x4f876d 0x4fa6c0 0x4f6128 0x4f7d60 0x4f876d 0x4fa6c0 0x4f7a28 0x4f876d
tcmalloc: large alloc 1331142656 bytes == 0x106ce2000 @  0x7f1bf09e41e7 0x5a1c5c 0x7f1be6b226da 0x7f1be6857ef4 0x56204c 0x4f88ba 0x4f98c7 0x4f6128 0x4f7d60 0x4f876d 0x4fa6c0 0x4f6128 0x4f7d60 0x4f876d 0x4f98c7 0x4f6128 0x4f7d60 0x4f876d 0x4fa6c0 0x4f6128 0x4f7d60 0x4f876d 0x4fa6c0 0x4f7a28 0x4f876d 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f6128 0x4f9023
tcmalloc: large alloc 1817280512 bytes == 0x7f1ad3ae8000 @  0x7f1bf09e6887 0x7f1b721f13ea 0x7f1b721e4632 0x7f1b723b96d4 0x7f1b721d038f 0x7f1b7eea486a 0x7f1b7eefb94a 0x56204c 0x4f88ba 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f

### Inference from existing TensorRT FP16 plan
Inference with an existing plan can be launched with the `--use_existing_engine` flag.

In [29]:
%%bash
PYTHONPATH=/content/DeepLearningExamples/PyTorch/SpeechRecognition/Jasper 
python ../trt/perf.py \
--wav=./example1.wav \
--model_toml=../configs/jasper10x5dr_nomask.toml \
--use_existing_engine --engine_path jasper_fp16.plan \
--trt_fp16

INTERENCE TIME: 301.42106899984356 ms
TRANSCRIPT:  when these two souls perceived each other they recognized each other as necessary to each other and embraced each other closely


tcmalloc: large alloc 1116463104 bytes == 0xb1754000 @  0x7f0533d0e1e7 0x5a1c5c 0x578954 0x561fca 0x57c961 0x57e6ae 0x4bb666 0x4f858d 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f7a28 0x4f876d 0x4f98c7 0x4f6128 0x4f9023 0x6415b2 0x64166a 0x643730 0x62b26e 0x4b4cb0 0x7f053390bb97 0x5bdf6a


## Conclusion

In this notebook, we have walked through the complete process of carrying out inference using a pretrained Jasper Pytorch model using NVIDIA TensorRT on Google Colab.
### What's next
Now that  you are familiar with running Jasper inference with TensorRT using full and automatic mixed precision, you may want to play with your own audio samples.

For information on training a Jasper model using your own data, please check out our Github repo: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper