In [None]:
# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Accelerating HuggingFace BART Inference with TensorRT

BART is an encoder-decoder model that converts all NLP problems into a text-to-text format. More specifically, it does so by encoding different tasks as text directives in the input stream. This enables a single model to be trained supervised on a wide variety of NLP tasks such as translation, classification, Q&A and summarization.

This notebook shows easy steps to convert a [HuggingFace PyTorch BART model](https://huggingface.co/docs/transformers/model_doc/bart) to a TensorRT engine for high-performance inference, with performance comparison between PyTorch and TensorRT inference.


## Prerequisites

Follow the instructions at https://github.com/NVIDIA/TensorRT to build the TensorRT-OSS docker container required to run this notebook.

Next, we install some extra dependencies.

In [None]:
#%%capture
!pip3 install -r ../requirements.txt

**Note:** After this step, you should restart the Jupyter kernel for the change to take effect.

In [None]:
import os
import sys
ROOT_DIR = os.path.abspath("../")
sys.path.append(ROOT_DIR)

# disable warning in notebook
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# notebook widgets
import ipywidgets as widgets
widget_style = {'description_width': 'initial'}
widget_layout = widgets.Layout(width='auto')

import torch
import time

from BART.frameworks import BARTHF
from BART.trt import BARTTRT


## API usage

We have wrapped the process of importing models from PyTorch, exporting to onnx files and build TRT engines into a single class. We introduce new `BARTHF` and `BARTTRT` classes that both expose `generate` as the main entry point to run BART. `BARTTRT` will automatically do all the 3 steps per user inputs. Here is an example:


### Specify model arguments

You pick your favorite model and configurations, and TRT will run it for you! The main choice that you need to make is:
- `variant`: You need to state which model variant you want to run
- `use_cache`: kv cache to speed decoding
- `num_beams`: beam search for better results
- `fp16`: Using float16 to speed decoding

The BART variants that are suported by TensorRT are: facebook/bart-base (139M), facebook/bart-large (406M), facebook/bart-large-cnn (406M), facebook/mbart-large-50 (680M), and more!

### Model and Inference Configuration

In [None]:
args = {
    "variant": "facebook/bart-base", # A HuggingFace model variant name. Required.
    "fp16": True, # Default: False
    "use_cache": True, # We support decoder kv cache in generation. Default: False
    "num_beams": 1, # We support beam search in generation. Default: 1
    "batch_size": 1, # Default: 1
    # Folder name. Required. All the PyTorch, ONNX and TRT Engines will be stored in the folder.
    "working_dir": "models",
    # Log level.
    "info": True,
    # Benchmarking args
    "iterations": 10,
    "number": 1,
    "warmup": 3,
    "duration": 0,
    "percentile": 50,
}


### Initialize the models
Calling the API is just this easy...

In [None]:
framework_model = BARTHF(**args)
trt_model = BARTTRT(**args)

### Try your sentence!
Both `BARTHF` and `BARTTRT` exposes `setup_tokenizer_and_model` and `generate`. If `setup_tokenizer_and_model` is not called prior to `generate`, it will be called first.

In [None]:
input_str = "NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms. TensorRT, built on the NVIDIA CUDA parallel programming model, enables developers to optimize inference by leveraging libraries, development tools, and technologies in CUDA-X for AI, autonomous machines, high performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, TensorRT also uses sparse tensor cores for an additional performance boost."

In [None]:
framework_model.models = framework_model.setup_tokenizer_and_model()

In [None]:
framework_model.generate(input=input_str)

In [None]:
trt_model.models = trt_model.setup_tokenizer_and_model()

In [None]:
trt_model.generate(input=input_str)


### Performance benchmark
You can see that TRT and PyTorch generates the same result, which is expected. To measure their performance, both `T5HF` and `T5TRT` exposes `execute_inference`, `full_inference`, `encoder_inference` and `decoder_inference` to measure the inference time. Let's take a look at how our latest TRT performs.

In [None]:
from tabulate import tabulate

data = [
    ['full p50(s)', 'decoder p50(s)', 'encoder p50(s)'],
]

def format_result(result):
    entry = []
    for segment in result.runtime:
        entry.append('{:.4f}'.format(segment.runtime))
    
    return entry

In [None]:
framework_result= framework_model.execute_inference(input=input_str)
data.append(format_result(framework_result))

In [None]:
trt_result = trt_model.execute_inference(input_str)
data.append(format_result(trt_result))

In [None]:
print(tabulate(data, headers='firstrow', tablefmt='github'))

We can now compare the HuggingFace model and the TensorRT engine, from both separate encoder/decoder and end-to-end speed difference. For bart-base variant on an NVIDIA Titan V GPU and input/output sequence length around 130, this results in about 4x performance improvement with FP16 inference + kv cache.

### Variable Input/Output Length Performance Benchmarking

We can run more tests by varying input/output length, while using the same engines.

Note that TensorRT performance depends on optimal selection of the kernels in the engine. The variable length test here uses the same engine built with max input/output length profile = `max_length` in HuggingFace config to represent the best use of the model. If you want to change the length, please change this field prior to calling `set_tokenizer_and_model`.

In [None]:
from tabulate import tabulate

input_output_len_list = [
    (64, 128), # generation task
    (128, 64), # summarization task
    (32, 32), # translation task
    (128, 128),
]

data = [
    ['(input_len, output_len)', 'HF p50 (s)', 'TRT p50 (s)'],
]

for (in_len, out_len) in input_output_len_list:

    input_ids = torch.randint(0, framework_model.config.vocab_size, (framework_model.config.batch_size, in_len))
    
    framework_model.config.min_output_length = out_len
    framework_model.config.max_output_length = out_len
    trt_model.config.min_output_length = out_len
    trt_model.config.max_output_length = out_len
    
    _, framework_e2e = framework_model.full_inference(input_ids)
    _, trt_e2e = trt_model.full_inference(input_ids)

    data.append([(in_len, out_len), framework_e2e, trt_e2e])

print(tabulate(data, headers='firstrow', tablefmt='github'))

Does TRT performance amaze you?

## Conclusion and where-to next?

Is this the end? The API sounds too simple. I am used to the previous version that walks me step by step, and/or I want to know more on the process of conversion. Just follow the directory and you will find that PyTorch model, ONNX files and TRT engines are there. Feel free to investigate them. We have wrapped the entire model conversion process in `setup_tokenizer_and_model`. The TensorRT inference engine can be conviniently used as a drop-in replacement for the orginial HuggingFace BART model while providing significant speed up. If you are interested in further details of the conversion process, check out [BART](../BART) and [Seq2Seq/trt.py](../Seq2Seq/trt.py). You will find that all the Seq2Seq models could be treated in a similar way!