In [None]:
# Copyright 2023 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Accelerating HuggingFace GPT-2 Inference with TensorRT

GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. The model was pretrained on the raw texts to guess the next word in sentences. As no human labeling was required, GPT-2 pretraining can use lots of publicly available data with an automatic process to generate inputs and labels from those data.

This notebook shows how to convert a [HuggingFace PyTorch GPT-2 model](https://huggingface.co/gpt2) to a TensorRT engine for high-performance inference in a few lines of code.

## Prerequisite

Follow the instruction at https://github.com/NVIDIA/TensorRT to build the TensorRT-OSS docker container required to run this notebook.

Next, we install some extra dependencies.

In [None]:
%%capture
!pip3 install -r ../requirements.txt

**Note:** After this step, you should restart the Jupyter kernel for the change to take effect.

In [None]:
import os
import sys
ROOT_DIR = os.path.abspath("../")
sys.path.append(ROOT_DIR)

import torch

from GPT2.frameworks import GPT2HF
from GPT2.trt import GPT2TRT


## API usage

We have wrapped the process of importing models from PyTorch, exporting to onnx files and build TRT engines into a single class. We introduce new `GPT2HF` and `GPT2TRT` classes that both expose `generate` as the main entry point to run GPT2. `GPT2TRT` will automatically do all the 3 steps per user inputs. Here is an example:


### Specify model arguments

You pick your favorite model and configurations, and TRT will run it for you! The main choice that you need to make is:
- `use_cache`: kv cache to speed decoding
- `num_beams`: beam search for better results
- `fp16`: Using float16 to speed decoding

In [None]:
args = {
    "variant": "gpt2", # A HuggingFace model variant name. Required.
    "use_cache": False, # We support decoder kv cache in generation. Default: False
    "fp16": True, # Default: False
    "num_beams": 1, # We support beam search in generation. Default: 1
    "batch_size": 1, # Default: 1
    # Folder name. Required. All the PyTorch, ONNX and TRT Engines will be stored in the folder.
    "working_dir": "models",
    # Log level.
    "info": True,
    # Benchmarking args
    "iterations": 10,
    "number": 1,
    "warmup": 3,
    "duration": 0,
    "percentile": 50,
}


### Initialize the models
Calling the API is just this easy...

In [None]:
framework_model = GPT2HF(**args)
trt_model = GPT2TRT(**args)

### Try your sentence!
Both `GPT2HF` and `GPT2TRT` exposes `setup_tokenizer_and_model` and `generate`. If `setup_tokenizer_and_model` is not called prior to `generate`, it will be called first.

In [None]:
input_str = "TensorRT is a deep learning accelerator software developed by NVIDIA. It can run "

In [None]:
framework_model.models = framework_model.setup_tokenizer_and_model()

In [None]:
framework_model.generate(input_str = input_str)

In [None]:
trt_model.models = trt_model.setup_tokenizer_and_model()

In [None]:
trt_model.generate(input_str = input_str)


### Performance benchmark
You can see that TRT and PyTorch generates reasonable results, which is expected. To measure their performance, both `GPT2HF` and `GPT2TRT` exposes `execute_inference`, `full_inference`, `encoder_inference` and `decoder_inference` to measure the inference time. Let's take a look at how our latest TRT performs.

In [None]:
from tabulate import tabulate

data = [
    ['full p50(s)', 'decoder p50(s)'],
]

def format_result(result):
    entry = []
    for segment in result.median_runtime:
        entry.append('{:.4f}'.format(segment.runtime))
    
    return entry

In [None]:
framework_result = framework_model.execute_inference(input_str)
data.append(format_result(framework_result))

In [None]:
trt_result = trt_model.execute_inference(input_str)
data.append(format_result(trt_result))

In [None]:
print(tabulate(data, headers='firstrow', tablefmt='github'))

Did TensorRT's performance amaze you?

## Conclusion and where-to next?

Is this the end? The API sounds too simple. I am used to the previous version that walks me step by step, and/or I want to know more on the process of conversion. Just follow the directory and you will find that PyTorch model, ONNX files and TRT engines are there. Feel free to investigate them. We have wrapped the entire model conversion process in `setup_tokenizer_and_model`. The TensorRT inference engine can be conviniently used as a drop-in replacement for the orginial HuggingFace GPT2 model while providing significant speed up. If you are interested in further details of the conversion process, check out [GPT2](../GPT2) and [Seq2Seq/trt.py](../Seq2Seq/trt.py). You will find that all the Seq2Seq models could be treated in a similar way!