In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Accelerating HuggingFace GPT-2 Inference with TensorRT

GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. The model was pretrained on the raw texts to guess the next word in sentences. As no human labeling was required, GPT-2 pretraining can use lots of publicly available data with an automatic process to generate inputs and labels from those data.

This notebook shows 3 easy steps to convert a [HuggingFace PyTorch GPT-2 model](https://huggingface.co/gpt2) to a TensorRT engine for high-performance inference.

1. [Download HuggingFace GPT-2 model ](#1)
1. [Convert to ONNX format](#2)
1. [Convert to TensorRT engine](#3)

## Prerequisite

Follow the instruction at https://github.com/NVIDIA/TensorRT to build the TensorRT-OSS docker container required to run this notebook.

Next, we install some extra dependencies and restart the Jupyter kernel.

In [2]:
import os
import sys
ROOT_DIR = os.path.abspath("../")
sys.path.append(ROOT_DIR)

import torch 

# huggingface
from transformers import (
    GPT2LMHeadModel,
    GPT2Tokenizer,
    GPT2Config,
)

<a id="1"></a>

## 1. Download HuggingFace GPT-2 model 

First, we download the original HuggingFace PyTorch GPT-2 model from HuggingFace model hubs, together with its associated tokernizer.

The GPT-2 variants supported by TensorRT 8 are: gpt2 (117M), gpt2-large (774M).

In [3]:
# download model and tokernizer
GPT2_VARIANT = 'gpt2' # choices: gpt2 | gpt2-large

model: GPT2LMHeadModel = GPT2LMHeadModel.from_pretrained(GPT2_VARIANT)

config = GPT2Config(GPT2_VARIANT)
tokenizer = GPT2Tokenizer.from_pretrained(GPT2_VARIANT)

In [4]:
# save model locally
pytorch_model_dir = './models/{}/pytorch'.format(GPT2_VARIANT)
!mkdir -p $pytorch_model_dir

model.save_pretrained(pytorch_model_dir)
print("Pytorch Model saved to {}".format(pytorch_model_dir))

Pytorch Model saved to ./models/gpt2/pytorch


### Inference with PyTorch model

#### Single example inference

In [5]:
# carry out inference with a single sample
inputs = tokenizer("Hello, my dog is ", return_tensors="pt")
print(inputs)
print("----")
model.eval()
with torch.no_grad():
    outputs = model(**inputs, labels=inputs["input_ids"])

logits = outputs.logits
print(logits)

{'input_ids': tensor([[15496,    11,   616,  3290,   318,   220]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
----
tensor([[[ -35.2362,  -35.3266,  -38.9753,  ...,  -44.4645,  -43.9974,
           -36.4580],
         [-112.6171, -114.5831, -116.5724,  ..., -119.0128, -118.8059,
          -111.6917],
         [ -88.7435,  -89.8643,  -93.1977,  ...,  -92.3839,  -96.1782,
           -92.1273],
         [ -85.1646,  -88.3379,  -92.8703,  ...,  -99.8017,  -94.7657,
           -90.9330],
         [-116.7280, -119.3950, -121.7259,  ..., -129.1003, -124.6102,
          -121.6092],
         [ -61.9847,  -63.7082,  -65.6898,  ...,  -76.0924,  -71.7898,
           -66.1154]]])


For benchmarking purposes, we will employ a helper function `gpt2_inference` which executes the inference on a single batch repeatedly and measures end to end execution time. Let's take note of this execution time for later comparison with TensorRT. 
 
`TimingProfile` is a named tuple that specifies the number of experiments and number of times to call the function per iteration (and number of warm-up calls although it is not used here).

In [6]:
from HuggingFace.GPT2.measurements import gpt2_inference
from HuggingFace.NNDF.networks import TimingProfile

# Benchmarking TensorRT performance on single batch
output, decoder_e2e_median_time = gpt2_inference(
            model.to('cuda:0'), inputs.input_ids.to('cuda:0'), TimingProfile(iterations=10, number=1, warmup=1)
        )
decoder_e2e_median_time

0.010863043999052024

#### Open-end text generation
Next, we will employ the PyTorch model for the open-end text generation task, which GPT-2 is particularly good at. 

In [7]:
from HuggingFace.GPT2.GPT2ModelConfig import GPT2ModelTRTConfig

sample_output = model.to('cuda:0').generate(inputs.input_ids.to('cuda:0'), max_length=GPT2ModelTRTConfig.MAX_SEQUENCE_LENGTH['gpt2'], num_beams=5, num_return_sequences=3, do_sample=True)

# de-tokenize model output to raw text
for s in sample_output:
    print(tokenizer.decode(s, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  next_indices = next_tokens // vocab_size


Hello, my dog is icky, but I'm going to get rid of him."

"Oh, he's not icky. He's icky, but I'm going to get rid of him."

"Oh, he's not icky. He's icky, but I'm going
Hello, my dog is __________, and I am not sure if he is or not. He is __________, and I am not sure if he is or not. He is __________, and I am not sure if he is or not. He is __________, and I am not
Hello, my dog is icky, but I love him so much. He's my best friend. I love him so much. He's my best friend. I love him so much. He's my best friend. I love him so much. He's my best friend. I love him so much. He


In [8]:
type(model)

transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel

For benchmarking purposes, we will employ a helper function `full_inference_greedy` which executes the inference repeatedly and measures end to end execution time. Let's take note of this execution time for later comparison with TensorRT. 
 
TimingProfile is a named tuple that specifies the number of experiments and number of times to call the function per iteration (and number of warm-up calls although it is not used here).

In [9]:
from HuggingFace.GPT2.measurements import full_inference_greedy

# get complete decoder inference result and its timing profile
sample_output, full_e2e_median_runtime = full_inference_greedy(
    model.to('cuda:0'), inputs.input_ids, TimingProfile(iterations=10, number=1, warmup=1),
    max_length=GPT2ModelTRTConfig.MAX_SEQUENCE_LENGTH[GPT2_VARIANT]
)
full_e2e_median_runtime

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

0.5969887245000791

<a id="2"></a>

## 2. Convert to ONNX format

Prior to converting the model to a TensorRT engine, we will first convert the PyTorch model to an intermediate universal format: ONNX.

ONNX is an open format for machine learning and deep learning models. It allows you to convert deep learning and machine learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single format.

At a high level, the steps to convert a PyTorch model to TensorRT are as follows:
- Convert the pretrained image segmentation PyTorch model into ONNX.
- Import the ONNX model into TensorRT.
- Apply optimizations and generate an engine.
- Perform inference on the GPU with the TensorRT engine.

In [10]:
import torch
from torch.nn import Module

tokenizer = GPT2Tokenizer.from_pretrained(GPT2_VARIANT)


In [11]:
from transformers import BatchEncoding

input_ids: BatchEncoding = tokenizer("Here is some text to encode Hello World", add_special_tokens=True, return_tensors="pt")
print(type(input_ids))

<class 'transformers.tokenization_utils_base.BatchEncoding'>


In [12]:
model.to("cpu")
model.eval()
with torch.no_grad():
    print(model(**input_ids))

CausalLMOutputWithCrossAttentions(loss=None, logits=tensor([[[ -34.3027,  -33.9891,  -37.5683,  ...,  -42.6734,  -42.0399,
           -34.6136],
         [ -83.3065,  -82.9769,  -86.1204,  ...,  -89.8062,  -89.4546,
           -83.6084],
         [ -91.4901,  -92.5655,  -95.6423,  ...,  -96.6183,  -98.1545,
           -91.5266],
         ...,
         [ -92.8820,  -94.8433,  -98.9224,  ..., -101.4426, -103.2702,
           -95.7642],
         [ -72.6140,  -76.3407,  -79.7973,  ...,  -87.3300,  -85.7930,
           -77.7521],
         [-103.6147, -108.7898, -109.6276,  ..., -116.8557, -116.5565,
          -107.4467]]]), past_key_values=((tensor([[[[-1.2580,  1.5852,  1.0896,  ..., -1.5187, -0.0358,  1.1204],
          [-1.8348,  2.4955,  1.7497,  ..., -1.5397, -2.3685,  2.4482],
          [-2.3188,  2.1258,  1.6742,  ..., -0.6896, -1.4082,  1.8576],
          ...,
          [-1.7020,  2.4332,  1.0700,  ..., -1.6933, -0.7572,  0.9417],
          [-2.1612,  1.8802,  0.7015,  ..., -0.2824,

In [45]:
from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
from transformers.generation_utils import GenerationMixin


class GPTWrapper(Module, GenerationMixin):

    def __init__(self, model: GPT2LMHeadModel):
        super().__init__()
        self.transformer = model.transformer
        self.lm_head = model.lm_head
        self.config = model.config
        self.device = model.device

    def prepare_inputs_for_generation(self, input_ids, **kwargs):
        return {
            "input_ids": input_ids,
        }

    def forward(self, input_ids, **_):
        transformer_outputs = self.transformer(input_ids=input_ids)
        hidden_states = transformer_outputs[0]
        logits =  self.lm_head(hidden_states)
        return CausalLMOutputWithCrossAttentions(logits=logits)

In [46]:
gpt2_model = GPTWrapper(model=model)
model.eval()
with torch.no_grad():
    sample_output = gpt2_model.generate(inputs.input_ids.to('cpu'), max_length=64)
    print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello, my dog is icky. I'm not sure if he's a dog or not, but he's a dog. I'm not sure if he's a dog or not, but he's a dog.

I'm not sure if he's a dog or not, but he's a dog.


In [50]:
from torch.onnx import TrainingMode
from collections import OrderedDict

input_ids: BatchEncoding = tokenizer("Here is some text to encode Hello World", add_special_tokens=True, return_tensors="pt")
dynamic_axis = OrderedDict()
axis = {0: "batch_size", 1: "sequence"}
dynamic_axis["input_ids"] = axis
dynamic_axis["output"] = {0: "batch_size", 1: "sequence", 2: "vocabulary_size"}
model.eval()
with torch.no_grad():
    torch.onnx.export(
        model=gpt2_model,
        args=input_ids["input_ids"],
        f="test-gpt2.onnx",
        opset_version=13,
        do_constant_folding=True,
        input_names=["input_ids"],
        output_names=["output"],
        dynamic_axes=dynamic_axis,
        training = TrainingMode.EVAL,
    )

In [48]:
from transformer_deploy.backends.ort_utils import create_model_for_provider
from typing import Dict
import numpy as np
input_ids: Dict[str, np.ndarray] = dict(tokenizer("Here is some text to encode Hello World", add_special_tokens=True, return_attention_mask=False, return_tensors="np"))

model_onnx = create_model_for_provider(path="test-gpt2.onnx", provider_to_use="CPUExecutionProvider")
output = model_onnx.run(None, input_ids)

In [49]:
# https://github.com/Ki6an/fastT5/blob/2f73bd57ca3bab226952679b4381049eb09721a4/fastT5/onnx_models.py#L110
output

[array([[[ -34.302658,  -33.98911 ,  -37.568275, ...,  -42.6734  ,
           -42.0399  ,  -34.613556],
         [ -83.306496,  -82.9769  ,  -86.120415, ...,  -89.806244,
           -89.4546  ,  -83.60838 ],
         [ -91.49007 ,  -92.565544,  -95.64229 , ...,  -96.618324,
           -98.154526,  -91.52658 ],
         ...,
         [ -92.88199 ,  -94.84328 ,  -98.922386, ..., -101.44257 ,
          -103.27019 ,  -95.764175],
         [ -72.61406 ,  -76.34074 ,  -79.79736 , ...,  -87.33001 ,
           -85.793   ,  -77.75212 ],
         [-103.61468 , -108.78979 , -109.62762 , ..., -116.8557  ,
          -116.55652 , -107.44668 ]]], dtype=float32)]

## 3. Convert to TensorRT engine

Now we are ready to parse the ONNX model and convert it to an optimized TensorRT model.

Note: As TensorRT carries out many optimization, this conversion process for the larger model might take a while.

In [38]:
from typing import Callable
import tensorrt as trt
from tensorrt.tensorrt import ICudaEngine, Logger, Runtime

from transformer_deploy.backends.trt_utils import build_engine, load_engine, save_engine

trt_logger: Logger = trt.Logger(trt.Logger.INFO)
runtime: Runtime = trt.Runtime(trt_logger)
engine: ICudaEngine = build_engine(
    runtime=runtime,
    onnx_file_path="test-gpt2.onnx",
    logger=trt_logger,
    min_shape=(1, 8),
    optimal_shape=(1, 8),
    max_shape=(1, 8),
    workspace_size=10 * 1024 * 1024,
    fp16=False,
    int8=False,
)
save_engine(engine, "test-gpt2.plan")

[01/14/2022-11:49:45] [TRT] [I] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.

[01/14/2022-11:49:45] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 7605, GPU 5775 (MiB)
[01/14/2022-11:49:45] [TRT] [I] The logger passed into createInferBuilder differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.

[01/14/2022-11:49:45] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 7605, GPU 5775 (MiB)
[01/14/2022-11:49:46] [TRT] [I] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 7605 MiB, GPU 5775 MiB
[01/14/2022-11:49:46] [TRT] [I] [MemUsageSnapshot] End constructing builder kernel library: CPU 7759 MiB, GPU 5817 MiB




[01/14/2022-11:49:47] [TRT] [W] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[01/14/2022-11:49:47] [TRT] [W] onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
[01/14/2022-11:49:47] [TRT] [W] onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
[01/14/2022-11:49:47] [TRT] [W] ShapedWeights.cpp:173: Weights transformer.h.0.attn.c_attn.weight has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[01/14/2022-11:49:47] [TRT] [W] onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
[01/14/2022-11:49:47] [TRT] [W] ShapedWeights.cpp:173: Weights transformer.h.0.attn.c_proj.weight has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the ne

In [39]:
tensorrt_model: Callable[[Dict[str, np.ndarray]], np.ndarray] = load_engine(engine_file_path="test-gpt2.plan", runtime=runtime)

input_ids: Dict[str, np.ndarray] = dict(tokenizer("Here is some text to encode Hello World", add_special_tokens=True, return_tensors="np"))
tensorrt_model(input_ids)

# engine_name = "TensorRT (FP16)"
# tensorrt_output, time_buffer = launch_inference(infer=tensorrt_model, inputs=inputs_onnx, nb_measures=commands.nb_measures)


[01/14/2022-11:52:19] [TRT] [I] Loaded engine size: 1244 MiB
[01/14/2022-11:52:20] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 12672, GPU 7711 (MiB)
[01/14/2022-11:52:20] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +622, now: CPU 0, GPU 1243 (MiB)
[01/14/2022-11:52:20] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 11427, GPU 7711 (MiB)
[01/14/2022-11:52:20] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1, now: CPU 0, GPU 1244 (MiB)


[array([[[ -34.302776,  -33.989223,  -37.568394, ...,  -42.67351 ,
           -42.040012,  -34.61367 ],
         [ -83.30649 ,  -82.97688 ,  -86.12041 , ...,  -89.80624 ,
           -89.45458 ,  -83.608376],
         [ -91.49008 ,  -92.56556 ,  -95.64229 , ...,  -96.61833 ,
           -98.15453 ,  -91.52657 ],
         ...,
         [ -92.88197 ,  -94.84328 ,  -98.92238 , ..., -101.44255 ,
          -103.27018 ,  -95.764175],
         [ -72.61402 ,  -76.3407  ,  -79.79732 , ...,  -87.32998 ,
           -85.79297 ,  -77.75209 ],
         [-103.61464 , -108.78976 , -109.62758 , ..., -116.85567 ,
          -116.55647 , -107.446625]]], dtype=float32)]

### Inference with TensorRT engine

Great, if you have reached this stage, it means we now have an optimized TensorRT engine for the GPT-2 model, ready for us to carry out inference. 

The GPT-2 model with TensorRT backend can now be employed in place of the original HuggingFace GPT-2 model.

#### Single batch inference


In [None]:
from HuggingFace.GPT2.trt import GPT2TRTDecoder

gpt2_trt = GPT2TRTDecoder(gpt2_engine, metadata, config)

outputs = gpt2_trt(inputs.input_ids)
logits = outputs.logits

In [None]:
# Benchmarking TensorRT performance on single batch
output, decoder_e2e_median_time = gpt2_inference(
            gpt2_trt, inputs.input_ids, TimingProfile(iterations=10, number=1, warmup=1)
        )
decoder_e2e_median_time

#### Open-end text generation

In [None]:
sample_output = gpt2_trt.generate(inputs.input_ids.to('cuda:0'), max_length=GPT2ModelTRTConfig.MAX_SEQUENCE_LENGTH['gpt2'])

# de-tokenize model output to raw text
tokenizer.decode(sample_output[0], skip_special_tokens=True)

In [None]:
# get complete decoder inference result and its timing profile
sample_output, full_e2e_median_runtime = full_inference_greedy(
    gpt2_trt, inputs.input_ids, TimingProfile(iterations=10, number=1, warmup=1),
    max_length=GPT2ModelTRTConfig.MAX_SEQUENCE_LENGTH['gpt2']
)
full_e2e_median_runtime

You can now compare the output of the original PyTorch model and the TensorRT engine. Notice the speed difference. On an NVIDIA V100 32GB GPU, this results in about ~5x performance improvement for the GPT-2 small model (from an average of 0.704s to 0.134s).

## Conclusion and where-to next?

This notebook has walked you through the process of converting a HuggingFace PyTorch GPT-2 model to an optimized TensorRT engine for inference in 3 easy steps. The TensorRT inference engine can be conviniently used as a drop-in replacement for the orginial HuggingFace GPT-2 model while providing significant speed up. 

If you are interested in further details of the conversion process, check out [GPT2/trt.py](../GPT2/trt.py)