# Accelerating GPT-2 model

### *(and any decoder based transformer models)*

In this notebook we will see how to accelerate generative models (decoder only) like GPT-2.
The main thing we will learn is that in generative models executed on GPU, you need to take care of memory transfer.

## GPT-2 loading

As a reminder:

* `gpt2`: 117M parameters
* `gpt2-large` 774M parameters

In [2]:
import logging
import time
from typing import Callable, Dict

import numpy as np
import tensorrt as trt
import torch
from tensorrt import ICudaEngine
from tensorrt.tensorrt import Logger, Runtime
from transformers import AutoTokenizer, BatchEncoding, GPT2LMHeadModel, AutoModelForCausalLM
from transformers.modeling_outputs import BaseModelOutputWithPastAndCrossAttentions
from transformer_deploy.utils.generative_model import GPTModelWrapper
import inspect
from transformers import TensorType

from transformer_deploy.backends.ort_utils import create_model_for_provider, inference_onnx_binding, optimize_onnx
from transformer_deploy.backends.pytorch_utils import convert_to_onnx, get_model_size
from transformer_deploy.backends.trt_utils import build_engine, load_engine, save_engine

In [3]:
model_name = "gpt2"  # choices: gpt2 | gpt2-large

# use GPT2LMHeadModel and not AutoModel to export raw outputs to predict next token
model: GPT2LMHeadModel = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)
# to avoid error message or passing some args to each generate call
model.config.pad_token_id = tokenizer.eos_token_id

### Model output

Below we output predictions for the next token.
Those values will be used by the decoding algorithm.
Output shape looks like: [batch size, nb tokens, vocabulary size]

In [4]:
inputs = tokenizer("Hello, my dog is ", return_tensors="pt")
print(inputs)
print(inputs["input_ids"].size())
print("----")

with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits
print(logits)
print(f"shape: {logits.shape}")
print(f"tensor size: {np.prod(logits.shape)*32/8/1024**2:.2f} Mb")  # same as sys.getsizeof(logits.storage())/1024**2

{'input_ids': tensor([[15496,    11,   616,  3290,   318,   220]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
torch.Size([1, 6])
----
tensor([[[ -35.2362,  -35.3266,  -38.9753,  ...,  -44.4645,  -43.9974,
           -36.4580],
         [-112.6171, -114.5831, -116.5724,  ..., -119.0128, -118.8059,
          -111.6917],
         [ -88.7435,  -89.8643,  -93.1977,  ...,  -92.3839,  -96.1782,
           -92.1273],
         [ -85.1646,  -88.3379,  -92.8703,  ...,  -99.8017,  -94.7657,
           -90.9330],
         [-116.7280, -119.3950, -121.7259,  ..., -129.1003, -124.6102,
          -121.6092],
         [ -61.9847,  -63.7082,  -65.6898,  ...,  -76.0924,  -71.7898,
           -66.1154]]])
shape: torch.Size([1, 6, 50257])
tensor size: 1.15 Mb


### Total tensor size

GPT-2 will generate a sequence 1 token at a time.
So to generates 256 tokens from a 6 tokens prompt, it will perform 249 inference.

To simplify things, we assume that we are using a greedy decoding algorithm.
Let's compute the total size of the output tensor.

In [5]:
size = 0
for i in range(6, 256, 1):
    # input sequence (input_ids) made of int-32 (4 bytes)
    size += np.prod([1, i]) * 4
    # output tensor made of float-32 (4 bytes)
    size += np.prod([1, i, 50257]) * 4
print(f"total size (input+output): {size / 1024**3:.2f} Gb")

total size (input+output): 6.11 Gb


It's important to keep the order of magnitude in mind when we will try to optimize GPT-2 inference.
In particular, we will try to limit tensor movement from GPU memory to host.

## Build ONNX graph

Performant inference engines tend to consume graph instead of imperative Pytorch code. We use ONNX for that purpose.

> ONNX is an open format built to represent machine learning models. ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers.
> https://onnx.ai/

In [6]:
input_ids: BatchEncoding = tokenizer(
    "Here is some text to encode Hello World", add_special_tokens=True, return_attention_mask=False, return_tensors="pt"
)
# some inference engines don't support int64 tensor as inputs, we convert all input tensors to int32 type
for k, v in input_ids.items():  # type: str, torch.Tensor
    input_ids[k] = v.type(dtype=torch.int32)

convert_to_onnx(
    model_pytorch=model,
    output_path="test-gpt2.onnx",
    inputs_pytorch=dict(input_ids),
    quantization=False,
    var_output_seq=True,  # we inform ONNX export tool that the output shape will vary with the input shape
)
# model may switch to train mode for some unknown reasons, we force the eval mode.
_ = model.eval()

  attn_weights = attn_weights / (float(value.size(-1)) ** 0.5)


### Optimize ONNX graph

In [7]:
logging.basicConfig()
logging.getLogger().setLevel(logging.INFO)
num_attention_heads, hidden_size = get_model_size(path=model_name)
optimize_onnx(
    onnx_path="test-gpt2.onnx",
    onnx_optim_model_path="test-gpt2-opt.onnx",
    fp16=True,
    use_cuda=True,
    num_attention_heads=num_attention_heads,
    hidden_size=hidden_size,
    architecture="gpt2",
)

INFO:fusion_base:Fused LayerNormalization count: 25
INFO:fusion_base:Fused FastGelu count: 12
INFO:fusion_utils:Remove reshape node Reshape_9 since its input shape is same as output: ['batch_size', 'sequence']
INFO:fusion_utils:Remove reshape node Reshape_19 since its input shape is same as output: [1, 'sequence']
INFO:fusion_utils:Remove reshape node Reshape_2700 since its input shape is same as output: ['batch_size', 'sequence', 768]
INFO:onnx_model:Graph pruned: 0 inputs, 0 outputs and 23 nodes are removed
INFO:onnx_model:Graph pruned: 0 inputs, 0 outputs and 864 nodes are removed
INFO:onnx_model_gpt2:postprocess: remove Reshape count:72
INFO:fusion_base:Fused FastGelu(add bias) count: 12
INFO:onnx_model_bert:opset verion: 13
INFO:onnx_model_bert:Optimized operators:{'EmbedLayerNormalization': 0, 'Attention': 0, 'Gelu': 0, 'FastGelu': 12, 'BiasGelu': 0, 'LayerNormalization': 25, 'SkipLayerNormalization': 0}
INFO:root:optimizations applied: {'EmbedLayerNormalization': 0, 'Attention':

## Build TensorRT engine

In [8]:
from pathlib import Path

trt_logger: Logger = trt.Logger(trt.Logger.ERROR)
runtime: Runtime = trt.Runtime(trt_logger)
trt_model_name = "test-gpt2.plan"

# create only of does not exist because it's slow to run...
if not Path(trt_model_name).exists():
    engine: ICudaEngine = build_engine(
        runtime=runtime,
        onnx_file_path="test-gpt2.onnx",
        logger=trt_logger,
        min_shape=(1, 1),
        optimal_shape=(1, 128),  # num beam -> batch size
        max_shape=(1, 384),  # num beam -> batch size
        workspace_size=12000 * 1024 * 1024,
        fp16=True,
        int8=False,
    )
    save_engine(engine, trt_model_name)

## Benchmarks

We will benchmark 3 inference engines:
- *Pytorch* (Hugging Face implementation)
- *ONNX Runtime* with optimized ONNX graph and standard API (tensors stored as numpy objects, on host RAM)
- *ONNX Runtime* with optimized ONNX graph and binding IO API (tensors stored on CUDA, to limit IO)
- *Nvidia TensorRT* (tensors stored on CUDA, to limit IO)

For each of them we print the output of the result to check that it generates the same string.
Then we run 2 run for the warmup and 10 to measure the latency.

### Generative model wrapper

The most interesting thing in the class below is that we herit from `GenerationMixin` (from Hugging Face transformers library), it will give our class some super powers like a method to generate sequences (aka running some decoding algorithm on top of the model output).
Note that the actual model inference is done by a function provided through the constructor.

In [9]:
print(inspect.getsource(GPTModelWrapper))

class GPTModelWrapper(Module, GenerationMixin):
    def __init__(
        self, config: PretrainedConfig, device: torch.device, inference: Callable[[torch.Tensor], torch.Tensor]
    ):
        super().__init__()
        self.config: PretrainedConfig = config
        self.device: torch.device = device
        self.inference: Callable[[torch.Tensor], torch.Tensor] = inference
        self.main_input_name = "input_ids"  # https://github.com/huggingface/transformers/pull/14803

    def prepare_inputs_for_generation(self, input_ids, **kwargs):
        return {
            self.main_input_name: input_ids,
        }

    def forward(self, input_ids, **_):
        logits = self.inference(input_ids)
        return CausalLMOutputWithCrossAttentions(logits=logits)



In [10]:
inputs = tokenizer(
    "Here is some text to encode Hello World",  # Nvidia example prompt
    add_special_tokens=True,
    return_attention_mask=False,  # Not used
    return_tensors=TensorType.PYTORCH,
)

### Pytorch inference

We use vanilla Hugging face implementation with Pytorch backend.

In [11]:
def inference_torch(input_ids: torch.Tensor) -> torch.Tensor:
    transformer_outputs: BaseModelOutputWithPastAndCrossAttentions = model.transformer(input_ids=input_ids)
    return model.lm_head(transformer_outputs.last_hidden_state)


model.cuda()
model.eval()
inputs.to("cuda")
with torch.inference_mode():
    gpt2_model = GPTModelWrapper(config=model.config, device=model.device, inference=inference_torch)
    sample_output = gpt2_model.generate(inputs.input_ids, max_length=64)
    print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
    for _ in range(2):
        _ = gpt2_model.generate(inputs.input_ids, max_length=64)
    start = time.time()
    for _ in range(10):
        _ = gpt2_model.generate(inputs.input_ids, max_length=256, use_cache=True)
    print(f"----\nPytorch: {(time.time() - start)/10:.2f}s/sequence")
_ = model.cpu()

Here is some text to encode Hello World.

Hello World

Hello World is a simple program that takes a string and returns a string.

The program is written in C.

The program is written in C. The program is written in C. The program is written in C. The program
----
Pytorch: 2.17s/sequence


### Naive ONNX Runtime inference

Below we use `ONNX Runtime` and its standard API (CUDA provider).
It takes as input and output `numpy` tensors, meaning they are stored on host memory (RAM).
Then we convert the `numpy` tensor to `Pytorch` one.
It means that for each token the whole output tensor is moved from GPU memory to host.
It reprensents more than 6Gb of memory transfer.

This move is not really useful as the decoding algorithm is coded in Pytorch and would also work with GPU stored Pytorch tensors.

In [12]:
model_onnx = create_model_for_provider(path="test-gpt2-opt.onnx", provider_to_use="CUDAExecutionProvider")


def inference_onnx_naive(input_ids: torch.Tensor) -> torch.Tensor:
    data = {"input_ids": input_ids.detach().cpu().numpy().astype(np.int32)}
    logit = model_onnx.run(None, data)
    np_logit = np.array(logit)  # convert list of numpy arrays to a numpy array
    # we convert numpy tensor to Pytorch tensor as it's the type expected by HF decoding algorithm
    return torch.squeeze(torch.from_numpy(np_logit), dim=0)


gpt2_model = GPTModelWrapper(config=model.config, device=torch.device("cpu"), inference=inference_onnx_naive)
inputs.to("cpu")
sample_output = gpt2_model.generate(inputs.input_ids, max_length=64)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
for _ in range(2):
    _ = gpt2_model.generate(inputs.input_ids, max_length=64)
start = time.time()
for _ in range(10):
    _ = gpt2_model.generate(inputs.input_ids, max_length=256, use_cache=False)
print(f"----\nONNX Runtime (standard API): {(time.time() - start)/10:.2f}s/sequence")

del model_onnx

Here is some text to encode Hello World.

Hello World

Hello World is a simple program that takes a string and returns a string.

The program is written in C.

The program is written in C. The program is written in C. The program is written in C. The program
----
ONNX Runtime (standard API): 4.05s/sequence


### Optimized ONNX Runtime inference

Here we use ONNX Runtime and its binding IO API.
The main difference compared to the previous benchmark is that we keep everything on GPU.
By just removing memory movement, we reduce the inference time by a large margin.

In [13]:
model_onnx = create_model_for_provider(path="test-gpt2-opt.onnx", provider_to_use="CUDAExecutionProvider")


def inference_onnx_optimized(input_ids: torch.Tensor) -> torch.Tensor:
    data = {"input_ids": input_ids}
    return inference_onnx_binding(model_onnx=model_onnx, inputs=data, device="cuda")["output"]


gpt2_model = GPTModelWrapper(config=model.config, device=torch.device("cuda"), inference=inference_onnx_optimized)
inputs.to("cuda")
sample_output = gpt2_model.generate(inputs.input_ids, max_length=64)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
for _ in range(2):
    _ = gpt2_model.generate(inputs.input_ids, max_length=64)
start = time.time()
for _ in range(10):
    _ = gpt2_model.generate(inputs.input_ids, max_length=256, use_cache=False)
print(f"----\nONNX Runtime (binding io API): {(time.time() - start)/10:.2f}/sequence")
del model_onnx

Here is some text to encode Hello World.

Hello World

Hello World is a simple program that takes a string and returns a string.

The program is written in C.

The program is written in C. The program is written in C. The program is written in C. The program
----
ONNX Runtime (binding io API): 0.88/sequence


### TensorRT Inference

To conclude we use the Nvidia engine, all tensors are stored on CUDA.
Unlike ONNX Runtime, each optimization applied has been checked on the target GPU with the expected tensor shape.
It explains most of its performance compared to ONNX Runtime optimized performances.

In [14]:
tensorrt_model: Callable[[Dict[str, torch.Tensor]], torch.Tensor] = load_engine(
    engine_file_path="test-gpt2.plan", runtime=runtime
)


def inference_tensorrt(input_ids: torch.Tensor) -> torch.Tensor:
    data = {"input_ids": input_ids}
    return tensorrt_model(data)[0]


gpt2_model = GPTModelWrapper(config=model.config, device=torch.device("cuda"), inference=inference_tensorrt)
inputs.to("cuda")
sample_output = gpt2_model.generate(inputs.input_ids, max_length=64)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
for _ in range(2):
    _ = gpt2_model.generate(inputs.input_ids, max_length=64)
start = time.time()
for _ in range(10):
    _ = gpt2_model.generate(inputs.input_ids, max_length=256, use_cache=False)
print(f"----\nTensorRT + CUDA tensors: {(time.time() - start)/10:.2f}/sequence")

del tensorrt_model

Here is some text to encode Hello World.

Hello World

Hello World is a simple program that takes a string and returns a string.

The program is written in C.

The program is written in C. The program is written in C. The program is written in C. The program
----
TensorRT + CUDA tensors: 0.52/sequence


## IS caching of Key / Values (self attention) a good strategy on GPU?

In the self-attention block, the first step is to compute key, query and value (known as K, Q and V) representation for each input token.
This computation is done for each self-attention block (as each of them have their own memory).
In a generative model, we need to recompute those values for each generated token.
Because, for a specific input token, the result won't change from one inference to the next one, it may be interesting to cache and reuse the results instead of recomputing it.

However, this won't come for free as we would need 2 ONNX / TensorRT models, one generating the values for the first time (to boot the sequence generation), and one able to reuse the cached values.

But first, let's measure the gain in performance, if any.

### Export a model able to reuse cache

To simplify code, we just use the ONNX exporter tool from Hugging Face library.

In [15]:
from itertools import chain
from torch.onnx import export
from transformers.models.gpt2 import GPT2OnnxConfig
from transformers.onnx.features import FeaturesManager

model_name = "gpt2"
feature = "causal-lm-with-past"
seq_len = 256
atol = 0.2

tokenizer = AutoTokenizer.from_pretrained(model_name)
model: GPT2LMHeadModel = FeaturesManager.get_model_from_feature(feature, model_name)
model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature)
onnx_config: GPT2OnnxConfig = model_onnx_config(model.config)

with torch.no_grad():
    model.config.return_dict = True
    model.eval()

    # Check if we need to override certain configuration item
    if onnx_config.values_override is not None:
        for override_config_key, override_config_value in onnx_config.values_override.items():
            setattr(model.config, override_config_key, override_config_value)

    # Ensure inputs match
    model_inputs = onnx_config.generate_dummy_inputs(tokenizer, framework=TensorType.PYTORCH)
    for k, v in model_inputs.items():
        if isinstance(v, torch.Tensor):
            model_inputs[k] = model_inputs[k].type(torch.int32)
    onnx_outputs = list(onnx_config.outputs.keys())

    onnx_config.patch_ops()

    # export can works with named args but the dict containing named args as to be last element of the args tuple
    export(
        model,
        (model_inputs,),
        f="model-support-cache.onnx",
        input_names=list(onnx_config.inputs.keys()),
        output_names=onnx_outputs,
        dynamic_axes={name: axes for name, axes in chain(onnx_config.inputs.items(), onnx_config.outputs.items())},
        do_constant_folding=True,
        use_external_data_format=onnx_config.use_external_data_format(model.num_parameters()),
        enable_onnx_checker=True,
        opset_version=13,
    )

    onnx_config.restore_ops()

  if batch_size <= 0:


The ONNX graph with cache support expects as input:
* 1 tensor name `input_ids` of shape [batch, sequence]
* 12 tensors named `past_key_values.X.key` (X replaced by layer ID) of shape [batch, 12, "past_sequence + sequence", 64]
* 12 tensors named `past_key_values.X.value` (X replaced by layer ID) of shape [batch, 12, "past_sequence + sequence", 64]
* 1 tensor named `attention_mask` of shape [batch, "past_sequence + sequence"]

To list model inputs (according to their ONNX graph):

In [16]:
import onnx

model_cache = onnx.load("model-support-cache.onnx")
# first 100 lines
text = "\n".join(str(model_cache.graph.input).split("\n")[:80])
print(text)
del model_cache

[name: "input_ids"
type {
  tensor_type {
    elem_type: 6
    shape {
      dim {
        dim_param: "batch"
      }
      dim {
        dim_param: "sequence"
      }
    }
  }
}
, name: "past_key_values.0.key"
type {
  tensor_type {
    elem_type: 1
    shape {
      dim {
        dim_param: "batch"
      }
      dim {
        dim_value: 12
      }
      dim {
        dim_param: "past_sequence + sequence"
      }
      dim {
        dim_value: 64
      }
    }
  }
}
, name: "past_key_values.0.value"
type {
  tensor_type {
    elem_type: 1
    shape {
      dim {
        dim_param: "batch"
      }
      dim {
        dim_value: 12
      }
      dim {
        dim_param: "past_sequence + sequence"
      }
      dim {
        dim_value: 64
      }
    }
  }
}
, name: "past_key_values.1.key"
type {
  tensor_type {
    elem_type: 1
    shape {
      dim {
        dim_param: "batch"
      }
      dim {
        dim_value: 12
      }
      dim {
        dim_param: "past_sequence + sequence"
 

### Benchmark ONNX Runtime, CUDA tensors, with cache

When using cache, we only test non optimized ONNX model (no kernel fusion) as kernel fusion fails in this setup.
It's probably because ONNX Runtime doesn't find some patterns.

In [17]:
batch = 1
sequence = 256

# input with random values
input_cache = dict()
input_cache["input_ids"] = torch.ones((batch, 1), dtype=torch.int32)
for i in range(12):  # 12 layers
    input_cache[f"past_key_values.{i}.key"] = torch.empty((batch, 12, sequence, 64), dtype=torch.float32)
    input_cache[f"past_key_values.{i}.value"] = torch.empty((batch, 12, sequence, 64), dtype=torch.float32)
input_cache["attention_mask"] = torch.ones((batch, sequence + 1), dtype=torch.int32)

print(f"for {sequence} tokens WITH cache:")
for nb_inference, provider, device in [(10, "CPUExecutionProvider", "cpu"), (100, "CUDAExecutionProvider", "cuda")]:
    model_onnx = create_model_for_provider(path="model-support-cache.onnx", provider_to_use=provider)

    # Pytorch
    model = model.to(device=device, non_blocking=False)
    output_pytorch = model(
        input_ids=torch.ones((batch, sequence), dtype=torch.int32, device=device), past_key_values=None
    )
    pytorch_input = torch.ones((batch, 1), dtype=torch.int32, device=device)
    for i in range(nb_inference):
        _ = model(input_ids=pytorch_input, past_key_values=output_pytorch.past_key_values)
        torch.cuda.synchronize()
    start = time.time()
    for i in range(nb_inference):
        _ = model(input_ids=pytorch_input, past_key_values=output_pytorch.past_key_values)
        torch.cuda.synchronize()
    print(f"[Pytorch / {device.upper()}] {1e3*(time.time() - start)/nb_inference:.2f}ms")

    # naive implementation (tensor copies)
    inputs_ort_np = {k: v.cpu().numpy() for k, v in input_cache.items()}
    # warmup
    for _ in range(nb_inference):
        _ = model_onnx.run(None, inputs_ort_np)
    start = time.time()
    for _ in range(nb_inference):
        _ = model_onnx.run(None, inputs_ort_np)
    print(f"[ONNX Runtime / {device.upper()} - with copy (naive)] {1e3*(time.time() - start)/nb_inference:.2f}ms")

    # ONNX Runtime optimized (no tensor copy)
    inputs = {k: v.to(device) for k, v in input_cache.items()}
    for _ in range(nb_inference):
        inference_onnx_binding(model_onnx=model_onnx, inputs=inputs, device=device)
    start = time.time()
    for _ in range(nb_inference):
        inference_onnx_binding(model_onnx=model_onnx, inputs=inputs, device=device)
    print(f"[ONNX Runtime / {device.upper()} - no copy] {1e3*(time.time() - start)/nb_inference:.2f}ms")

for 256 tokens WITH cache:
[Pytorch / CPU] 23.15ms
[ONNX Runtime / CPU - with copy (naive)] 27.90ms
[ONNX Runtime / CPU - no copy] 24.58ms
[Pytorch / CUDA] 10.56ms
[ONNX Runtime / CUDA - with copy (naive)] 14.15ms
[ONNX Runtime / CUDA - no copy] 4.29ms


First we note that ONNX Runtime with tensor copy is equivalent to Pytorch which keep all its tensors on GPU.
When we remove the overhead of tensor copy, ONNX Runtime is 1/3 faster compared to Pytorch.
On CPU no copy latency is similar to tensor copy implementation, probably because tensors are all on host RAM (no GPU memory transfer).
On GPU, no copy latency is 3 times smaller than tensor copy implementation which shows that IO is crucial.

We compare outputs of Pytorch and ONNX Runtime, with and without cache:

In [23]:
device = "cuda"
model = model.to(device)
a = model(input_ids=torch.ones((batch, sequence + 1), dtype=torch.int32, device=device), past_key_values=None)
print("pytorch - do not use cache output")
print(a.logits[:, -1, :])
b = model(input_ids=pytorch_input, past_key_values=output_pytorch.past_key_values)
print("pytorch - use cache output")
print(b.logits[:, -1, :])
input_cache = dict()
input_cache["input_ids"] = torch.ones((batch, 1), dtype=torch.int32, device=device)
input_cache["attention_mask"] = torch.ones((batch, sequence + 1), dtype=torch.int32, device=device)
for index, (k, v) in enumerate(output_pytorch.past_key_values):  # type: int, (torch.Tensor, torch.Tensor)
    input_cache[f"past_key_values.{index}.key"] = k
    input_cache[f"past_key_values.{index}.value"] = v
model_onnx = create_model_for_provider(path="model-support-cache.onnx", provider_to_use="CUDAExecutionProvider")
print("ONNX Runtime - use cache output")
print(inference_onnx_binding(model_onnx=model_onnx, inputs=input_cache, device=device)["logits"])

pytorch - do not use cache output
tensor([[-252.7095, -233.3065, -248.4932,  ..., -274.1753, -281.5232,
         -249.9053]], device='cuda:0', grad_fn=<SliceBackward0>)
pytorch - use cache output
tensor([[-252.6975, -233.2973, -248.4842,  ..., -274.1697, -281.5168,
         -249.8984]], device='cuda:0', grad_fn=<SliceBackward0>)
ONNX Runtime - use cache output
tensor([[[-252.6790, -233.2568, -248.4554,  ..., -274.1343, -281.4962,
          -249.8820]]], device='cuda:0')


### Benchmark ONNX Runtime, CUDA tensors, without cache

We compare the following setup, on both CPU and GPU:

- Pytorch
- ONNX Runtime (FP32): `test-gpt2.onnx`
- ONNX Runtime (FP16): `test-gpt2-opt.onnx`

For ONNX Runtime, we test 2 models, one without kernel fusion and one with it and FP16 precision.

In [26]:
input_cache = dict()
input_cache["input_ids"] = torch.ones((batch, sequence), dtype=torch.int32, device="cuda")

print(f"for {sequence} tokens WITHOUT cache:")
for nb_inference, provider, device in [(10, "CPUExecutionProvider", "cpu"), (100, "CUDAExecutionProvider", "cuda")]:
    inputs = {k: v.to(device) for k, v in input_cache.items()}

    for _ in range(nb_inference):
        _ = model(**input_cache)
        torch.cuda.synchronize()
    start = time.time()
    for _ in range(nb_inference):
        _ = model(**input_cache)
        torch.cuda.synchronize()
    print(f"[Pytorch / {device.upper()}] {1e3*(time.time() - start)/nb_inference:.2f}ms")

    for model_path in ["test-gpt2.onnx", "test-gpt2-opt.onnx"]:
        model_onnx = create_model_for_provider(path=model_path, provider_to_use=provider)

        # naive implementation
        inputs_ort_numpy = {k: v.cpu().numpy() for k, v in input_cache.items()}
        # warmup
        for _ in range(nb_inference):
            _ = model_onnx.run(None, inputs_ort_numpy)
        start = time.time()
        for _ in range(nb_inference):
            _ = model_onnx.run(None, inputs_ort_numpy)
        print(
            f"[ONNX Runtime / {device.upper()} - with copy (naive) - {model_path}] {1e3*(time.time() - start)/nb_inference:.2f}ms"
        )

        # no copy
        for _ in range(nb_inference):
            inference_onnx_binding(model_onnx=model_onnx, inputs=inputs, device=device)
        start = time.time()
        for _ in range(nb_inference):
            inference_onnx_binding(model_onnx=model_onnx, inputs=inputs, device=device)
        print(
            f"[ONNX Runtime / {device.upper()} - no copy - {model_path}] {1e3*(time.time() - start)/nb_inference:.2f}ms"
        )

for 256 tokens WITHOUT cache:
[Pytorch / CPU] 12.17ms
[ONNX Runtime / CPU - with copy (naive) - test-gpt2.onnx] 203.05ms
[ONNX Runtime / CPU - no copy - test-gpt2.onnx] 173.47ms
[ONNX Runtime / CPU - with copy (naive) - test-gpt2-opt.onnx] 387.59ms
[ONNX Runtime / CPU - no copy - test-gpt2-opt.onnx] 385.84ms
[Pytorch / CUDA] 11.27ms
[ONNX Runtime / CUDA - with copy (naive) - test-gpt2.onnx] 13.74ms
[ONNX Runtime / CUDA - no copy - test-gpt2.onnx] 4.93ms
[ONNX Runtime / CUDA - with copy (naive) - test-gpt2-opt.onnx] 13.67ms
[ONNX Runtime / CUDA - no copy - test-gpt2-opt.onnx] 3.51ms


In [27]:
trt_logger: Logger = trt.Logger(trt.Logger.ERROR)
runtime: Runtime = trt.Runtime(trt_logger)
trt_model_name = "test-gpt2.plan"

tensorrt_model: Callable[[Dict[str, torch.Tensor]], torch.Tensor] = load_engine(
    engine_file_path="test-gpt2.plan", runtime=runtime
)
nb_inference = 100
for _ in range(nb_inference):
    tensorrt_model(input_cache)

start = time.time()
for _ in range(nb_inference):
    tensorrt_model(input_cache)
print(f"[TensorRT / {device.upper()} - no copy] {1e3*(time.time() - start)/nb_inference:.2f}ms")

[TensorRT / CUDA - no copy] 2.20ms


The most interesting thing to note is that on ONNX Runtime, without cache and kernel fusion we get similar latency than with cache.
It means that cache overhead is bigger than recomputing the tensors! It's probably the most unexpected result of this experience.

As usual, FP16 models on CPU are slower than FP32.

And to conclude, surprise, TensorRT is the fastest option by a large margin.