# Recipes to perform Nvidia GPU INT-8 quantization on most transformers model (encoder based)

Quantization is one of the most effective and generic approach to make model inference faster.
Basically, it replaces high precision float numbers in model tensors encoded in 32 or 16 bits by lower precision ones encoded in 8 bits or less:

* it takes less memory
* computation is easier / faster

It can be applied to any model in theory, and, if done well, it should not decrease model accuracy.

The purpose of this tutorial is to show 2 processes to perform quantization on most `transformer` architecture.

**TL;DR, inference is 5 times faster on a `Roberta-base` model** with a batch of size 32 / seq len 256, benchmark on MNLI datasets (bold -> **quantization**):

| Framework                  | Precision | Latency (ms) | Accuracy | Speedup   | Hardware |
|:---------------------------|-----------|--------------|----------|:----------|:--------:|
| Pytorch                    | FP32      | 4407         | 86.8 %   | X 0.02    | CPU      |
| Pytorch                    | FP16      | 4255         | 86.8 %   | X 0.02    | CPU      |
| Pytorch                    | FP32      | 77           | 86.8 %   | X 1       | GPU      |
| Pytorch                    | FP16      | 58           | 86.8 %   | X 1.3     | GPU      |
| TensorRT                   | FP16      | 30           | 86.8 %   | X 2.6     | GPU      |
| TensorRT (transplantation) | **INT-8** | 15           | 84.8 %   | **X 5.1** | GPU      |
| TensorRT (custom QDQ code) | **INT-8** | 15           | 85.6 %   | **X 5.1** | GPU      |

> measures done on a Nvidia RTX 3090 GPU + 12 cores i7 Intel CPU
> accuracy obtained after a single epoch, no LR search or any hyper parameter optimization
> CPU measures are unfair but still indicative of what kind of perf to expect from Pytorch+CPU deployment
> same kind of acceleration is observed on all seq len / batch sizes


## A (very) short intro to INT-8 quantization

Basic idea behind model quantization is to replace tensors made of float numbers (usually encoded on 32 bits) by lower precision representation (encoded on 8 bits for Nvidia GPUs), in general integers.
Therefore computation is faster and model memory footprint is lower. Making tensor storage smaller makes memory transfer faster... and is also a computation acceleration factor.
This technic is very interesting for its trade-off: you reduce inference time significantly, and in most scenarios it cost close to nothing in accuracy.

Replacing float numbers by integers is done through a mapping.
This step is called `calibration`, and its purpose is to compute for each tensor or each channel of a tensor (one of its dimensions) a range of all possible values and then define a scale and a distribution center to map float numbers to 8 bits integers.
The process is well described in this [Nvidia presentation](https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf).

There are several ways to perform quantization, depending of how and when the `calibration` is performed:

* dynamically: the mapping is done during the inference, there are some overhead but it's easy to put in place and usually the accuracy is preserved,
* statically, after training (`post training quantization` or `PTQ`): this way is efficient, but it may have a significant accuracy cost,
* statically, before training (`quantization aware training` or `QAT`): this way is efficient and has a low accuracy cost as the weights will take care of the result

In this guide we will focus on the third option: QAT.

During a quantization aware training:

* in the inside, Pytorch will work with high precision float numbers,
* on the outside, Pytorch will simulate that a quantization has already been applied and output results accordingly (for loss computation for instance)
* it will also refine the quantization mapping (scale, range, distribution center, etc.)

You can check this [high quality blog post](https://leimao.github.io/article/Neural-Networks-Quantization/) for more information.

## Why a dedicated tutorial?

CPU quantization is supported out of the box by `Pytorch` or `ONNX Runtime`.
**GPU quantization on the other side requires specific tools and process to be applied**.

In the specific case of `transformer` models, right now (december 2021), the only way shown by Nvidia is to build manually the graph of our models in `TensorRT`. This is a low level approach, based on GPU capacity knowledge (which operators are supported, etc.). It's certainly out of reach of most NLP practitioners and is very time consuming to update/adapt to new architectures.

Hopefully, Nvidia recently added to Hugging Face `transformer` library a new model called `QDQBert`.
Basically, it's a vanilla `Bert` architecture which supports INT-8 quantization.
It doesn't support any other architecture out of the box, like `Albert`, `Roberta`, or `Electra`.
Nvidia also provide a demo dedicated to the SQuaD task.

To be both simple and cover most use cases, in this tutorial we will see:

* how to perform GPU quantization on **any** transformer model (not just Bert) using a simple trick, a `transplatation`
* how to perform GPU quantization on `QDQRoberta`, a custom model similar to `QDQBert` and supported by `transformer-deploy` library
* how to apply quantization to a common task like classification (which is easier to understand than question answering)
* measure performance gain (latency)


## Project setup

### Dependencies installation

We install `master` branch of `transfomers` library to use a new model: **QDQBert** and `transformer-deploy` to leverage `TensorRT` models (TensorRT API is not something simple to master, it's highly advised to use a wrapper).

In [1]:
#! pip install git+https://github.com/huggingface/transformers
#! pip install git+https://github.com/ELS-RD/transformer-deploy
#! pip install sklearn datasets
#! pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com
# or install pytorch-quantization from https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization

Check the GPU is enabled and usable.

In [2]:
! nvidia-smi

Wed Dec  8 07:41:28 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:03:00.0  On |                  N/A |
| 30%   40C    P8    37W / 350W |    499MiB / 24267MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [3]:
import numpy as np
from tqdm.notebook import tqdm
import transformers
import datasets
from typing import OrderedDict as OD, List, Dict, Union
import torch
from torch import Tensor
from transformers import (
    AutoModelForSequenceClassification,
    PreTrainedModel,
    QDQBertForSequenceClassification,
    BertForSequenceClassification,
    TrainingArguments,
    Trainer,
    IntervalStrategy,
    AutoTokenizer,
    PreTrainedTokenizer,
)
from datasets import load_dataset, load_metric
from transformer_deploy.QDQModels.QDQRoberta import QDQRobertaForSequenceClassification
import pytorch_quantization.nn as quant_nn
from pytorch_quantization.tensor_quant import QuantDescriptor
from pytorch_quantization import calib
import logging
from datasets import DatasetDict
from transformer_deploy.backends.trt_utils import build_engine, get_binding_idxs, infer_tensorrt, load_engine
from transformer_deploy.backends.ort_utils import convert_to_onnx
from collections import OrderedDict
from transformer_deploy.benchmarks.utils import track_infer_time, print_timings
from pycuda._driver import Stream
import tensorrt as trt
from tensorrt.tensorrt import IExecutionContext, Logger, Runtime
import pycuda.autoinit

Set logging to `error` to make the `notebook` more readable on Github.

In [4]:
log_level = logging.ERROR
logging.getLogger().setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
trt_logger: Logger = trt.Logger(trt.Logger.ERROR)
transformers.logging.set_verbosity_error()

### Download data

This part is inspired from an [official Notebooks from Hugging Face](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb).

In [5]:
task = "mnli"
num_labels = 3
model_checkpoint = "roberta-base"
batch_size = 32
max_seq_len = 256
validation_key = "validation_matched"
timings: Dict[str, List[float]] = dict()

We will use the [ðŸ¤— Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark).

In [6]:
dataset = load_dataset("glue", task)
metric = load_metric("glue", task)
dataset

  0%|          | 0/5 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 392702
    })
    validation_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9815
    })
    validation_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9832
    })
    test_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9796
    })
    test_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9847
    })
})

### Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a ðŸ¤— Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [7]:
tokenizer: PreTrainedTokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We can them write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True` and `padding="max_length"`. This will ensure that all sequences have the same size.

In [8]:
def preprocess_function(examples):
    return tokenizer(
        examples["premise"], examples["hypothesis"], truncation=True, padding="max_length", max_length=max_seq_len
    )

In [9]:
encoded_dataset = dataset.map(preprocess_function, batched=True)

Some functions required for training and exporting the model:

In [10]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)


def calibrate(model: PreTrainedModel, encoded_dataset: DatasetDict, nb_sample: int = 128) -> PreTrainedModel:
    # Find the TensorQuantizer and enable calibration
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                module.disable_quant()
                module.enable_calib()
            else:
                module.disable()

    with torch.no_grad():
        for start_index in tqdm(range(0, nb_sample, batch_size)):
            end_index = start_index + batch_size
            data = encoded_dataset["train"][start_index:end_index]
            input_torch = {
                k: torch.tensor(v, dtype=torch.long, device="cpu")
                for k, v in data.items()
                if k in ["input_ids", "attention_mask", "token_type_ids"]
            }
            model(**input_torch)

    # Finalize calibration
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                if isinstance(module._calibrator, calib.MaxCalibrator):
                    module.load_calib_amax()
                else:
                    module.load_calib_amax("percentile", percentile=99.99)
                module.enable_quant()
                module.disable_calib()
            else:
                module.enable()

    model.cuda()
    return model


def convert_tensor(data: OD[str, List[List[int]]], output: str) -> OD[str, Union[np.ndarray, torch.Tensor]]:
    input: OD[str, Union[np.ndarray, torch.Tensor]] = OrderedDict()
    for k in ["input_ids", "attention_mask", "token_type_ids"]:
        if k in data:
            v = data[k]
            if output == "torch":
                value = torch.tensor(v, dtype=torch.long, device="cuda")
            elif output == "np":
                value = np.asarray(v, dtype=np.int32)
            else:
                raise Exception(f"unknown output type: {output}")
            input[k] = value
    return input

Some `TensorRT` reused variables:

In [11]:
runtime: Runtime = trt.Runtime(trt_logger)
profile_index = 0

## Fine-tuning model

Now that our data are ready, we can download the pretrained model and fine-tune it.

Default parameters to be used for the training:

In [12]:
nb_step = 1000
strategy = IntervalStrategy.STEPS
args = TrainingArguments(
    f"{model_checkpoint}-{task}",
    evaluation_strategy=strategy,
    eval_steps=nb_step,
    logging_steps=nb_step,
    save_steps=nb_step,
    save_strategy=strategy,
    learning_rate=1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size * 2,
    num_train_epochs=1,
    fp16=True,
    group_by_length=True,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to=[],
)

## Method 1: `Transplantation` of weights from a source model to an optimized architecture

Transplantation idea is to export weights from one model and use them in another one.
In our case, the source are `Roberta` weights and the target is `Bert` archtecture which is highly optimized on `TensorRT` for GPU quantization.

Indeed, not all models are quantization compliant. The optimization engine (`TensorRT`) search for some patterns and will fail to opimize the model if it doesn't find them. It requires the Pytorch code to be written in a certain way and use certain operations. For that reason, it's a good idea to reuse an architecture highly optimized.

We will leverage the fact that since `Bert` have been released, very few improvements have been brought to the transformer architecture (at least for encoder only models).
Better models appeared, and most of the work has been done to improve the pretraining step (aka the weights).
So the idea will be to take the weights from those new models and put them inside `Bert` architecture.

The process described below should work for most users.

**steps**:

* load `Bert` model
* retrieve layer/weight names
* load target model (here `Roberta`)
* replace weight/layer names with those from `Roberta`
* override the architecture name in model configuration

If there is no 1 to 1 correspondance (it happens), try to keep at least embeddings and self attention. Of course, it's possible that if a model is very different, the transplant may cost some accuracy. In our experience, if your trainset is big enough it should not happen.


In [13]:
model_bert: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=num_labels
)
bert_keys = list(model_bert.state_dict().keys())
del model_bert

model_roberta: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=num_labels
)
model_roberta.save_pretrained("roberta-in-bert")
del model_roberta
model_weights: OD[str, Tensor] = torch.load("roberta-in-bert/pytorch_model.bin")

# Roberta -> Bert, there is 1 to 1 correspondance, for other models, you may need to create your own mapping.
for bert_key in bert_keys:
    # pop remove the first weights from the Ordered dict ...
    _, weight = model_weights.popitem(last=False)
    # ... and we re-insert them, in order, with a new key
    model_weights[bert_key] = weight

# we re-export the weights
torch.save(model_weights, "roberta-in-bert/pytorch_model.bin")
del model_weights

We override the architecture name to make `transformers` believe it is `Bert`...

In [14]:
# =====> change architecture to bert base <======
import json

with open("roberta-in-bert/config.json") as f:
    content = json.load(f)
    content["architectures"] = ["bert"]

with open("roberta-in-bert/config.json", mode="w") as f:
    json.dump(content, f)

## Model training


When you create a classification model from a pretrained one, the last layer are randomly initialized.
We don't want to take these totally random values to compute the calibration of tensors.
Moreover, our trainset is a bit small, and it's easy to overfit.

Therefore, we train our `Roberta into Bert` model on 1/6 of the train set.
The goal is to slightly update the weights to the new architecture, not to get the best score.

> another approach is to fully train your model, perform calibration, and then retrain it on a small part of the data with a low learning rate (usually 1/10 of the original one).


In [15]:
transformers.logging.set_verbosity_error()
model_bert = BertForSequenceClassification.from_pretrained("roberta-in-bert", num_labels=num_labels)
model_bert = model_bert.cuda()

trainer = Trainer(
    model_bert,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
transformers.logging.set_verbosity_error()
trainer.train()
print(trainer.evaluate())
model_bert.save_pretrained("roberta-in-bert-trained")
del trainer
del model_bert

[INFO|trainer.py:437] 2021-12-08 07:41:50,834 >> Using amp half precision backend


{'loss': 0.7658, 'learning_rate': 9.1875814863103e-06, 'epoch': 0.08}
{'eval_loss': 0.5338948369026184, 'eval_accuracy': 0.7948038716250637, 'eval_runtime': 18.3625, 'eval_samples_per_second': 534.514, 'eval_steps_per_second': 8.387, 'epoch': 0.08}
{'loss': 0.5566, 'learning_rate': 8.372718383311604e-06, 'epoch': 0.16}
{'eval_loss': 0.4757803678512573, 'eval_accuracy': 0.8167091186958737, 'eval_runtime': 18.39, 'eval_samples_per_second': 533.713, 'eval_steps_per_second': 8.374, 'epoch': 0.16}
{'loss': 0.5135, 'learning_rate': 7.557855280312908e-06, 'epoch': 0.24}
{'eval_loss': 0.46861791610717773, 'eval_accuracy': 0.8164034640855833, 'eval_runtime': 18.4131, 'eval_samples_per_second': 533.044, 'eval_steps_per_second': 8.364, 'epoch': 0.24}
{'loss': 0.4868, 'learning_rate': 6.743807040417211e-06, 'epoch': 0.33}
{'eval_loss': 0.4253948926925659, 'eval_accuracy': 0.8351502801833928, 'eval_runtime': 18.4305, 'eval_samples_per_second': 532.543, 'eval_steps_per_second': 8.356, 'epoch': 0.33}

### Quantization

Below we will start the quantization process.
It follow those steps:

* perform the calibration
* perform a quantization aware training

By passing validation values to the model, we will calibrate it, meaning it will get the right range / scale to convert FP32 weights to int-8 ones.

## Calibration

### Activate histogram calibration

There are several kinds of calbrators, below we use the percentile one (99.99p) (`histogram`), basically, its purpose is to just remove the most extreme values before computing range / scale.
The other option is `max`, it's much faster but expect lower accuracy.

Second calibration option, choose between calibration done at the tensor level or per channel (more fine grained, slower).

In [16]:
# you can also use "max" instead of "historgram"
input_desc = QuantDescriptor(num_bits=8, calib_method="histogram")
# below we do per-channel quantization for weights, set axis to None to get a per tensor calibration
weight_desc = QuantDescriptor(num_bits=8, axis=(0,))
quant_nn.QuantLinear.set_default_quant_desc_input(input_desc)
quant_nn.QuantLinear.set_default_quant_desc_weight(weight_desc)

### Perform calibration

During this step we will enable the calibration nodes, and pass some representative data to the model.
It will then be used to compute the scale/range.

Official recommendations from Nvidia is to calibrate over thousands of examples from the validation set.
Here we use 40*32 examples, because it's a slow process. It's enough to be close from the original accuracy, on your use case, follow Nvidia process.

In [19]:
# keep it on CPU
model_q = QDQBertForSequenceClassification.from_pretrained("roberta-in-bert-trained", num_labels=num_labels)
model_q = calibrate(model=model_q, encoded_dataset=encoded_dataset)
model_q.save_pretrained("roberta-in-bert-trained-quantized")

  0%|          | 0/4 [00:00<?, ?it/s]

### Quantization Aware Training (QAT)

The query aware training is not a mandatory step, but **highly** recommended to get the best accuracy. Basically we will redo the training with the quantization enabled and a low learning rate to avoid overfitting.

In [20]:
model_q = QDQBertForSequenceClassification.from_pretrained("roberta-in-bert-trained-quantized", num_labels=num_labels)
model_q = model_q.cuda()

args.learning_rate = 1e-6
trainer = Trainer(
    model_q,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
transformers.logging.set_verbosity_error()
print(trainer.evaluate())
trainer.train()
print(trainer.evaluate())
model_q.save_pretrained("roberta-in-bert-trained-quantized-bis")
del model_q
del trainer

[INFO|trainer.py:437] 2021-12-08 08:48:40,176 >> Using amp half precision backend


{'eval_loss': 0.43096092343330383, 'eval_accuracy': 0.8348446255731024, 'eval_runtime': 46.2449, 'eval_samples_per_second': 212.24, 'eval_steps_per_second': 3.33}
{'eval_loss': 0.43096092343330383, 'eval_accuracy': 0.8348446255731024, 'eval_runtime': 46.2449, 'eval_samples_per_second': 212.24, 'eval_steps_per_second': 3.33}
{'loss': 0.4542, 'learning_rate': 9.187581486310299e-07, 'epoch': 0.08}
{'eval_loss': 0.4320202171802521, 'eval_accuracy': 0.8392256749872644, 'eval_runtime': 47.5223, 'eval_samples_per_second': 206.535, 'eval_steps_per_second': 3.241, 'epoch': 0.08}
{'loss': 0.4439, 'learning_rate': 8.372718383311604e-07, 'epoch': 0.16}
{'eval_loss': 0.4244120717048645, 'eval_accuracy': 0.8415690269994905, 'eval_runtime': 46.9517, 'eval_samples_per_second': 209.045, 'eval_steps_per_second': 3.28, 'epoch': 0.16}
{'loss': 0.4323, 'learning_rate': 7.557855280312907e-07, 'epoch': 0.24}
{'eval_loss': 0.4180322289466858, 'eval_accuracy': 0.8435048395313296, 'eval_runtime': 46.8629, 'eval

### Benchmark

#### Export a `QDQ Pytorch` model on `ONNX`, we need to enable fake quantization mode from Pytorch.

In [24]:
data = encoded_dataset["train"][0:3]
input_torch = convert_tensor(data, output="torch")

model_q = QDQBertForSequenceClassification.from_pretrained(
    "roberta-in-bert-trained-quantized-bis", num_labels=num_labels
)
model_q = model_q.cuda()
from pytorch_quantization.nn import TensorQuantizer

TensorQuantizer.use_fb_fake_quant = True
convert_to_onnx(model_q, output_path="model_q.onnx", inputs_pytorch=input_torch, opset=13)
TensorQuantizer.use_fb_fake_quant = False
# del model_q

#### Convert `ONNX` graph to `TensorRT` engine

In [25]:
engine = build_engine(
    runtime=runtime,
    onnx_file_path="model_q.onnx",
    logger=trt_logger,
    min_shape=(batch_size, max_seq_len),
    optimal_shape=(batch_size, max_seq_len),
    max_shape=(batch_size, max_seq_len),
    workspace_size=10000 * 1024 * 1024,
    fp16=False,
    int8=True,
)

In [26]:
# same thing from command line
# !/usr/src/tensorrt/bin/trtexec --onnx=model_q.onnx --shapes=input_ids:32x256,attention_mask:32x256 --int8 --workspace=10000 --saveEngine="test.plan"

#### Prepare input and output buffer

In [27]:
stream: Stream = pycuda.driver.Stream()
context: IExecutionContext = engine.create_execution_context()
context.set_optimization_profile_async(profile_index=profile_index, stream_handle=stream.handle)
input_binding_idxs, output_binding_idxs = get_binding_idxs(engine, profile_index)  # type: List[int], List[int]

In [28]:
data = encoded_dataset["train"][0:batch_size]
input_np: Dict[str, np.ndarray] = convert_tensor(data, output="np")

#### Inference on `TensorRT`

We first check that inference is working correctly:

In [29]:
tensorrt_output = infer_tensorrt(
    context=context,
    host_inputs=input_np,
    input_binding_idxs=input_binding_idxs,
    output_binding_idxs=output_binding_idxs,
    stream=stream,
)
print(tensorrt_output)

[array([[ 0.34206298,  1.5652132 , -2.3528326 ],
       [ 2.5013878 , -0.81571996, -1.6251811 ],
       [ 1.8918471 , -0.76798105, -1.0148249 ],
       [ 2.0562491 , -0.22451262, -1.8686965 ],
       [ 2.586117  , -0.09310705, -2.4128742 ],
       [ 3.1871881 , -0.38016185, -2.5407064 ],
       [-3.4681158 ,  2.25822   ,  0.37315404],
       [ 3.5095093 , -0.8846639 , -2.5989952 ],
       [-0.17400724, -1.6495969 ,  1.7838944 ],
       [-2.966234  , -1.4364657 ,  4.0166936 ],
       [ 3.275045  , -0.9761375 , -2.1260378 ],
       [-1.35331   , -0.42718923,  1.3907498 ],
       [-2.6201942 ,  2.9925148 , -1.0296444 ],
       [-2.8947299 ,  2.072019  ,  0.1730565 ],
       [ 0.10867599, -0.7385151 ,  0.35388532],
       [ 3.0392425 , -0.94136757, -1.9179116 ],
       [ 3.5692515 , -0.6002568 , -2.7545912 ],
       [-2.6759057 , -1.738315  ,  4.1253285 ],
       [-3.2203894 , -1.2297541 ,  4.019567  ],
       [-2.4096491 ,  3.5356538 , -1.7411288 ],
       [ 3.8419678 , -0.9140588 , -2.81

We warmup the GPU with few inferences and then start the measures:

In [30]:
for _ in range(30):
    _ = infer_tensorrt(
        context=context,
        host_inputs=input_np,
        input_binding_idxs=input_binding_idxs,
        output_binding_idxs=output_binding_idxs,
        stream=stream,
    )
time_buffer = list()
for _ in range(100):
    with track_infer_time(time_buffer):
        _ = infer_tensorrt(
            context=context,
            host_inputs=input_np,
            input_binding_idxs=input_binding_idxs,
            output_binding_idxs=output_binding_idxs,
            stream=stream,
        )

print_timings(name="TensorRT (INT-8)", timings=time_buffer)
del engine, context  # delete all tensorrt objects

[TensorRT (INT-8)] mean=15.42ms, sd=1.35ms, min=14.16ms, max=18.86ms, median=14.58ms, 95p=17.79ms, 99p=18.25ms


## Method 2: use a dedicated QDQ model

In method 2, the idea is to take the source code of a specific model and add manually in the source code `QDQ` nodes. That way, quantization will work out of the box for this architecture.
We have started with `QDQRoberta` a quantization compliant `Roberta` model.

To adapt to another architecture, one need to:

* replace linear layers with their quantized version
* replace operations not supported out of the box by `TensorRT` by a similar code supporting the operation.

> concrete examples on `Roberta` architecture: in HF library, there is a `cumsum` in the position embedding generation. Something very simple. It takes as input an integer tensor and output an integer tensor. It happens that the `cumsum` operator from TensorRT supports float but not integer (https://github.com/onnx/onnx-tensorrt/blob/master/docs/operators.md). It leads to a crash during the model conversion with a strange error message. Converting the input to float tensor fix the issue. Not complex, but requires some knowledge.

The process below is a bit simpler than the method 1:

<!-- * finetune the QDQ model on the task (Quantization Aware Training) -->
* calibrate
* Quantization Aware training (QAT)


### Fine tuning the model

### Calibration

In [32]:
input_desc = QuantDescriptor(num_bits=8, calib_method="histogram")
# below we do per-channel quantization for weights, set axis to None to get a per tensor calibration
weight_desc = QuantDescriptor(num_bits=8, axis=(0,))
quant_nn.QuantLinear.set_default_quant_desc_input(input_desc)
quant_nn.QuantLinear.set_default_quant_desc_weight(weight_desc)

# keep it on CPU
model_roberta: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=num_labels
)
model_roberta.save_pretrained("roberta-untrained-quantized")
del model_roberta

model_roberta_q: PreTrainedModel = QDQRobertaForSequenceClassification.from_pretrained("roberta-untrained-quantized")
model_roberta_q = calibrate(model=model_roberta_q, encoded_dataset=encoded_dataset)
model_roberta_q.save_pretrained("roberta-untrained-quantized")
del model_roberta_q

  0%|          | 0/4 [00:00<?, ?it/s]

### Quantization Aware Training (QAT)

In [33]:
model_roberta_q: PreTrainedModel = QDQRobertaForSequenceClassification.from_pretrained(
    "roberta-untrained-quantized", num_labels=num_labels
)
model_roberta_q = model_roberta_q.cuda()

args.learning_rate = 1e-5
trainer = Trainer(
    model_roberta_q,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
transformers.logging.set_verbosity_error()
trainer.train()
print(trainer.evaluate())
model_roberta_q.save_pretrained("roberta-trained-quantized")
del model_roberta_q

[INFO|trainer.py:437] 2021-12-08 11:40:25,911 >> Using amp half precision backend


{'loss': 0.7745, 'learning_rate': 9.1875814863103e-06, 'epoch': 0.08}
{'eval_loss': 0.5123801827430725, 'eval_accuracy': 0.8002037697401936, 'eval_runtime': 46.8364, 'eval_samples_per_second': 209.559, 'eval_steps_per_second': 3.288, 'epoch': 0.08}
{'loss': 0.5453, 'learning_rate': 8.372718383311604e-06, 'epoch': 0.16}
{'eval_loss': 0.4548088014125824, 'eval_accuracy': 0.8248599083036169, 'eval_runtime': 50.0504, 'eval_samples_per_second': 196.102, 'eval_steps_per_second': 3.077, 'epoch': 0.16}
{'loss': 0.5076, 'learning_rate': 7.558670143415907e-06, 'epoch': 0.24}
{'eval_loss': 0.4582265615463257, 'eval_accuracy': 0.82190524707081, 'eval_runtime': 48.9017, 'eval_samples_per_second': 200.709, 'eval_steps_per_second': 3.149, 'epoch': 0.24}
{'loss': 0.4843, 'learning_rate': 6.743807040417211e-06, 'epoch': 0.33}
{'eval_loss': 0.41166964173316956, 'eval_accuracy': 0.8402445236882323, 'eval_runtime': 47.7718, 'eval_samples_per_second': 205.456, 'eval_steps_per_second': 3.224, 'epoch': 0.33}

### Benchmark

#### Export a `QDQ Pytorch` model on `ONNX`, we need to enable fake quantization mode from Pytorch.

In [34]:
model_roberta_q: PreTrainedModel = QDQRobertaForSequenceClassification.from_pretrained(
    "roberta-trained-quantized", num_labels=num_labels
)
model_roberta_q = model_roberta_q.cuda()

data = encoded_dataset["train"][1:3]
input_torch = convert_tensor(data, output="torch")

from pytorch_quantization.nn import TensorQuantizer

TensorQuantizer.use_fb_fake_quant = True
convert_to_onnx(model_pytorch=model_roberta_q, output_path="roberta_q.onnx", inputs_pytorch=input_torch, opset=13)
TensorQuantizer.use_fb_fake_quant = False

  inputs, amax.item() / bound, 0,
  quant_dim = list(amax.shape).index(list(amax_sequeeze.shape)[0])


#### Convert `ONNX` graph to `TensorRT` engine

In [35]:
engine = build_engine(
    runtime=runtime,
    onnx_file_path="roberta_q.onnx",
    logger=trt_logger,
    min_shape=(batch_size, max_seq_len),
    optimal_shape=(batch_size, max_seq_len),
    max_shape=(batch_size, max_seq_len),
    workspace_size=10000 * 1024 * 1024,
    fp16=False,
    int8=True,
)

In [36]:
# same conversion from the terminal
#!/usr/src/tensorrt/bin/trtexec --onnx=roberta_q.onnx --shapes=input_ids:32x256,attention_mask:32x256 --int8 --workspace=10000 --saveEngine="test.plan"

#### Prepare input and output buffer

In [37]:
stream: Stream = pycuda.driver.Stream()
context: IExecutionContext = engine.create_execution_context()
context.set_optimization_profile_async(profile_index=profile_index, stream_handle=stream.handle)
input_binding_idxs, output_binding_idxs = get_binding_idxs(engine, profile_index)  # type: List[int], List[int]

In [38]:
data = encoded_dataset["train"][0:batch_size]
input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")
input_np: OD[str, np.ndarray] = convert_tensor(data=data, output="np")

#### Inference on `TensorRT`

We first check that inference is working correctly:

In [39]:
tensorrt_output = infer_tensorrt(
    context=context,
    host_inputs=input_np,
    input_binding_idxs=input_binding_idxs,
    output_binding_idxs=output_binding_idxs,
    stream=stream,
)
print(tensorrt_output)

[array([[ 0.00858257,  1.5917815 , -1.8337398 ],
       [ 2.432996  , -1.3068045 , -1.9821789 ],
       [ 1.1561737 , -0.86323494, -1.0034285 ],
       [ 1.5863879 , -0.49799222, -1.7219063 ],
       [ 1.7697937 , -0.11104879, -2.3511643 ],
       [ 3.5160832 , -1.3530374 , -3.0601408 ],
       [-3.4769394 ,  2.0265098 ,  1.874698  ],
       [ 3.3827643 , -1.2117878 , -2.8793433 ],
       [-0.17693216, -1.1394652 ,  0.9083401 ],
       [-2.8701797 , -0.7220555 ,  4.0437098 ],
       [ 3.2363806 , -1.5264729 , -2.39297   ],
       [-2.4144251 , -0.68517655,  3.2756474 ],
       [-2.5281413 ,  2.697305  , -0.10096363],
       [-2.4246836 ,  2.7231753 , -0.41800928],
       [ 0.01045033, -0.68109804,  0.3442644 ],
       [ 2.307869  , -1.3556942 , -1.7211589 ],
       [ 3.5693195 , -1.0019355 , -3.1455066 ],
       [-2.253701  , -1.5583014 ,  4.6081343 ],
       [-2.986448  , -0.8324479 ,  4.4171877 ],
       [-2.3470848 ,  3.5537364 , -1.2475395 ],
       [ 3.5942395 , -1.2296011 , -3.00

We warmup the GPU with few inferences and then start the measures:

In [40]:
for _ in range(30):
    _ = infer_tensorrt(
        context=context,
        host_inputs=input_np,
        input_binding_idxs=input_binding_idxs,
        output_binding_idxs=output_binding_idxs,
        stream=stream,
    )
time_buffer = list()
for _ in range(100):
    with track_infer_time(time_buffer):
        _ = infer_tensorrt(
            context=context,
            host_inputs=input_np,
            input_binding_idxs=input_binding_idxs,
            output_binding_idxs=output_binding_idxs,
            stream=stream,
        )

print_timings(name="TensorRT (INT-8)", timings=time_buffer)
del engine, context

[TensorRT (INT-8)] mean=15.77ms, sd=0.58ms, min=14.85ms, max=17.66ms, median=15.81ms, 95p=16.61ms, 99p=17.50ms


## Pytorch baseline

### Finetuning

In [50]:
model_roberta: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=num_labels
)
model_roberta = model_roberta.cuda()

args.learning_rate = 1e-5
trainer = Trainer(
    model_roberta,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
transformers.logging.set_verbosity_error()
trainer.train()
print(trainer.evaluate())
# {'eval_loss': 0.3559744358062744, 'eval_accuracy': 0.8655119714722364, 'eval_runtime': 19.6678, 'eval_samples_per_second': 499.04, 'eval_steps_per_second': 7.83, 'epoch': 0.98}
trainer.save_model("roberta-baseline")
del model_roberta
del trainer

[INFO|trainer.py:437] 2021-12-08 13:17:01,492 >> Using amp half precision backend


{'loss': 0.65, 'learning_rate': 9.1875814863103e-06, 'epoch': 0.08}
{'eval_loss': 0.4644513428211212, 'eval_accuracy': 0.8212939378502292, 'eval_runtime': 18.8325, 'eval_samples_per_second': 521.174, 'eval_steps_per_second': 8.177, 'epoch': 0.08}
{'loss': 0.4912, 'learning_rate': 8.372718383311604e-06, 'epoch': 0.16}
{'eval_loss': 0.4196386933326721, 'eval_accuracy': 0.8379011716760061, 'eval_runtime': 19.1574, 'eval_samples_per_second': 512.335, 'eval_steps_per_second': 8.039, 'epoch': 0.16}
{'loss': 0.4631, 'learning_rate': 7.558670143415907e-06, 'epoch': 0.24}
{'eval_loss': 0.42019498348236084, 'eval_accuracy': 0.8382068262862965, 'eval_runtime': 18.5971, 'eval_samples_per_second': 527.772, 'eval_steps_per_second': 8.281, 'epoch': 0.24}
{'loss': 0.4455, 'learning_rate': 6.743807040417211e-06, 'epoch': 0.33}
{'eval_loss': 0.3791417181491852, 'eval_accuracy': 0.8584819154355579, 'eval_runtime': 18.955, 'eval_samples_per_second': 517.804, 'eval_steps_per_second': 8.124, 'epoch': 0.33}


### GPU execution

To finish, we will measure vanilla Pytorch inference on both FP32 and FP16 precision, it will be our baseline:

In [51]:
baseline_model = AutoModelForSequenceClassification.from_pretrained("roberta-baseline", num_labels=num_labels)
baseline_model = baseline_model.cuda()
baseline_model = baseline_model.eval()

data = encoded_dataset["train"][0:batch_size]
input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")

with torch.inference_mode():
    for _ in range(30):
        _ = baseline_model(**input_torch)
        torch.cuda.synchronize()
    time_buffer = list()
    for _ in range(100):
        with track_infer_time(time_buffer):
            _ = baseline_model(**input_torch)
            torch.cuda.synchronize()
print_timings(name="Pytorch (FP32)", timings=time_buffer)

[Pytorch (FP32)] mean=83.53ms, sd=3.69ms, min=79.18ms, max=91.07ms, median=84.09ms, 95p=89.34ms, 99p=90.44ms


In [52]:
from torch.cuda.amp import autocast

with torch.inference_mode():
    with autocast():
        for _ in range(30):
            _ = baseline_model(**input_torch)
            torch.cuda.synchronize()
        time_buffer = []
        for _ in range(100):
            with track_infer_time(time_buffer):
                _ = baseline_model(**input_torch)
                torch.cuda.synchronize()
print_timings(name="Pytorch (FP16)", timings=time_buffer)
del baseline_model

[Pytorch (FP16)] mean=58.78ms, sd=1.59ms, min=57.74ms, max=64.04ms, median=58.15ms, 95p=62.80ms, 99p=63.88ms


### CPU execution

In [53]:
baseline_model = AutoModelForSequenceClassification.from_pretrained("roberta-baseline", num_labels=num_labels)
baseline_model = baseline_model.eval()
input_torch_cpu = {k: v.to("cpu") for k, v in input_torch.items()}


with torch.inference_mode():
    for _ in range(3):
        _ = baseline_model(**input_torch_cpu)
        torch.cuda.synchronize()
    time_buffer = list()
    for _ in range(10):
        with track_infer_time(time_buffer):
            _ = baseline_model(**input_torch_cpu)
            torch.cuda.synchronize()
print_timings(name="Pytorch (FP32) - CPU", timings=time_buffer)

[Pytorch (FP32) - CPU] mean=4406.68ms, sd=290.44ms, min=3908.02ms, max=4794.74ms, median=4486.10ms, 95p=4725.07ms, 99p=4780.80ms


In [54]:
with torch.inference_mode():
    with autocast():
        for _ in range(3):
            _ = baseline_model(**input_torch_cpu)
            torch.cuda.synchronize()
        time_buffer = []
        for _ in range(10):
            with track_infer_time(time_buffer):
                _ = baseline_model(**input_torch_cpu)
                torch.cuda.synchronize()
print_timings(name="Pytorch (FP16) - CPU", timings=time_buffer)
del baseline_model

[Pytorch (FP16) - CPU] mean=4255.15ms, sd=123.93ms, min=4103.51ms, max=4527.69ms, median=4206.06ms, 95p=4469.24ms, 99p=4516.00ms


### TensorRT baseline

Below we export a randomly initialized `Roberta` model, the purpose is to only check the performance on mixed precision (FP16, no quantization).

In [55]:
baseline_model = AutoModelForSequenceClassification.from_pretrained("roberta-baseline", num_labels=num_labels)
baseline_model = baseline_model.cuda()
convert_to_onnx(baseline_model, output_path="baseline.onnx", inputs_pytorch=input_torch, opset=12)
del baseline_model

In [56]:
engine = build_engine(
    runtime=runtime,
    onnx_file_path="baseline.onnx",
    logger=trt_logger,
    min_shape=(batch_size, max_seq_len),
    optimal_shape=(batch_size, max_seq_len),
    max_shape=(batch_size, max_seq_len),
    workspace_size=10000 * 1024 * 1024,
    fp16=True,
    int8=False,
)
stream: Stream = pycuda.driver.Stream()
context: IExecutionContext = engine.create_execution_context()
context.set_optimization_profile_async(profile_index=profile_index, stream_handle=stream.handle)
input_binding_idxs, output_binding_idxs = get_binding_idxs(engine, profile_index)  # type: List[int], List[int]
for _ in range(30):
    _ = infer_tensorrt(
        context=context,
        host_inputs=input_np,
        input_binding_idxs=input_binding_idxs,
        output_binding_idxs=output_binding_idxs,
        stream=stream,
    )
time_buffer = list()
for _ in range(100):
    with track_infer_time(time_buffer):
        _ = infer_tensorrt(
            context=context,
            host_inputs=input_np,
            input_binding_idxs=input_binding_idxs,
            output_binding_idxs=output_binding_idxs,
            stream=stream,
        )

print_timings(name="TensorRT (FP16)", timings=time_buffer)
del engine, context

[TensorRT (FP16)] mean=30.23ms, sd=0.25ms, min=29.92ms, max=31.51ms, median=30.14ms, 95p=30.74ms, 99p=30.95ms
