# Nvidia GPU INT-8 quantization on any transformers model (encoder based)

Quantization is one of the most effective and generic approaches to make model inference faster.
Basically, it replaces high precision float numbers in model tensors encoded in 32 or 16 bits by lower precision ones encoded in 8 bits or less:

* it takes less memory
* computation is easier / faster

It can be applied to any model in theory, and, if done well, it should maintain accuracy.

The purpose of this notebook is to show a process to perform quantization on any `transformer` architectures.

Moreover, the library is designed to offer a simple API and still let advanced users tweak the algorithm.

**TL;DR, we benchmarked Pytorch and Nvidia TensorRT, on both CPU and GPU, with/without quantization, our methods provide the fastest inference by large margin**.

| Framework                 | Precision | Latency (ms) | Accuracy | Speedup    | Hardware |
|:--------------------------|-----------|--------------|----------|:-----------|:--------:|
| Pytorch                   | FP32      | 4267         | 86.6 %   | X 0.02     |   CPU    |
| Pytorch                   | FP16      | 4428         | 86.6 %   | X 0.02     |   CPU    |
| Pytorch                   | INT-8     | 3300         | 85.9 %   | X 0.02     |   CPU    |
| Pytorch                   | FP32      | 77           | 86.6 %   | X 1        |   GPU    |
| Pytorch                   | FP16      | 56           | 86.6 %   | X 1.38     |   GPU    |
| ONNX Runtime              | FP32      | 76           | 86.6 %   | X 1.01     |   GPU    |
| ONNX Runtime              | FP16      | 34           | 86.6 %   | X 2.26     |   GPU    |
| ONNX Runtime              | FP32      | 4023         | 86.6 %   | X 0.02     |   CPU    |
| ONNX Runtime              | FP16      | 3957         | 86.6 %   | X 0.02     |   CPU    |
| ONNX Runtime              | INT-8     | 3336         | 86.5 %   | X 0.02     |   CPU    |
| TensorRT                  | FP16      | 30           | 86.6 %   | X 2.57     |   GPU    |
| TensorRT (**our method**) | **INT-8** | **17**       | 86.2 %   | **X 4.53** | **GPU**  |

> measures done on a Nvidia RTX 3090 GPU + 12 cores i7 Intel CPU (support AVX-2 instruction)
>
> `base` architecture flavor with batch of size 32 / seq len 256, similar results obtained for other sizes/seq len not included in the table.
>
> accuracy obtained after a single epoch, no LR search or any hyper parameter optimization


## A (very) short intro to INT-8 quantization

Basic idea behind model quantization is to replace tensors made of float numbers (usually encoded on 32 bits) by lower precision representation (integers encoded on 8 bits for Nvidia GPUs).
Therefore computation is faster and model memory footprint is lower. Making tensor storage smaller makes memory transfer faster... and is also a source of computation acceleration.
This approach is very interesting for its trade-off: you reduce inference time significantly, and it costs close to nothing in accuracy.

Replacing float numbers by integers is done through a mapping.
This step is called `calibration`, and its purpose is to compute for each tensor or each channel of a tensor (one of its dimensions) a range covering most weights and then define a scale and a distribution center to map float numbers to 8 bits integers.

There are several ways to perform quantization, depending of how and when the `calibration` is performed:

* dynamically: the mapping is done online, during the inference, there are some overhead but it's usually the easiest to leverage, end user has very few configuration to set,
* statically, after training (`post training quantization` or `PTQ`): this way is efficient because quantization is done offline, before inference, but it may have an accuracy cost,
* statically, after training (`quantization aware training` or `QAT`): like a PTQ followed by a second fine tuning. Same efficiency but usually slightly better accuracy.

Nvidia GPUs don't support dynamic quantization, CPU supports all types of quantization.  
Compared to `PTQ`, `QAT` better preserves accuracy and should be preferred in most cases.


During the quantization aware *training*:

* in the inside, Pytorch will train with high precision float numbers,
* on the outside, Pytorch will simulate that a quantization has already been applied and output results accordingly (for loss computation for instance)

The simulation process is done through the add of quantization / dequantization nodes, most often called `QDQ`, it's an abbreviation you will see often in the quantization world.



> Want to learn more about quantization?
> 
> * You can check this [high quality blog post](https://leimao.github.io/article/Neural-Networks-Quantization/) for more information.
> * The process is well described in this [Nvidia presentation](https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf)

## Why this notebook?

CPU quantization is supported out of the box by `Pytorch` and `ONNX Runtime`.
**GPU quantization on the other side requires specific tools and process to be applied**.

In the specific case of `transformer` models, few demos from Nvidia and Microsoft exist; they are all for the old vanilla Bert architecture.

It doesn't support modern architectures out of the box, like `Albert`, `Roberta`, `Deberta` or `Electra`.



## Project setup

### Dependencies installation

Your machine should have Nvidia CUDA 11.X, TensorRT 8.2.1 and cuBLAS installed. It's said to be tricky to install, in my experience, just follow Nvidia download page instructions **and nothing else**, it should work out of the box. Nvidia Docker image could be a good choice too.

In [1]:
#! pip3 install transformers datasets sklearn
#! pip3 install git+ssh://git@github.com/ELS-RD/transformer-deploy
#! pip3 install git+ssh://git@github.com/NVIDIA/TensorRT#egg=pytorch-quantization\&subdirectory=tools/pytorch-quantization/

Check the GPU is enabled and usable.

In [2]:
! nvidia-smi

Tue Dec 28 21:46:26 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:03:00.0  On |                  N/A |
| 35%   42C    P8    40W / 350W |    263MiB / 24267MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [3]:
import logging
import os
from collections import OrderedDict
from typing import Dict, List
from typing import OrderedDict as OD
from typing import Union

import datasets
import numpy as np
import pycuda.autoinit
import tensorrt as trt
import torch
import transformers
from datasets import load_dataset, load_metric
from pycuda._driver import Stream
from tensorrt.tensorrt import IExecutionContext, Logger, Runtime
from pytorch_quantization import nn as quant_nn

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    IntervalStrategy,
    PreTrainedModel,
    PreTrainedTokenizer,
    Trainer,
    TrainingArguments,
)

from transformer_deploy.backends.ort_utils import (
    convert_to_onnx,
    convert_to_quant_onnx,
    cpu_quantization,
    create_model_for_provider,
    optimize_onnx,
)
from transformer_deploy.backends.trt_utils import build_engine, get_binding_idxs, infer_tensorrt
from transformer_deploy.benchmarks.utils import print_timings, track_infer_time
from transformer_deploy.QDQModels.calibration_utils import QATCalibrate

Set logging to `error` level to ease readability of this `notebook` on Github.

In [4]:
log_level = logging.ERROR
logging.getLogger().setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
trt_logger: Logger = trt.Logger(trt.Logger.ERROR)
transformers.logging.set_verbosity_error()

### Preprocess data

This part is inspired from an [official Notebooks from Hugging Face](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb).

There is nothing special to do. Define the task:

In [5]:
model_name = "roberta-base"
task = "mnli"
num_labels = 3
batch_size = 32
max_seq_len = 256
validation_key = "validation_matched"
timings: Dict[str, List[float]] = dict()
runtime: Runtime = trt.Runtime(trt_logger)
profile_index = 0

Preprocess data (task specific):

In [6]:
def preprocess_function(examples):
    return tokenizer(
        examples["premise"], examples["hypothesis"], truncation=True, padding="max_length", max_length=max_seq_len
    )


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)


def convert_tensor(data: OD[str, List[List[int]]], output: str) -> OD[str, Union[np.ndarray, torch.Tensor]]:
    input: OD[str, Union[np.ndarray, torch.Tensor]] = OrderedDict()
    for k in ["input_ids", "attention_mask", "token_type_ids"]:
        if k in data:
            v = data[k]
            if output == "torch":
                value = torch.tensor(v, dtype=torch.long, device="cuda")
            elif output == "np":
                value = np.asarray(v, dtype=np.int32)
            else:
                raise Exception(f"unknown output type: {output}")
            input[k] = value
    return input


def measure_accuracy(infer, int64: bool) -> float:
    outputs = list()
    for start_index in range(0, len(encoded_dataset[validation_key]), batch_size):
        end_index = start_index + batch_size
        data = encoded_dataset[validation_key][start_index:end_index]
        inputs: OD[str, np.ndarray] = convert_tensor(data=data, output="np")
        if int64:
            for k, v in inputs.items():
                inputs[k] = v.astype(np.int64)
        output = infer(inputs)
        output = np.argmax(output[0], axis=1).astype(int).tolist()
        outputs.extend(output)
    return np.mean(np.array(outputs) == np.array(validation_labels))


def get_trainer(model: PreTrainedModel) -> Trainer:
    trainer = Trainer(
        model,
        args,
        train_dataset=encoded_dataset["train"],
        eval_dataset=encoded_dataset[validation_key],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )
    transformers.logging.set_verbosity_error()
    return trainer

In [7]:
tokenizer: PreTrainedTokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

dataset = load_dataset("glue", task)
metric = load_metric("glue", task)

encoded_dataset = dataset.map(preprocess_function, batched=True)
validation_labels = [item["label"] for item in encoded_dataset[validation_key]]

nb_step = 1000
strategy = IntervalStrategy.STEPS
args = TrainingArguments(
    f"{model_name}-{task}",
    evaluation_strategy=strategy,
    eval_steps=nb_step,
    logging_steps=nb_step,
    save_steps=nb_step,
    save_strategy=strategy,
    learning_rate=1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size * 2,
    num_train_epochs=1,
    fp16=True,
    group_by_length=True,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to=[],
)

  0%|          | 0/5 [00:00<?, ?it/s]

## (Standard) fine-tuning model

Now that our data are ready, we can download/fine tune the pretrained model.

In [9]:
model_fp16: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
trainer = get_trainer(model_fp16)
transformers.logging.set_verbosity_error()
trainer.train()
print(trainer.evaluate())

model_fp16.save_pretrained("model_trained_fp16")

[INFO|trainer.py:439] 2021-12-27 09:19:51,063 >> Using amp half precision backend


{'loss': 0.6605, 'learning_rate': 9.1875814863103e-06, 'epoch': 0.08}
{'eval_loss': 0.4653007388114929, 'eval_accuracy': 0.8183392766174223, 'eval_runtime': 18.2981, 'eval_samples_per_second': 536.393, 'eval_steps_per_second': 8.416, 'epoch': 0.08}
{'loss': 0.4956, 'learning_rate': 8.372718383311604e-06, 'epoch': 0.16}
{'eval_loss': 0.4208127558231354, 'eval_accuracy': 0.8346408558329088, 'eval_runtime': 18.3709, 'eval_samples_per_second': 534.268, 'eval_steps_per_second': 8.383, 'epoch': 0.16}
{'loss': 0.4662, 'learning_rate': 7.557855280312908e-06, 'epoch': 0.24}
{'eval_loss': 0.42171549797058105, 'eval_accuracy': 0.8358634742740703, 'eval_runtime': 18.3642, 'eval_samples_per_second': 534.464, 'eval_steps_per_second': 8.386, 'epoch': 0.24}
{'loss': 0.4458, 'learning_rate': 6.7429921773142115e-06, 'epoch': 0.33}
{'eval_loss': 0.3808833658695221, 'eval_accuracy': 0.8527763627101376, 'eval_runtime': 18.3578, 'eval_samples_per_second': 534.649, 'eval_steps_per_second': 8.389, 'epoch': 0.

## Add quantization support to any model

The idea is to take the source code of a specific model and add automatically `QDQ` nodes. QDQ nodes will be placed before and after an operation that we want to quantize, that’s inside these nodes that the information to perform the mapping between high precision and low precision number is stored.

That way, quantization will work out of the box for the final user.

The process is based on Python AST modification, basically we parse the model source code in RAM, we convert it to a tree, then we patch the tree to add the QDQ nodes and we replace, still in RAM, the original module source code. Our library also offer the option to restore original behavior.

In theory it works for any model. However, not related to quantization, some models are not fully compliant with `TensorRT` (unsupported operators, etc.).
For those models, we rewrite some part of the source code, these patches are manually written but are applied to the model at run time (like the AST manipulation).

> concrete examples on `Roberta` architecture: in HF library, there is a `cumsum` operator used during the position embedding generation. Something very simple. It takes as input an integer tensor and output an integer tensor. It happens that the `cumsum` operator from TensorRT supports float but not integer (https://github.com/onnx/onnx-tensorrt/blob/master/docs/operators.md). It leads to a crash during the model conversion with a strange error message. Converting the input to float tensor fixes the issue. 

The process below is:

* Calibrate
* Quantization Aware training (QAT)

> there are many ways to get a QDQ model, you can modify Pytorch source code (including doing it at runtime like here), patch ONNX graph (this approach is used at Microsoft for instance but only support PTQ, not QAT as ONNX file can't be trained on Pytorch for now) or leverage the new FX Pytorch interface (it's a bit experimental and it seems to miss some feature to support Nvidia QAT library). Modifying the source code is the most straightforward, and doing it through AST is the least intrusive (no need to duplicate the work of HF).

### Post Training Quantization (PTQ)
A PTQ is basically a fine tuned model where we add quantization nodes and that we calibrate.

Calibration is a key step in the static quantization process. Its quality depends on the final accuracy (the inference speed will stay the same).  
Moreover, a good PTQ is a good basis for a good Quantization Aware Training (QAT).

By calling `with QATCalibrate(...) as qat:`, the lib will patch transformer model AST (source code) in RAM, basically adding quantization support to each model.

#### Calibration percentile grid search

One of the things we try to guess during the calibration is what range of tensor values capture most of the information stored in the tensor. Indeed, a FP32 tensor can store at the same time very large and very small values, we obviously can't do the same with a 8-bits integer tensors and a scale. An 8-bits integer can only encode 255 values so we need to fix some limits and say, if a value is outside our limits, it just takes a maximum value instead of its real one. For instance, if we say our range is -1000 to +1000 and a tensor contains the value +4000, it will be replaced by the maximum value, +1000.

As said before, we will use the histogram method to find the perfect range. We also need to choose a percentile. Usually, you will choose something very close to 100.

If the percentile is too small, we put too many values outside the covered range. Values outside the range will be replaced by a single maximum value and you lose some granularity in model weights.

If the percentile is too big, your range will be very large and because 8-bits signed integers can only encode values between -127 to +127, even when you use a scale you lose in granularity.

Therefore, we launch a grid search on percentile hyper parameter.


In [19]:
for percentile in [99.9, 99.99, 99.999, 99.9999]:
    with QATCalibrate(method="histogram", percentile=percentile) as qat:
        model_q: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(
            "model_trained_fp16", num_labels=num_labels
        )
        model_q = model_q.cuda()
        qat.setup_model_qat(model_q)  # prepare quantizer to any model

        with torch.no_grad():
            for start_index in range(0, 128, batch_size):
                end_index = start_index + batch_size
                data = encoded_dataset["train"][start_index:end_index]
                input_torch = {
                    k: torch.tensor(v, dtype=torch.long, device="cuda")
                    for k, v in data.items()
                    if k in ["input_ids", "attention_mask", "token_type_ids"]
                }
                model_q(**input_torch)
    trainer = get_trainer(model_q)
    print(f"percentile: {percentile}")
    print(trainer.evaluate())

[INFO|trainer.py:439] 2021-12-27 17:25:51,070 >> Using amp half precision backend


percentile: 99.9
{'eval_loss': 0.47421666979789734, 'eval_accuracy': 0.8121242995415181, 'eval_runtime': 47.9158, 'eval_samples_per_second': 204.839, 'eval_steps_per_second': 3.214}
{'eval_loss': 0.47421666979789734, 'eval_accuracy': 0.8121242995415181, 'eval_runtime': 47.9158, 'eval_samples_per_second': 204.839, 'eval_steps_per_second': 3.214}


[INFO|trainer.py:439] 2021-12-27 17:30:13,795 >> Using amp half precision backend


percentile: 99.99
{'eval_loss': 0.3841923773288727, 'eval_accuracy': 0.8487009679062659, 'eval_runtime': 46.6715, 'eval_samples_per_second': 210.3, 'eval_steps_per_second': 3.3}
{'eval_loss': 0.3841923773288727, 'eval_accuracy': 0.8487009679062659, 'eval_runtime': 46.6715, 'eval_samples_per_second': 210.3, 'eval_steps_per_second': 3.3}


[INFO|trainer.py:439] 2021-12-27 17:34:34,280 >> Using amp half precision backend


percentile: 99.999
{'eval_loss': 0.3939284086227417, 'eval_accuracy': 0.850636780438105, 'eval_runtime': 49.1138, 'eval_samples_per_second': 199.842, 'eval_steps_per_second': 3.136}
{'eval_loss': 0.3939284086227417, 'eval_accuracy': 0.850636780438105, 'eval_runtime': 49.1138, 'eval_samples_per_second': 199.842, 'eval_steps_per_second': 3.136}


[INFO|trainer.py:439] 2021-12-27 17:38:54,289 >> Using amp half precision backend


percentile: 99.9999
{'eval_loss': 1.0285985469818115, 'eval_accuracy': 0.4956698930208864, 'eval_runtime': 48.0849, 'eval_samples_per_second': 204.118, 'eval_steps_per_second': 3.203}
{'eval_loss': 1.0285985469818115, 'eval_accuracy': 0.4956698930208864, 'eval_runtime': 48.0849, 'eval_samples_per_second': 204.118, 'eval_steps_per_second': 3.203}


As you can see, the chosen percentile value has a high impact on the final accuracy.

For the rest of the notebook, we apply the `99.999` percentile.

In [8]:
with QATCalibrate(method="histogram", percentile=99.999) as qat:
    model_q: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(
        "model_trained_fp16", num_labels=num_labels
    )
    model_q = model_q.cuda()
    qat.setup_model_qat(model_q)  # prepare quantizer to any model

    with torch.no_grad():
        for start_index in range(0, 128, batch_size):
            end_index = start_index + batch_size
            data = encoded_dataset["train"][start_index:end_index]
            input_torch = {
                k: torch.tensor(v, dtype=torch.long, device="cuda")
                for k, v in data.items()
                if k in ["input_ids", "attention_mask", "token_type_ids"]
            }
            model_q(**input_torch)
trainer = get_trainer(model_q)
print(trainer.evaluate())

[INFO|trainer.py:439] 2021-12-28 13:52:09,215 >> Using amp half precision backend


{'eval_loss': 0.3939284086227417, 'eval_accuracy': 0.850636780438105, 'eval_runtime': 46.5572, 'eval_samples_per_second': 210.816, 'eval_steps_per_second': 3.308}
{'eval_loss': 0.3939284086227417, 'eval_accuracy': 0.850636780438105, 'eval_runtime': 46.5572, 'eval_samples_per_second': 210.816, 'eval_steps_per_second': 3.308}


#### Per layer quantization analysis

Below we will run a sensitivity analysis, by enabling quantization of one layer at a time and measuring the accuracy. That way we will be able to detect if the quantization of a specific layer has a larger cost on accuracy than other layers.

In [10]:
from pytorch_quantization import nn as quant_nn

for i in range(12):
    layer_name = f"layer.{i}"
    print(layer_name)
    for name, module in model_q.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if layer_name in name:
                module.enable_quant()
            else:
                module.disable_quant()
    trainer.evaluate()
    print("----")

layer.0
{'eval_loss': 0.35163024067878723, 'eval_accuracy': 0.8663270504330107, 'eval_runtime': 20.695, 'eval_samples_per_second': 474.27, 'eval_steps_per_second': 7.441}
----
layer.1
{'eval_loss': 0.3527306318283081, 'eval_accuracy': 0.8661232806928171, 'eval_runtime': 26.1334, 'eval_samples_per_second': 375.573, 'eval_steps_per_second': 5.893}
----
layer.2
{'eval_loss': 0.3557673394680023, 'eval_accuracy': 0.8629648497198166, 'eval_runtime': 21.1364, 'eval_samples_per_second': 464.366, 'eval_steps_per_second': 7.286}
----
layer.3
{'eval_loss': 0.3551430106163025, 'eval_accuracy': 0.8649006622516556, 'eval_runtime': 20.9252, 'eval_samples_per_second': 469.051, 'eval_steps_per_second': 7.36}
----
layer.4
{'eval_loss': 0.35053929686546326, 'eval_accuracy': 0.8649006622516556, 'eval_runtime': 21.05, 'eval_samples_per_second': 466.271, 'eval_steps_per_second': 7.316}
----
layer.5
{'eval_loss': 0.35701483488082886, 'eval_accuracy': 0.865206316861946, 'eval_runtime': 20.9236, 'eval_samples_

It seems that quantization of layers 2 to 6 has the largest accuracy impact.


#### Operator quantization analysis


Below we will run a sensitivity analysis, by enabling quantization of one operator type at a time and measuring the accuracy. That way we will be able to detect if a specific operator has a larger cost on accuracy. On Roberta we only quantize `matmul` and `LayerNorm`, so we test both candidates.

In [11]:
for op in ["matmul", "layernorm"]:
    for name, module in model_q.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if op in name:
                module.enable_quant()
            else:
                module.disable_quant()
    print(op)
    trainer.evaluate()
    print("----")

matmul
{'eval_loss': 0.35049352049827576, 'eval_accuracy': 0.8658176260825268, 'eval_runtime': 26.1972, 'eval_samples_per_second': 374.659, 'eval_steps_per_second': 5.878}
----
layernorm
{'eval_loss': 0.35847699642181396, 'eval_accuracy': 0.8597045338767193, 'eval_runtime': 24.3004, 'eval_samples_per_second': 403.903, 'eval_steps_per_second': 6.337}
----


It appears that `LayerNorm` quantization has a significant accuracy cost.

Our goal is to disable quantization for as few operations as possible while preserving accuracy as much as possible. Therefore we will try to only disable quantization for `LayerNorm` on Layers 2 to 6.

In [11]:
disable_layer_names = ["layer.2", "layer.3", "layer.4", "layer.6"]

for name, module in model_q.named_modules():
    if isinstance(module, quant_nn.TensorQuantizer):
        if any([f"{l}.output.layernorm" in name for l in disable_layer_names]):
            print(f"disable {name}")
            module.disable_quant()
        else:
            module.enable_quant()
trainer.evaluate()

disable roberta.encoder.layer.2.output.layernorm_quantizer_0
disable roberta.encoder.layer.2.output.layernorm_quantizer_1
disable roberta.encoder.layer.3.output.layernorm_quantizer_0
disable roberta.encoder.layer.3.output.layernorm_quantizer_1
disable roberta.encoder.layer.4.output.layernorm_quantizer_0
disable roberta.encoder.layer.4.output.layernorm_quantizer_1
disable roberta.encoder.layer.6.output.layernorm_quantizer_0
disable roberta.encoder.layer.6.output.layernorm_quantizer_1
{'eval_loss': 0.3660135269165039, 'eval_accuracy': 0.8618441161487519, 'eval_runtime': 45.9324, 'eval_samples_per_second': 213.684, 'eval_steps_per_second': 3.353}


{'eval_loss': 0.3660135269165039,
 'eval_accuracy': 0.8618441161487519,
 'eval_runtime': 45.9324,
 'eval_samples_per_second': 213.684,
 'eval_steps_per_second': 3.353}

By just disabling quantization for a single operator on a few layers, we keep most of the performance boost (quantization) but retrieve more than 1 point of accuracy. It's also possible to perform an analysis per quantizer to get a smaller granularity but it's a bit slow to run.

If we stop here, it's called a Post Training Quantization (PTQ). Below, we will try to retrieve even more accuracy.

### Quantization Aware Training (QAT)

We retrain the model with 1/10 or 1/100 of the original learning rate. Our goal is to retrieve most of the original accuracy.

In [12]:
args.learning_rate = 1e-7
trainer = get_trainer(model_q)
trainer.train()
print(trainer.evaluate())
model_q.save_pretrained("model-qat")

[INFO|trainer.py:439] 2021-12-28 13:54:41,146 >> Using amp half precision backend


{'loss': 0.3591, 'learning_rate': 9.188396349413298e-08, 'epoch': 0.08}
{'eval_loss': 0.3738575875759125, 'eval_accuracy': 0.8596026490066225, 'eval_runtime': 46.992, 'eval_samples_per_second': 208.865, 'eval_steps_per_second': 3.277, 'epoch': 0.08}
{'loss': 0.3182, 'learning_rate': 8.373533246414603e-08, 'epoch': 0.16}
{'eval_loss': 0.38133203983306885, 'eval_accuracy': 0.8586856851757514, 'eval_runtime': 45.7335, 'eval_samples_per_second': 214.613, 'eval_steps_per_second': 3.367, 'epoch': 0.16}
{'loss': 0.3062, 'learning_rate': 7.558670143415906e-08, 'epoch': 0.24}
{'eval_loss': 0.3903615176677704, 'eval_accuracy': 0.8592969943963321, 'eval_runtime': 45.6544, 'eval_samples_per_second': 214.985, 'eval_steps_per_second': 3.373, 'epoch': 0.24}
{'loss': 0.2986, 'learning_rate': 6.74380704041721e-08, 'epoch': 0.33}
{'eval_loss': 0.39669597148895264, 'eval_accuracy': 0.8577687213448802, 'eval_runtime': 45.6583, 'eval_samples_per_second': 214.966, 'eval_steps_per_second': 3.373, 'epoch': 0.

#### Export a `QDQ Pytorch` model to `ONNX`

We need to enable fake quantization mode from Pytorch.

In [13]:
data = encoded_dataset["train"][1:3]
input_torch = convert_tensor(data, output="torch")
convert_to_quant_onnx(model_pytorch=model_q, output_path="model_qat.onnx", inputs_pytorch=input_torch)

  inputs, amax.item() / bound, 0,
  quant_dim = list(amax.shape).index(list(amax_sequeeze.shape)[0])


In [14]:
del model_q
QATCalibrate.restore()

### Benchmark

#### Convert `ONNX` graph to `TensorRT` engine

In [15]:
engine = build_engine(
    runtime=runtime,
    onnx_file_path="model_qat.onnx",
    logger=trt_logger,
    min_shape=(1, max_seq_len),
    optimal_shape=(batch_size, max_seq_len),
    max_shape=(batch_size, max_seq_len),
    workspace_size=10000 * 1024 * 1024,
    fp16=True,
    int8=True,
)

In [16]:
# same as above, but from the terminal
# !/usr/src/tensorrt/bin/trtexec --onnx=model_qat.onnx --shapes=input_ids:32x256,attention_mask:32x256 --best --workspace=10000 --saveEngine="test.plan"

#### Prepare input and output buffer

In [17]:
stream: Stream = pycuda.driver.Stream()
context: IExecutionContext = engine.create_execution_context()
context.set_optimization_profile_async(profile_index=profile_index, stream_handle=stream.handle)
input_binding_idxs, output_binding_idxs = get_binding_idxs(engine, profile_index)  # type: List[int], List[int]

In [18]:
data = encoded_dataset["train"][0:batch_size]
input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")
input_np: OD[str, np.ndarray] = convert_tensor(data=data, output="np")

#### Inference on `TensorRT`

We first check that inference is working correctly:

In [19]:
tensorrt_output = infer_tensorrt(
    context=context,
    host_inputs=input_np,
    input_binding_idxs=input_binding_idxs,
    output_binding_idxs=output_binding_idxs,
    stream=stream,
)
print(tensorrt_output)

[array([[ 0.11111109,  2.9936233 , -2.5243347 ],
       [ 3.2135723 , -0.4374885 , -2.4485767 ],
       [ 2.1678474 , -1.1477091 , -0.7798154 ],
       [ 1.8148003 , -0.2093072 , -1.416711  ],
       [ 2.3070638 ,  0.27601779, -2.2818418 ],
       [ 4.1799006 , -0.83163625, -2.8492923 ],
       [-3.695277  ,  2.3409832 ,  1.4314314 ],
       [ 4.1796045 , -1.0709951 , -2.6119678 ],
       [-0.44781622, -1.4288648 ,  1.888488  ],
       [-2.9845483 , -1.5895646 ,  4.117529  ],
       [ 3.9293122 , -0.68528754, -2.9477124 ],
       [-2.516609  ,  0.34680495,  2.2793124 ],
       [-3.0710464 ,  3.3439813 ,  0.08079423],
       [-2.2859852 ,  1.9546673 ,  0.37908432],
       [ 0.3999826 , -1.0603418 ,  0.5099453 ],
       [ 2.9247677 , -0.6867883 , -1.7499886 ],
       [ 4.1125493 , -0.7771612 , -2.986419  ],
       [-2.58058   , -2.3291597 ,  4.553415  ],
       [-3.215447  , -1.3902456 ,  4.2499046 ],
       [-2.014185  ,  4.117433  , -1.634403  ],
       [ 4.051285  , -0.64716065, -2.90

Measure of the accuracy:

In [20]:
infer_trt = lambda inputs: infer_tensorrt(
    context=context,
    host_inputs=inputs,
    input_binding_idxs=input_binding_idxs,
    output_binding_idxs=output_binding_idxs,
    stream=stream,
)

measure_accuracy(infer=infer_trt, int64=False)

0.8629648497198166

Latency measures:

In [22]:
time_buffer = list()
for _ in range(100):
    with track_infer_time(time_buffer):
        _ = infer_tensorrt(
            context=context,
            host_inputs=input_np,
            input_binding_idxs=input_binding_idxs,
            output_binding_idxs=output_binding_idxs,
            stream=stream,
        )

print_timings(name="TensorRT (INT-8)", timings=time_buffer)

[TensorRT (INT-8)] mean=17.01ms, sd=1.14ms, min=16.64ms, max=27.17ms, median=16.76ms, 95p=18.16ms, 99p=20.06ms


In [23]:
del engine, context

## Pytorch baseline

Time to get some numbers to compare with.

### GPU execution

We will measure vanilla Pytorch inference on both FP32 and FP16 precision on GPU, it will be our baseline:

In [8]:
baseline_model = AutoModelForSequenceClassification.from_pretrained("model_trained_fp16", num_labels=num_labels)
baseline_model = baseline_model.cuda()
baseline_model = baseline_model.eval()

data = encoded_dataset["train"][0:batch_size]
input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")

with torch.inference_mode():
    for _ in range(30):
        _ = baseline_model(**input_torch)
        torch.cuda.synchronize()
    time_buffer = list()
    for _ in range(100):
        with track_infer_time(time_buffer):
            _ = baseline_model(**input_torch)
            torch.cuda.synchronize()
print_timings(name="Pytorch (FP32)", timings=time_buffer)

[Pytorch (FP32)] mean=76.87ms, sd=1.24ms, min=75.68ms, max=82.38ms, median=76.52ms, 95p=79.29ms, 99p=81.90ms


In [9]:
with torch.inference_mode():
    with torch.cuda.amp.autocast():
        for _ in range(30):
            _ = baseline_model(**input_torch)
            torch.cuda.synchronize()
        time_buffer = []
        for _ in range(100):
            with track_infer_time(time_buffer):
                _ = baseline_model(**input_torch)
                torch.cuda.synchronize()
print_timings(name="Pytorch (FP16)", timings=time_buffer)
del baseline_model

[Pytorch (FP16)] mean=56.24ms, sd=0.67ms, min=55.53ms, max=59.61ms, median=56.05ms, 95p=57.80ms, 99p=58.18ms


### CPU execution

In [10]:
baseline_model = AutoModelForSequenceClassification.from_pretrained("model_trained_fp16", num_labels=num_labels)
baseline_model = baseline_model.eval()
data = encoded_dataset["train"][0:batch_size]
input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")
input_torch_cpu = {k: v.to("cpu") for k, v in input_torch.items()}

torch.set_num_threads(os.cpu_count())

with torch.inference_mode():
    for _ in range(3):
        _ = baseline_model(**input_torch_cpu)
        torch.cuda.synchronize()
    time_buffer = list()
    for _ in range(10):
        with track_infer_time(time_buffer):
            _ = baseline_model(**input_torch_cpu)
            torch.cuda.synchronize()
print_timings(name="Pytorch (FP32) - CPU", timings=time_buffer)

[Pytorch (FP32) - CPU] mean=4267.96ms, sd=249.08ms, min=3959.59ms, max=4697.79ms, median=4299.22ms, 95p=4632.12ms, 99p=4684.66ms


In [11]:
with torch.inference_mode():
    with torch.cuda.amp.autocast():
        for _ in range(3):
            _ = baseline_model(**input_torch_cpu)
            torch.cuda.synchronize()
        time_buffer = []
        for _ in range(10):
            with track_infer_time(time_buffer):
                _ = baseline_model(**input_torch_cpu)
                torch.cuda.synchronize()
print_timings(name="Pytorch (FP16) - CPU", timings=time_buffer)
del baseline_model

[Pytorch (FP16) - CPU] mean=4428.94ms, sd=225.39ms, min=4148.26ms, max=4871.84ms, median=4404.70ms, 95p=4781.81ms, 99p=4853.83ms


Below, we will perform dynamic quantization on CPU.

In [12]:
quantized_baseline_model = AutoModelForSequenceClassification.from_pretrained(
    "model_trained_fp16", num_labels=num_labels
)
quantized_baseline_model = quantized_baseline_model.eval()
quantized_baseline_model = torch.quantization.quantize_dynamic(
    quantized_baseline_model, {torch.nn.Linear}, dtype=torch.qint8
)

with torch.inference_mode():
    for _ in range(3):
        _ = quantized_baseline_model(**input_torch_cpu)
        torch.cuda.synchronize()
    time_buffer = list()
    for _ in range(10):
        with track_infer_time(time_buffer):
            _ = quantized_baseline_model(**input_torch_cpu)
            torch.cuda.synchronize()
print_timings(name="Pytorch (INT-8) - CPU", timings=time_buffer)

[Pytorch (INT-8) - CPU] mean=3299.66ms, sd=37.76ms, min=3274.33ms, max=3405.91ms, median=3285.20ms, 95p=3366.88ms, 99p=3398.10ms


## TensorRT baseline

Below we export our finetuned model, the purpose is to only check the performance on mixed precision (FP16, no quantization).

In [13]:
baseline_model = AutoModelForSequenceClassification.from_pretrained("model_trained_fp16", num_labels=num_labels)
baseline_model = baseline_model.cuda()
convert_to_onnx(baseline_model, output_path="baseline.onnx", inputs_pytorch=input_torch, opset=12)
del baseline_model

In [15]:
engine = build_engine(
    runtime=runtime,
    onnx_file_path="baseline.onnx",
    logger=trt_logger,
    min_shape=(batch_size, max_seq_len),
    optimal_shape=(batch_size, max_seq_len),
    max_shape=(batch_size, max_seq_len),
    workspace_size=10000 * 1024 * 1024,
    fp16=True,
    int8=False,
)
input_np: OD[str, np.ndarray] = convert_tensor(data=data, output="np")
stream: Stream = pycuda.driver.Stream()
context: IExecutionContext = engine.create_execution_context()
context.set_optimization_profile_async(profile_index=profile_index, stream_handle=stream.handle)
input_binding_idxs, output_binding_idxs = get_binding_idxs(engine, profile_index)  # type: List[int], List[int]
for _ in range(30):
    _ = infer_tensorrt(
        context=context,
        host_inputs=input_np,
        input_binding_idxs=input_binding_idxs,
        output_binding_idxs=output_binding_idxs,
        stream=stream,
    )
time_buffer = list()
for _ in range(100):
    with track_infer_time(time_buffer):
        _ = infer_tensorrt(
            context=context,
            host_inputs=input_np,
            input_binding_idxs=input_binding_idxs,
            output_binding_idxs=output_binding_idxs,
            stream=stream,
        )

print_timings(name="TensorRT (FP16)", timings=time_buffer)
del engine, context

[TensorRT (FP16)] mean=29.90ms, sd=0.82ms, min=29.30ms, max=33.41ms, median=29.69ms, 95p=31.85ms, 99p=32.79ms


## ONNX Runtime baseline

ONNX Runtime is the go to inference solution from Microsoft.

The recent 1.10 version of ONNX Runtime (with TensorRT support) is still a bit buggy on transformer models, that is why we use the 1.9.0 version in the measures below.

As before, CPU quantization is dynamic.
Function `
` will set ONNX Runtime to use all cores available and enable any possible optimizations.

In [8]:
optimize_onnx(
    onnx_path="baseline.onnx",
    onnx_optim_model_path="baseline-optimized.onnx",
    fp16=True,
    use_cuda=True,
)

In [9]:
cpu_quantization(input_model_path="baseline-optimized.onnx", output_model_path="baseline-quantized.onnx")



Ignore MatMul due to non constant B: /[MatMul_113]
Ignore MatMul due to non constant B: /[MatMul_127]
Ignore MatMul due to non constant B: /[MatMul_137]
Ignore MatMul due to non constant B: /[MatMul_207]
Ignore MatMul due to non constant B: /[MatMul_221]
Ignore MatMul due to non constant B: /[MatMul_231]
Ignore MatMul due to non constant B: /[MatMul_301]
Ignore MatMul due to non constant B: /[MatMul_315]
Ignore MatMul due to non constant B: /[MatMul_325]
Ignore MatMul due to non constant B: /[MatMul_395]
Ignore MatMul due to non constant B: /[MatMul_409]
Ignore MatMul due to non constant B: /[MatMul_419]
Ignore MatMul due to non constant B: /[MatMul_489]
Ignore MatMul due to non constant B: /[MatMul_503]
Ignore MatMul due to non constant B: /[MatMul_513]
Ignore MatMul due to non constant B: /[MatMul_583]
Ignore MatMul due to non constant B: /[MatMul_597]
Ignore MatMul due to non constant B: /[MatMul_607]
Ignore MatMul due to non constant B: /[MatMul_677]
Ignore MatMul due to non consta

is operator.


In [10]:
labels = [item["label"] for item in encoded_dataset[validation_key]]
data = encoded_dataset[validation_key][0:batch_size]
inputs_onnx: OD[str, np.ndarray] = convert_tensor(data=data, output="np")
for k, v in inputs_onnx.items():
    inputs_onnx[k] = v.astype(np.int64)

model = create_model_for_provider(path="baseline-optimized.onnx", provider_to_use="CUDAExecutionProvider")
output = model.run(None, inputs_onnx)

In [11]:
data = encoded_dataset["train"][0:batch_size]
inputs_onnx: OD[str, np.ndarray] = convert_tensor(data=data, output="np")
for k, v in inputs_onnx.items():
    inputs_onnx[k] = v.astype(np.int64)

for provider, model_path, benchmark_name, warmup, nb_inference in [
    ("CUDAExecutionProvider", "baseline.onnx", "ONNX Runtime GPU (FP32)", 10, 100),
    ("CUDAExecutionProvider", "baseline-optimized.onnx", "ONNX Runtime GPU (FP16)", 10, 100),
    ("CPUExecutionProvider", "baseline.onnx", "ONNX Runtime CPU (FP32)", 3, 10),
    ("CPUExecutionProvider", "baseline-optimized.onnx", "ONNX Runtime CPU (FP16)", 3, 10),
    ("CPUExecutionProvider", "baseline-quantized.onnx", "ONNX Runtime CPU (INT-8)", 3, 10),
]:
    model = create_model_for_provider(path=model_path, provider_to_use=provider)
    for _ in range(warmup):
        _ = model.run(None, inputs_onnx)
    time_buffer = []
    for _ in range(nb_inference):
        with track_infer_time(time_buffer):
            _ = model.run(None, inputs_onnx)
    print_timings(name=benchmark_name, timings=time_buffer)
    del model

[ONNX Runtime GPU (FP32)] mean=76.38ms, sd=4.99ms, min=73.10ms, max=91.05ms, median=73.91ms, 95p=88.30ms, 99p=89.42ms
[ONNX Runtime GPU (FP16)] mean=34.21ms, sd=1.68ms, min=33.23ms, max=41.80ms, median=33.70ms, 95p=38.87ms, 99p=40.63ms
[ONNX Runtime CPU (FP32)] mean=4023.32ms, sd=92.76ms, min=3895.51ms, max=4267.63ms, median=4013.27ms, 95p=4170.44ms, 99p=4248.19ms
[ONNX Runtime CPU (FP16)] mean=3956.61ms, sd=167.65ms, min=3709.88ms, max=4188.62ms, median=3914.53ms, 95p=4180.81ms, 99p=4187.06ms
[ONNX Runtime CPU (INT-8)] mean=3336.29ms, sd=168.96ms, min=3170.64ms, max=3765.07ms, median=3299.52ms, 95p=3641.01ms, 99p=3740.26ms


Measure of the accuracy with ONNX Runtime engine and CUDA provider:

In [12]:
model = create_model_for_provider(path="baseline.onnx", provider_to_use="CUDAExecutionProvider")
infer_ort = lambda tokens: model.run(None, tokens)
measure_accuracy(infer=infer_ort, int64=True)

0.8663270504330107

In [13]:
model = create_model_for_provider(path="baseline-optimized.onnx", provider_to_use="CUDAExecutionProvider")
infer_ort = lambda tokens: model.run(None, tokens)
measure_accuracy(infer=infer_ort, int64=True)

0.8663270504330107

In [14]:
model = create_model_for_provider(path="baseline-quantized.onnx", provider_to_use="CPUExecutionProvider")
infer_ort = lambda tokens: model.run(None, tokens)
measure_accuracy(infer=infer_ort, int64=True)

0.8650025471217524

In [15]:
del model