# A recipe to perform Nvidia GPU int-8 quantization on most transformers model

Quantization is one of the most effective and generic approach to make model inference faster.
Basically it replaces float numbers generally encoded in 16 or 32 bits by integers encoded in 8 bits or less:

* it takes less memory
* computation is easier / faster

**GPU quantization is a way to double the inference speed of your GPU**.
It can be applied to any model in theory, and unlike distillation, if done well, it should not decrease your model accuracy.

The purpose of this tutorial is to show 2 processes to perform quantization on most `transformer` architecture.

## What is int-8 quantization?

Basic idea behind the expression int-8 quantization is that instead of doing deep learning computations with `float` numbers (usually encoded on 32 bits), you use integers (encoded on 8 bits). On a large matrix multiplication it has 2 effects:

* it reduces by a large margin the size in memory, making **memory transfer faster** (on GPU, many operations are very fast to compute, and memory transfer is the main bottleneck, they are called memory bound)
* it also makes **computation faster** accelerating the slowest operations (in transformer, mainly big matrix multiplication during the self attention comptutation)

A 8-bit integer can encode values from -128 to +127, and no decimal (as it's an integer).
So a 8-bit integer can't encode values like `1280.872654`.

However we can use our integer if it's associated to a scale (a FP32 scale). For instance, for a scale of 20, I can set my integer to 64 (64*20=1280), it's not exactly `1280.872654` but it's close enough.

That's why we need to perform a step called `calibration` during which the range of values and the scale (encoded as a FP32 float) will be computed.

Basically, we know that by converting a FP32 to an int-8 and its scale, we will lose some information, and the goal of the calibration is to minimize this loss.

If in a matrix, values go from -1.5 to +2, it may be encoded as an integer taking value from -127 to +127, associated to a scale of 64 (2*64=128)


[A good documentation on quantization](https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf)


## Why a dedicated tutorial?

CPU quantization is supported out of the box by `Pytorch` or ONNX Runtime.
GPU quantization on the other side requires specific tools and process to be applied.

In the specific case of `transformer` models, right now (december 2021), the only way shown by Nvidia is to build manually the graph of your model in `TensorRT`. This is a low level approach, based on GPU capacity knowledge (which operator are supported, etc.). It's certainly out of reach of most NLP practitioners.

Hopefully, Nvidia recently added to Hugging Face `transformer` library a new model called `QDQBert`.
Basically, it's a vanilla `Bert` architecture which supports int-8 quantization.
It doesn't support any other architecture out of the box, like `Albert`, `Roberta`, or `Electra`.
The Nvidia demo is dedicated to SQuaD task.

The code from Nvidia only supports out of the box vanilla `Bert` model (and not similar models, like RoBerta & co).
The demo from Nvidia is on the SQuaD task, it's cool but it makes the code a lot less clear that needed.

To be both simple and cover most use cases, in this tutorial we will see:

* how to perform GPU quantization on **any** transformer model (not just Bert) using a simple trick
* how to to apply quantization to a common task like classification (which is easier to understand than question answering)
* measure performance gain (latency)

## ToC

### [Dependencies](#Dependencies-installation)

## Project setup

### Dependencies installation

We install `master` branch of `transfomers` library to use a new model: **QDQBert** and `transformer-deploy` to leverage `TensorRT` models (TensorRT API is not something simple to master, it's highly advised to use a wrapper).

In [1]:
#! pip install git+https://github.com/huggingface/transformers
#! pip install git+https://github.com/ELS-RD/transformer-deploy
#! pip install sklearn datasets -U
#! pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com

Check the GPU is enabled and usable.

In [2]:
! nvidia-smi

Mon Dec  6 17:39:28 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:03:00.0  On |                  N/A |
| 70%   55C    P8    47W / 350W |    304MiB / 24267MiB |     15%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [3]:
import numpy as np
from tqdm.notebook import tqdm

from typing import Dict, OrderedDict, List
import torch
from torch import Tensor
from transformers import (
    AutoModelForSequenceClassification,
    PreTrainedModel,
    QDQBertForSequenceClassification,
    BertForSequenceClassification,
    TrainingArguments,
    Trainer,
    IntervalStrategy,
)
from transformer_deploy.QDQModels.QDQRoberta import QDQRobertaForSequenceClassification
import pytorch_quantization.nn as quant_nn
from pytorch_quantization.tensor_quant import QuantDescriptor
from pytorch_quantization import calib
import logging
import transformers
import datasets
from transformer_deploy.backends.trt_utils import build_engine, get_binding_idxs, infer_tensorrt

In [4]:
from pycuda._driver import Stream
import tensorrt as trt
from tensorrt.tensorrt import IExecutionContext, Logger, Runtime
import pycuda.autoinit


Set logging to `error` to make the `notebook` more readable on Github.

In [5]:
log_level = logging.ERROR
logging.getLogger().setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
trt_logger: Logger = trt.Logger(trt.Logger.ERROR)
transformers.logging.set_verbosity_error()

### Download data

This part is inspired from an [official Notebooks from Hugging Face](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb).

In [6]:
task = "mnli"
num_labels = 3
model_checkpoint = "roberta-base"
batch_size = 32
max_seq_len = 256
validation_key = "validation_matched"

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.

In [7]:
from datasets import load_dataset, load_metric

dataset = load_dataset("glue", task)
metric = load_metric('glue', task)
dataset

  0%|          | 0/5 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 392702
    })
    validation_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9815
    })
    validation_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9832
    })
    test_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9796
    })
    test_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9847
    })
})

### Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We can them write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

In [9]:
def preprocess_function(examples):
    return tokenizer(examples["premise"], examples["hypothesis"], truncation=True, padding="max_length", max_length=max_seq_len)

In [10]:
encoded_dataset = dataset.map(preprocess_function, batched=True)

## Fine-tuning model

Now that our data are ready, we can download the pretrained model and fine-tune it.

We will also prepare some export function right now

In [11]:
def convert_to_onnx(model_pytorch: PreTrainedModel, output_path: str, inputs_pytorch: Dict[str, torch.Tensor]) -> None:
    with torch.no_grad():
        torch.onnx.export(
            model_pytorch,  # model to optimize
            args=(inputs_pytorch["input_ids"], inputs_pytorch["attention_mask"]),  # tuple of multiple inputs , inputs_pytorch["token_type_ids"]
            f=output_path,  # output path / file object
            opset_version=13,  # the ONNX version to use, 13 is the first to support QDQ nodes
            do_constant_folding=True,  # simplify model (replace constant expressions)
            input_names=["input_ids", "attention_mask"],  # input names "token_type_ids"
            output_names=["model_output"],  # output name
            dynamic_axes={  # declare dynamix axis for each input / output (dynamic axis == variable length axis)
                "input_ids": {0: "batch_size", 1: "sequence"},
                "attention_mask": {0: "batch_size", 1: "sequence"},
                #"token_type_ids": {0: "batch_size", 1: "sequence"},
                "model_output": {0: "batch_size"},
            },
            verbose=False,
        )


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)


def calibrate(model: PreTrainedModel, encoded_dataset, nb_sample: int=128) -> None:
    # Find the TensorQuantizer and enable calibration
    for name, module in tqdm(model.named_modules()):
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                module.disable_quant()
                module.enable_calib()
            else:
                module.disable()

    with torch.no_grad():
        for start_index in tqdm(range(0, nb_sample, batch_size)):
            end_index = start_index + batch_size
            data = encoded_dataset["train"][start_index:end_index]
            input_torch = {k: torch.tensor(list(v), dtype=torch.long, device="cpu")
                           for k, v in data.items() if k in ["input_ids", "attention_mask", "token_type_ids"]}
            model(**input_torch)


    # Finalize calibration
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                if isinstance(module._calibrator, calib.MaxCalibrator):
                    module.load_calib_amax()
                else:
                    module.load_calib_amax("percentile", percentile=99.99)
                module.enable_quant()
                module.disable_calib()
            else:
                module.enable()

    model.cuda()


In [12]:
runtime: Runtime = trt.Runtime(trt_logger)
profile_index = 0

Default parameters to be used for the training:

In [13]:
nb_step = 1000
strategy = IntervalStrategy.STEPS
args = TrainingArguments(
    f"{model_checkpoint}-finetuned-{task}",
    evaluation_strategy = strategy,
    eval_steps=nb_step,
    logging_steps=nb_step,
    save_steps=nb_step,
    save_strategy = strategy,
    learning_rate=1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size*2,
    num_train_epochs=1,
    fp16=True,
    group_by_length=True,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to=[],
)

## Method 1: `Transplantation` of weights from a source model to an optimized architecture

Transplantation idea is to export weights from one model and use them in another one.
In our case, the source are `Roberta` weights and the target is `Bert` archtecture which is highly optimized on `TensorRT` for GPU quantization.

Indeed, not all models are quantization compliant. The optimization engine (`TensorRT`) search for some patterns and will fail to opimize the model if it doesn't find them. It requires the Pytorch code to be written in a certain way and use certain operations. For that reason, it's a good idea to reuse an architecture highly optimized.

We will leverage the fact that since `Bert` have been released, very few improvements have been brought to the transformer architecture (at least for encoder only models).
Better models appeared, and most of the work has been done to improve the pretraining step (aka the weights).
So the idea will be to take the weights from those new models and put them inside `Bert` architecture.

The process described below should work for most users.

**steps**:

* load `Bert` model
* retrieve layer/weight names
* load target model (here `Roberta`)
* replace weight/layer names with those from `Roberta`
* override the architecture name in model configuration

If there is no 1 to 1 correspondance (it happens), try to keep at least embeddings and self attention. Of course, it's possible that if a model is very different, the transplant may cost some accuracy. In our experience, if your trainset is big enough it should not happen.


In [14]:
model_bert: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=num_labels)
bert_keys = list(model_bert.state_dict().keys())
del model_bert

model_roberta: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
model_roberta.save_pretrained("roberta-in-bert")
del model_roberta
model_weights: OrderedDict[str, Tensor] = torch.load("roberta-in-bert/pytorch_model.bin")

# Roberta -> Bert, there is 1 to 1 correspondance, for other models, you may need to create your own mapping.
for bert_key in bert_keys:
    # pop remove the first weights from the Ordered dict ...
    _, weight = model_weights.popitem(last=False)
    # ... and we re-insert them, in order, with a new key
    model_weights[bert_key] = weight

# we re-export the weights
torch.save(model_weights, "roberta-in-bert/pytorch_model.bin")
del model_weights

We override the architecture name to make `transformers` believe it is `Bert`...

In [15]:
# =====> change architecture to bert base <======
import json

with open("roberta-in-bert/config.json") as f:
    content = json.load(f)
    content['architectures'] = ["bert"]

with open("roberta-in-bert/config.json", mode="w") as f:
    json.dump(content, f)

## Model training


When you create a classification model from a pretrained one, the last layer are randomly initialized.
We don't want to take these totally random values to compute the calibration of tensors.
Moreover, our trainset is a bit small, and it's easy to overfit.

Therefore, we train our `Roberta into Bert` model on 1/6 of the train set.
The goal is to slightly update the weights to the new architecture, not to get the best score.

> another approach is to fully train your model, perform calibration, and then retrain it on a small part of the data with a low learning rate (usually 1/10 of the original one).


In [16]:
transformers.logging.set_verbosity_error()
model_bert = BertForSequenceClassification.from_pretrained("roberta-in-bert", num_labels=num_labels)
model_bert = model_bert.cuda()

trainer = Trainer(
    model_bert,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
transformers.logging.set_verbosity_error()
trainer.train()
print(trainer.evaluate())
model_bert.save_pretrained("roberta-in-bert-trained")
del trainer
del model_bert

[INFO|trainer.py:437] 2021-12-06 17:39:49,638 >> Using amp half precision backend


{'loss': 0.7303, 'learning_rate': 9.1875814863103e-06, 'epoch': 0.08}
{'eval_loss': 0.5143516659736633, 'eval_accuracy': 0.8018339276617422, 'eval_runtime': 18.9153, 'eval_samples_per_second': 518.892, 'eval_steps_per_second': 8.142, 'epoch': 0.08}
{'loss': 0.5419, 'learning_rate': 8.373533246414604e-06, 'epoch': 0.16}
{'eval_loss': 0.4696938693523407, 'eval_accuracy': 0.8183392766174223, 'eval_runtime': 19.0652, 'eval_samples_per_second': 514.813, 'eval_steps_per_second': 8.078, 'epoch': 0.16}
{'loss': 0.5056, 'learning_rate': 7.558670143415907e-06, 'epoch': 0.24}
{'eval_loss': 0.4684630036354065, 'eval_accuracy': 0.819969434538971, 'eval_runtime': 18.5425, 'eval_samples_per_second': 529.326, 'eval_steps_per_second': 8.305, 'epoch': 0.24}
{'loss': 0.4806, 'learning_rate': 6.744621903520209e-06, 'epoch': 0.33}
{'eval_loss': 0.42402705550193787, 'eval_accuracy': 0.8364747834946511, 'eval_runtime': 18.5925, 'eval_samples_per_second': 527.901, 'eval_steps_per_second': 8.283, 'epoch': 0.33

### Quantization

Below we will start the quantization process.
It follow those steps:

* perform the calibration
* perform a quantization aware training

By passing validation values to the model, we will calibrate it, meaning it will get the right range / scale to convert FP32 weights to int-8 ones.

## Calibration

### Activate histogram calibration

There are several kinds of calbrators, below we use the percentile one (99.99p) (`histogram`), basically, its purpose is to just remove the most extreme values before computing range / scale.
The other option is `max`, it's much faster but expect lower accuracy.

Second calibration option, choose between calibration done at the tensor level or per channel (more fine grained, slower).

In [17]:
# you can also use "max" instead of "historgram"
input_desc = QuantDescriptor(num_bits=8, calib_method="histogram")
# below we do per-channel quantization for weights, set axis to None to get a per tensor calibration
weight_desc = QuantDescriptor(num_bits=8, axis=(0,))
quant_nn.QuantLinear.set_default_quant_desc_input(input_desc)
quant_nn.QuantLinear.set_default_quant_desc_weight(weight_desc)

### Perform calibration

During this step we will enable the calibration nodes, and pass some representative data to the model.
It will then be used to compute the scale/range.

Official recommendations from Nvidia is to calibrate over thousands of examples from the validation set.
Here we use 40*32 examples, because it's a slow process. It's enough to be close from the original accuracy, on your use case, follow Nvidia process.

In [20]:
model_q = QDQBertForSequenceClassification.from_pretrained("roberta-in-bert-trained", num_labels=num_labels)

calibrate(model=model_q, encoded_dataset=encoded_dataset)

# count = 0
# for name, mod in model_q.named_modules():
#     if isinstance(mod, pytorch_quantization.nn.TensorQuantizer):
#         print(f"{name:80} {mod}")
#         count += 1
# print(f"{count} TensorQuantizers found in model")
# model_q.save_pretrained("roberta-in-bert-trained-quantized")

0it [00:00, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

QDQBertForSequenceClassification(
  (bert): QDQBertModel(
    (embeddings): QDQBertEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): QDQBertEncoder(
      (layer): ModuleList(
        (0): QDQBertLayer(
          (attention): QDQBertAttention(
            (self): QDQBertSelfAttention(
              (query): QuantLinear(
                in_features=768, out_features=768, bias=True
                (_input_quantizer): TensorQuantizer(8bit fake per-tensor amax=4.3825 calibrator=HistogramCalibrator scale=1.0 quant)
                (_weight_quantizer): TensorQuantizer(8bit fake axis=(0,) amax=[0.2278, 0.7138](768) calibrator=MaxCalibrator scale=1.0 quant)
              )
              (key): QuantLinear(
                in_features=7

### Quantization Aware Training (QAT)

The query aware training is not a mandatory step, but **highly** recommended to get the best accuracy. Basically we will redo the training with the quantization enabled and a low learning rate.

In [21]:
model_q = QDQBertForSequenceClassification.from_pretrained("roberta-in-bert-trained-quantized", num_labels=num_labels)
model_q = model_q.cuda()

args.learning_rate /= 10
trainer = Trainer(
    model_q,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
transformers.logging.set_verbosity_error()
print(trainer.evaluate())
trainer.train()
print(trainer.evaluate())
model_q.save_pretrained("roberta-in-bert-trained-quantized-bis")
del model_q
del trainer

[INFO|trainer.py:437] 2021-12-06 18:34:07,721 >> Using amp half precision backend


{'eval_loss': 0.4492516815662384, 'eval_accuracy': 0.8271013754457464, 'eval_runtime': 46.2281, 'eval_samples_per_second': 212.317, 'eval_steps_per_second': 3.331}
{'eval_loss': 0.4492516815662384, 'eval_accuracy': 0.8271013754457464, 'eval_runtime': 46.2281, 'eval_samples_per_second': 212.317, 'eval_steps_per_second': 3.331}
{'loss': 0.4752, 'learning_rate': 9.188396349413299e-07, 'epoch': 0.08}
{'eval_loss': 0.4362102150917053, 'eval_accuracy': 0.8346408558329088, 'eval_runtime': 46.4717, 'eval_samples_per_second': 211.204, 'eval_steps_per_second': 3.314, 'epoch': 0.08}
{'loss': 0.4643, 'learning_rate': 8.373533246414604e-07, 'epoch': 0.16}
{'eval_loss': 0.42539361119270325, 'eval_accuracy': 0.8370860927152318, 'eval_runtime': 46.5627, 'eval_samples_per_second': 210.791, 'eval_steps_per_second': 3.307, 'epoch': 0.16}
{'loss': 0.4509, 'learning_rate': 7.558670143415907e-07, 'epoch': 0.24}
{'eval_loss': 0.42584264278411865, 'eval_accuracy': 0.8367804381049414, 'eval_runtime': 46.5106, 

### Benchmark

#### Export a `QDQ Pytorch` model on `ONNX`, we need to enable fake quantization mode from Pytorch.

In [22]:
data = encoded_dataset["train"][0: 3]
input_torch = {k: torch.tensor(v, dtype=torch.long, device="cuda") for k, v in data.items() if k in ["input_ids", "attention_mask", "token_type_ids"]}

from pytorch_quantization.nn import TensorQuantizer
model_q = QDQBertForSequenceClassification.from_pretrained("roberta-in-bert-trained-quantized-bis", num_labels=num_labels)
model_q = model_q.cuda()
TensorQuantizer.use_fb_fake_quant = True
convert_to_onnx(model_q, output_path="model_q.onnx", inputs_pytorch=input_torch)
TensorQuantizer.use_fb_fake_quant = False
del model_q

  inputs, amax.item() / bound, 0,
  quant_dim = list(amax.shape).index(list(amax_sequeeze.shape)[0])


#### Convert `ONNX` graph to `TensorRT` engine

In [23]:
engine = build_engine(
    runtime=runtime,
    onnx_file_path="model_q.onnx",
    logger=trt_logger,
    min_shape=(batch_size, max_seq_len),
    optimal_shape=(batch_size, max_seq_len),
    max_shape=(batch_size, max_seq_len),
    workspace_size=10000 * 1024 * 1024,
    fp16=False,
    int8=True,
)

#### Prepare input and output buffer

In [24]:
profile_index = 0
np_input = {"input_ids": np.random.randint(1, 10000, size=(batch_size, max_seq_len), dtype=np.int64),
 "attention_mask": np.ones(shape=(batch_size, max_seq_len), dtype=np.int64),
            }

stream: Stream = pycuda.driver.Stream()

context: IExecutionContext = engine.create_execution_context()
context.set_optimization_profile_async(profile_index=profile_index, stream_handle=stream.handle)
input_binding_idxs, output_binding_idxs = get_binding_idxs(engine, profile_index)  # type: List[int], List[int]

#### Inference on `TensorRT`

In [25]:
tensorrt_output = infer_tensorrt(
    context=context,
    host_inputs=np_input,
    input_binding_idxs=input_binding_idxs,
    output_binding_idxs=output_binding_idxs,
    stream=stream,
)
print(tensorrt_output)

[array([[ 0.1358001 , -1.4377486 ,  1.3672757 ],
       [-0.16206698, -1.149481  ,  1.4266016 ],
       [ 0.0163878 , -1.0470941 ,  1.2498031 ],
       [-0.21079333, -0.91275144,  1.2614312 ],
       [ 0.13416213, -1.2132894 ,  1.0915226 ],
       [-0.23387383, -0.6663823 ,  1.0708152 ],
       [-0.4426742 , -0.64095986,  0.6767337 ],
       [-0.39520252, -0.6310587 ,  1.162437  ],
       [-0.11956491, -0.9094458 ,  1.2330313 ],
       [-0.34652767, -0.56745625,  1.1321819 ],
       [-0.3788384 , -0.9477967 ,  1.3850961 ],
       [-1.079162  ,  0.04613969,  0.9176692 ],
       [-0.12555303, -0.8791798 ,  1.2635291 ],
       [-0.12463601, -0.63906515,  0.95351076],
       [ 0.31858096, -0.410717  ,  0.69519377],
       [ 0.07587517, -0.58817637,  0.82071406],
       [ 0.1137608 , -0.8322618 ,  0.6675602 ],
       [-0.50839895, -0.8443974 ,  1.462322  ],
       [-0.14658742, -1.1222454 ,  1.3913041 ],
       [ 0.05990895, -1.4671483 ,  1.5297441 ],
       [ 0.17553274, -0.26642302,  0.67

#### Conversion with `trtexec` (command line approach)

In [26]:
#!/usr/src/tensorrt/bin/trtexec --onnx=model_q.onnx --shapes=input_ids:32x256,attention_mask:32x256 --int8 --workspace=6000  --saveEngine="test.plan"

## Method 2: use a dedicated QDQ model

In method 2, the idea is to take the source code of a specific model and add manually in the source code `QDQ` nodes. That way, quantization will work out of the box. Even if `Bert` has many variations, it seems that very few of them are really used. Hugging Face transformers library include `Bert` model.
Our library offer a dedicated implementation of `Roberta`.

To adapt another architecture, you need to:

* replaced linear layers with their quantized version
* replace operations not supported out of the box by TensorRT by a similar code supporting the operation.

> it's not a complex process, but it requires some knowledge of `ONNX` supported operations and `TensorRT` framework

The process below is a bit simpler than the method 1:

* finetune the QDQ model on the task (Quantization Aware Training)
* calibrate
* Quantization Aware training (QAT)

> you may skip step 1/ if you want

### Fine tuning the model

In [29]:
model_roberta: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
model_roberta = model_roberta.cuda()

args.learning_rate = 1e-5
trainer = Trainer(
    model_roberta,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
transformers.logging.set_verbosity_error()
trainer.train()
print(trainer.evaluate())
# {'eval_loss': 0.3559744358062744, 'eval_accuracy': 0.8655119714722364, 'eval_runtime': 19.6678, 'eval_samples_per_second': 499.04, 'eval_steps_per_second': 7.83, 'epoch': 0.98}
trainer.save_model("roberta-model")
del model_roberta
del trainer

[INFO|trainer.py:437] 2021-12-06 20:38:02,464 >> Using amp half precision backend


{'loss': 0.6886, 'learning_rate': 9.188396349413299e-06, 'epoch': 0.08}
{'eval_loss': 0.4678966999053955, 'eval_accuracy': 0.8171166581762608, 'eval_runtime': 18.7354, 'eval_samples_per_second': 523.874, 'eval_steps_per_second': 8.22, 'epoch': 0.08}
{'loss': 0.5021, 'learning_rate': 8.373533246414604e-06, 'epoch': 0.16}
{'eval_loss': 0.4271945059299469, 'eval_accuracy': 0.8333163525216505, 'eval_runtime': 18.5466, 'eval_samples_per_second': 529.209, 'eval_steps_per_second': 8.303, 'epoch': 0.16}
{'loss': 0.4682, 'learning_rate': 7.558670143415907e-06, 'epoch': 0.24}
{'eval_loss': 0.4240091145038605, 'eval_accuracy': 0.8358634742740703, 'eval_runtime': 18.6916, 'eval_samples_per_second': 525.101, 'eval_steps_per_second': 8.239, 'epoch': 0.24}
{'loss': 0.4491, 'learning_rate': 6.743807040417211e-06, 'epoch': 0.33}
{'eval_loss': 0.38295766711235046, 'eval_accuracy': 0.8523688232297504, 'eval_runtime': 18.6766, 'eval_samples_per_second': 525.524, 'eval_steps_per_second': 8.246, 'epoch': 0.

### Calibration

In [30]:

input_desc = QuantDescriptor(num_bits=8, calib_method="histogram")
# below we do per-channel quantization for weights, set axis to None to get a per tensor calibration
weight_desc = QuantDescriptor(num_bits=8, axis=(0,))
quant_nn.QuantLinear.set_default_quant_desc_input(input_desc)
quant_nn.QuantLinear.set_default_quant_desc_weight(weight_desc)

# keep it on CPU
model_roberta_q: PreTrainedModel = QDQRobertaForSequenceClassification.from_pretrained("roberta-model")
calibrate(model=model_roberta_q, encoded_dataset=encoded_dataset)


model_roberta_q.save_pretrained("roberta-trained-quantized")
del model_roberta_q


0it [00:00, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

### Quantization Aware Training (QAT)

In [31]:

model_roberta_q: PreTrainedModel = QDQRobertaForSequenceClassification.from_pretrained("roberta-trained-quantized", num_labels=num_labels)
model_roberta_q = model_roberta_q.cuda()

args.learning_rate /= 10
print(f"LR: {args.learning_rate}")
trainer = Trainer(
    model_roberta_q,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
transformers.logging.set_verbosity_error()
print(trainer.evaluate())
# 4 batches
# {'eval_loss': 0.38076257705688477, 'eval_accuracy': 0.8552215995924605, 'eval_runtime': 46.9577, 'eval_samples_per_second': 209.018, 'eval_steps_per_second': 3.28}
# 100 batches
# {'eval_loss': 0.386756956577301, 'eval_accuracy': 0.8516556291390729, 'eval_runtime': 48.9996, 'eval_samples_per_second': 200.308, 'eval_steps_per_second': 3.143}
trainer.train()
print(trainer.evaluate())
# {'eval_loss': 0.40235549211502075, 'eval_accuracy': 0.8589913397860418, 'eval_runtime': 46.1754, 'eval_samples_per_second': 212.559, 'eval_steps_per_second': 3.335, 'epoch': 1.0}
model_roberta_q.save_pretrained("roberta-in-bert-trained-quantized-retrained")
del model_roberta_q

[INFO|trainer.py:437] 2021-12-06 21:28:16,421 >> Using amp half precision backend


LR: 1.0000000000000002e-06
{'eval_loss': 0.38657698035240173, 'eval_accuracy': 0.8526744778400408, 'eval_runtime': 47.6064, 'eval_samples_per_second': 206.17, 'eval_steps_per_second': 3.235}
{'eval_loss': 0.38657698035240173, 'eval_accuracy': 0.8526744778400408, 'eval_runtime': 47.6064, 'eval_samples_per_second': 206.17, 'eval_steps_per_second': 3.235}
{'loss': 0.4018, 'learning_rate': 9.187581486310301e-07, 'epoch': 0.08}
{'eval_loss': 0.38418063521385193, 'eval_accuracy': 0.8558329088130413, 'eval_runtime': 46.6509, 'eval_samples_per_second': 210.393, 'eval_steps_per_second': 3.301, 'epoch': 0.08}
{'loss': 0.3954, 'learning_rate': 8.373533246414604e-07, 'epoch': 0.16}
{'eval_loss': 0.3795166015625, 'eval_accuracy': 0.8589913397860418, 'eval_runtime': 46.5562, 'eval_samples_per_second': 210.821, 'eval_steps_per_second': 3.308, 'epoch': 0.16}
{'loss': 0.3916, 'learning_rate': 7.558670143415907e-07, 'epoch': 0.24}
{'eval_loss': 0.3784726560115814, 'eval_accuracy': 0.8558329088130413, 'e

### Benchmark

In [32]:
model_roberta_q: PreTrainedModel = QDQRobertaForSequenceClassification.from_pretrained("roberta-in-bert-trained-quantized-retrained", num_labels=num_labels)
model_roberta_q = model_roberta_q.cuda()

data = encoded_dataset["train"][1: 3]
input_torch = {k: torch.tensor(list(v), dtype=torch.long, device="cuda") for k, v in data.items() if k in ["input_ids", "attention_mask", "token_type_ids"]}

from pytorch_quantization.nn import TensorQuantizer
TensorQuantizer.use_fb_fake_quant = True
convert_to_onnx(model_pytorch=model_roberta_q, output_path="roberta_q.onnx", inputs_pytorch=input_torch)
TensorQuantizer.use_fb_fake_quant = False

  inputs, amax.item() / bound, 0,
  quant_dim = list(amax.shape).index(list(amax_sequeeze.shape)[0])


## Latency measures

Let's see if what we have done is useful...


### TensorRT baseline

Below we export a randomly initialized Roberta model, the purpose is to only check the performance.

In [33]:
data = encoded_dataset["train"][1:10]
input_torch = {k: torch.tensor(list(v), dtype=torch.long, device="cuda")
               for k, v in data.items() if k in ["input_ids", "attention_mask", "token_type_ids"]}

baseline_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
baseline_model = baseline_model.cuda()
convert_to_onnx(baseline_model, output_path="baseline.onnx", inputs_pytorch=input_torch)
del baseline_model

In [34]:
#!/usr/src/tensorrt/bin/trtexec --onnx=baseline.onnx --shapes=input_ids:1x384,attention_mask:1x384 --best --workspace=6000

## Pytorch baseline