# A recipe to perform Nvidia GPU int-8 quantization on most transformers model

Recently, Nvidia added to Hugging Face `transformer` library a new model called `QDQBert`.
The single purpose of this model is to show how to add GPU quantization to vanilla Bert.
There are also some demo scripts to demonstrate the use of the model on SQuaD task.

**GPU quantization is a way to double the inference speed of your GPU**.
It can be applied to any model in theory, and unlike distillation, if done well, it should not decrease your model accuracy.

Unfortunately, these extreme perforamances are not easy to get, it requires some good knowledge of TensorRT API, ONNX export, or quantization process. The purpose of this tutorial is to show a good enough process to perform quantization.

## What is int-8 quantization?

Basic idea behind the expression int-8 quantization is that instead of doing deep learning computations with `float` numbers (usually encoded on 32 bits), you use integers (encoded on 8 bits). On a large matrix multiplication it has 2 effects:

* it reduces by a large margin the size in memory, making **memory transfer faster** (on GPU, many operations are very fast to compute, and memory transfer is the main bottleneck, they are called memory bound)
* it also makes **computation faster** accelerating the slowest operations (in transformer, mainly big matrix multiplication during the self attention comptutation)

A 8-bit integer can encode values from -128 to +127, and no decimal (as it's an integer).
So a 8-bit integer can't encode values like `1280.872654`.

However we can use our integer if it's associated to a scale (a FP32 scale). For instance, for a scale of 20, I can set my integer to 64 (64*20=1280), it's not exactly `1280.872654` but it's close enough.

That's why we need to perform a step called `calibration` during which the range of values and the scale (encoded as a FP32 float) will be computed.

Basically, we know that by converting a FP32 to an int-8 and its scale, we will lose some information, and the goal of the calibration is to minimize this loss.

If in a matrix, values go from -1.5 to +2, it may be encoded as an integer taking value from -128 to +127, associated to a scale of 64 (2*64=128)


[A good documentation on quantization](https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf)


## Why a dedicated tutorial?

The code from Nvidia only supports out of the box vanilla `Bert` model (and not similar models, like RoBerta & co).
The demo from Nvidia is on the SQuaD task, it's cool but it makes the code a lot less clear that needed.

To be both simple and cover most use cases, in this tutorial we will see:

* how to perform GPU quantization on **any** transformer model (not just Bert) using a simple trick
* how to to apply quantization to a common task like classification (which is easier to understand than question answering)
* measure performance gain (latency)

# Dependencies installation

We install `master` branch of `transfomers` library to use a new model: **QDQBert** and `transformer-deploy` to leverage `TensorRT` models (TensorRT API is not something simple to master, it's highly advised to use a wrapper).

In [1]:
#! pip install git+https://github.com/huggingface/transformers
#! pip install git+https://github.com/ELS-RD/transformer-deploy
#! pip install sklearn datasets -U
#! pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com

In [2]:
# check that the GPU is enabled
! nvidia-smi

Wed Dec  1 18:59:19 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:03:00.0  On |                  N/A |
| 79%   60C    P8    52W / 350W |    311MiB / 24267MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

# Fine-tuning a model on a text classification task

This part is inspired from [official Notebooks from Hugging Face](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb).

In [3]:
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

In [4]:
task = "mnli"
num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
model_checkpoint = "roberta-base"
batch_size = 32
validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"

### Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.

In [5]:
from datasets import load_dataset, load_metric
import datasets

actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)
dataset

Reusing dataset glue (/home/geantvert/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/5 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 392702
    })
    validation_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9815
    })
    validation_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9832
    })
    test_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9796
    })
    test_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 9847
    })
})

### Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

In [7]:
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

We can double check it does work on our current dataset:

In [8]:
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence 1: Conceptually cream skimming has two basic dimensions - product and geography.
Sentence 2: Product and geography are what make cream skimming work. 


We can them write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

In [9]:
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True, padding="max_length", max_length=256)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True, padding="max_length", max_length=256)

In [10]:
encoded_dataset = dataset.map(preprocess_function, batched=True)

Loading cached processed dataset at /home/geantvert/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-1f1a13d917d99a50.arrow
Loading cached processed dataset at /home/geantvert/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-9b9dbd19d82c4713.arrow
Loading cached processed dataset at /home/geantvert/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-b37b7241dc97daf7.arrow
Loading cached processed dataset at /home/geantvert/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-f78b5169435f1ed4.arrow
Loading cached processed dataset at /home/geantvert/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-87c8e6fc7a3e0678.arrow


## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about sentence classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which is always 2, except for STS-B which is a regression problem and MNLI where we have 3 labels):

In [11]:
import pytorch_quantization.nn as quant_nn
from pytorch_quantization.tensor_quant import QuantDescriptor
import numpy as np
from tqdm.notebook import tqdm

from typing import Dict, OrderedDict
import torch
from torch import Tensor
from transformers import (
    AutoModelForSequenceClassification,
    PreTrainedModel,
    QDQBertForSequenceClassification,
    BertForSequenceClassification,
    TrainingArguments,
    Trainer,
    IntervalStrategy,
)
import pytorch_quantization
from pytorch_quantization import calib
import shutil

In [12]:
def convert_to_onnx(model_pytorch: PreTrainedModel, output_path: str, inputs_pytorch: Dict[str, torch.Tensor]) -> None:
    with torch.no_grad():
        torch.onnx.export(
            model_pytorch,  # model to optimize
            args=(inputs_pytorch["input_ids"], inputs_pytorch["attention_mask"]),  # tuple of multiple inputs , inputs_pytorch["token_type_ids"]
            f=output_path,  # output path / file object
            opset_version=13,  # the ONNX version to use
            do_constant_folding=True,  # simplify model (replace constant expressions)
            input_names=["input_ids", "attention_mask"],  # input names "token_type_ids"
            output_names=["model_output"],  # output name
            dynamic_axes={  # declare dynamix axis for each input / output (dynamic axis == variable length axis)
                "input_ids": {0: "batch_size", 1: "sequence"},
                "attention_mask": {0: "batch_size", 1: "sequence"},
                #"token_type_ids": {0: "batch_size", 1: "sequence"},
                "model_output": {0: "batch_size"},
            },
            verbose=False,
        )

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

In [13]:
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"
model_name = model_checkpoint.split("/")[-1]

nb_step = 1000
strategy = IntervalStrategy.STEPS
args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = strategy,
    eval_steps=nb_step,
    logging_steps=nb_step,
    save_steps=nb_step,
    save_strategy = strategy,
    learning_rate=1e-5,  # 7.5e-6 https://github.com/pytorch/fairseq/issues/2057#issuecomment-643674771
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size*2,
    num_train_epochs=1,
    fp16=True,
    group_by_length=False,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

## Transplant weights from one model into Bert architecture

First, you need to know that not all models are quantization compliant. The optimization engine (`TensorRT`) search for some patterns and will fail to opimize the model if it doesn't find them. It requires the code to be written in a certain way. For that reason we will try to reuse what works.

We will leverage the fact that since Bert have been released, very few improvements have been brought to the transformer architecture (at least encoder only models).
Indeed, better model appeared, and most of the work has been done to improve the pretraining step.
So the idea will be to take the weights from those new models and put them inside Bert.

The reason of this process is to avoid the modification of the source code of these others model.
Copy-pasting quantization part of QDQModel to another one is not hard (there are only few blocks modified) but would require some work on the user side, making quantization harder that it should be.
The process described below is not perfect but should work for most users.

**steps**:

* load Bert model
* retrieve layer/weight names
* load target model (here Roberta)
* replace weight/layer names with those from Roberta
* override the architecture name in model configuration

If there is no 1 to 1 correspondance (it happens), try to keep at least embeddings and self attention. Of course, it's possible that if a model is very different, the transplant may cost some accuracy. In our experience, if your trainset is big enough it should not happen.


In [14]:
print(model_checkpoint)

roberta-base


In [15]:
model_bert = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=num_labels)
bert_keys = list(model_bert.state_dict().keys())
del model_bert

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
model.save_pretrained("roberta-in-bert")
del model
model_weights: OrderedDict[str, Tensor] = torch.load("roberta-in-bert/pytorch_model.bin")


# a too simple check
# IRL, check layer names and find a way to map self attention and embeddings from the original model to Bert
assert len(model_weights) == len(bert_keys)

for bert_key in bert_keys:
    # pop remove the first weights from the Ordered dict ...
    _, weight = model_weights.popitem(last=False)
    # ... and we re-insert them, in order, with a new key
    model_weights[bert_key] = weight

# we re-export the weights
torch.save(model_weights, "roberta-in-bert/pytorch_model.bin")
del model_weights


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

We override the architecture name to make `transformers` believe it is Bert...

In [16]:
# =====> change architecture to bert base <======
import json

with open("roberta-in-bert/config.json") as f:
    content = json.load(f)
    content['architectures'] = ["bert"]

with open("roberta-in-bert/config.json", mode="w") as f:
    json.dump(content, f)

## Model training


When you create a classification model from a pretrained one, the last layer are randomly initialized.
We don't want to take these totally random values to compute the calibration of tensors.
Moreover, our trainset is a bit small, and it's easy to overfit.

Therefore, we train our `Roberta into Bert` model on 1/6 of the train set.
The goal is to slightly update the weights to the new architecture, not to get the best score.

> another approach is to fully train your model, perform calibration, and then retrain it on a small part of the data with a low learning rate (usually 1/10 of the original one).


In [17]:
model_bert = BertForSequenceClassification.from_pretrained("roberta-in-bert", num_labels=num_labels)
model_bert = model_bert.cuda()

args.max_steps = 2000
trainer = Trainer(
    model_bert,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()
print(trainer.evaluate())
model_bert.save_pretrained("roberta-in-bert-trained")
del trainer
del model_bert

You are using a model of type roberta to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
max_steps is given, it will override any value given in num_train_epochs
Using amp half precision backend
The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, hypothesis, premise.
***** Running training *****
  Num examples = 392702
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 2000


Step,Training Loss,Validation Loss,Accuracy
1000,0.7235,0.532824,0.792562
2000,0.5491,0.483588,0.809068


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, hypothesis, premise.
***** Running Evaluation *****
  Num examples = 9815
  Batch size = 64
Saving model checkpoint to roberta-base-finetuned-mnli/checkpoint-1000
Configuration saved in roberta-base-finetuned-mnli/checkpoint-1000/config.json
Model weights saved in roberta-base-finetuned-mnli/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in roberta-base-finetuned-mnli/checkpoint-1000/tokenizer_config.json
Special tokens file saved in roberta-base-finetuned-mnli/checkpoint-1000/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, hypothesis, premise.
***** Running Evaluation *****
  Num examples = 9815
  Batch size = 64
Saving model checkpoint to roberta-base-finetuned-mnli/checkpoint-2000
Configuratio

Configuration saved in roberta-in-bert-trained/config.json


{'eval_loss': 0.4835878908634186, 'eval_accuracy': 0.8090677534386144, 'eval_runtime': 19.1183, 'eval_samples_per_second': 513.384, 'eval_steps_per_second': 8.055, 'epoch': 0.16}


Model weights saved in roberta-in-bert-trained/pytorch_model.bin


# Quantization

Below we will start the quantization process.
It follow those steps:

* perform the calibration
* perform a quantization aware training

By passing validation values to the model, we will calibrate it, meaning it will get the right range / scale to convert FP32 weights to int-8 ones.

## Calibration

### Activate histogram calibration

There are several kinds of calbrators, below we use the percentile one (99.99p) (`histogram`), basically, its purpose is to just remove the most extreme values before computing range / scale.
The other option is `max`, it's much faster but expect lower accuracy.

Second calibration option, choose between calibration done at the tensor level or per channel (more fine grained, slower).

In [18]:
# you can also use "max" instead of "historgram"
input_desc = QuantDescriptor(num_bits=8, calib_method="histogram")
# below we do per-channel quantization for weights, set axis to None to get a per tensor calibration
weight_desc = QuantDescriptor(num_bits=8, axis=(0,))
quant_nn.QuantLinear.set_default_quant_desc_input(input_desc)
quant_nn.QuantLinear.set_default_quant_desc_weight(weight_desc)

### Perform calibration

During this step we will enable the calibration nodes, and pass some representative data to the model.
It will then be used to compute the scale/range.

In [19]:
model_q = QDQBertForSequenceClassification.from_pretrained("roberta-in-bert-trained", num_labels=num_labels)
model_q = model_q.cuda()

# Find the TensorQuantizer and enable calibration
for name, module in tqdm(model_q.named_modules()):
    if isinstance(module, quant_nn.TensorQuantizer):
        if module._calibrator is not None:
            module.disable_quant()
            module.enable_calib()
        else:
            module.disable()

with torch.no_grad():
    for start_index in tqdm(range(0, 4*batch_size, batch_size)):
        end_index = start_index + batch_size
        data = encoded_dataset["train"][start_index:end_index]
        input_torch = {k: torch.tensor(list(v), dtype=torch.long, device="cuda")
                       for k, v in data.items() if k in ["input_ids", "attention_mask", "token_type_ids"]}
        model_q(**input_torch)


print("calibration")
# Finalize calibration
for name, module in model_q.named_modules():
    if isinstance(module, quant_nn.TensorQuantizer):
        if module._calibrator is not None:
            if isinstance(module._calibrator, calib.MaxCalibrator):
                module.load_calib_amax()
            else:
                module.load_calib_amax("percentile", percentile=99.99)
            module.enable_quant()
            module.disable_calib()
        else:
            module.enable()

model_q.cuda()

count = 0
for name, mod in model_q.named_modules():
    if isinstance(mod, pytorch_quantization.nn.TensorQuantizer):
        print(f"{name:80} {mod}")
        count += 1
print(f"{count} TensorQuantizers found in model")
model_q.save_pretrained("roberta-in-bert-trained-quantized")

loading configuration file roberta-in-bert-trained/config.json
You are using a model of type bert to instantiate a model of type qdqbert. This is not supported for all configurations of models and can yield errors.
Model config QDQBertConfig {
  "_name_or_path": "roberta-in-bert",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "qdqbert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  

I1201 19:07:22.795491 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantLinear with axis (0,)!
I1201 19:07:22.795853 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:22.796230 140057592104768 tensor_quantizer.py:105] Creating Max calibrator
I1201 19:07:22.807137 140057592104768 _utils.py:72] Input is fake quantized to 8 bits in QuantLinear with axis None!
I1201 19:07:22.807927 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantLinear with axis (0,)!
I1201 19:07:22.808553 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:22.809141 140057592104768 tensor_quantizer.py:105] Creating Max calibrator
I1201 19:07:22.809771 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:22.810495 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:22.811114 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19

I1201 19:07:23.058811 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantLinear with axis (0,)!
I1201 19:07:23.059446 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:23.060050 140057592104768 tensor_quantizer.py:105] Creating Max calibrator
I1201 19:07:23.079895 140057592104768 _utils.py:72] Input is fake quantized to 8 bits in QuantLinear with axis None!
I1201 19:07:23.080400 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantLinear with axis (0,)!
I1201 19:07:23.080774 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:23.081119 140057592104768 tensor_quantizer.py:105] Creating Max calibrator
I1201 19:07:23.081963 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:23.082303 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:23.092197 140057592104768 _utils.py:72] Input is fake quantized to 8 bits in QuantLinear 

I1201 19:07:23.328983 140057592104768 _utils.py:72] Input is fake quantized to 8 bits in QuantLinear with axis None!
I1201 19:07:23.329701 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantLinear with axis (0,)!
I1201 19:07:23.330348 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:23.331055 140057592104768 tensor_quantizer.py:105] Creating Max calibrator
I1201 19:07:23.331630 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:23.332135 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:23.332654 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:23.333236 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:23.343668 140057592104768 _utils.py:72] Input is fake quantized to 8 bits in QuantLinear with axis None!
I1201 19:07:23.344560 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantL

I1201 19:07:23.619890 140057592104768 _utils.py:72] Input is fake quantized to 8 bits in QuantLinear with axis None!
I1201 19:07:23.620731 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantLinear with axis (0,)!
I1201 19:07:23.621435 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:23.622109 140057592104768 tensor_quantizer.py:105] Creating Max calibrator
I1201 19:07:23.623023 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:23.623664 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:23.634196 140057592104768 _utils.py:72] Input is fake quantized to 8 bits in QuantLinear with axis None!
I1201 19:07:23.634888 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantLinear with axis (0,)!
I1201 19:07:23.635725 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:07:23.636688 140057592104768 tensor_quantizer.py:105] Creating

0it [00:00, ?it/s]

I1201 19:07:24.008212 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.008661 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.008988 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.009320 140057592104768 tensor_quantizer.py:179] Enable MaxCalibrator
I1201 19:07:24.009638 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.009947 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.010272 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.010650 140057592104768 tensor_quantizer.py:179] Enable MaxCalibrator
I1201 19:07:24.011093 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.011521 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.011932 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.012334 140057592104768 tensor_q

I1201 19:07:24.045287 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.045724 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.046156 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.046822 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.047271 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.047720 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.048171 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.048631 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.049073 140057592104768 tensor_quantizer.py:179] Enable MaxCalibrator
I1201 19:07:24.049527 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.049985 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.050450 14005759210476

I1201 19:07:24.084179 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.084587 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.084959 140057592104768 tensor_quantizer.py:179] Enable MaxCalibrator
I1201 19:07:24.092437 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.093339 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.094649 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.099028 140057592104768 tensor_quantizer.py:179] Enable MaxCalibrator
I1201 19:07:24.099557 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.100039 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.100597 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.100946 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.101308 140057592104768 tens

I1201 19:07:24.139753 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.140066 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.140388 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.140700 140057592104768 tensor_quantizer.py:179] Enable MaxCalibrator
I1201 19:07:24.141023 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.141333 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.141650 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.141971 140057592104768 tensor_quantizer.py:179] Enable MaxCalibrator
I1201 19:07:24.142299 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.142667 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.142989 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.143304 140057592104768 tensor_q

I1201 19:07:24.174127 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.174479 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.174819 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.175151 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.175482 140057592104768 tensor_quantizer.py:179] Enable MaxCalibrator
I1201 19:07:24.175847 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.176198 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.176537 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.176899 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.177248 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.177613 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.177950 14005759210476

I1201 19:07:24.208091 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.208422 140057592104768 tensor_quantizer.py:179] Enable MaxCalibrator
I1201 19:07:24.208777 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.209120 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator
I1201 19:07:24.209468 140057592104768 tensor_quantizer.py:183] Disable `quant` stage.
I1201 19:07:24.209815 140057592104768 tensor_quantizer.py:179] Enable HistogramCalibrator


  0%|          | 0/4 [00:00<?, ?it/s]

I1201 19:07:24.249449 140057592104768 histogram.py:69] Calibrator encountered negative values. It shouldn't happen after ReLU. Make sure this is the right tensor to calibrate.
I1201 19:07:24.322341 140057592104768 max.py:60] Calibrator encountered negative values. It shouldn't happen after ReLU. Make sure this is the right tensor to calibrate.
W1201 19:11:25.924171 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
W1201 19:11:25.925113 140057592104768 tensor_quantizer.py:238] Call .cuda() if running on GPU after loading calibrated amax.
I1201 19:11:25.925732 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:25.926429 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:25.926960 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([768, 1]).
I1201 19:11:25.927570 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:25.928478 140057592104768 tensor_qua

I1201 19:11:25.971887 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:25.972258 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:25.973086 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:25.973466 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:25.973832 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:25.974312 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:25.974988 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:25.975597 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:25.976069 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:25.976484 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:25.977218 140057592104768 tensor_quantizer.py:173] Dis

I1201 19:11:26.025758 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.026160 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:26.026600 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([768, 1]).
I1201 19:11:26.027027 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.027422 140057592104768 tensor_quantizer.py:173] Disable MaxCalibrator
W1201 19:11:26.028437 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.028844 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.029254 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:26.030018 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.030419 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.030812 140057592104768 tensor_quantizer.py:173] Dis

I1201 19:11:26.074990 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.075353 140057592104768 tensor_quantizer.py:173] Disable MaxCalibrator
W1201 19:11:26.076299 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.076682 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.077061 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:26.077822 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.078284 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.078689 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:26.079439 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.079832 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.080327 140057592104768 tensor_quantizer.py:173] Disable H

calibration


W1201 19:11:26.124270 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.124982 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.125337 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:26.125682 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([3072, 1]).
I1201 19:11:26.126024 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.126360 140057592104768 tensor_quantizer.py:173] Disable MaxCalibrator
W1201 19:11:26.126821 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.127168 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.127512 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:26.127857 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([768, 1]).
I1201 19:11:26.128201 140057592104768

W1201 19:11:26.164236 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([768, 1]).
I1201 19:11:26.164555 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.164862 140057592104768 tensor_quantizer.py:173] Disable MaxCalibrator
W1201 19:11:26.165282 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.166933 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.167483 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:26.167847 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([768, 1]).
I1201 19:11:26.168207 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.168518 140057592104768 tensor_quantizer.py:173] Disable MaxCalibrator
W1201 19:11:26.169337 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.169656 140057592104768 tensor

W1201 19:11:26.203794 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.204129 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.204463 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:26.205213 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.205532 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.205856 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:26.206314 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.206920 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.207242 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:26.207595 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([3072, 1]).
I1201 19:11:26.207930 140057592104768

W1201 19:11:26.243480 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([768, 1]).
I1201 19:11:26.243837 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.244161 140057592104768 tensor_quantizer.py:173] Disable MaxCalibrator
W1201 19:11:26.245077 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.245416 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.245760 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:26.246103 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([768, 1]).
I1201 19:11:26.246461 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.246803 140057592104768 tensor_quantizer.py:173] Disable MaxCalibrator
W1201 19:11:26.247718 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.248057 140057592104768 tensor

W1201 19:11:26.287186 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.287554 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.287908 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:26.288288 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([768, 1]).
I1201 19:11:26.288911 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.289278 140057592104768 tensor_quantizer.py:173] Disable MaxCalibrator
W1201 19:11:26.289983 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.290372 140057592104768 tensor_quantizer.py:187] Enable `quant` stage.
W1201 19:11:26.290744 140057592104768 tensor_quantizer.py:173] Disable HistogramCalibrator
W1201 19:11:26.291477 140057592104768 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
I1201 19:11:26.291844 140057592104768 tensor

bert.encoder.layer.0.attention.self.query._input_quantizer                       TensorQuantizer(8bit fake per-tensor amax=4.3667 calibrator=HistogramCalibrator scale=1.0 quant)
bert.encoder.layer.0.attention.self.query._weight_quantizer                      TensorQuantizer(8bit fake axis=(0,) amax=[0.2286, 0.7135](768) calibrator=MaxCalibrator scale=1.0 quant)
bert.encoder.layer.0.attention.self.key._input_quantizer                         TensorQuantizer(8bit fake per-tensor amax=4.3667 calibrator=HistogramCalibrator scale=1.0 quant)
bert.encoder.layer.0.attention.self.key._weight_quantizer                        TensorQuantizer(8bit fake axis=(0,) amax=[0.2130, 0.8616](768) calibrator=MaxCalibrator scale=1.0 quant)
bert.encoder.layer.0.attention.self.value._input_quantizer                       TensorQuantizer(8bit fake per-tensor amax=4.3667 calibrator=HistogramCalibrator scale=1.0 quant)
bert.encoder.layer.0.attention.self.value._weight_quantizer                      TensorQuantiz

Model weights saved in roberta-in-bert-trained-quantized/pytorch_model.bin


## Quantization aware training

The query aware training is not a mandatory step, but highly recommended to get the best accuracy. Basically we will redo the training with the quantization enabled.

In [20]:
model_q = QDQBertForSequenceClassification.from_pretrained("roberta-in-bert-trained-quantized", num_labels=num_labels)
model_q = model_q.cuda()

args.max_steps = -1
trainer = Trainer(
    model_q,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
print(trainer.evaluate())
trainer.train()
print(trainer.evaluate())
model_q.save_pretrained("roberta-in-bert-trained-quantized-bis")
del model_q

loading configuration file roberta-in-bert-trained-quantized/config.json
Model config QDQBertConfig {
  "_name_or_path": "roberta-in-bert-trained",
  "architectures": [
    "QDQBertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "qdqbert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.13.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 502

I1201 19:11:27.439879 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.440475 140057592104768 tensor_quantizer.py:105] Creating Max calibrator
I1201 19:11:27.450418 140057592104768 _utils.py:72] Input is fake quantized to 8 bits in QuantLinear with axis None!
I1201 19:11:27.451315 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantLinear with axis (0,)!
I1201 19:11:27.451903 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.452303 140057592104768 tensor_quantizer.py:105] Creating Max calibrator
I1201 19:11:27.452749 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.453151 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.453549 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.453957 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.465221 14005759210

I1201 19:11:27.693244 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.693841 140057592104768 tensor_quantizer.py:105] Creating Max calibrator
I1201 19:11:27.714617 140057592104768 _utils.py:72] Input is fake quantized to 8 bits in QuantLinear with axis None!
I1201 19:11:27.715979 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantLinear with axis (0,)!
I1201 19:11:27.716634 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.717667 140057592104768 tensor_quantizer.py:105] Creating Max calibrator
I1201 19:11:27.719031 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.719666 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.731079 140057592104768 _utils.py:72] Input is fake quantized to 8 bits in QuantLinear with axis None!
I1201 19:11:27.731843 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantLinear 

I1201 19:11:27.980991 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantLinear with axis (0,)!
I1201 19:11:27.981393 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.981833 140057592104768 tensor_quantizer.py:105] Creating Max calibrator
I1201 19:11:27.982313 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.982766 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.983162 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.983512 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:27.993079 140057592104768 _utils.py:72] Input is fake quantized to 8 bits in QuantLinear with axis None!
I1201 19:11:27.993498 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantLinear with axis (0,)!
I1201 19:11:27.993848 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1

I1201 19:11:28.249994 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantLinear with axis (0,)!
I1201 19:11:28.250553 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:28.251643 140057592104768 tensor_quantizer.py:105] Creating Max calibrator
I1201 19:11:28.253252 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:28.253765 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:28.265286 140057592104768 _utils.py:72] Input is fake quantized to 8 bits in QuantLinear with axis None!
I1201 19:11:28.266306 140057592104768 _utils.py:75] Weight is fake quantized to 8 bits in QuantLinear with axis (0,)!
I1201 19:11:28.267012 140057592104768 tensor_quantizer.py:101] Creating histogram calibrator
I1201 19:11:28.267527 140057592104768 tensor_quantizer.py:105] Creating Max calibrator
I1201 19:11:28.278982 140057592104768 _utils.py:72] Input is fake quantized to 8 bits in QuantLinear 

All the weights of QDQBertForSequenceClassification were initialized from the model checkpoint at roberta-in-bert-trained-quantized.
If your task is similar to the task the model of the checkpoint was trained on, you can already use QDQBertForSequenceClassification for predictions without further training.
Using amp half precision backend
The following columns in the evaluation set  don't have a corresponding argument in `QDQBertForSequenceClassification.forward` and have been ignored: idx, hypothesis, premise.
***** Running Evaluation *****
  Num examples = 9815
  Batch size = 64


The following columns in the training set  don't have a corresponding argument in `QDQBertForSequenceClassification.forward` and have been ignored: idx, hypothesis, premise.
***** Running training *****
  Num examples = 392702
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 12272


{'eval_loss': 0.5553710460662842, 'eval_accuracy': 0.7799286805909322, 'eval_runtime': 46.6334, 'eval_samples_per_second': 210.472, 'eval_steps_per_second': 3.302}


Step,Training Loss,Validation Loss,Accuracy
1000,0.5814,0.505601,0.805807
2000,0.5424,0.481971,0.811105
3000,0.5108,0.469823,0.823637
4000,0.494,0.459618,0.821905
5000,0.4827,0.418851,0.837596
6000,0.4712,0.417829,0.836373
7000,0.4607,0.43154,0.834947
8000,0.4601,0.402023,0.847376
9000,0.4577,0.396712,0.846052
10000,0.4354,0.398412,0.84646


The following columns in the evaluation set  don't have a corresponding argument in `QDQBertForSequenceClassification.forward` and have been ignored: idx, hypothesis, premise.
***** Running Evaluation *****
  Num examples = 9815
  Batch size = 64
Saving model checkpoint to roberta-base-finetuned-mnli/checkpoint-1000
Configuration saved in roberta-base-finetuned-mnli/checkpoint-1000/config.json
Model weights saved in roberta-base-finetuned-mnli/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in roberta-base-finetuned-mnli/checkpoint-1000/tokenizer_config.json
Special tokens file saved in roberta-base-finetuned-mnli/checkpoint-1000/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `QDQBertForSequenceClassification.forward` and have been ignored: idx, hypothesis, premise.
***** Running Evaluation *****
  Num examples = 9815
  Batch size = 64
Saving model checkpoint to roberta-base-finetuned-mnli/checkpoint-2000
Config

Loading best model from roberta-base-finetuned-mnli/checkpoint-12000 (score: 0.8502292409577178).
W1201 20:35:01.955523 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.0.attention.self.query._input_quantizer: Overwriting amax.
W1201 20:35:01.956298 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.0.attention.self.query._weight_quantizer: Overwriting amax.
W1201 20:35:01.959111 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.0.attention.self.key._input_quantizer: Overwriting amax.
W1201 20:35:01.960055 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.0.attention.self.key._weight_quantizer: Overwriting amax.
W1201 20:35:01.961468 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.0.attention.self.value._input_quantizer: Overwriting amax.
W1201 20:35:01.962328 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.0.attention.self.value._weight_quantizer: Overwriting amax.
W1201 20:35:01.963392 140057592104768 tensor_quantizer.py

W1201 20:35:02.018628 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.2.output.add_local_input_quantizer: Overwriting amax.
W1201 20:35:02.019512 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.2.output.add_residual_input_quantizer: Overwriting amax.
W1201 20:35:02.021052 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.3.attention.self.query._input_quantizer: Overwriting amax.
W1201 20:35:02.021683 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.3.attention.self.query._weight_quantizer: Overwriting amax.
W1201 20:35:02.023229 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.3.attention.self.key._input_quantizer: Overwriting amax.
W1201 20:35:02.024051 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.3.attention.self.key._weight_quantizer: Overwriting amax.
W1201 20:35:02.025203 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.3.attention.self.value._input_quantizer: Overwriting amax.
W1201 20:35:02.025818 

W1201 20:35:02.080285 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.5.output.dense._input_quantizer: Overwriting amax.
W1201 20:35:02.081100 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.5.output.dense._weight_quantizer: Overwriting amax.
W1201 20:35:02.082259 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.5.output.add_local_input_quantizer: Overwriting amax.
W1201 20:35:02.082955 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.5.output.add_residual_input_quantizer: Overwriting amax.
W1201 20:35:02.084679 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.6.attention.self.query._input_quantizer: Overwriting amax.
W1201 20:35:02.085377 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.6.attention.self.query._weight_quantizer: Overwriting amax.
W1201 20:35:02.086830 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.6.attention.self.key._input_quantizer: Overwriting amax.
W1201 20:35:02.087531 14005759210476

W1201 20:35:02.141115 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.8.intermediate.dense._weight_quantizer: Overwriting amax.
W1201 20:35:02.143184 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.8.output.dense._input_quantizer: Overwriting amax.
W1201 20:35:02.143835 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.8.output.dense._weight_quantizer: Overwriting amax.
W1201 20:35:02.144592 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.8.output.add_local_input_quantizer: Overwriting amax.
W1201 20:35:02.145140 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.8.output.add_residual_input_quantizer: Overwriting amax.
W1201 20:35:02.146503 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.9.attention.self.query._input_quantizer: Overwriting amax.
W1201 20:35:02.147159 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.9.attention.self.query._weight_quantizer: Overwriting amax.
W1201 20:35:02.148222 1400575921047

W1201 20:35:02.197219 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.11.attention.output.add_residual_input_quantizer: Overwriting amax.
W1201 20:35:02.199527 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.11.intermediate.dense._input_quantizer: Overwriting amax.
W1201 20:35:02.200487 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.11.intermediate.dense._weight_quantizer: Overwriting amax.
W1201 20:35:02.202985 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.11.output.dense._input_quantizer: Overwriting amax.
W1201 20:35:02.203737 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.11.output.dense._weight_quantizer: Overwriting amax.
W1201 20:35:02.204701 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.11.output.add_local_input_quantizer: Overwriting amax.
W1201 20:35:02.205333 140057592104768 tensor_quantizer.py:402] bert.encoder.layer.11.output.add_residual_input_quantizer: Overwriting amax.
The following columns i

{'eval_loss': 0.39855679869651794, 'eval_accuracy': 0.8502292409577178, 'eval_runtime': 47.3757, 'eval_samples_per_second': 207.174, 'eval_steps_per_second': 3.251, 'epoch': 1.0}


# Latency measures

Let's see if what we have done is useful...


## TensorRT quantized model

In [23]:
from pytorch_quantization.nn import TensorQuantizer
TensorQuantizer.use_fb_fake_quant = True
model_q = QDQBertForSequenceClassification.from_pretrained("roberta-in-bert-trained-quantized-bis", num_labels=num_labels)
model_q = model_q.cuda()
print(trainer.evaluate())
convert_to_onnx(model_q, output_path="model_q.onnx", inputs_pytorch=input_torch)
TensorQuantizer.use_fb_fake_quant = False
del model_q

W1201 21:32:19.778870 140057592104768 tensor_quantizer.py:280] Use Pytorch's native experimental fake quantization.
  inputs, amax.item() / bound, 0,
  quant_dim = list(amax.shape).index(list(amax_sequeeze.shape)[0])


In [None]:
!/usr/src/tensorrt/bin/trtexec --onnx=model_q.onnx --shapes=input_ids:1x384,attention_mask:1x384 --best --workspace=6000

## TensorRT baseline

In [None]:
baseline_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
baseline_model = baseline_model.cuda()
convert_to_onnx(baseline_model, output_path="baseline.onnx", inputs_pytorch=input_torch)
del baseline_model

In [None]:
!/usr/src/tensorrt/bin/trtexec --onnx=baseline.onnx --shapes=input_ids:1x384,attention_mask:1x384 --fp16 --workspace=6000

In [None]:
del baseline_model

## Pytorch baseline