# Quick Get Started Notebook of Intel® Neural Compressor for Pytorch


This notebook is designed to provide an easy-to-follow guide for getting started with the [Intel® Neural Compressor](https://github.com/intel/neural-compressor) (INC) library for [pytorch](https://github.com/pytorch/pytorch) framework.

In the following sections, we are going to use a DistilBert model fine-tuned on MRPC as an example to show how to apply post-training quantization on [transformers](https://github.com/huggingface/transformers) models using the INC library.


The main objectives of this notebook are:

1. Prerequisite: Prepare necessary environment, model and dataset.
2. Quantization with INC: Walk through the step-by-step process of applying post-training quantization.
3. Benchmark with INC: Evaluate the performance of the FP32 and INT8 models.


## 1. Prerequisite

### 1.1 Environment

If you have Jupyter Notebook, you may directly run this notebook. We will use pip to install or upgrade [neural-compressor](https://github.com/intel/neural-compressor), [pytorch](https://github.com/pytorch/pytorch) and other required packages.

Otherwise, you can setup a new environment. First, we install [Anaconda](https://www.anaconda.com/distribution/). Then open an Anaconda prompt window and run the following commands:

```shell
conda create -n inc_notebook python==3.8
conda activate inc_notebook
pip install jupyter
jupyter notebook
```
The last command will launch Jupyter Notebook and we can open this notebook in browser to continue.

Then, let's install necessary packages.

In [3]:
%cd ~/neural-compressor
!pwd


/data/home/dujianhua/neural-compressor
/data/home/dujianhua/neural-compressor


In [1]:
# install neural-compressor from source
import sys
# !git clone https://github.com/intel/neural-compressor.git
# %cd ./neural-compressor
!{sys.executable} -m pip install -r requirements.txt
!{sys.executable} setup.py install
%cd ..

# or install stable basic version from pypi
# !{sys.executable} -m pip install neural-compressor


81.07s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


^C
[31mERROR: Operation cancelled by user[0m[31m
[0m

88.06s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


/data/home/dujianhua/anaconda3/envs/inc_notebook/bin/python: can't open file '/data/home/dujianhua/neural-compressor/examples/notebook/pytorch/setup.py': [Errno 2] No such file or directory
/data/home/dujianhua/neural-compressor/examples/notebook


In [11]:
# install other packages used in this notebook.
!cd ./neural-compressor

!{sys.executable} -m pip install -r requirements.txt


[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'[0m[31m
[0m

### 1.2 Load Dataset

The General Language Understanding Evaluation (GLUE) benchmark is a group of nine classification tasks on sentences or pairs of sentences which are:

- [CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not.
- [MNLI](https://arxiv.org/abs/1704.05426) (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.
- [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
- [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. This dataset is built from the SQuAD dataset.
- [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
- [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
- [SST-2](https://nlp.stanford.edu/sentiment/index.html) (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
- [STS-B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
- [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. This dataset is built from the Winograd Schema Challenge dataset.

Here, we use MRPC task. We download and load the required dataset from hub.

In [1]:
import datasets
import numpy as np
import transformers
from datasets import load_dataset, load_metric
from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    EvalPrediction,
    Trainer,
)

2023-10-31 16:31:22.257636: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
task_name = 'mrpc'
raw_datasets = load_dataset("glue", task_name)
label_list = raw_datasets["train"].features["label"].names
num_labels = len(label_list)

Found cached dataset glue (/data/home/dujianhua/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

### 1.3 Prepare Model
Download the pretrained model [textattack/distilbert-base-uncased-MRPC](https://huggingface.co/textattack/distilbert-base-uncased-MRPC) to a pytorch model.

In [3]:
model_name = 'textattack/distilbert-base-uncased-MRPC'

config = AutoConfig.from_pretrained(
    model_name,
    num_labels=num_labels,
    finetuning_task=task_name,
    use_auth_token=None,
)

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    use_auth_token=None,
)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    from_tf=False,
    config=config,
    use_auth_token=None,
)

### 1.4 Dataset Preprocessing
We need to preprocess the raw dataset.

In [4]:
sentence1_key, sentence2_key = ("sentence1", "sentence2")
padding = "max_length"
label_to_id = None
max_seq_length = 128

def preprocess_function(examples):
    args = (
        (examples[sentence1_key], examples[sentence2_key])
    )
    result = tokenizer(*args, padding=padding, max_length=max_seq_length, truncation=True)
    return result

raw_datasets = raw_datasets.map(preprocess_function, batched=True)

Loading cached processed dataset at /data/home/dujianhua/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-24ad1972b8225650.arrow
Loading cached processed dataset at /data/home/dujianhua/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-27c2369cdbd7171d.arrow


Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

## 2. Quantization with Intel® Neural Compressor

### 2.1 Define metric, evaluate function, and dataloader

In this part, we define a GLUE metirc and use it to generate an evaluate function for INC.

Refer to doc [metric.md](https://github.com/intel/neural-compressor/blob/master/docs/source/metric.md#build-custom-metric-with-python-api) for how to build your own metric.
Refer to doc [dataset.md](https://github.com/intel/neural-compressor/blob/master/docs/source/dataset.md#user-specific-dataset) and [dataloader.md](https://github.com/intel/neural-compressor/blob/master/docs/source/dataloader.md#build-custom-dataloader-with-python-apiapi) for how to build your own dataset and dataloader.

In [5]:
eval_dataset = raw_datasets["validation"]
metric = load_metric("glue", task_name)
data_collator = None

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(preds, axis=1)
    result = metric.compute(predictions=preds, references=p.label_ids)
    if len(result) > 1:
        result["combined_score"] = np.mean(list(result.values())).item()
    return result

# Initialize our Trainer
trainer = Trainer(
    model=model,
    train_dataset=None,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

eval_dataloader = trainer.get_eval_dataloader()

# for transformers 4.31.0: accelerate dataloader
# please use the code below to avoid error 
if eval_dataloader.batch_size is None:
    def _build_inc_dataloader(dataloader):
        class INCDataLoader:
            __iter__ = dataloader.__iter__
            def __init__(self) -> None:
                self.dataloader = dataloader
                self.batch_size = dataloader.total_batch_size
        return INCDataLoader()
    eval_dataloader = _build_inc_dataloader(eval_dataloader)
batch_size = eval_dataloader.batch_size

def take_eval_steps(model, trainer, save_metrics=False):
    trainer.model = model
    metrics = trainer.evaluate()
    bert_task_acc_keys = ['eval_f1', 'eval_accuracy', 'eval_matthews_correlation',
                            'eval_pearson', 'eval_mcc', 'eval_spearmanr']
    for key in bert_task_acc_keys:
        if key in metrics.keys():
            throughput = metrics.get("eval_samples_per_second")
            print('Batch size = %d' % batch_size)
            print("Finally Eval {} Accuracy: {}".format(key, metrics[key]))
            print("Latency: %.3f ms" % (1000 / throughput))
            print("Throughput: {} samples/sec".format(throughput))
            return metrics[key]
    assert False, "No metric returned, Please check inference metric!"

def eval_func(model):
    return take_eval_steps(model, trainer)

  metric = load_metric("glue", task_name)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


### 2.2 Run Quantization

So far, we can finally start to quantize the model. 

To start, we need to set the configuration for post-training quantization using `PostTrainingQuantConfig` class. Once the configuration is set, we can proceed to the next step by calling the `quantization.fit()` function. This function performs the quantization process on the model and will return the best quantized model.

In [6]:
from neural_compressor.quantization import fit
from neural_compressor.config import PostTrainingQuantConfig, TuningCriterion
tuning_criterion = TuningCriterion(max_trials=600)
conf = PostTrainingQuantConfig(approach="static", tuning_criterion=tuning_criterion)
q_model = fit(model, conf=conf, calib_dataloader=eval_dataloader, eval_func=eval_func)

2023-10-31 16:32:57 [INFO] Start auto tuning.
2023-10-31 16:32:57 [INFO] Execute the tuning process due to detect the evaluation function.
2023-10-31 16:32:57 [INFO] Adaptor has 5 recipes.
2023-10-31 16:32:57 [INFO] 0 recipes specified by user.
2023-10-31 16:32:57 [INFO] 3 recipes require future tuning.
2023-10-31 16:32:57 [INFO] *** Initialize auto tuning
2023-10-31 16:32:57 [INFO] {
2023-10-31 16:32:57 [INFO]     'PostTrainingQuantConfig': {
2023-10-31 16:32:57 [INFO]         'AccuracyCriterion': {
2023-10-31 16:32:57 [INFO]             'criterion': 'relative',
2023-10-31 16:32:57 [INFO]             'higher_is_better': True,
2023-10-31 16:32:57 [INFO]             'tolerable_loss': 0.01,
2023-10-31 16:32:57 [INFO]             'absolute': None,
2023-10-31 16:32:57 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x7f3f980df3d0>>,
2023-10-31 16:32:57 [INFO]             'relative': 0.01
2023-10-31 16:32:57 [INFO]    

2023-10-31 16:33:43 [INFO] Save tuning history to /data/home/dujianhua/neural-compressor/examples/notebook/pytorch/nc_workspace/2023-10-31_16-32-55/./history.snapshot.
2023-10-31 16:33:43 [INFO] FP32 baseline is: [Accuracy: 0.9027, Duration (seconds): 44.6601]
2023-10-31 16:33:43 [INFO] Quantize the model with default config.


Batch size = 64
Finally Eval eval_f1 Accuracy: 0.9026845637583893
Latency: 109.445 ms
Throughput: 9.137 samples/sec


2023-10-31 16:33:43 [INFO] Fx trace of the entire model failed, We will conduct auto quantization
2023-10-31 16:33:46 [ERROR] Unexpected exception NotImplementedError("Could not run 'quantized::embedding_bag_prepack' with arguments from the 'QuantizedCUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'quantized::embedding_bag_prepack' is only available for these backends: [QuantizedCPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnaps

## 3. Benchmark with Intel® Neural Compressor

INC provides a benchmark feature to measure the model performance with the objective settings.

In [7]:
# fp32 benchmark
!{sys.executable} benchmark.py --input_model ./pytorch_model.bin 2>&1|tee fp32_benchmark.log

# int8 benchmark
!{sys.executable} benchmark.py --input_model ./saved_results/best_model.pt 2>&1|tee int8_benchmark.log


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/bin/bash: {sys.executable}: command not found
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/bin/bash: {sys.executable}: command not found
