Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
requirements.txt	requirements.txt
run_benchmark.sh	run_benchmark.sh
run_distributed_tuning.sh	run_distributed_tuning.sh
run_glue.py	run_glue.py
run_quant.sh	run_quant.sh

Step-by-Step

This document is used to introduce the details about how to quantize the model with PostTrainingStatic on the text classification task and obtain the benchmarking results.

Prerequisite

1. Environment

Python 3.6 or higher version is recommended. The dependent packages are listed in requirements, please install them as follows,

cd examples/pytorch/nlp/huggingface_models/text-classification/quantization/ptq_static/fx
pip install -r requirements.txt

Run

1. Quantization

1.1 Quantization with single node

python run_glue.py \
        --model_name_or_path yoshitomo-matsubara/bert-large-uncased-rte \
        --task_name rte \
        --do_eval \
        --do_train \
        --max_seq_length 128 \
        --per_device_eval_batch_size 16 \
        --no_cuda \
        --output_dir saved_results \
        --tune \
        --overwrite_output_dir

NOTE: saved_results is the path to finetuned output_dir.

bash run_quant.sh --topology=topology_name --input_model=model_name_or_path

1.2 Quantization with multi-node

Prerequisites:
- Open MPI
- mpi4py

NOTE: User can also install Open MPI with Conda.

In run_glue.py, set config.quant_leve1 to 1 and config.tuning_criterion.strategy to "basic" by the following statement.

from neural_compressor.config import PostTrainingQuantConfig, TuningCriterion
tuning_criterion = TuningCriterion(max_trials=600, strategy="basic")
conf = PostTrainingQuantConfig(quant_level=1, approach="static", tuning_criterion=tuning_criterion)

And then, run the following command:

mpirun -np <NUM_PROCESS> \
         -mca btl_tcp_if_include <NETWORK_INTERFACE> \
         -x OMP_NUM_THREADS=<MAX_NUM_THREADS> \
         --host <HOSTNAME1>,<HOSTNAME2>,<HOSTNAME3> \
         bash run_distributed_tuning.sh

<NUM_PROCESS> is the number of processes, which is recommended to set to be equal to the number of hosts.
<MAX_NUM_THREADS> is the number of threads, which is recommended to set to be equal to the number of physical cores on one node.
<HOSTNAME> is the host name, and argument --host <HOSTNAME>,<HOSTNAME>... can be replaced with --hostfile <HOSTFILE>, when each line in <HOSTFILE> is a host name.
-mca btl_tcp_if_include <NETWORK_INTERFACE> is used to set the network communication interface between hosts. For example, <NETWORK_INTERFACE> can be set to 192.168.20.0/24 to allow the MPI communication between all hosts under the 192.168.20.* network segment.

2. Benchmark

# int8
bash run_benchmark.sh --topology=topology_name --mode=performance --int8=true --input_model=saved_results
# fp32
bash run_benchmark.sh --topology=topology_name --mode=performance --input_model=model_name_or_path

3. Validated Model List

Topology Name	Model Name	Dataset/Task Name
bert_large_RTE	yoshitomo-matsubara/bert-large-uncased-rte	rte
xlm-roberta-base_MRPC	Intel/xlm-roberta-base-mrpc	mrpc
bert_base_MRPC	Intel/bert-base-uncased-mrpc	mrpc
bert_base_CoLA	textattack/bert-base-uncased-CoLA	cola
bert_base_STS-B	textattack/bert-base-uncased-STS-B	stsb
bert_base_SST-2	gchhablani/bert-base-cased-finetuned-sst2	sst2
bert_base_RTE	ModelTC/bert-base-uncased-rte	rte
bert_large_QNLI	textattack/bert-base-uncased-QNLI	qnli
bert_large_CoLA	yoshitomo-matsubara/bert-large-uncased-cola	cola
distilbert_base_MRPC	textattack/distilbert-base-uncased-MRPC	mrpc
xlnet_base_cased_MRPC	Intel/xlnet-base-cased-mrpc	mrpc
roberta_base_MRPC	textattack/roberta-base-MRPC	mnli
camembert_base_MRPC	Intel/camembert-base-mrpc	mrpc

HuggingFace Model Hub

1. To upstream into HuggingFace model hub

We provide an API save_for_huggingface_upstream to collect configuration files, tokenizer files and INT8 model weights in the format of transformers.

from neural_compressor.utils.load_huggingface import save_for_huggingface_upstream
save_for_huggingface_upstream(q_model, tokenizer, output_dir)

Users can upstream files in the output_dir into the model hub and reuse them with our OptimizedModel API.

2. To download from HuggingFace model hub

We provide an API OptimizedModel to initialize INT8 models from HuggingFace model hub and its usage is same as the model class provided by transformers.

from neural_compressor.utils.load_huggingface import OptimizedModel
model = OptimizedModel.from_pretrained(
            model_args.model_name_or_path,
            config=config,
            cache_dir=model_args.cache_dir,
            revision=model_args.model_revision,
            use_auth_token=True if model_args.use_auth_token else None,
        )

We also upstreamed several INT8 models into HuggingFace model hub for users to ramp up.

Tutorial of Enabling NLP Models with Intel® Neural Compressor

1. Intel® Neural Compressor supports two usages:

User specifies FP32 'model', calibration dataset 'q_dataloader', evaluation dataset "eval_dataloader" and metrics.
User specifies FP32 'model', calibration dataset 'q_dataloader' and a custom "eval_func" which encapsulates the evaluation dataset and metrics by itself.

2. Code Prepare

We update run_glue.py as follows:

trainer = QuestionAnsweringTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset if training_args.do_train else None,
    eval_dataset=eval_dataset if training_args.do_eval else None,
    eval_examples=eval_examples if training_args.do_eval else None,
    tokenizer=tokenizer,
    data_collator=data_collator,
    post_process_function=post_processing_function,
    compute_metrics=compute_metrics,
)

eval_dataloader = trainer.get_eval_dataloader()
batch_size = eval_dataloader.batch_size
metric_name = "eval_f1"

def take_eval_steps(model, trainer, metric_name, save_metrics=False):
    trainer.model = model
    metrics = trainer.evaluate()
    return metrics.get(metric_name)

def eval_func(model):
    return take_eval_steps(model, trainer, metric_name)

from neural_compressor.config import PostTrainingQuantConfig
from neural_compressor import quantization
tuning_criterion = TuningCriterion(max_trials=600)
conf = PostTrainingQuantConfig(approach="static", tuning_criterion=tuning_criterion, use_distributed_tuning=False)
q_model = fit(model, conf=conf, calib_dataloader=eval_dataloader, eval_func=eval_func)
q_model.save(training_args.output_dir)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fx

fx

README.md

README.md

requirements.txt

requirements.txt

run_benchmark.sh

run_benchmark.sh

run_distributed_tuning.sh

run_distributed_tuning.sh

run_glue.py

run_glue.py

run_quant.sh

run_quant.sh

README.md

Step-by-Step

Prerequisite

1. Environment

Run

1. Quantization

1.1 Quantization with single node

1.2 Quantization with multi-node

2. Benchmark

3. Validated Model List

HuggingFace Model Hub

1. To upstream into HuggingFace model hub

2. To download from HuggingFace model hub

We also upstreamed several INT8 models into HuggingFace model hub for users to ramp up.

Tutorial of Enabling NLP Models with Intel® Neural Compressor

1. Intel® Neural Compressor supports two usages:

2. Code Prepare

Files

fx

Directory actions

More options

Directory actions

More options

Latest commit

History

fx

Folders and files

parent directory

Step-by-Step

Prerequisite

1. Environment

Run

1. Quantization

1.1 Quantization with single node

1.2 Quantization with multi-node

2. Benchmark

3. Validated Model List

HuggingFace Model Hub

1. To upstream into HuggingFace model hub

2. To download from HuggingFace model hub

We also upstreamed several INT8 models into HuggingFace model hub for users to ramp up.

Tutorial of Enabling NLP Models with Intel® Neural Compressor

1. Intel® Neural Compressor supports two usages:

2. Code Prepare