This document is used to introduce the details about how to quantize the model with PostTrainingStatic
on the text classification task and obtain the benchmarking results.
Python 3.6 or higher version is recommended. The dependent packages are listed in requirements, please install them as follows,
cd examples/pytorch/nlp/huggingface_models/text-classification/quantization/ptq_static/fx
pip install -r requirements.txt
python run_glue.py \
--model_name_or_path yoshitomo-matsubara/bert-large-uncased-rte \
--task_name rte \
--do_eval \
--do_train \
--max_seq_length 128 \
--per_device_eval_batch_size 16 \
--no_cuda \
--output_dir saved_results \
--tune \
--overwrite_output_dir
NOTE:
saved_results
is the path to finetuned output_dir.
or
bash run_quant.sh --topology=topology_name --input_model=model_name_or_path
NOTE: User can also install Open MPI with Conda.
In run_glue.py
, set config.quant_leve1
to 1 and config.tuning_criterion.strategy
to "basic" by the following statement.
from neural_compressor.config import PostTrainingQuantConfig, TuningCriterion
tuning_criterion = TuningCriterion(max_trials=600, strategy="basic")
conf = PostTrainingQuantConfig(quant_level=1, approach="static", tuning_criterion=tuning_criterion)
And then, run the following command:
mpirun -np <NUM_PROCESS> \
-mca btl_tcp_if_include <NETWORK_INTERFACE> \
-x OMP_NUM_THREADS=<MAX_NUM_THREADS> \
--host <HOSTNAME1>,<HOSTNAME2>,<HOSTNAME3> \
bash run_distributed_tuning.sh
-
<NUM_PROCESS>
is the number of processes, which is recommended to set to be equal to the number of hosts. -
<MAX_NUM_THREADS>
is the number of threads, which is recommended to set to be equal to the number of physical cores on one node. -
<HOSTNAME>
is the host name, and argument--host <HOSTNAME>,<HOSTNAME>...
can be replaced with--hostfile <HOSTFILE>
, when each line in<HOSTFILE>
is a host name. -
-mca btl_tcp_if_include <NETWORK_INTERFACE>
is used to set the network communication interface between hosts. For example,<NETWORK_INTERFACE>
can be set to 192.168.20.0/24 to allow the MPI communication between all hosts under the 192.168.20.* network segment.
# int8
bash run_benchmark.sh --topology=topology_name --mode=performance --int8=true --input_model=saved_results
# fp32
bash run_benchmark.sh --topology=topology_name --mode=performance --input_model=model_name_or_path
Topology Name | Model Name | Dataset/Task Name |
---|---|---|
bert_large_RTE | yoshitomo-matsubara/bert-large-uncased-rte | rte |
xlm-roberta-base_MRPC | Intel/xlm-roberta-base-mrpc | mrpc |
bert_base_MRPC | Intel/bert-base-uncased-mrpc | mrpc |
bert_base_CoLA | textattack/bert-base-uncased-CoLA | cola |
bert_base_STS-B | textattack/bert-base-uncased-STS-B | stsb |
bert_base_SST-2 | gchhablani/bert-base-cased-finetuned-sst2 | sst2 |
bert_base_RTE | ModelTC/bert-base-uncased-rte | rte |
bert_large_QNLI | textattack/bert-base-uncased-QNLI | qnli |
bert_large_CoLA | yoshitomo-matsubara/bert-large-uncased-cola | cola |
distilbert_base_MRPC | textattack/distilbert-base-uncased-MRPC | mrpc |
xlnet_base_cased_MRPC | Intel/xlnet-base-cased-mrpc | mrpc |
roberta_base_MRPC | textattack/roberta-base-MRPC | mnli |
camembert_base_MRPC | Intel/camembert-base-mrpc | mrpc |
We provide an API save_for_huggingface_upstream
to collect configuration files, tokenizer files and INT8 model weights in the format of transformers.
from neural_compressor.utils.load_huggingface import save_for_huggingface_upstream
save_for_huggingface_upstream(q_model, tokenizer, output_dir)
Users can upstream files in the output_dir
into the model hub and reuse them with our OptimizedModel
API.
We provide an API OptimizedModel
to initialize INT8 models from HuggingFace model hub and its usage is same as the model class provided by transformers.
from neural_compressor.utils.load_huggingface import OptimizedModel
model = OptimizedModel.from_pretrained(
model_args.model_name_or_path,
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
We also upstreamed several INT8 models into HuggingFace model hub for users to ramp up.
- User specifies FP32 'model', calibration dataset 'q_dataloader', evaluation dataset "eval_dataloader" and metrics.
- User specifies FP32 'model', calibration dataset 'q_dataloader' and a custom "eval_func" which encapsulates the evaluation dataset and metrics by itself.
We update run_glue.py
as follows:
trainer = QuestionAnsweringTrainer(
model=model,
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
eval_examples=eval_examples if training_args.do_eval else None,
tokenizer=tokenizer,
data_collator=data_collator,
post_process_function=post_processing_function,
compute_metrics=compute_metrics,
)
eval_dataloader = trainer.get_eval_dataloader()
batch_size = eval_dataloader.batch_size
metric_name = "eval_f1"
def take_eval_steps(model, trainer, metric_name, save_metrics=False):
trainer.model = model
metrics = trainer.evaluate()
return metrics.get(metric_name)
def eval_func(model):
return take_eval_steps(model, trainer, metric_name)
from neural_compressor.config import PostTrainingQuantConfig
from neural_compressor import quantization
tuning_criterion = TuningCriterion(max_trials=600)
conf = PostTrainingQuantConfig(approach="static", tuning_criterion=tuning_criterion, use_distributed_tuning=False)
q_model = fit(model, conf=conf, calib_dataloader=eval_dataloader, eval_func=eval_func)
q_model.save(training_args.output_dir)