Official PyTorch implementation of Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
The Outlier Suppression+ (OS+) effectively suppresses outliers in large language models for better quantization performance without extra inference burden. It first identifies the outlier asymmetric shape across channels and proposes a channel-wise shifting technique with a migration pattern to eliminate it. It then focuses on the outlier concentration phenomenon and proposes to scale down outlier channels toward an elaborate objective.
We assess the efficacy of our approach under both standard and fine-grained quantization settings. On standard one, OS+ achieves near-floating-point performance on 8-bit and 6-bit BERT, OPTs, BLOOM, and BLOOMZ. On fine-grained one, OS+ can surpass others by 9.41% on 4-bit LLaMA with per-token quantization and obtain lossless results on 4-bit OPT with per-group quantization.
In the following sections, Support gives supported models and quantization schemes, [Getting Started](#getting started) introduces the whole procedure to run this project including data preparation, quantization, evaluation and updated model export. Evaluation lists configs for each table in the paper for other researchers to reproduce.
-
Models: detailed analysis of outliers can be found in outlier_analysis.md
- OPTs
- BLOOM, BLOOMZ
- LLaMA
-
Quantization scheme
- Standard quantization (per-channel vs per-tensor, symmetric vs asymmetric)
- Per-token quantization proposed in ZeroQuant
- Per-group quantization proposed in ZeroQuant-V2
-
Bits
- 8-bit
- 6-bit
- 4-bit
-
Prepare A100 to run the code.
-
Install transformers (version 4.30.0) and datasets (version 2.13.1).
-
Prepare pre-trained models such as OPTs, BLOOMs.
-
For codes of FP tasks, we adopt the implementation in lm-evaluation-harness. Thus, download your dataset and update your own data path in lm_eval/dataset directory for later evaluation.
-
Considering quantization on zero-shot tasks here, additionally prepare the pile datasets (part of train datasets) for later calibration. Please first download the PILE dataset to ${pile_path}, and run the following code which will randomly select 128 samples for calibration.
dataset_path = ${pile_path} calibration_size = 128 raw_dataset = load_dataset("json", data_files=dataset_path, split="train") random_rows = random.sample(range(raw_dataset.num_rows), calibration_size) raw_dataset = raw_dataset.select(random_rows) raw_dataset.save_to_disk(${pile_cali_path})
Conduct quantization on specific models and with certain quantization schemes. Here is a simple example for OPT-66B and 8-bit per-tensor symmetric quantization.
# opt.sh
model_size=66b
task_name=winogrande
model_path=model_path
q_config=exp/opt/int8.yaml # q_config.yaml path
is_quant=True # True: quantization; False: fp16
export CUDA_VISIBLE_DEVICES=0,1,2,3
python main.py \
--model opt \
--model_args pretrained=$model_path \
--tasks $task_name \
--batch_size 16 \
--no_cache \
--dtype 'float16' \
--is_quant ${is_quant} \
--config ${q_config} \
2>&1 | tee experiment/opt_${model_size}_${task_name}.log
Model is assigned with --model and --model_args pretrained args. Quantization config is assigned with q_config. The task is assigned with task_name. By running the above command, you will get an accuracy of 69.0.
As the method only updates weights and biases of the floating-point model, we can easily export a new FP model with weaker outliers, enjoying convenience for further development. Here is an example to export opt-6.7B.
- Obtain the scaling and shifting values by turning on the "is_export" choice in opt.sh file. It will store the calculated values in the ${quant_dir}.
# opt.sh
model_size=6.7b
task_name=winogrande
model_path=model_path
q_config=exp/opt/int8.yaml # q_config.yaml path
is_quant=True # True: quantization; False: fp16
export CUDA_VISIBLE_DEVICES=0
python main.py \
--model opt \
--model_args pretrained=$model_path \
--tasks $task_name \
--batch_size 16 \
--no_cache \
--dtype 'float16' \
--is_quant ${is_quant} \
--is_export \
--config ${q_config} \
2>&1 | tee experiment/opt_${model_size}_${task_name}.log
- Use them to update the original FP model by running export.sh
# export.sh
model_size=6.7b # model size
model_type=opt # model type
model_path=model_path # original model path
output_path=output_path # new FP model path
scale_shift_list=exp/opt/scale_shift_list.pth # scaling and shifting values path
export CUDA_VISIBLE_DEVICES=0
python quant_transformer/solver/export.py \
--model_path $model_path \
--scale_shift_list $scale_shift_list \
--model_type $model_type \
--output_path $output_path
quant:
a_qconfig: # quantization details for activation
quantizer: FixedFakeQuantize # quantizer type
observer: AvgMinMaxObserver # calibration methods
bit: 8 # bit selection
symmetric: True # True: symmetric quantization, False: asymmetric one
ch_axis: -1 # -1: per-layer quantization
w_qconfig: # quantization details for weight
quantizer: FixedQuantize # Quantizer type
observer: MinMaxObserver # calibration methods
bit: 8 # bit selection
symmetric: True # True: symmetric quantization, False: asymmetric one
ch_axis: -1 # 0: per-channel quantization, -1: per-layer one
calibrate: 128 # calibration size
calibrate_path: /mnt/lustre/weixiuying.vendor/datasets/nlp_datasets/pile_cali # calibration dataset path, make sure there is _cali in the name
except_quantizer: null
is_remove_padding: True # True: remove [PAD] during calibration
migrate: True # True: shifting and scaling operations, False: no shifting and scaling operations.
model:
max_length: 512 # For PIQA, Winogrande tasks, 512 is enough. For WikiText2, a longer one can improve FP16 results.
-
8-bit and 6-bit OPTs (Table 2 and Table 9)
opt.sh:
model_size=66b task_name=winogrande model_path=model_path q_config=exp/opt/int8.yaml # q_config.yaml path is_quant=True # True: quantization; False: fp16 export CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \ --model opt \ --model_args pretrained=$model_path \ --tasks $task_name \ --batch_size 16 \ --no_cache \ --dtype 'float16' \ --is_quant ${is_quant} \ --config ${q_config} \ 2>&1 | tee experiment/opt_${model_size}_${task_name}.log
q_config: exp/opt/int8.yaml exp/opt/int6.yaml
-
8-bit and 6-bit BLOOM and BLOOMZ (Table 2)
bloom.sh:
model_size=176b task_name=winogrande model_path=model_path q_config=exp/bloom/int8.yaml # quantization config path is_quant=True # True: quantization; False: fp16 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py \ --model bloom \ --model_args pretrained=$model_path \ --tasks ${task_name} \ --batch_size 12 \ --no_cache \ --dtype 'float16' \ --is_quant ${is_quant} \ --config ${q_config} \ 2>&1 | tee experiment/bloom_${model_size}_${task_name}.log
q_config: exp/bloom/int8.yaml exp/bloom/int6.yaml exp/bloomz/int8.yaml exp/bloomz/int6.yaml
Here, OS+ is combined with fine-grained quantization to validate its wide application and go extremely low bit setting like 4-bit quantization.
-
6-bit and 4-bit LLaMA with per-token quantization (Table 3 and Table 10)
Per-token quantization which customizes quantization parameters for individual tokens, can bring better predictions, especially for lower-bit quantization and longer output like WikiText2.
llama.sh:
model_size=7b task_name=piqa model_path=model_path q_config=exp/llama/int6_token.yaml # quantization config path is_quant=True # True: quantization; False: fp16 export CUDA_VISIBLE_DEVICES=0 python main.py \ --model llama \ --model_args pretrained=${model_path} \ --tasks ${task_name} \ --batch_size 16 \ --no_cache \ --dtype 'bfloat16' \ --is_quant ${is_quant} \ --config ${q_config}" 2>&1 | tee experiment/llama_${model_size}_${task_name}.log
q_config: exp/llama/int6_token_disable.yaml exp/llama/int4_token_disable.yaml exp/llama/int6_token.yaml exp/llama/int4_token.yaml
-
4-bit OPT with per-group quantization (Figure 3)
opt.sh:
the same with the one in standard quantization
q_config:
exp/opt/int4_group.yaml
If you find this repo useful for your research, please consider citing the paper:
@article{wei2023outlier,
title={Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling},
author={Wei, Xiuying and Zhang, Yunchen and Li, Yuhang and Zhang, Xiangguo and Gong, Ruihao and Guo, Jinyang and Liu, Xianglong},
journal={arXiv preprint arXiv:2304.09145},
year={2023}
}