Longguang Zhong, Fanqi Wan, Ruijun Chen, Xiaojun Quan, Liangzhi Li
In this work, we explored the phenomenon of block redundancy in existing LLMs and proposed a general block pruning framework. It first decomposes each Transformer layer into two minimal residual blocks (MHA or MLP). Then, we use our proposed block importance evaluation metric to assess the importance of each block. Finally, we iteratively prune the block with the lowest importance.
we experiment with three series of models: Llama2, Baichuan2, and Qwen1.5. We employ 7B and 13B models for Llama2 and Baichuan2, respectively, and 7B and 14B models for Qwen1.5.
Here are the evaluation results.
To use and evaluate BlockPruner, we have to install the following libraries first:
torch==2.2.1
lm_eval==0.4.0 # provided in ./lm_eval
Below is the script for obtaining the pruning sequence of the corresponding model:
export CUDA_VISIBLE_DEVICES=0
model_name=Llama-2-7b
nsamples=64
dataset=alpaca
block_num=20
python block_search.py \
--model-path models/${model_name}\
--block-type mix \
--cal-nsamples ${nsamples} \
--del-block-num ${block_num} \
--cal-dataset ${dataset} \
--ppl-search-path ppls \
--ppl-eval-batch-size 2 \
--device cuda
You can obtain pruning sequences for different types of block by changing block-type
to mha
, mlp
, or mix
.
block-num
represents the maximum number of pruning blocks in the sequence, typically constrained to about one-third of the total number of model blocks.
nsamples
indicates the number of samples used for perplexity calculation, with 256 used in our paper.
We evaluated our pruning algorithm on five benchmarks: PIQA, WingoGrande, HellaSwag, ARC-c, and ARC-e. You can download and install the official code or use the version we provide (available in ./lm_eval). Below is our evaluation script:
export CUDA_VISIBLE_DEVICES=0
model_name=Llama-2-7b
block_num=12
dataset=wikitext2
ppl_search_file=ppls/${model_name}_mix_alpaca_ns_64_del_order_list.json
python eval.py \
--do-eval \
--model-path models/${model_name}\
--del-block-num ${block_num} \
--cal-dataset ${dataset} \
--ppl-search-file ${ppl_search_file}\
--ppl-eval-batch-size 1 \
--device cuda \
--compute-dtype bf16
ppl-search-file
is the pruning sequence file obtained in the previous step.
If you find this work relevant to your research or applications, please feel free to cite our work!
@article{zhong2024blockpruner,
title={BlockPruner: Fine-grained Pruning for Large Language Models},
author={Zhong, Longguang and Wan, Fanqi and Chen, Ruijun and Quan, Xiaojun and Li, Liangzhi},
journal={arXiv preprint arXiv:2406.10594},
year={2024}
}