GitHub - MrGGLS/BlockPruner: A block pruning framework for LLMs.

BlockPruner: Fine-grained Pruning for Large Language Models

Longguang Zhong, Fanqi Wan, Ruijun Chen, Xiaojun Quan, Liangzhi Li

Overview

In this work, we explored the phenomenon of block redundancy in existing LLMs and proposed a general block pruning framework. It first decomposes each Transformer layer into two minimal residual blocks (MHA or MLP). Then, we use our proposed block importance evaluation metric to assess the importance of each block. Finally, we iteratively prune the block with the lowest importance.

Evaluation Results

we experiment with three series of models: Llama2, Baichuan2, and Qwen1.5. We employ 7B and 13B models for Llama2 and Baichuan2, respectively, and 7B and 14B models for Qwen1.5.

Here are the evaluation results.

Quick Start

Setup

To use and evaluate BlockPruner, we have to install the following libraries first:

torch==2.2.1
lm_eval==0.4.0 # provided in ./lm_eval

Usage

Below is the script for obtaining the pruning sequence of the corresponding model:

export CUDA_VISIBLE_DEVICES=0

model_name=Llama-2-7b
nsamples=64
dataset=alpaca
block_num=20

python block_search.py \
        --model-path models/${model_name}\
        --block-type mix \
        --cal-nsamples ${nsamples} \
        --del-block-num ${block_num} \
        --cal-dataset ${dataset} \
        --ppl-search-path ppls \
        --ppl-eval-batch-size 2 \
        --device cuda

You can obtain pruning sequences for different types of block by changing block-type to mha, mlp, or mix.

block-num represents the maximum number of pruning blocks in the sequence, typically constrained to about one-third of the total number of model blocks.

nsamples indicates the number of samples used for perplexity calculation, with 256 used in our paper.

Evaluation

We evaluated our pruning algorithm on five benchmarks: PIQA, WingoGrande, HellaSwag, ARC-c, and ARC-e. You can download and install the official code or use the version we provide (available in ./lm_eval). Below is our evaluation script:

export CUDA_VISIBLE_DEVICES=0

model_name=Llama-2-7b
block_num=12
dataset=wikitext2
ppl_search_file=ppls/${model_name}_mix_alpaca_ns_64_del_order_list.json

python eval.py \
        --do-eval \
        --model-path models/${model_name}\
        --del-block-num ${block_num} \
        --cal-dataset ${dataset} \
        --ppl-search-file ${ppl_search_file}\
        --ppl-eval-batch-size 1 \
        --device cuda \
        --compute-dtype bf16

ppl-search-file is the pruning sequence file obtained in the previous step.

Citation

If you find this work relevant to your research or applications, please feel free to cite our work!

@article{zhong2024blockpruner,
  title={BlockPruner: Fine-grained Pruning for Large Language Models},
  author={Zhong, Longguang and Wan, Fanqi and Chen, Ruijun and Quan, Xiaojun and Li, Liangzhi},
  journal={arXiv preprint arXiv:2406.10594},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
lm_eval		lm_eval
ppls		ppls
README.md		README.md
block_search.py		block_search.py
block_search.sh		block_search.sh
eval.py		eval.py
eval.sh		eval.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BlockPruner: Fine-grained Pruning for Large Language Models

Contents

Overview

Evaluation Results

Quick Start

Setup

Usage

Evaluation

Citation

About

Releases

Packages

Languages

MrGGLS/BlockPruner

Folders and files

Latest commit

History

Repository files navigation

BlockPruner: Fine-grained Pruning for Large Language Models

Contents

Overview

Evaluation Results

Quick Start

Setup

Usage

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages