Official codebase for Shortened LLaMA: A Simple Depth Pruning for Large Language Models [ArXiv] [ICLR 2024 Workshop on ME-FoMo].
conda create -n shortened-llm python=3.9
conda activate shortened-llm
git clone https://github.com/Nota-NetsPresso/shortened-llm.git
cd shortened-llm
pip install -r requirement.txt
Note on package versions:
The scripts perform (1) block pruning ➔ (2) LoRA-based retraining ➔ (3) zero-shot evaluation.
- Pruning criterion: PPL (top); Taylor+ (bottom).
- 20% pruning of LLaMA-1-7b (based on
LlamaForCausalLM
)bash script/prune_llama-7b_crit-ppl.sh bash script/prune_llama-7b_crit-taylor.sh
- 20% pruning of Vicuna-7b-v1.3 (based on
LlamaForCausalLM
)bash script/prune_vicuna-7b_crit-ppl.sh bash script/prune_vicuna-7b_crit-taylor.sh
- 21% pruning of Vicuna-13b-v1.3 (based on
LlamaForCausalLM
)bash script/prune_vicuna-13b_crit-ppl.sh bash script/prune_vicuna-13b_crit-taylor.sh
- pruning of CatPPT-base (based on
MistralForCausalLM
)bash script/prune_CatPPT_crit-ppl.sh bash script/prune_CatPPT_crit-taylor.sh
- pruning of Gemma-2b (based on
GemmaForCausalLM
)bash script/prune_gemma-2b_crit-ppl_yesBOS.sh bash script/prune_gemma-2b_crit-taylor_yesBOS.sh
- pruning of Gemma-7b (based on
GemmaForCausalLM
)bash script/prune_gemma-7b_crit-ppl_yesBOS.sh bash script/prune_gemma-7b_crit-taylor_yesBOS.sh
- pruning of Llama-3-8B (based on
LlamaForCausalLM
)bash script/prune_llama3-8b_crit-ppl.sh bash script/prune_llama3-8b_crit-taylor.sh
After identifying unimportant Transformer blocks, we perform one-shot pruning and light LoRA-based retraining.
-
Available at 🤗Hugging Face Models:
Source
ModelPruning
RatioPruning
CriterionHF Models
LinkLLaMA-1-7B 20% PPL nota-ai/st-llama-1-5.5b-ppl LLaMA-1-7B 20% Taylor+ nota-ai/st-llama-1-5.5b-taylor Vicuna-v1.3-7B 20% PPL nota-ai/st-vicuna-v1.3-5.5b-ppl Vicuna-v1.3-7B 20% Taylor+ nota-ai/st-vicuna-v1.3-5.5b-taylor Vicuna-v1.3-13B 21% PPL nota-ai/st-vicuna-v1.3-10.5b-ppl Vicuna-v1.3-13B 21% Taylor+ nota-ai/st-vicuna-v1.3-10.5b-taylor
-
To measure (1) PPL on WikiText2 & PTB, and (2) accuracy on seven commonsense reasoning tasks, use: (EleutherAI/lm-evaluation-harness version 3326c54)
bash script/evaluate.sh
-
To test other pruning ratios, use:
bash script/prune.sh
-
To obtain baselines using the magnitude pruning criterion, use:
bash script/prune_llama-7b_crit-magnitude.sh bash script/prune_vicuna-7b_crit-magnitude.sh bash script/prune_vicuna-13b_crit-magnitude.sh
-
To measure latency & throughput, use:
bash script/measure_time.sh
-
To measure VRAM requirements, use:
bash script/measure_vram.sh
-
To measure GPU compute utilization, use:
bash script/measure_gpuutil.sh
The demo compares the use of LLM-Pruner (Ma et al., 2023; width pruning) and Shortened LLaMA (Ours; depth pruning) for the LLaMA-1-7B model:
pip install transformers==4.33.1 # to run LLM-Pruner's model
python src/app.py
- All rights related to this repository and the compressed models are reserved by Nota Inc.
- The intended use is strictly limited to research and non-commercial projects.
- Microsoft for Startups Founders Hub and Gwangju AICA for generously providing GPU resources.
- LLM-Pruner, which utilizes LM Evaluation Harness, PEFT, and Alpaca-LoRA. Thanks for the pioneering work on structured pruning of LLMs!
- Meta AI's LLaMA and LMSYS Org's Vicuna. Thanks for the open-source LLMs!
@article{kim2024shortened,
title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
journal={arXiv preprint arXiv:2402.02834},
year={2024},
url={https://arxiv.org/abs/2402.02834}
}
@article{kim2024mefomo,
title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
journal={ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)},
year={2024},
url={https://openreview.net/forum?id=18VGxuOdpu}
}