Shortened LLM by Nota AI

Official codebase for Shortened LLaMA: A Simple Depth Pruning for Large Language Models [ArXiv] [ICLR 2024 Workshop on ME-FoMo].

Installation

conda create -n shortened-llm python=3.9
conda activate shortened-llm
git clone https://github.com/Nota-NetsPresso/shortened-llm.git
cd shortened-llm
pip install -r requirement.txt

Note on package versions:

Part of the below repositories is included for evaluation:
- src/LLMPruner: horseee/LLM-Pruner version 213ffa4
- src/lm_eval: EleutherAI/lm-evaluation-harness version 3326c54
Torch version used in our experiments: 2.0.1 for RTX3090 & A100; 2.1.1 for H100.

Examples

The scripts perform (1) block pruning ➔ (2) LoRA-based retraining ➔ (3) zero-shot evaluation.

Pruning criterion: PPL (top); Taylor+ (bottom).

20% pruning of LLaMA-1-7b (based on LlamaForCausalLM)

bash script/prune_llama-7b_crit-ppl.sh
bash script/prune_llama-7b_crit-taylor.sh

20% pruning of Vicuna-7b-v1.3 (based on LlamaForCausalLM)

bash script/prune_vicuna-7b_crit-ppl.sh
bash script/prune_vicuna-7b_crit-taylor.sh

21% pruning of Vicuna-13b-v1.3 (based on LlamaForCausalLM)

bash script/prune_vicuna-13b_crit-ppl.sh
bash script/prune_vicuna-13b_crit-taylor.sh

pruning of CatPPT-base (based on MistralForCausalLM)

bash script/prune_CatPPT_crit-ppl.sh
bash script/prune_CatPPT_crit-taylor.sh

pruning of Gemma-2b (based on GemmaForCausalLM)

bash script/prune_gemma-2b_crit-ppl_yesBOS.sh
bash script/prune_gemma-2b_crit-taylor_yesBOS.sh

pruning of Gemma-7b (based on GemmaForCausalLM)

bash script/prune_gemma-7b_crit-ppl_yesBOS.sh
bash script/prune_gemma-7b_crit-taylor_yesBOS.sh

pruning of Llama-3-8B (based on LlamaForCausalLM)

bash script/prune_llama3-8b_crit-ppl.sh
bash script/prune_llama3-8b_crit-taylor.sh

Model Description

After identifying unimportant Transformer blocks, we perform one-shot pruning and light LoRA-based retraining.

Click to see a method figure:

Model Links

Available at 🤗Hugging Face Models:

Source Model	Pruning Ratio	Pruning Criterion	HF Models Link
LLaMA-1-7B	20%	PPL	nota-ai/st-llama-1-5.5b-ppl
LLaMA-1-7B	20%	Taylor+	nota-ai/st-llama-1-5.5b-taylor
Vicuna-v1.3-7B	20%	PPL	nota-ai/st-vicuna-v1.3-5.5b-ppl
Vicuna-v1.3-7B	20%	Taylor+	nota-ai/st-vicuna-v1.3-5.5b-taylor
Vicuna-v1.3-13B	21%	PPL	nota-ai/st-vicuna-v1.3-10.5b-ppl
Vicuna-v1.3-13B	21%	Taylor+	nota-ai/st-vicuna-v1.3-10.5b-taylor

Zero-shot Evaluation

To measure (1) PPL on WikiText2 & PTB, and (2) accuracy on seven commonsense reasoning tasks, use: (EleutherAI/lm-evaluation-harness version 3326c54)
```
bash script/evaluate.sh
```
Click to see the zero-shot results:

Other Scripts

To test other pruning ratios, use:
```
bash script/prune.sh
```

To obtain baselines using the magnitude pruning criterion, use:

bash script/prune_llama-7b_crit-magnitude.sh
bash script/prune_vicuna-7b_crit-magnitude.sh
bash script/prune_vicuna-13b_crit-magnitude.sh

To measure latency & throughput, use:
```
bash script/measure_time.sh
```
To measure VRAM requirements, use:
```
bash script/measure_vram.sh
```
To measure GPU compute utilization, use:
```
bash script/measure_gpuutil.sh
```

Gradio Demo: Width✄ vs. Depth✄

The demo compares the use of LLM-Pruner (Ma et al., 2023; width pruning) and Shortened LLaMA (Ours; depth pruning) for the LLaMA-1-7B model:

pip install transformers==4.33.1 # to run LLM-Pruner's model
python src/app.py

Click to see a demo screenshot (on an A100 80GB GPU):

License

All rights related to this repository and the compressed models are reserved by Nota Inc.
The intended use is strictly limited to research and non-commercial projects.

Acknowledgments

Microsoft for Startups Founders Hub and Gwangju AICA for generously providing GPU resources.
LLM-Pruner, which utilizes LM Evaluation Harness, PEFT, and Alpaca-LoRA. Thanks for the pioneering work on structured pruning of LLMs!
Meta AI's LLaMA and LMSYS Org's Vicuna. Thanks for the open-source LLMs!

Citation

@article{kim2024shortened,
  title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={arXiv preprint arXiv:2402.02834},      
  year={2024},
  url={https://arxiv.org/abs/2402.02834}
}

@article{kim2024mefomo,
  title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)},
  year={2024},
  url={https://openreview.net/forum?id=18VGxuOdpu}
}

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github		.github
output_block_sensitivity		output_block_sensitivity
results		results
script		script
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

output_block_sensitivity

output_block_sensitivity

results

results

script

script

src

src

.gitignore

.gitignore

README.md

README.md

pyproject.toml

pyproject.toml

requirement.txt

requirement.txt

Repository files navigation

Shortened LLM by Nota AI

Installation

Examples

Model Description

Model Links

Zero-shot Evaluation

Other Scripts

Gradio Demo: Width✄ vs. Depth✄

License

Acknowledgments

Citation

About

Contributors 7

Languages

Nota-NetsPresso/shortened-llm

Folders and files

Latest commit

History

Repository files navigation

Shortened LLM by Nota AI

Installation

Examples

Model Description

Model Links

Zero-shot Evaluation

Other Scripts

Gradio Demo: Width✄ vs. Depth✄

License

Acknowledgments

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages