Skip to content

Nota-NetsPresso/shortened-llm

Repository files navigation

Shortened LLM by Nota AI

Official codebase for Shortened LLaMA: A Simple Depth Pruning for Large Language Models [ArXiv] [ICLR 2024 Workshop on ME-FoMo].

Installation

conda create -n shortened-llm python=3.9
conda activate shortened-llm
git clone https://github.com/Nota-NetsPresso/shortened-llm.git
cd shortened-llm
pip install -r requirement.txt
Note on package versions:
  • Part of the below repositories is included for evaluation:
    • src/LLMPruner: horseee/LLM-Pruner version 213ffa4
    • src/lm_eval: EleutherAI/lm-evaluation-harness version 3326c54
  • Torch version used in our experiments: 2.0.1 for RTX3090 & A100; 2.1.1 for H100.

Examples

The scripts perform (1) block pruning ➔ (2) LoRA-based retraining ➔ (3) zero-shot evaluation.

  • Pruning criterion: PPL (top); Taylor+ (bottom).
  • 20% pruning of LLaMA-1-7b (based on LlamaForCausalLM)
    bash script/prune_llama-7b_crit-ppl.sh
    bash script/prune_llama-7b_crit-taylor.sh
  • 20% pruning of Vicuna-7b-v1.3 (based on LlamaForCausalLM)
    bash script/prune_vicuna-7b_crit-ppl.sh
    bash script/prune_vicuna-7b_crit-taylor.sh
  • 21% pruning of Vicuna-13b-v1.3 (based on LlamaForCausalLM)
    bash script/prune_vicuna-13b_crit-ppl.sh
    bash script/prune_vicuna-13b_crit-taylor.sh
  • pruning of CatPPT-base (based on MistralForCausalLM)
    bash script/prune_CatPPT_crit-ppl.sh
    bash script/prune_CatPPT_crit-taylor.sh
  • pruning of Gemma-2b (based on GemmaForCausalLM)
    bash script/prune_gemma-2b_crit-ppl_yesBOS.sh
    bash script/prune_gemma-2b_crit-taylor_yesBOS.sh
  • pruning of Gemma-7b (based on GemmaForCausalLM)
    bash script/prune_gemma-7b_crit-ppl_yesBOS.sh
    bash script/prune_gemma-7b_crit-taylor_yesBOS.sh
  • pruning of Llama-3-8B (based on LlamaForCausalLM)
    bash script/prune_llama3-8b_crit-ppl.sh
    bash script/prune_llama3-8b_crit-taylor.sh

Model Description

After identifying unimportant Transformer blocks, we perform one-shot pruning and light LoRA-based retraining.

Click to see a method figure: method

Model Links

Zero-shot Evaluation

  • To measure (1) PPL on WikiText2 & PTB, and (2) accuracy on seven commonsense reasoning tasks, use: (EleutherAI/lm-evaluation-harness version 3326c54)

    bash script/evaluate.sh
    Click to see the zero-shot results: results

Other Scripts

  • To test other pruning ratios, use:

    bash script/prune.sh
  • To obtain baselines using the magnitude pruning criterion, use:

    bash script/prune_llama-7b_crit-magnitude.sh
    bash script/prune_vicuna-7b_crit-magnitude.sh
    bash script/prune_vicuna-13b_crit-magnitude.sh
  • To measure latency & throughput, use:

    bash script/measure_time.sh
  • To measure VRAM requirements, use:

    bash script/measure_vram.sh
  • To measure GPU compute utilization, use:

    bash script/measure_gpuutil.sh

Gradio Demo: Width✄ vs. Depth✄

The demo compares the use of LLM-Pruner (Ma et al., 2023; width pruning) and Shortened LLaMA (Ours; depth pruning) for the LLaMA-1-7B model:

pip install transformers==4.33.1 # to run LLM-Pruner's model
python src/app.py
Click to see a demo screenshot (on an A100 80GB GPU): demo

License

  • All rights related to this repository and the compressed models are reserved by Nota Inc.
  • The intended use is strictly limited to research and non-commercial projects.

Acknowledgments

Citation

@article{kim2024shortened,
  title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={arXiv preprint arXiv:2402.02834},      
  year={2024},
  url={https://arxiv.org/abs/2402.02834}
}
@article{kim2024mefomo,
  title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)},
  year={2024},
  url={https://openreview.net/forum?id=18VGxuOdpu}
}