Fast gradient checkpoint is designed for accelerate the training with memory-efficient attention like FlashAttention and LightSeq. FastCkpt has monkey patch for both rematerialization-aware checkpointing and FlashAttention, so you can patch both in only one line!
Paper: https://arxiv.org/pdf/2310.03294.pdf
- [2023/10] FastCkpt now supports LlamaModel in Huggingface!
pip install fastckpt
FastCkpt now supports HF training pipeline.
To use fasckpt
with flash_attn
, import and run replace_hf_ckpt_with_fast_ckpt
before importing transformers
# add monkey patch for fastckpt
from fastckpt.llama_flash_attn_ckpt_monkey_patch import replace_hf_ckpt_with_fast_ckpt
replace_hf_ckpt_with_fast_ckpt()
# import transformers and other packages
import transformers
...
To only replace the LlamaAttention
with flash_attn
without chaning the checkpointing strategy, import and run replace_llama_attn_with_flash_attn
# add monkey patch for fastckpt
from fastckpt.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
replace_llama_attn_with_flash_attn()
# import transformers and other packages
import transformers
...
If you find this repo useful, please cite
@article{li2023lightseq,
title={LIGHTSEQ: SEQUENCE LEVEL PARALLELISM FOR DISTRIBUTED TRAINING OF LONG CONTEXT TRANS},
author={Li, Dacheng and Shao, Rulin and Xie𝑠, Anze and Xing𝑐𝑚, Eric P and Gonzalez𝑏, Joseph E and Stoica𝑏, Ion and Ma𝑢, Xuezhe and Zhang𝑠, Hao},
journal={arXiv preprint arXiv:2310.03294},
year={2023}
}