Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flash attention 加速效果较差,大约只提升5%的推理速度 #49

Closed
zcuuu opened this issue Aug 4, 2023 · 5 comments
Closed

Comments

@zcuuu
Copy link

zcuuu commented Aug 4, 2023

Hi
我按照 您这边给的flash attention 安装步骤,成功安装了 flash attention,
在运行时 log也显示了:
use flash_attn rotary
use flash_attn rms_norm

我在A100 机器上测试,方式安装flash attention比不安装带来的性能提速,只能带来低于5%的推理提速,(每个token的生成耗时)
所以我想问问,在你们内部实测时,flash attention 带来的性能提升大概是多少呀

@jeffchy
Copy link

jeffchy commented Aug 4, 2023

flash主要是训练降显存,提速度。推理一般都没太大用处。

@jackaihfia2334
Copy link

我也是,还使用了最新的flash-attention-2,是因为flah-attn确实对推理加速不明显吗?

@logicwong
Copy link
Contributor

其实跟序列长度关系比较大,短序列提速不明显,长序列提速还是比较可观的。简单测试了一下:

  • 5000左右的context length,提速有20%以上;
  • 50左右的context length (Readme里的case),提速不到5%

@logicwong
Copy link
Contributor

补充一点,刚才的实验只提到了context length,其实跟output length的关系也很大。
当output length比较小,context length比较大时,瓶颈主要context的计算上,这时候FlashAttention就比较有优势;
当output length比较大,瓶颈就落在autoregressive的单步推理上,FlashAttention的提速就比较弱了

@123yanchangwei
Copy link

其实跟序列长度关系比较大,短序列提速不明显,长序列提速还是比较可观的。简单测试了一下:

  • 5000左右的context length,提速有20%以上;
  • 50左右的context length (Readme里的case),提速不到5%

请问您是在什么环境下跑的,我在qwen2.5-1.5B,32B用flash attention都是30%左右的负优化

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants