-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flash attention 加速效果较差,大约只提升5%的推理速度 #49
Comments
flash主要是训练降显存,提速度。推理一般都没太大用处。 |
我也是,还使用了最新的flash-attention-2,是因为flah-attn确实对推理加速不明显吗? |
其实跟序列长度关系比较大,短序列提速不明显,长序列提速还是比较可观的。简单测试了一下:
|
补充一点,刚才的实验只提到了context length,其实跟output length的关系也很大。 |
请问您是在什么环境下跑的,我在qwen2.5-1.5B,32B用flash attention都是30%左右的负优化 |
Hi
我按照 您这边给的flash attention 安装步骤,成功安装了 flash attention,
在运行时 log也显示了:
use flash_attn rotary
use flash_attn rms_norm
我在A100 机器上测试,方式安装flash attention比不安装带来的性能提速,只能带来低于5%的推理提速,(每个token的生成耗时)
所以我想问问,在你们内部实测时,flash attention 带来的性能提升大概是多少呀
The text was updated successfully, but these errors were encountered: