Why the use of flash attention in the inference stage will lead to slower？ #27

xyfZzz · 2023-07-24T05:10:05Z

Hi, I've seen eval scripts mention that using flash attention will be slower. I am wondering why the use of flash attention in the inference stage will lead to slower, because I have the impression that flash attention can speed up.

LongChat/longeval/eval.py

Line 62 in a824bda

    
           parser.add_argument("--longchat_flash_attn", action='store_true', help="Only apply to longchat models. Whether to enable flash attention to save memory, but slower.")

DachengLi1 · 2023-07-24T08:40:48Z

@xyfZzz Flash attention by itself does not support ky_cache, so it is recomputing the cache again and again in our naive implementation. We have a member on vLLM team working on better support.

xyfZzz · 2023-07-24T12:52:31Z

@xyfZzz Flash attention by itself does not support ky_cache, so it is recomputing the cache again and again in our naive implementation. We have a member on vLLM team working on better support.

I understand. Thank you for your explanation!

DachengLi1 closed this as completed Jul 31, 2023

Zhuqln mentioned this issue Aug 10, 2023

support vllm & lightllm #38

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the use of flash attention in the inference stage will lead to slower？ #27

Why the use of flash attention in the inference stage will lead to slower？ #27

xyfZzz commented Jul 24, 2023

DachengLi1 commented Jul 24, 2023

xyfZzz commented Jul 24, 2023

Why the use of flash attention in the inference stage will lead to slower？ #27

Why the use of flash attention in the inference stage will lead to slower？ #27

Comments

xyfZzz commented Jul 24, 2023

DachengLi1 commented Jul 24, 2023

xyfZzz commented Jul 24, 2023