flash attention does not yield speed gains on llama example #2069

jorgeantonio21 · 2024-04-15T10:41:54Z

After trying llama example with either cuda or flash-attn features, I realized the generation times are quite similar. I would expect flash attention to have a significant improvement in the token generation speed (at least, according to the authors of the paper).

I am running these tests on a NVIDIA RTX4090, and running the commmands:

cargo run --release --features flash-attn --example llama -- --use-flash-attn --sample-len 1000

with

47.79993663930067 token/s

and

cargo run --release --features cuda --example llama -- --sample-len 1000

with

47.11094751627298 token/s

I have also experimented with falcon 7b example, and noticed the same lack of speed improvement.

The text was updated successfully, but these errors were encountered:

LaurentMazare · 2024-04-22T15:46:05Z

Sorry for the late reply. I think one issue with measurement here might be that we're including the time to generate the first token which is bounded by the model being loaded in an asynchronous way. I've tweaked it in #2106 so that we only measure the time spent after the first token. With this change I get 68.5 token/s without flash-attn, and 74.4 token/s with flash-attn (on a H100) so not a massive speedup but it seems to have some effect.

jorgeantonio21 changed the title ~~flash attention does not yield speed gains~~ flash attention does not yield speed gains on llama example Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flash attention does not yield speed gains on llama example #2069

flash attention does not yield speed gains on llama example #2069

jorgeantonio21 commented Apr 15, 2024 •

edited

LaurentMazare commented Apr 22, 2024

flash attention does not yield speed gains on llama example #2069

flash attention does not yield speed gains on llama example #2069

Comments

jorgeantonio21 commented Apr 15, 2024 • edited

LaurentMazare commented Apr 22, 2024

jorgeantonio21 commented Apr 15, 2024 •

edited