Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flash attention does not yield speed gains on llama example #2069

Open
jorgeantonio21 opened this issue Apr 15, 2024 · 1 comment
Open

flash attention does not yield speed gains on llama example #2069

jorgeantonio21 opened this issue Apr 15, 2024 · 1 comment

Comments

@jorgeantonio21
Copy link
Contributor

jorgeantonio21 commented Apr 15, 2024

After trying llama example with either cuda or flash-attn features, I realized the generation times are quite similar. I would expect flash attention to have a significant improvement in the token generation speed (at least, according to the authors of the paper).

I am running these tests on a NVIDIA RTX4090, and running the commmands:

cargo run --release --features flash-attn --example llama -- --use-flash-attn --sample-len 1000

with

47.79993663930067 token/s

and

cargo run --release --features cuda --example llama -- --sample-len 1000

with

47.11094751627298 token/s

I have also experimented with falcon 7b example, and noticed the same lack of speed improvement.

@jorgeantonio21 jorgeantonio21 changed the title flash attention does not yield speed gains flash attention does not yield speed gains on llama example Apr 15, 2024
@LaurentMazare
Copy link
Collaborator

Sorry for the late reply. I think one issue with measurement here might be that we're including the time to generate the first token which is bounded by the model being loaded in an asynchronous way. I've tweaked it in #2106 so that we only measure the time spent after the first token. With this change I get 68.5 token/s without flash-attn, and 74.4 token/s with flash-attn (on a H100) so not a massive speedup but it seems to have some effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants