You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After trying llama example with either cuda or flash-attn features, I realized the generation times are quite similar. I would expect flash attention to have a significant improvement in the token generation speed (at least, according to the authors of the paper).
I am running these tests on a NVIDIA RTX4090, and running the commmands:
Sorry for the late reply. I think one issue with measurement here might be that we're including the time to generate the first token which is bounded by the model being loaded in an asynchronous way. I've tweaked it in #2106 so that we only measure the time spent after the first token. With this change I get 68.5 token/s without flash-attn, and 74.4 token/s with flash-attn (on a H100) so not a massive speedup but it seems to have some effect.
After trying llama example with either
cuda
orflash-attn
features, I realized the generation times are quite similar. I would expect flash attention to have a significant improvement in the token generation speed (at least, according to the authors of the paper).I am running these tests on a NVIDIA RTX4090, and running the commmands:
cargo run --release --features flash-attn --example llama -- --use-flash-attn --sample-len 1000
with
47.79993663930067 token/s
and
cargo run --release --features cuda --example llama -- --sample-len 1000
with
47.11094751627298 token/s
I have also experimented with falcon 7b example, and noticed the same lack of speed improvement.
The text was updated successfully, but these errors were encountered: