Open
Description
Hi all,
I've been playing with cudnn-frontend
to test the Flash Attention kernel. Overall, it's easy to use and fast but I've come across a limitation that I don't really understand.
It seems that the kernel can't be used in paged mode with packed tensors. This is something that other paged attention kernels support (and it makes a big difference in terms of performance as well as tokens can be batched per sequence).
So two questions about that:
- Is it a limitation only in
cudnn-frontend
? Because I couldn't find in the backend doc such a limitation - Are there plans to add that feature in the future ?
Metadata
Metadata
Assignees
Labels
No labels