Support for packed layout with paged attention

Hi all,

I've been playing with `cudnn-frontend` to test the Flash Attention kernel. Overall, it's easy to use and fast but I've come across a limitation that I don't really understand.

It seems that the kernel can't be used in paged mode with packed tensors. This is something that other paged attention kernels support (and it makes a big difference in terms of performance as well as tokens can be batched per sequence).

So two questions about that:
1) Is it a limitation only in `cudnn-frontend` ? Because I couldn't find in the backend doc such a limitation
2) Are there plans to add that feature in the future ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for packed layout with paged attention #132

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for packed layout with paged attention #132

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions