Skip to content

Support for packed layout with paged attention #132

Open
@Corendos

Description

@Corendos

Hi all,

I've been playing with cudnn-frontend to test the Flash Attention kernel. Overall, it's easy to use and fast but I've come across a limitation that I don't really understand.

It seems that the kernel can't be used in paged mode with packed tensors. This is something that other paged attention kernels support (and it makes a big difference in terms of performance as well as tokens can be batched per sequence).

So two questions about that:

  1. Is it a limitation only in cudnn-frontend ? Because I couldn't find in the backend doc such a limitation
  2. Are there plans to add that feature in the future ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions