How did flash-attn compute attention for cu_seqlens #850

zigzagcai · 2024-02-23T08:21:29Z

Hi,

We know that cu_seqlens is for the compute efficiency when we do training over multiple variable-length samples. And the attention mask can be calculated through cu_seqlens. We can cut the original cumulative sequence into a batch sequence and pad the empty positions with zeros. But this approach will waste training efficiency, since computing resources are consumed for meaningless padding tokens.

I am not quite familiar with the implementation details of flash-attn. So, I am just curious about where can I find the implementation or mechanism that how did flash-attn compute attention directly over cumulative sequence and get the separate results?

Thanks!

The text was updated successfully, but these errors were encountered:

tridao · 2024-02-23T08:28:12Z

We just launch parallel work (i.e. thread blocks) to process each attn head of each sequence, and each thread block will figure out the start and end idx of each sequence from cu_seqlens.

zigzagcai · 2024-02-26T02:36:50Z

We just launch parallel work (i.e. thread blocks) to process each attn head of each sequence, and each thread block will figure out the start and end idx of each sequence from cu_seqlens.

Got it. Thanks for the explanation!

zigzagcai · 2024-03-06T08:16:18Z

We just launch parallel work (i.e. thread blocks) to process each attn head of each sequence, and each thread block will figure out the start and end idx of each sequence from cu_seqlens.

We observe that the flash API provides fwd/bwd and varlen_fwd/varlen_bwd API to handle inputs wo/w cu_seqlen. Different patterns of inputs are all passed into run_mha_fwd and run_mha_bwd, and ultimately are processed by the flash_fwd_kernel and flash_bwd_kernel within the templates.

In the kernel of flash attention, a structure named BlockInfo is defined to store the offsets of qkv. These offsets are calculated based on cu_seqlen_q and cu_seqlen_k, allowing for the computation of attention on a row-by-row basis (compute_attn_1rowblock). Through BlockInfo, we can specify which thread block will compute attention on which row of qkv.

Hence, each row of qkv would supports variable lengths since the gemm computation is break down into row-by-row computation, thereby preventing the waste of computational resources on meaningless padding tokens.

zigzagcai mentioned this issue Feb 26, 2024

Question about does mamba support variable-length input or cu_seqlens like flash attention? state-spaces/mamba#180

Open

zigzagcai mentioned this issue Mar 12, 2024

Attempt to support packed sequence or cu_seqlens state-spaces/mamba#235

Closed

zigzagcai closed this as completed Mar 12, 2024

zigzagcai reopened this Mar 15, 2024

zigzagcai mentioned this issue Mar 25, 2024

[Feature] Support variable-length sequences for mamba block state-spaces/mamba#244

Open

tridao closed this as completed Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How did flash-attn compute attention for cu_seqlens #850

How did flash-attn compute attention for cu_seqlens #850

zigzagcai commented Feb 23, 2024 •

edited

Loading

tridao commented Feb 23, 2024

zigzagcai commented Feb 26, 2024

zigzagcai commented Mar 6, 2024 •

edited

Loading

How did flash-attn compute attention for cu_seqlens #850

How did flash-attn compute attention for cu_seqlens #850

Comments

zigzagcai commented Feb 23, 2024 • edited Loading

tridao commented Feb 23, 2024

zigzagcai commented Feb 26, 2024

zigzagcai commented Mar 6, 2024 • edited Loading

zigzagcai commented Feb 23, 2024 •

edited

Loading

zigzagcai commented Mar 6, 2024 •

edited

Loading