DynamicFlexAttention wrapper class for dynamic sequence lengths #1960

zyklotomic · 2025-03-09T18:07:49Z

Had a stab at making Flex Attention work without excessive recompilation. I am not fully confident in this approach, it kinda feels jank to the max. Hence, I wanted to have confirmation if this is the right approach.

In essence, the kernel has to recompile every time the input sizes change. Hence, why not compile a kernel for a larger size, and pad inputs when necessary, and then splice the result before returning. See code for more thorough comments.

I haven't had the chance to really test the performance yet. There are potential enhancements too that I mention in the comments.

Will attach testing code for a demo in a bit.

zyklotomic · 2025-03-09T18:14:25Z

See following gist

https://gist.github.com/zyklotomic/527cb96da86c2b5f5984bede3be9b227

danielhanchen · 2025-03-10T11:42:33Z

Hey! Great PR! Do you know why dynamic = True fails? Hmm padding to 128 will also require inputs ie the data collator to pad to 128

zyklotomic · 2025-03-12T04:54:29Z

I have some interesting findings to report back! Should have dug deeper initially. Turns out getting dynamic shapes to work is something that has been worked on, and apparently is available in the nightly version of PyTorch.

Links of interest:
pytorch/pytorch#135206
pytorch/pytorch#147756
tolleybot/pytorch@4a57fd0

https://github.com/pytorch/pytorch/blob/8d08b4901586f230353a558ee00c16ad57f95178/torch/_inductor/kernel/flex_attention.py#L705 (most recent commit as of writing) -> which points to https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/flex_decoding.py#L336

Something interesting to note about the above is the check that the query length is under 128 as seen in sympy.Lt(query.get_size()[-2], 128) in order to use the dynamic kernel? So maybe there is a case for the approach seen in this PR after all for larger sequence lengths? It does line up with my understanding that dynamic=True is slow and not worth it for large sizes.

I did try my example notebook and set dynamic=True with the nightly kernel and ended up with a different failure though. Granted it is a nightly release after all. When dynamic=False, I saw that a lot of auto-tuning logic has since been added.

Not a torch.compile expert, a lot of this is new to me, and I definitely did not understand 100% of the links I shared lol.

As for your question on why dynamic=True failed, I suspect at least one reason is that the shapes just didn't match up with the hard-coded possible block sizes https://github.com/pytorch/pytorch/blob/v2.5.1/torch/_inductor/kernel/flex_attention.py#L822 ? A lot of speculation here.

What do you think is the best course of action? Should we wait for the PyTorch folks to stabilize instead?

zyklotomic · 2025-03-15T20:37:48Z

Hmm padding to 128 will also require inputs ie the data collator to pad to 128

I think I only just understood what you mean. If I understand correctly, my wrapper class handles the padding for you based on the input size.

zyklotomic · 2025-03-17T03:35:56Z

https://colab.research.google.com/drive/1X7CpQgIqgRpV2aIUgS_p7u1TR4ITUfXF?usp=sharing

It might be a bit primitive to use a temporary print statement to confirm that the flex attention module was indeed being invoked but don't think there was any better way.

zyklotomic · 2025-03-22T21:39:35Z

Still WIP, trying to debug performance issues. Will try to cache the block masks and enable dynamic kernel size selection.

Also seems like I forgot to account for GQA, the num heads might be diff for kv and query.

zyklotomic · 2025-03-23T10:27:48Z

https://colab.research.google.com/drive/1LlAbzLWeC7Js3S19NhMFKRHmWtP7oFrP?usp=sharing

So it seems like when comparing to https://github.com/unslothai/notebooks/blob/main/nb/Gemma2_(2B)-Alpaca.ipynb, we use less GPU memory, but unfortunately a lot more time. The peak reserved memory for training number on my notebook is wrong because I forgot to run the stats collection cell right before; trying to conserve colab credits.

I suspect it has to do with the block mask. The padding strategy unfortunately interferes with the strategy of using one larger block mask for everything. Maybe a custom BlockMask constructor would help, we would need to properly understand the format of the BlockMask.

The padding also most likely means more extraneous operations in the matmul. There is also definitely a compile cost for each of the Flex Attention kernels, but that is a fixed cost.

Taking the [WIP] tag off, but not sure if this is merge worthy given the performance problems.

Initial DynamicFlexAttention wrapper class for dynamic sequence lengths

e84645a

zyklotomic added 2 commits March 15, 2025 17:27

Slight refactoring of Flex Attention

332cf09

Use new dynamic Flex Attention wrapper in Gemma 2

098642e

zyklotomic force-pushed the dynamic-flex-attention branch from 380b4c8 to 098642e Compare March 17, 2025 00:53

zyklotomic added 3 commits March 16, 2025 20:54

Add logging statement

1e47df2

Fix type imports

389e6e5

Fix type imports

275a743

zyklotomic force-pushed the dynamic-flex-attention branch from 108cb95 to 275a743 Compare March 17, 2025 03:29

Add input shape histogram collection

e965fc9

Cache blockmask and flex_attention kernel

6982a50

zyklotomic force-pushed the dynamic-flex-attention branch from a048acc to 6982a50 Compare March 23, 2025 09:32

Fix flex attention arguments

2fa0c93

zyklotomic changed the title ~~[WIP]: Initial DynamicFlexAttention wrapper class for dynamic sequence lengths~~ DynamicFlexAttention wrapper class for dynamic sequence lengths Mar 23, 2025

Increase BlockMask LRU cache to size 512

cb264d4

zyklotomic mentioned this pull request Mar 26, 2025

Initial changes: Refactor Attention #2156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DynamicFlexAttention wrapper class for dynamic sequence lengths #1960

DynamicFlexAttention wrapper class for dynamic sequence lengths #1960

Uh oh!

zyklotomic commented Mar 9, 2025

Uh oh!

zyklotomic commented Mar 9, 2025

Uh oh!

danielhanchen commented Mar 10, 2025

Uh oh!

zyklotomic commented Mar 12, 2025

Uh oh!

zyklotomic commented Mar 15, 2025

Uh oh!

zyklotomic commented Mar 17, 2025

Uh oh!

zyklotomic commented Mar 22, 2025 •

edited

Loading

Uh oh!

zyklotomic commented Mar 23, 2025

Uh oh!

Uh oh!

Uh oh!

DynamicFlexAttention wrapper class for dynamic sequence lengths #1960

Are you sure you want to change the base?

DynamicFlexAttention wrapper class for dynamic sequence lengths #1960

Uh oh!

Conversation

zyklotomic commented Mar 9, 2025

Uh oh!

zyklotomic commented Mar 9, 2025

Uh oh!

danielhanchen commented Mar 10, 2025

Uh oh!

zyklotomic commented Mar 12, 2025

Uh oh!

zyklotomic commented Mar 15, 2025

Uh oh!

zyklotomic commented Mar 17, 2025

Uh oh!

zyklotomic commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zyklotomic commented Mar 23, 2025

Uh oh!

Uh oh!

zyklotomic commented Mar 22, 2025 •

edited

Loading