Add new flash attn features to cuDNN SDPA API and remove fused attn #21228

Cjkkkk · 2024-05-14T20:39:03Z

add variable sequence length: Accepts two additional tensor seqlen_q and seqlen_kv to indicate the non padded length to reduce computation.
add MQA/GQA.
add broadcast bias: bias can be broadcast on batch/head dim.
add dbias calculation.
remove fused attn and default to flash attn.

Cjkkkk · 2024-05-14T20:42:55Z

@superbobry Hi Sergei, could you help review this PR?

Cjkkkk · 2024-05-16T21:45:30Z

@superbobry hi Sergei, any updates on this?

superbobry · 2024-05-16T22:02:19Z

No updates just yet, sorry. I will review some time tomorrow.

superbobry

I did my best to read through, but these large diffs are really hard to get through. Please send smaller PRs for any follow up changes.

I would also recommend to ask someone from NVidia to review for CUDNN APIs etc.

jax/_src/cudnn/fused_attention_stablehlo.py

tests/fused_attention_stablehlo_test.py

superbobry · 2024-05-17T13:55:57Z

jax/_src/cudnn/fused_attention_stablehlo.py

@@ -41,10 +41,42 @@ class AttentionLayout(Enum):
  BTNH = 0
  BNTH = 1

+class MaskType(Enum):
+  NO_MASK = 0


OOC why not use None instead, when a mask is not specified?

Just a choice to make it more explicit

Cjkkkk · 2024-05-17T23:48:08Z

I did my best to read through, but these large diffs are really hard to get through. Please send smaller PRs for any follow up changes.

I would also recommend to ask someone from NVidia to review for CUDNN APIs etc.

Understood, sorry for the large PR, i will create smaller one next time. I think people from Nvidia don't have access to approve and merge the PR?

superbobry · 2024-05-18T10:37:22Z

Please address the comments, and the we can merge.

Cjkkkk · 2024-05-20T05:16:11Z

Please address the comments, and the we can merge.

Comments addressed, sorry about the delay.

superbobry · 2024-05-20T08:21:00Z

Can you squash the PR please?

-- f625317 by cjkkkk <ske@nvidia.com>: init COPYBARA_INTEGRATE_REVIEW=#21228 from Cjkkkk:sdpa_new_cudnn_frontend f625317 PiperOrigin-RevId: 635518631

jakevdp assigned superbobry May 14, 2024

jakevdp requested a review from superbobry May 14, 2024 20:45

superbobry approved these changes May 17, 2024

View reviewed changes

google-ml-butler bot added kokoro:force-run pull ready Ready for copybara import and testing labels May 17, 2024

kokoro-team removed the kokoro:force-run label May 17, 2024

init

403ad05

Cjkkkk force-pushed the sdpa_new_cudnn_frontend branch from 627a064 to 403ad05 Compare May 20, 2024 17:38

copybara-service bot pushed a commit that referenced this pull request May 20, 2024

Copybara import of the project:

06d2e48

-- f625317 by cjkkkk <ske@nvidia.com>: init COPYBARA_INTEGRATE_REVIEW=#21228 from Cjkkkk:sdpa_new_cudnn_frontend f625317 PiperOrigin-RevId: 635518631

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new flash attn features to cuDNN SDPA API and remove fused attn #21228

Add new flash attn features to cuDNN SDPA API and remove fused attn #21228

Cjkkkk commented May 14, 2024

Cjkkkk commented May 14, 2024

Cjkkkk commented May 16, 2024

superbobry commented May 16, 2024

superbobry left a comment

superbobry May 17, 2024

Cjkkkk May 20, 2024

Cjkkkk commented May 17, 2024

superbobry commented May 18, 2024

Cjkkkk commented May 20, 2024

superbobry commented May 20, 2024

Add new flash attn features to cuDNN SDPA API and remove fused attn #21228

Are you sure you want to change the base?

Add new flash attn features to cuDNN SDPA API and remove fused attn #21228

Conversation

Cjkkkk commented May 14, 2024

Cjkkkk commented May 14, 2024

Cjkkkk commented May 16, 2024

superbobry commented May 16, 2024

superbobry left a comment

Choose a reason for hiding this comment

superbobry May 17, 2024

Choose a reason for hiding this comment

Cjkkkk May 20, 2024

Choose a reason for hiding this comment

Cjkkkk commented May 17, 2024

superbobry commented May 18, 2024

Cjkkkk commented May 20, 2024

superbobry commented May 20, 2024