Cross-Attention with key_padding_mask #127

jvend · 2023-02-22T01:42:21Z

Maybe I missed it, but I didn’t see any code using flash-enabled cross attention with key_padding_mask analogous to FlashAttention in flash_attn/flash_attention.py. Is there any reason this is the case? I have a working implementation (with the same structure as FlashAttention) and would be happy to submit a pull request if there's interest. Thanks for the great work!

tridao · 2023-02-22T04:20:14Z

I think we have some of that here. If you have sth easier to use, would love to see it.

In general for the best performance, one should remove all the padding tokens at the very beginning, pass through all the layers, and then add back the padding tokens so as to avoid wasting computation. We do that in our BERT implementation.

samvanstroud mentioned this issue Sep 25, 2023

[v2] Attention Masking #352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-Attention with key_padding_mask #127

Cross-Attention with key_padding_mask #127

jvend commented Feb 22, 2023

tridao commented Feb 22, 2023 •

edited

Loading

Cross-Attention with key_padding_mask #127

Cross-Attention with key_padding_mask #127

Comments

jvend commented Feb 22, 2023

tridao commented Feb 22, 2023 • edited Loading

tridao commented Feb 22, 2023 •

edited

Loading