[Draft] support qk head_dim different from vo head_dim #980

defei-coder · 2024-06-06T09:24:14Z

Support query/key head_dim different from value head_dim, fix issue-753 and issue-952.
Recently, DeepSeek-V2 proposed a new attention called MLA (Multi-head Latent Attention), which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference. MLA will use query/key head_dim=192 and value head_dim=128, but flashAttention not support the combination. Although this can be achieved by padding value head_dim from 128 to 192, but this way will increase global memory and hurt the performence.
In order to expand the versatility of flashAttention, I modify the code to support this ability. For compilation time considerations, only one combination is added, other combinations can be implemented by the user as needed.
Compared with padding value head_dim from 192 to 128, use query/key head_dim=192 and value head_dim=128 will save global memory and improve performence(forward will speedup about 15%, backward will speedup 5%).

support head_dim not equal

8346e39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] support qk head_dim different from vo head_dim #980

[Draft] support qk head_dim different from vo head_dim #980

defei-coder commented Jun 6, 2024 •

edited

Loading

[Draft] support qk head_dim different from vo head_dim #980

Are you sure you want to change the base?

[Draft] support qk head_dim different from vo head_dim #980

Conversation

defei-coder commented Jun 6, 2024 • edited Loading

defei-coder commented Jun 6, 2024 •

edited

Loading