Skip to content

ggml : fix FA mask dim 2 and 3 #14505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 3, 2025
Merged

ggml : fix FA mask dim 2 and 3 #14505

merged 3 commits into from
Jul 3, 2025

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Jul 2, 2025

In #14500, @JohannesGaessler correctly noted that the FA did not utilize dim 3 of the mask. I overlooked this and now as I was updating #14363 realized that we need to align the dimensions.

The fix is simple, I will try to update it myself across the Vulkan and CUDA backends later today.

Also small fix for the ggml_soft_max_ext(): was incorrectly requiring the mask to be a 3D array + test for this.

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Nvidia GPU Issues specific to Nvidia GPUs labels Jul 2, 2025
@ggerganov
Copy link
Member Author

ggerganov commented Jul 2, 2025

For the CUDA kernels I think I updated most of the variants, but the MMA is more difficult to figure out. @JohannesGaessler Could you take a look at 2a20a7e and see if it makes sense. Any help with updating the MMA code is appreciated.

Edit: nvm, I think my approach was not correct.

@github-actions github-actions bot added the Vulkan Issues specific to the Vulkan backend label Jul 3, 2025
@ggerganov ggerganov marked this pull request as ready for review July 3, 2025 05:26
@ggerganov
Copy link
Member Author

@JohannesGaessler @jeffbolznv For now I will disable the broadcast of ggml_flash_attn_ext() in Vulkan again (it was already disabled for CUDA). Apologies for changing the API again - I messed it up a bit in #14435.

To summarize, the correct broadcast logic that we have to support is as follow:

llama.cpp/ggml/include/ggml.h

Lines 1983 to 2003 in 89ee2f1

// q: [n_embd_k, n_batch, n_head, ne3 ]
// k: [n_embd_k, n_kv, n_head_kv, ne3 ]
// v: [n_embd_v, n_kv, n_head_kv, ne3 ] !! not transposed !!
// mask: [n_kv, n_batch_pad, ne32, ne33] !! n_batch_pad = GGML_PAD(n_batch, GGML_KQ_MASK_PAD) !!
// res: [n_embd_v, n_head, n_batch, ne3 ] !! permuted !!
//
// broadcast:
// n_head % n_head_kv == 0
// n_head % ne32 == 0
// ne3 % ne33 == 0
//
GGML_API struct ggml_tensor * ggml_flash_attn_ext(
struct ggml_context * ctx,
struct ggml_tensor * q,
struct ggml_tensor * k,
struct ggml_tensor * v,
struct ggml_tensor * mask,
float scale,
float max_bias,
float logit_softcap);

In practice, the ne32 will probably always be equal to 1. I.e. we specify the mask for a single attention head and broadcast it to all other heads. But in theory, ne32 could be any integer that can be broadcasted to n_head.

The ne33 on master is always equal to 1. But with the upcoming #14363 it can also be equal to ne3 > 1 (number of parallel sequences).

This way, the mask that we pass to ggml_soft_max_ext() and ggml_flash_attn_ext() is technically the same - the dimensions correspond 1-to-1.

Merging this for now so I can continue working on top of this and later on we'll hopefully add support for these cases.

@ggerganov ggerganov merged commit 9067487 into master Jul 3, 2025
39 of 48 checks passed
@ggerganov ggerganov deleted the gg/fa-fix-dim-2 branch July 3, 2025 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant