ggml : fix FA mask dim 2 and 3 #14505

ggerganov · 2025-07-02T14:11:57Z

In #14500, @JohannesGaessler correctly noted that the FA did not utilize dim 3 of the mask. I overlooked this and now as I was updating #14363 realized that we need to align the dimensions.

~~The fix is simple, I will try to update it myself across the Vulkan and CUDA backends later today.~~

Also small fix for the ggml_soft_max_ext(): was incorrectly requiring the mask to be a 3D array + test for this.

ggml-ci

ggerganov · 2025-07-02T16:29:00Z

For the CUDA kernels I think I updated most of the variants, but the MMA is more difficult to figure out. @JohannesGaessler Could you take a look at 2a20a7e and see if it makes sense. Any help with updating the MMA code is appreciated.

Edit: nvm, I think my approach was not correct.

ggml-ci

ggerganov · 2025-07-03T05:38:30Z

@JohannesGaessler @jeffbolznv For now I will disable the broadcast of ggml_flash_attn_ext() in Vulkan again (it was already disabled for CUDA). Apologies for changing the API again - I messed it up a bit in #14435.

To summarize, the correct broadcast logic that we have to support is as follow:

llama.cpp/ggml/include/ggml.h

Lines 1983 to 2003 in 89ee2f1

    
           // q:    [n_embd_k, n_batch,     n_head,    ne3 ] 
        
           // k:    [n_embd_k, n_kv,        n_head_kv, ne3 ] 
        
           // v:    [n_embd_v, n_kv,        n_head_kv, ne3 ] !! not transposed !! 
        
           // mask: [n_kv,     n_batch_pad, ne32,      ne33] !! n_batch_pad = GGML_PAD(n_batch, GGML_KQ_MASK_PAD) !! 
        
           // res:  [n_embd_v, n_head,      n_batch,   ne3 ] !! permuted !! 
        
           // 
        
           // broadcast: 
        
           //   n_head % n_head_kv == 0 
        
           //   n_head % ne32      == 0 
        
           //   ne3    % ne33      == 0 
        
           // 
        
           GGML_API struct ggml_tensor * ggml_flash_attn_ext( 
        
                   struct ggml_context * ctx, 
        
                   struct ggml_tensor  * q, 
        
                   struct ggml_tensor  * k, 
        
                   struct ggml_tensor  * v, 
        
                   struct ggml_tensor  * mask, 
        
                   float                 scale, 
        
                   float                 max_bias, 
        
                   float                 logit_softcap);

In practice, the ne32 will probably always be equal to 1. I.e. we specify the mask for a single attention head and broadcast it to all other heads. But in theory, ne32 could be any integer that can be broadcasted to n_head.

The ne33 on master is always equal to 1. But with the upcoming #14363 it can also be equal to ne3 > 1 (number of parallel sequences).

This way, the mask that we pass to ggml_soft_max_ext() and ggml_flash_attn_ext() is technically the same - the dimensions correspond 1-to-1.

Merging this for now so I can continue working on top of this and later on we'll hopefully add support for these cases.

ggml : fix FA mask dim 2 and 3

6036177

ggml-ci

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Nvidia GPU Issues specific to Nvidia GPUs labels Jul 2, 2025

ggerganov mentioned this pull request Jul 2, 2025

vulkan: support softmax/FA batch and broadcast #14449

Merged

backends : unsupport batched FA in CUDA and Vulkan

89ee2f1

ggml-ci

ggerganov force-pushed the gg/fa-fix-dim-2 branch from 2a20a7e to 89ee2f1 Compare July 3, 2025 04:57

github-actions bot added the Vulkan Issues specific to the Vulkan backend label Jul 3, 2025

ggerganov marked this pull request as ready for review July 3, 2025 05:26

vulkan : disable FA for mask->ne[2] != 1

b1b22ae

ggerganov force-pushed the gg/fa-fix-dim-2 branch from a65fa3a to b1b22ae Compare July 3, 2025 07:33

ggerganov merged commit 9067487 into master Jul 3, 2025
39 of 48 checks passed

ggerganov deleted the gg/fa-fix-dim-2 branch July 3, 2025 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : fix FA mask dim 2 and 3 #14505

ggml : fix FA mask dim 2 and 3 #14505

ggerganov commented Jul 2, 2025 •

edited

Loading

Uh oh!

ggerganov commented Jul 2, 2025 •

edited

Loading

Uh oh!

ggerganov commented Jul 3, 2025

Uh oh!

Uh oh!

Uh oh!

ggml : fix FA mask dim 2 and 3 #14505

ggml : fix FA mask dim 2 and 3 #14505

Conversation

ggerganov commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jul 3, 2025

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Jul 2, 2025 •

edited

Loading

ggerganov commented Jul 2, 2025 •

edited

Loading