[WIP] GQA supports per head smooth softmax #25269

tianleiwu · 2025-07-03T00:38:28Z

Description

It is an extension of Smooth Softmax feature. The difference is that each head has a learnable smooth factor that adding to the denominator of softmax. The smooth factor s is like an extra element that joins the softmax.

The usage of the smooth factor in softmax is like the following (For Smooth Softmax, s is constant 0):

$$softmax_{i} = \frac{exp(x_{i})}{exp(s)+ \sum_{j} exp(x_{j})}$$

Given head_sink input, s can be looked up in head_sink for current head.

Changes in progress:

Update operator spec to add an optional new input head_sink
Implement CPU (MLAS) kernel.
Implement CUDA kernel.

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2025-07-05T23:26:03Z

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h


-          if (use_smooth_softmax_) {
-            ComputeSmoothSoftmaxInplace(output_softmax + start_offset, 1, static_cast<int>(window_size), nullptr);
+
+          if (use_smooth_softmax_ || head_sink != nullptr) {


Suggested change

if (use_smooth_softmax_) {

ComputeSmoothSoftmaxInplace(output_softmax + start_offset, 1, static_cast<int>(window_size), nullptr);

if (use_smooth_softmax_ || head_sink != nullptr) {

if (use_smooth_softmax_ || head_sink != nullptr) {

### Description support smooth softmax for non-FA GQA implementation This change depends on: - #25269 Work items: - [x] support smooth softmax - [x] support bias - [x] support head sink (per-head smooth softmax) The following will not be included in this PR: - support for FlashAttention - support sliding window

update spec

1d84450

tianleiwu marked this pull request as draft July 3, 2025 00:38

fs-eire mentioned this pull request Jul 5, 2025

[webgpu] support smooth softmax for non-FA GQA implementation #25285

Merged

3 tasks

Implement CPU

a1b51b7

github-actions bot reviewed Jul 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] GQA supports per head smooth softmax #25269

[WIP] GQA supports per head smooth softmax #25269

Uh oh!

tianleiwu commented Jul 3, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot Jul 5, 2025

Uh oh!

Uh oh!

[WIP] GQA supports per head smooth softmax #25269

Are you sure you want to change the base?

[WIP] GQA supports per head smooth softmax #25269

Uh oh!

Conversation

tianleiwu commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Jul 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tianleiwu commented Jul 3, 2025 •

edited

Loading