[Common] Improved fused MoE aux loss kernel for large # of experts by denera · Pull Request #2758 · NVIDIA/TransformerEngine

denera · 2026-03-13T11:19:38Z

Description

Eliminates expensive cluster management API and minimizes number of atomic ops to optimize perf for larger number of experts.

TODO: Perf testing on all archs.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-04-28T19:30:14Z

Greptile Summary

This PR replaces the cooperative-groups/cluster-launch based MoE aux-loss forward kernel with a simpler, architecture-agnostic multi-block atomicAdd approach that scales better to large expert counts without the overhead of cluster management APIs.

The new kernel has each CTA compute a partial dot-product of probs × tokens_per_expert, then atomically accumulate into a float buffer that is converted to the output dtype by a second single-thread kernel; C_coeff is precomputed on the host and stored at Coeff_buf[0] for the existing backward kernel to reuse.
The Coeff_buf allocation is correctly widened from 1 to 2 floats in all three binding layers (PyTorch router.cpp, JAX C++ router.cpp, and JAX Python router.py), and cudaMemsetAsync correctly zeroes only the accumulator slot before launch.
The total_num_tokens and topk arguments remain in the kernel signature but are now unused inside the body; these dead parameters may produce compiler warnings and should be removed.

Confidence Score: 5/5

The kernel rewrite is safe to merge: stream-ordering guarantees correct sequencing between the memset, forward kernel, and convert kernel; C_coeff is correctly preserved for the backward; and buffer sizes are consistently updated across all three binding layers.

The multi-block atomicAdd pattern is well-formed — Coeff_buf[1] is zeroed before launch and only read by a subsequent kernel on the same stream, so no inter-block race can produce wrong results. Coeff_buf[0] is written once by block 0 thread 0 and read only by future backward kernel launches, safely ordered by CUDA stream semantics. The 2-float buffer allocation is correctly propagated through PyTorch, JAX C++, and JAX Python. The only finding is cosmetic: two now-dead kernel parameters that the compiler may warn about.

transformer_engine/common/fused_router/fused_moe_aux_loss.cu — dead total_num_tokens and topk kernel parameters worth cleaning up.

Important Files Changed

Filename	Overview
transformer_engine/common/fused_router/fused_moe_aux_loss.cu	Replaces cooperative-groups cluster launch with a simpler multi-block atomicAdd approach; correctly stores C_coeff at Coeff_buf[0] for the backward, zeroes Coeff_buf[1] before launch, and uses a separate tiny kernel to write the final aux_loss value. Contains dead kernel parameters (total_num_tokens, topk) that are precomputed by the launcher.
transformer_engine/pytorch/csrc/extensions/router.cpp	Const_buf correctly resized from scalar {} to {2} to accommodate both C_coeff at index 0 and the float accumulator at index 1.
transformer_engine/jax/cpp_extensions/router.py	Abstract shape for const_buf_aval updated from (1,) to (2,) to match the new 2-float kernel layout.
transformer_engine/jax/csrc/extensions/router.cpp	Forward and backward TensorWrapper shapes for const_buf updated from {1} to {2}; grad_aux_loss and aux_loss correctly remain {1}.

Sequence Diagram

sequenceDiagram
    participant Host as Host (launcher)
    participant Memset as cudaMemsetAsync
    participant FwdKernel as fwd_kernel (grid_size blocks)
    participant ConvKernel as convert_accum_to_output
    participant BwdKernel as bwd_kernel

    Host->>Host: compute C_coeff
    Host->>Memset: zero Coeff_buf[1] (stream)
    Memset-->>Host: enqueued
    Host->>FwdKernel: launch grid_size blocks (stream)
    Note over FwdKernel: block0/thread0: Coeff_buf[0] = C_coeff
    Note over FwdKernel: each CTA partial dot-product then atomicAdd Coeff_buf[1]
    FwdKernel-->>Host: kernel complete
    Host->>ConvKernel: launch 1x1 (stream)
    Note over ConvKernel: aux_loss[0] = Coeff_buf[1]
    ConvKernel-->>Host: done
    Note over BwdKernel: separate call reads Coeff_buf[0] as C_coeff
    BwdKernel->>BwdKernel: grad_probs = C_coeff times tokens_per_expert times grad_aux_loss

_{Reviews (12): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

greptile-apps · 2026-04-29T21:13:09Z

Want your agent to iterate on Greptile's feedback? Try greploops.

denera · 2026-04-29T22:16:36Z

/te-ci pytorch

denera · 2026-04-29T22:32:20Z

@ptrendx Greptile's P2 issue seems to stem from the fact that check_shared_memory_capacity_num_experts() is built for checking shared memory size vs. num_experts but we actually compute shared memory size based on num_cols. In practice though, num_experts == num_cols from the way we invoke all the fused router APIs from the framework side. It's not clear to me why the num_experts and num_cols were ever set up as separate parameters for these fused router functions in the first place, but I didn't want to make that change in this PR. I'd like to talk to someone who is familiar with the E2E use cases and understand whether there is ever a possibility of these two being different values somehow before we streamline the function signatures in a separate PR.

denera · 2026-05-01T05:25:45Z

/te-ci

Signed-off-by: Alp Dener <adener@nvidia.com>

…_loss_v2 kernel - Accumulate into a float buffer instead of atomicAdd-ing directly into aux_loss (which could be fp16/bf16), fixing a buffer overflow and wrong results for non-float dtypes - Zero the accumulator on the host before launch to eliminate the race between block 0's init and other blocks' atomicAdds - Move kernel into fused_router namespace so symbols resolve correctly - Round block size up to a warp multiple for well-defined shuffles - Allocate Const_buf with 2 elements to hold both C_coeff and the float accumulator Signed-off-by: Alp Dener <adener@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Alp Dener <adener@nvidia.com>

…w V2 API in TE/common Signed-off-by: Alp Dener <adener@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Alp Dener <adener@nvidia.com>

…result to DataType Signed-off-by: Alp Dener <adener@nvidia.com>

Signed-off-by: Alp Dener <adener@nvidia.com>

…kward pass correctly Signed-off-by: Alp Dener <adener@nvidia.com>

Signed-off-by: Alp Dener <adener@nvidia.com>

for more information, see https://pre-commit.ci

denera · 2026-05-05T21:04:57Z

/te-ci

…VIDIA#2758) * added new implementation of fused_moe_aux_loss_forward kernel Signed-off-by: Alp Dener <adener@nvidia.com> * Fix race condition, type-punning, and namespace bugs in fused_moe_aux_loss_v2 kernel - Accumulate into a float buffer instead of atomicAdd-ing directly into aux_loss (which could be fp16/bf16), fixing a buffer overflow and wrong results for non-float dtypes - Zero the accumulator on the host before launch to eliminate the race between block 0's init and other blocks' atomicAdds - Move kernel into fused_router namespace so symbols resolve correctly - Round block size up to a warp multiple for well-defined shuffles - Allocate Const_buf with 2 elements to hold both C_coeff and the float accumulator Signed-off-by: Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added shared memory check on number of experts Signed-off-by: Alp Dener <adener@nvidia.com> * removed duplicate syncwarp Signed-off-by: Alp Dener <adener@nvidia.com> * updated TE/JAX primitive for fused MoE aux loss to comply with the new V2 API in TE/common Signed-off-by: Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added missing syncthreads after atomicAdds Signed-off-by: Alp Dener <adener@nvidia.com> * restored the small 1grid/1block kernel for casting accumulated float result to DataType Signed-off-by: Alp Dener <adener@nvidia.com> * fixed inter-block race on accumulation coefficient Signed-off-by: Alp Dener <adener@nvidia.com> * fixed the intermediate coefficient buffer getting passed onto the backward pass correctly Signed-off-by: Alp Dener <adener@nvidia.com> * removed old kernel, removed _v2 name from new kernel Signed-off-by: Alp Dener <adener@nvidia.com> * removed unused num_experts from kernel Signed-off-by: Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Alp Dener <adener@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

denera self-assigned this Mar 13, 2026

denera added the 2.15.0 label Mar 13, 2026

ptrendx added the MoE label Mar 17, 2026

nvMelissa mentioned this pull request Mar 26, 2026

Fused router optimization for GroupedTensor #2457

Closed

ptrendx added this to the 2.15 milestone Apr 23, 2026

denera force-pushed the common/fused-router-aux-loss branch from 1071b6b to 0dcef3b Compare April 24, 2026 18:06

KshitijLakhani removed the 2.15.0 label Apr 24, 2026

denera force-pushed the common/fused-router-aux-loss branch from 53ca925 to 0e503ae Compare April 28, 2026 19:24

denera marked this pull request as ready for review April 28, 2026 19:27

denera removed this from the 2.15 milestone Apr 28, 2026

denera added the 2.16.0 label Apr 28, 2026

denera requested a review from ptrendx April 28, 2026 19:28

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread transformer_engine/common/fused_router/fused_moe_aux_loss_v2.cu Outdated

Comment thread transformer_engine/common/fused_router/fused_moe_aux_loss_v2.cu Outdated

Comment thread transformer_engine/common/fused_router/fused_moe_aux_loss_v2.cu Outdated

denera force-pushed the common/fused-router-aux-loss branch 2 times, most recently from 53f044c to c120c6c Compare April 28, 2026 19:59

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread transformer_engine/common/fused_router/fused_moe_aux_loss_v2.cu Outdated

Comment thread transformer_engine/common/fused_router/fused_moe_aux_loss_v2.cu Outdated

ptrendx reviewed Apr 28, 2026

View reviewed changes