Skip to content

[None][feat] Optimize GDN of Qwen3-Next/3.5; adds BF16 TRTLLM MoE#12557

Draft
rosenrodt wants to merge 9 commits intoNVIDIA:mainfrom
rosenrodt:qwen3next-3_5-pyt-perf
Draft

[None][feat] Optimize GDN of Qwen3-Next/3.5; adds BF16 TRTLLM MoE#12557
rosenrodt wants to merge 9 commits intoNVIDIA:mainfrom
rosenrodt:qwen3next-3_5-pyt-perf

Conversation

@rosenrodt
Copy link
Copy Markdown
Collaborator

@rosenrodt rosenrodt commented Mar 26, 2026

@coderabbitai summary

Description

  • Enable BF16 TRTLLM MoE through FlashInfer in the PyTorch backend.
  • Fix the Mamba2 metadata prefill bubble in chunked prefill serving (by @Wong4j)
  • Improve Gated Delta Net kernels perf
    • Tune causal-conv launches for varlen / short-sequence workloads.
    • In-place indexed state updates in kernel
    • Keep decode q/k/v tensors as views instead of instantiating new packed tensor
    • Change raster order of fused_sigmoid_gating_delta_rule_update_kernel

Perf

Qwen3.5-35B-A3B BF16 TP1
ISL/OSL=4k/1k synthetic (ignore_eos=True)
Tested on B200

Concurrency CUTLASS MoE (baseline) CUTLASS MoE (this PR) TRTLLM MoE (this PR) Speedup
1 180.42 178.74 210.98 1.17
8 1039.22 1077.51 1190.87 1.15
64 3786.6 3995.66 4307.72 1.14
128 4840.73 5272.36 5631.7 1.16
256 5232.83 6179.34 6485.89 1.24

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

rosenrodt and others added 4 commits March 25, 2026 22:19
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
Three optimizations to eliminate GPU idle bubbles during prefill in
Mamba2Metadata.prepare() for hybrid GDN models (e.g. Qwen3.5):

1. Remove tl.constexpr from num_seqs and N in _cu_seqlens_triton_kernel.
   Triton JIT recompiles for each unique constexpr value (~120ms each).
   In serving, num_seqs varies every prefill step, causing repeated
   recompilation. With dynamic parameters, only one compilation occurs.

2. Accept total_seqlens from caller to skip first GPU->CPU sync.
   cu_seqlens[-1].item() blocked on all pending GPU work. The caller
   (Mamba2Metadata.prepare) already has num_ctx_tokens on CPU.

3. Compute extra_chunks with pure Python arithmetic on CPU seq_lens
   to eliminate the second GPU->CPU sync (cumsum + p[-1].item()).

Before: _prepare_inputs ~120-460ms per prefill step (Triton recompile +
        GPU sync bubbles)
After:  _prepare_inputs ~1-2ms steady state

Verified: 9200+ random equivalence tests + e2e serving assertion with
1000 requests (0 mismatches). GSM8K accuracy unchanged (90.07% on full
1319 samples).

Signed-off-by: Shijie Wang <jaywan@nvidia.com>
- update chunked Gated Delta Rule prefill to use indexed in-kernel state updates
- remove explicit Qwen3Next prefill state gather/scatter in forward_extend
- retune causalConv1d forward launch selection for varlen and short sequences

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
@rosenrodt
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40455 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --disable-fast-fail

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40456 [ run ] triggered by Bot. Commit: 85ec854 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40456 [ run ] completed with state SUCCESS. Commit: 85ec854
/LLM/main/L0_MergeRequest_PR pipeline #31545 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@rosenrodt
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40481 [ run ] triggered by Bot. Commit: 252269f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40481 [ run ] completed with state SUCCESS. Commit: 252269f
/LLM/main/L0_MergeRequest_PR pipeline #31569 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@rosenrodt
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40500 [ run ] triggered by Bot. Commit: 162777e Link to invocation

- keep decode qkv views and make the fused recurrent kernel stride-aware
- restore the decode tile choice that wins on the representative bs256 pure-decode benchmark

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
@rosenrodt rosenrodt force-pushed the qwen3next-3_5-pyt-perf branch from 162777e to 6b67c8e Compare March 27, 2026 13:55
@rosenrodt
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40502 [ run ] triggered by Bot. Commit: 6b67c8e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40502 [ run ] completed with state SUCCESS. Commit: 6b67c8e
/LLM/main/L0_MergeRequest_PR pipeline #31590 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
@rosenrodt
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40530 [ run ] triggered by Bot. Commit: 5b0a3fb Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40530 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/28.

Link to invocation

@rosenrodt
Copy link
Copy Markdown
Collaborator Author

cc @VALLIS-NERIA @nv-guomingz as this PR modifies some of the GDN, mamba state kernels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants