[None][feat] Optimize GDN of Qwen3-Next/3.5; adds BF16 TRTLLM MoE by rosenrodt · Pull Request #12557 · NVIDIA/TensorRT-LLM

rosenrodt · 2026-03-26T14:10:20Z

Description

Enable BF16 TRTLLM MoE through FlashInfer in the PyTorch backend.
Fix the Mamba2 metadata prefill bubble in chunked prefill serving (by @Wong4j)
Improve Gated Delta Net kernels perf
- Tune causal-conv launches for varlen / short-sequence workloads.
- In-place indexed state updates in kernel
- Keep decode q/k/v tensors as views instead of instantiating new packed tensor
- Change raster order of fused_sigmoid_gating_delta_rule_update_kernel

Perf

Qwen3.5-35B-A3B BF16 TP1
ISL/OSL=4k/1k synthetic (ignore_eos=True)
Tested on B200

Concurrency	CUTLASS MoE (baseline)	CUTLASS MoE (this PR)	TRTLLM MoE (this PR)	Speedup
1	180.42	178.74	210.98	1.17
8	1039.22	1077.51	1190.87	1.15
64	3786.6	3995.66	4307.72	1.14
128	4840.73	5272.36	5631.7	1.16
256	5232.83	6179.34	6485.89	1.24

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

Three optimizations to eliminate GPU idle bubbles during prefill in Mamba2Metadata.prepare() for hybrid GDN models (e.g. Qwen3.5): 1. Remove tl.constexpr from num_seqs and N in _cu_seqlens_triton_kernel. Triton JIT recompiles for each unique constexpr value (~120ms each). In serving, num_seqs varies every prefill step, causing repeated recompilation. With dynamic parameters, only one compilation occurs. 2. Accept total_seqlens from caller to skip first GPU->CPU sync. cu_seqlens[-1].item() blocked on all pending GPU work. The caller (Mamba2Metadata.prepare) already has num_ctx_tokens on CPU. 3. Compute extra_chunks with pure Python arithmetic on CPU seq_lens to eliminate the second GPU->CPU sync (cumsum + p[-1].item()). Before: _prepare_inputs ~120-460ms per prefill step (Triton recompile + GPU sync bubbles) After: _prepare_inputs ~1-2ms steady state Verified: 9200+ random equivalence tests + e2e serving assertion with 1000 requests (0 mismatches). GSM8K accuracy unchanged (90.07% on full 1319 samples). Signed-off-by: Shijie Wang <jaywan@nvidia.com>

- update chunked Gated Delta Rule prefill to use indexed in-kernel state updates - remove explicit Qwen3Next prefill state gather/scatter in forward_extend - retune causalConv1d forward launch selection for varlen and short sequences Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

rosenrodt · 2026-03-26T14:16:19Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-26T14:17:40Z

PR_Github #40455 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --disable-fast-fail

Link to invocation

tensorrt-cicd · 2026-03-26T14:22:21Z

PR_Github #40456 [ run ] triggered by Bot. Commit: 85ec854 Link to invocation

tensorrt-cicd · 2026-03-26T18:31:36Z

PR_Github #40456 [ run ] completed with state SUCCESS. Commit: 85ec854
/LLM/main/L0_MergeRequest_PR pipeline #31545 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

rosenrodt · 2026-03-27T05:01:03Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-27T05:07:02Z

PR_Github #40481 [ run ] triggered by Bot. Commit: 252269f Link to invocation

tensorrt-cicd · 2026-03-27T12:06:50Z

PR_Github #40481 [ run ] completed with state SUCCESS. Commit: 252269f
/LLM/main/L0_MergeRequest_PR pipeline #31569 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

rosenrodt · 2026-03-27T13:28:40Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-27T13:34:23Z

PR_Github #40500 [ run ] triggered by Bot. Commit: 162777e Link to invocation

- keep decode qkv views and make the fused recurrent kernel stride-aware - restore the decode tile choice that wins on the representative bs256 pure-decode benchmark Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

rosenrodt · 2026-03-27T13:57:25Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-27T14:03:57Z

PR_Github #40502 [ run ] triggered by Bot. Commit: 6b67c8e Link to invocation

tensorrt-cicd · 2026-03-27T21:00:44Z

PR_Github #40502 [ run ] completed with state SUCCESS. Commit: 6b67c8e
/LLM/main/L0_MergeRequest_PR pipeline #31590 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

rosenrodt · 2026-03-28T08:00:16Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-28T08:06:59Z

PR_Github #40530 [ run ] triggered by Bot. Commit: 5b0a3fb Link to invocation

tensorrt-cicd · 2026-03-28T08:07:00Z

PR_Github #40530 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/28.

Link to invocation

rosenrodt · 2026-03-28T17:47:24Z

cc @VALLIS-NERIA @nv-guomingz as this PR modifies some of the GDN, mamba state kernels

rosenrodt and others added 4 commits March 25, 2026 22:19

bf16 trtllm moe through flashinfer

606595b

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

reuse triton slicing kernel; use triton topk packing; tidy

0b5c0e1

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

github-actions bot assigned rosenrodt Mar 26, 2026

rosenrodt force-pushed the qwen3next-3_5-pyt-perf branch from e2c4962 to 85ec854 Compare March 26, 2026 14:15

rosenrodt added 4 commits March 27, 2026 21:55

Optimize qwen3.5 decode delta kernel

0e9f977

- keep decode qkv views and make the fused recurrent kernel stride-aware - restore the decode tile choice that wins on the representative bs256 pure-decode benchmark Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

more test variants for qwen3.5

1e1da93

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

avoid OOM in test and unwaive

2448688

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

skip unsupported cases in test

6b67c8e

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

rosenrodt force-pushed the qwen3next-3_5-pyt-perf branch from 162777e to 6b67c8e Compare March 27, 2026 13:55

ban DeepSeek routing w/ BF16 TRTLLMGenFusedMoE; bug inside Flashinfer

5b0a3fb

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

Conversation

rosenrodt commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Perf

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

rosenrodt commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

rosenrodt commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

rosenrodt commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

rosenrodt commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

rosenrodt commented Mar 28, 2026

Uh oh!

tensorrt-cicd commented Mar 28, 2026

Uh oh!

tensorrt-cicd commented Mar 28, 2026

Uh oh!

rosenrodt commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rosenrodt commented Mar 26, 2026 •

edited

Loading