Skip to content

[None][feat] FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron …#13773

Merged
farazkh80 merged 9 commits into
NVIDIA:mainfrom
farazkh80:faraz/b12x-flashinfer-moe-pr
May 25, 2026
Merged

[None][feat] FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron …#13773
farazkh80 merged 9 commits into
NVIDIA:mainfrom
farazkh80:faraz/b12x-flashinfer-moe-pr

Conversation

@farazkh80
Copy link
Copy Markdown
Collaborator

@farazkh80 farazkh80 commented May 5, 2026

Description

Adds a new MoE backend, FlashInferNvfp4Sm12xFusedMoE, for Nemotron-Super-120B-NVFP4 on SM120 (RTX 5090 / RTX PRO 6000 / GB202) and SM121 (DGX Spark / GB10). It is auto-selected on the CUTLASS path when quant_algo == NVFP4, SM is 120/121, and flashinfer is importable — otherwise falls back to plain CutlassFusedMoE. There is no new user-facing moe_backend value; the heuristic in get_moe_cls() mirrors the existing MEGAMOE_DEEPGEMM can_implement-gated pattern.

Hybrid composition:

  • Prefill (x.shape[0] >= 64) routes through the inherited CutlassFusedMoE NVFP4 GroupGEMM. b12x's 12-CTA-per-token MMA pattern is suboptimal at large m.
  • Decode (x.shape[0] < 64) dispatches to FlashInfer's B12xMoEWrapper.run. Beats CUTLASS by ~17.6 % TPOT.

Hardware constraints (rejected by can_implement or __init__)

NVFP4 only; SM120/121 only; bf16/fp16 activation only; ep_size == 1 only; no MoE alltoall; activations limited to Relu2 and Swiglu; no swiglu_gptoss_style. Fp4QuantizedTensor input rejected on the decode path (b12x quantizes activations internally).

Performance

Single RTX PRO 6000 Blackwell (SM120, 97 GB), Nemotron-Super-120B-NVFP4, TRT-LLM 1.3.0rc14, FlashInfer 0.6.8, cutlass-dsl 4.5.0; ISL=2048, OSL=1024, 5 reqs, conc=1, KV reuse off, cuda_graph_config: {batch_sizes: [1]}, max_num_tokens=2048.

Variant Total tput (tok/s) TTFT P50 (ms) TPOT P50 (ms)
moe_backend: CUTLASS (pre-PR) 70.58 154.67 13.97
FlashInfer-only (pure b12x) 85.32 229.26 11.50
moe_backend: CUTLASS (this PR, auto-promoted) 85.92 154.53 11.49

Matches CUTLASS on TTFT (prefill on CUTLASS), matches pure b12x on TPOT (decode on b12x), beats both on total throughput. Tokens/Watt +20.4 % vs CUTLASS.

GSM8K accuracy parity (1319 samples, 8-shot CoT, greedy)

Backend Accuracy Δ vs CUTLASS
CUTLASS (baseline) 92.418 %
CUTLASS (auto-promoted, this PR) 92.267 % −0.151 pp

Statistically indistinguishable (≈ 2 questions out of 1319; well within 95 % binomial CI of ±1.5 pp).

PR Checklist

  • PR description clearly explains what and why.
  • PR follows TRT-LLM CODING GUIDELINES.
  • Test cases provided.
  • No new dependencies added.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

  • New Features

    • Added FlashInfer as a selectable MoE backend option in model configuration for optimized Mixture of Experts inference on SM120/SM121 GPUs with NVFP4 quantization
  • Documentation

    • Updated developer guide with FlashInfer MoE backend capabilities matrix and supported constraints documentation

@farazkh80 farazkh80 requested review from a team as code owners May 5, 2026 18:38
@farazkh80 farazkh80 requested review from xxi-nv and zhenhuaw-me May 5, 2026 18:38
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

📝 Walkthrough

Walkthrough

This PR adds a new MoE backend backend option (FlashInferFusedMoE) that uses FlashInfer's B12xMoE kernel for NVFP4-quantized inference on Blackwell hardware (SM120/SM121). The implementation includes the core class, backend selection logic, configuration updates, and negative-path test coverage.

Changes

FlashInfer MoE Backend Addition

Layer / File(s) Summary
Configuration & API
tensorrt_llm/llmapi/llm_args.py
MoeConfig.backend literal type updated to include "FLASHINFER" backend option.
Core Implementation
tensorrt_llm/_torch/modules/fused_moe/fused_moe_flashinfer.py
New FlashInferFusedMoE class (subclass of CutlassFusedMoE) implementing FlashInfer b12x MoE compute path. Validates SM120/SM121 and NVFP4-only quantization; post_load_weights() converts per-expert FP8/FP4 scales into b12x MMA layout via convert_sf_to_mma_layout, builds FlashInfer-compatible weight tensors with packed uint8 views, and instantiates B12xMoEWrapper. Input quantization passes through; run_moe dispatches to the wrapper.
Backend Selection & Dispatch
tensorrt_llm/_torch/modules/fused_moe/create_moe.py
get_moe_cls() gains "FLASHINFER" branch with validation (requires NVFP4 quant, rejects unsupported SM versions). create_moe_backend() dispatch changed from exact equality check to issubclass(moe_cls, CutlassFusedMoE) to support subclasses.
Module Exports
tensorrt_llm/_torch/modules/fused_moe/__init__.py
Import and export FlashInferFusedMoE in module's __all__.
Documentation & Tests
tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md, tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py
Developer guide documents FlashInferFusedMoE in backend inventory and constraint section (SM/NVFP4 gating, lazy wrapper build, scale/layout conversions). Test module validates can_implement() rejection of unsupported SM/quantization/activation combos and get_moe_cls() error handling and selection.

Sequence Diagram

sequenceDiagram
    participant User
    participant MoeFactory as MoE Factory<br/>(create_moe)
    participant FI as FlashInferFusedMoE
    participant B12x as B12xMoEWrapper<br/>(FlashInfer)
    participant Weights as Weight Storage

    User->>MoeFactory: get_moe_cls(backend="FLASHINFER", quant_config)
    MoeFactory->>MoeFactory: Validate NVFP4 & SM versions
    MoeFactory-->>User: Return FlashInferFusedMoE class

    User->>FI: __init__(model_config)
    FI->>FI: Validate ep_size==1, no alltoall
    FI-->>User: Instance ready

    User->>Weights: Load model weights
    User->>FI: post_load_weights()
    FI->>FI: Import B12xMoEWrapper
    FI->>FI: Convert scales to MMA layout
    FI->>FI: Build FP4 uint8 weight views
    FI->>B12x: B12xMoEWrapper(experts, weights)
    B12x-->>FI: Wrapper instantiated
    FI-->>User: Initialization complete

    loop Forward Pass
        User->>FI: quantize_input(x)
        FI-->>User: (x, None) — passthrough
        User->>FI: run_moe(x, routing_tensors)
        FI->>B12x: run(x, token_selected_experts, ...)
        B12x->>B12x: Compute FP4 MoE with internal quant
        B12x-->>FI: Output tensor
        FI-->>User: MoE result
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: adding a FlashInfer MoE backend for NVFP4 quantization on SM120/SM121 processors.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description thoroughly explains the new FlashInfer NVFP4 MoE backend, its hardware constraints, performance metrics, accuracy parity, and includes a completed checklist.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)

216-238: ⚠️ Potential issue | 🔴 Critical

The issubclass() dispatch at line 216 makes the CuteDslFusedMoE and DeepGemmFusedMoE branches unreachable and will cause runtime errors.

Both CuteDslFusedMoE and DeepGemmFusedMoE inherit from CutlassFusedMoE, so they are now captured by the broader issubclass(moe_cls, CutlassFusedMoE) check before their exact-class branches (lines 269 and 285) can execute. The generic Cutlass path then attempts to pass arguments like swiglu_alpha, swiglu_beta, and swiglu_limit that these subclasses' narrower constructors do not accept, resulting in unexpected keyword argument errors. Only FlashInferFusedMoE is compatible because its __init__(self, *args, **kwargs) forwards all arguments.

Either revert to exact-class checks (moe_cls ==) for all three subclasses, or verify that the constructors accept the full argument set being passed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py` around lines 216 - 238,
The dispatch using issubclass(moe_cls, CutlassFusedMoE) wrongly catches
CuteDslFusedMoE and DeepGemmFusedMoE (both subclassing CutlassFusedMoE) and
passes unsupported kwargs (swiglu_alpha, swiglu_beta, swiglu_limit), causing
runtime errors; fix by making the dispatch check exact-class comparisons for
CuteDslFusedMoE and DeepGemmFusedMoE (i.e. moe_cls == CuteDslFusedMoE and
moe_cls == DeepGemmFusedMoE) or move those subclass-specific branches before the
generic issubclass(CutlassFusedMoE) branch so their narrower constructors run,
ensuring only FlashInferFusedMoE (which accepts arbitrary kwargs) is handled by
the broad issubclass(CutlassFusedMoE) path.
🧹 Nitpick comments (1)
tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py (1)

34-131: ⚡ Quick win

Cover the remaining no-GPU guard rails too.

This file exercises selection-time gating, but the backend's other hard rejects are still untested: ep_size != 1, alltoall enabled, Fp4QuantizedTensor input, and x_sf is not None. Those are pure-Python validation paths, so adding them here would catch the runtime guard rails most likely to regress during refactors.

As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py` around
lines 34 - 131, Add unit tests in
tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py that exercise
the remaining pure-Python guard rails on selection/creation: call
FlashInferFusedMoE.can_implement (and get_moe_cls where appropriate) with
ep_size != 1, with alltoall enabled, with an input type simulated as
Fp4QuantizedTensor, and with x_sf set (not None) to assert they return (or
raise) the expected hard rejects; reference FlashInferFusedMoE.can_implement and
get_moe_cls to locate validation logic, patch get_sm_version to a supported SM
(e.g., 120) so only these specific guards are hit, and assert the returned ok is
False and/or get_moe_cls raises ValueError matching the relevant reason text for
each case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md`:
- Around line 145-163: Add the runtime dependency floor for the FlashInfer
backend to this section: state the required minimum versions of the external
packages (e.g., flashinfer and the CUTLASS/DSL package) and any toolchain
constraints so users know the exact package combo needed before selecting
FLASHINFER; place this note alongside the "FlashInferFusedMoE — additional
constraints" paragraph and reference the selection point
(get_moe_cls("FLASHINFER", ...)) and the lazy wrapper initialization in
post_load_weights() so developers see the dependency requirement when reading
the backend constraints.

In `@tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py`:
- Around line 15-21: Add integration test definitions that exercise the
FlashInferFusedMoE backend on SM120/SM121 hardware: create a perf test entry
named with the l0_* pattern (so it runs in pre-merge CI) that sets
moe_backend=FLASHINFER and targets the SM120/SM121 GPU variant, and add
corresponding scheduled QA entries following the llm_perf_* naming convention
that also specify moe_backend=FLASHINFER/FlashInferFusedMoE and the same GPU
targets; ensure the new YAML entries include the job name, test selector,
hardware requirements, and any perf thresholds so the backend is executed in
both pre-merge CI and scheduled QA runs.

---

Outside diff comments:
In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py`:
- Around line 216-238: The dispatch using issubclass(moe_cls, CutlassFusedMoE)
wrongly catches CuteDslFusedMoE and DeepGemmFusedMoE (both subclassing
CutlassFusedMoE) and passes unsupported kwargs (swiglu_alpha, swiglu_beta,
swiglu_limit), causing runtime errors; fix by making the dispatch check
exact-class comparisons for CuteDslFusedMoE and DeepGemmFusedMoE (i.e. moe_cls
== CuteDslFusedMoE and moe_cls == DeepGemmFusedMoE) or move those
subclass-specific branches before the generic issubclass(CutlassFusedMoE) branch
so their narrower constructors run, ensuring only FlashInferFusedMoE (which
accepts arbitrary kwargs) is handled by the broad issubclass(CutlassFusedMoE)
path.

---

Nitpick comments:
In `@tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py`:
- Around line 34-131: Add unit tests in
tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py that exercise
the remaining pure-Python guard rails on selection/creation: call
FlashInferFusedMoE.can_implement (and get_moe_cls where appropriate) with
ep_size != 1, with alltoall enabled, with an input type simulated as
Fp4QuantizedTensor, and with x_sf set (not None) to assert they return (or
raise) the expected hard rejects; reference FlashInferFusedMoE.can_implement and
get_moe_cls to locate validation logic, patch get_sm_version to a supported SM
(e.g., 120) so only these specific guards are hit, and assert the returned ok is
False and/or get_moe_cls raises ValueError matching the relevant reason text for
each case.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8dc4a17c-d10b-410b-8630-d20e1b303ed2

📥 Commits

Reviewing files that changed from the base of the PR and between 2da7a97 and 57cdae4.

📒 Files selected for processing (6)
  • tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md
  • tensorrt_llm/_torch/modules/fused_moe/__init__.py
  • tensorrt_llm/_torch/modules/fused_moe/create_moe.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_flashinfer.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py

Comment thread tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md Outdated
Comment thread tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py Outdated
farazkh80 added a commit to farazkh80/TensorRT-LLM that referenced this pull request May 8, 2026
… diff, helper scripts

Captures the investigation findings + reproducible artifacts for the
``B12xLukeFusedMoE`` backend committed in 374b483. Lives under
.claude_docs/ rather than docs/ since the artifacts are working-files
(bench logs, container scripts, PR body) rather than user-facing docs.

Files:

- ``B12X_LUKE_RESULTS.md``: bench numbers vs the FI hybrid baseline
  (TPOT 12.97 ms vs FI 11.49 ms = +12.8% regression) + token-parity
  result + success-criteria table + root-cause writeup + follow-up probes.
- ``FI_VS_LUKE_DELTA.md``: side-by-side architectural comparison of
  flashinfer-vendored b12x and lukealonso/b12x master HEAD. Includes
  bisect data (986a405a / c9cc90ec / 1378cea7 all ~13.5 ms TPOT, gap to
  FI predates published luke history), trace-probe evidence (luke's
  MoEMicroKernel does fire; the slowdown is intrinsic), and the kernel-
  source diff that locates the root cause: FI uses Blackwell's warp-
  specialized producer/consumer pattern (5 warps/CTA: 4 MMA + 1 dedicated
  TMA-load with cute.arch.setmaxregister_increase/_decrease register
  repartitioning), luke uses a flat 16-warp design with no producer/
  consumer split. Concludes that closing the gap requires either an
  upstream rewrite or hand-porting FI's kernel (~12k lines of CuTe DSL
  cascade — out of scope here).
- ``PR_BODY_b12x_luke.md``: PR description used when opening the
  stacked PR against ``faraz/b12x-flashinfer-moe-pr`` on the
  farazkh80/TensorRT-LLM fork.
- ``start_runtime_container_b12x_luke.sh``: docker run helper that
  installs flashinfer + lukealonso/b12x @ 1378cea7 + cutlass-dsl 4.4.2
  trio + cache_dit (rc14 dep absent in rc12 base) + LD_LIBRARY_PATH
  fix-up so docker exec inherits libnvonnxparser.
- ``sync_b12x_luke_files.sh``: syncs edited fused_moe submodule files
  from the host source tree into the container's wheel-installed
  site-packages (the rc12 base image's tensorrt_llm has new imports
  like cache_dit that block PYTHONPATH overlay; targeted file copy is
  safer).
- ``bench_kvoff_b12x_luke.yml``: bench yaml clone of
  bench_kvoff_flashinfer.yml with moe_config.backend swapped to
  B12X_LUKE.
- ``parity_check_b12x_luke.py``: token-parity script with --moe-backend
  flag for FLASHINFER vs B12X_LUKE A/B (skipped this run, kept for
  future use).
- ``_patch_tp_moe_trace.py``: idempotent patch that injects [trace-luke]
  prints into b12x.integration.tp_moe._launch_compact_static, used to
  prove luke's micro path actually fires (not a fall-through bug). Not
  needed at runtime; kept for reproducibility.

The bench logs themselves and parent-PR (NVIDIA#13773) artifacts
(HYBRID_DOC.md, HYBRID_RESULTS.md, etc.) are intentionally NOT
committed: bench logs live under /home/farazkh_scratch/logs/ and
parent-PR docs belong on the parent branch.

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
farazkh80 added a commit to farazkh80/TensorRT-LLM that referenced this pull request May 8, 2026
…ide Draft PR

The fork-side PR_BODY_b12x_luke.md targets the GitHub UI on
farazkh80/TensorRT-LLM and assumes a stacked base of `b12x-hybrid`.
The NVIDIA-side PR (filed as Draft against NVIDIA:main) carries the
3 NVIDIA#13773 commits as overlap, so its body needs:

- A prominent DRAFT-blocked-on-NVIDIA#13773 header at the top.
- A condensed framing that names the perf regression upfront.
- Same bench data + warp-spec swap evidence + recommendation.

Open URL:
  https://github.com/NVIDIA/TensorRT-LLM/compare/main...farazkh80:b12x-luke-decode?expand=1

Use the "Create draft pull request" dropdown.

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
@farazkh80 farazkh80 force-pushed the faraz/b12x-flashinfer-moe-pr branch from b4c7031 to 1570171 Compare May 8, 2026 21:40
@tbraun96
Copy link
Copy Markdown

Disclosure: I work on Atlas. We've been running NVFP4 MoE on sm_120 / sm_121 for a while and hit the FlashInfer side of this gap pretty hard. Sharing the patch in case it's useful prior art for the SM120/SM121 backend wiring here.

Two things that bit us on FlashInfer + NVFP4 on consumer Blackwell that may or may not already be on your radar:

1. E2M1 conversion PTX is Hopper/SM10-only. FlashInfer's CUTLASS headers gate CUDA_PTX_FP4FP6_CVT_ENABLED for SM ranges that exclude sm_120/sm_121. The kernel still emits .tile::scatter4 PTX on those archs and faceplants at JIT time. We patch around this with a software conversion using __float_as_uint and bit-twiddling. Patch script in the public repo:

docker/gb10/fix_flashinfer_e2m1_sm121.py

About 30 lines. Disables the hardware E2M1 path for SM121 specifically and falls back to the software conversion. Roughly 32x speedup over the broken-PTX path on a 35B baseline (1.1 to 35 tok/s for us) so worth catching on the SM120 side too.

2. NVFP4 MoE backend dispatch bug. Upstream FlashInfer's select_nvfp4_moe_backend() returned None for k_cls on sm_121, which made vLLM fall through to a broken path silently. We patch that one separately:

docker/gb10/fix_flashinfer_nvfp4_moe_backend.py

If FlashInfer's gating already handles this on SM120 cleanly in your branch, ignore. If not, this is the call site that was lying to callers.

For end-to-end NVFP4 numbers on sm_121 with these patches in (Qwen3.6-35B-A3B-NVFP4, MTP K=2): 214.6 tok/s decode at c=1, sparkrun-benchmark-validated bundle attached on this PR:

Avarok-Cybersecurity/atlas-recipes#2

Glad to see Nemotron NVFP4 landing for SM120/121. Happy to dig into either of the patches above if you want to confirm whether your backend wiring handles them differently.

1 similar comment
@tbraun96
Copy link
Copy Markdown

Disclosure: I work on Atlas. We've been running NVFP4 MoE on sm_120 / sm_121 for a while and hit the FlashInfer side of this gap pretty hard. Sharing the patch in case it's useful prior art for the SM120/SM121 backend wiring here.

Two things that bit us on FlashInfer + NVFP4 on consumer Blackwell that may or may not already be on your radar:

1. E2M1 conversion PTX is Hopper/SM10-only. FlashInfer's CUTLASS headers gate CUDA_PTX_FP4FP6_CVT_ENABLED for SM ranges that exclude sm_120/sm_121. The kernel still emits .tile::scatter4 PTX on those archs and faceplants at JIT time. We patch around this with a software conversion using __float_as_uint and bit-twiddling. Patch script in the public repo:

docker/gb10/fix_flashinfer_e2m1_sm121.py

About 30 lines. Disables the hardware E2M1 path for SM121 specifically and falls back to the software conversion. Roughly 32x speedup over the broken-PTX path on a 35B baseline (1.1 to 35 tok/s for us) so worth catching on the SM120 side too.

2. NVFP4 MoE backend dispatch bug. Upstream FlashInfer's select_nvfp4_moe_backend() returned None for k_cls on sm_121, which made vLLM fall through to a broken path silently. We patch that one separately:

docker/gb10/fix_flashinfer_nvfp4_moe_backend.py

If FlashInfer's gating already handles this on SM120 cleanly in your branch, ignore. If not, this is the call site that was lying to callers.

For end-to-end NVFP4 numbers on sm_121 with these patches in (Qwen3.6-35B-A3B-NVFP4, MTP K=2): 214.6 tok/s decode at c=1, sparkrun-benchmark-validated bundle attached on this PR:

Avarok-Cybersecurity/atlas-recipes#2

Glad to see Nemotron NVFP4 landing for SM120/121. Happy to dig into either of the patches above if you want to confirm whether your backend wiring handles them differently.

@farazkh80 farazkh80 force-pushed the faraz/b12x-flashinfer-moe-pr branch from 1570171 to 11dcf4a Compare May 11, 2026 16:15
farazkh80 added a commit to farazkh80/TensorRT-LLM that referenced this pull request May 12, 2026
The b12x MoE kernel introduced by PR NVIDIA#13773 (FLASHINFER_NVFP4SM12X)
JIT-compiles via nvidia-cutlass-dsl, whose CUDA 13 runtime libraries
ship as a separate optional wheel (nvidia-cutlass-dsl-libs-cu13) and
are NOT pulled automatically by the main nvidia-cutlass-dsl wheel.

Without this wheel, executor initialization on SM120/SM121 hosts dies
with ptxas "Unexpected instruction types specified for '_mma'" because
the chip->compute_target conversion falls back to a path that strips
the 'a' suffix (sm_120a -> sm_120), and ptxas is then invoked with
-opt-arch=sm_120 against PTX that has .target sm_120a with sm_120a-only
mma instruction forms.

The runtime requirement was documented in the PR body but never made
binding via requirements.txt. Pin it explicitly at the same version as
the main wheel so fresh builds reproduce the same working environment.

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
@farazkh80 farazkh80 requested a review from a team as a code owner May 12, 2026 18:12
@farazkh80 farazkh80 force-pushed the faraz/b12x-flashinfer-moe-pr branch from 8f589ea to bbd21f2 Compare May 12, 2026 20:16
farazkh80 added a commit to farazkh80/TensorRT-LLM that referenced this pull request May 12, 2026
The b12x MoE kernel introduced by PR NVIDIA#13773 (FLASHINFER_NVFP4SM12X)
JIT-compiles via nvidia-cutlass-dsl, whose CUDA 13 runtime libraries
ship as a separate optional wheel (nvidia-cutlass-dsl-libs-cu13) and
are NOT pulled automatically by the main nvidia-cutlass-dsl wheel.

Without this wheel, executor initialization on SM120/SM121 hosts dies
with ptxas "Unexpected instruction types specified for '_mma'" because
the chip->compute_target conversion falls back to a path that strips
the 'a' suffix (sm_120a -> sm_120), and ptxas is then invoked with
-opt-arch=sm_120 against PTX that has .target sm_120a with sm_120a-only
mma instruction forms.

The runtime requirement was documented in the PR body but never made
binding via requirements.txt. Pin it explicitly at the same version as
the main wheel so fresh builds reproduce the same working environment.

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
@farazkh80
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48025 [ run ] triggered by Bot. Commit: 8536641 Link to invocation

@xxi-nv
Copy link
Copy Markdown
Collaborator

xxi-nv commented May 12, 2026

@QiJune @kaiyux Please help to review the skills added in this PR. And may I confirm whether the developers could add skills into this repo directly?

Comment thread tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md Outdated
Comment thread tests/unittest/_torch/modules/moe/test_cute_dsl_b12x_moe_backend.py
Comment thread tensorrt_llm/_torch/modules/fused_moe/fused_moe_flashinfer_nvfp4_sm12x.py Outdated
farazkh80 added a commit to farazkh80/TensorRT-LLM that referenced this pull request May 14, 2026
…to-promote

Replace the user-facing `moe_backend: FLASHINFER_NVFP4SM12X` knob with
transparent heuristic auto-promotion on the `CUTLASS` path. When the
user selects `moe_backend: CUTLASS` (the default), `get_moe_cls()` now
returns `FlashInferNvfp4Sm12xFusedMoE` automatically when:

  - quant_config has NVFP4
  - SM version is 120 or 121
  - `import flashinfer` succeeds

Otherwise it returns `CutlassFusedMoE` (the pre-PR behaviour). The
class itself, its weight lifecycle, and its hybrid `m >= 64` decode
dispatch are unchanged — only the selection plumbing moves.

This responds to xxi-nv's review comment on PR NVIDIA#13773 asking whether
the b12x backend could be selected via a heuristic rather than an
explicit name. Mirrors the existing `MEGAMOE_DEEPGEMM` pattern of
`can_implement`-gated promotion with a CUTLASS fallback.

Drops `"FLASHINFER_NVFP4SM12X"` from `MoeConfig.backend` Literal — the
class stays importable as an internal API for tests and for direct
construction, but is no longer a valid user-facing config string.

Tests in `test_flashinfer_nvfp4_sm12x_moe_backend.py` flipped from
"explicit name raises on bad config" to "heuristic auto-promotes vs
falls back to CutlassFusedMoE". Internal `MoeBackendType` entry kept
so `test_moe_backend.py` parametrization continues to cover the
backend; `create_test_backend` routes the enum through
`moe_backend="CUTLASS"` to exercise the same code path users hit.

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
farazkh80 added a commit to farazkh80/TensorRT-LLM that referenced this pull request May 14, 2026
…MOE guide

test_moe_module.py: register MoeBackendType.FLASHINFER_NVFP4SM12X in
BACKEND_TYPES so the unified ConfigurableMoE matrix exercises it.
_create_model_config maps the internal enum value to
moe_backend="CUTLASS" before passing into ModelConfig — the enum is
internal-only after the heuristic auto-promote landed; users reach the
backend via the CUTLASS path.

MOE_DEVELOPER_GUIDE.md: remove the dedicated
FlashInferNvfp4Sm12xFusedMoE section (composition / dispatch policy /
weight-conversion algebra / hard-reject list) and drop the Nvfp4Sm12x
matrix column. The class's NVFP4 support on SM120/121 is already
covered by the CUTLASS row in the matrix (auto-promote target). Only
the single inventory-table entry under "Backends" remains, pointing at
the backend file for anyone who wants the details.

Both changes respond to xxi-nv's review comments on PR NVIDIA#13773 asking
that test_moe_module.py / test_moe_backend.py cover the new backend
and that the MoE guide stay high-level.

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
farazkh80 added a commit to farazkh80/TensorRT-LLM that referenced this pull request May 14, 2026
…to-promote

Replace the user-facing `moe_backend: FLASHINFER_NVFP4SM12X` knob with
transparent heuristic auto-promotion on the `CUTLASS` path. When the
user selects `moe_backend: CUTLASS` (the default), `get_moe_cls()` now
returns `FlashInferNvfp4Sm12xFusedMoE` automatically when:

  - quant_config has NVFP4
  - SM version is 120 or 121
  - `import flashinfer` succeeds

Otherwise it returns `CutlassFusedMoE` (the pre-PR behaviour). The
class itself, its weight lifecycle, and its hybrid `m >= 64` decode
dispatch are unchanged — only the selection plumbing moves.

This responds to xxi-nv's review comment on PR NVIDIA#13773 asking whether
the b12x backend could be selected via a heuristic rather than an
explicit name. Mirrors the existing `MEGAMOE_DEEPGEMM` pattern of
`can_implement`-gated promotion with a CUTLASS fallback.

Drops `"FLASHINFER_NVFP4SM12X"` from `MoeConfig.backend` Literal — the
class stays importable as an internal API for tests and for direct
construction, but is no longer a valid user-facing config string.

Tests in `test_flashinfer_nvfp4_sm12x_moe_backend.py` flipped from
"explicit name raises on bad config" to "heuristic auto-promotes vs
falls back to CutlassFusedMoE". Internal `MoeBackendType` entry kept
so `test_moe_backend.py` parametrization continues to cover the
backend; `create_test_backend` routes the enum through
`moe_backend="CUTLASS"` to exercise the same code path users hit.

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
farazkh80 added a commit to farazkh80/TensorRT-LLM that referenced this pull request May 14, 2026
…MOE guide

test_moe_module.py: register MoeBackendType.FLASHINFER_NVFP4SM12X in
BACKEND_TYPES so the unified ConfigurableMoE matrix exercises it.
_create_model_config maps the internal enum value to
moe_backend="CUTLASS" before passing into ModelConfig — the enum is
internal-only after the heuristic auto-promote landed; users reach the
backend via the CUTLASS path.

MOE_DEVELOPER_GUIDE.md: remove the dedicated
FlashInferNvfp4Sm12xFusedMoE section (composition / dispatch policy /
weight-conversion algebra / hard-reject list) and drop the Nvfp4Sm12x
matrix column. The class's NVFP4 support on SM120/121 is already
covered by the CUTLASS row in the matrix (auto-promote target). Only
the single inventory-table entry under "Backends" remains, pointing at
the backend file for anyone who wants the details.

Both changes respond to xxi-nv's review comments on PR NVIDIA#13773 asking
that test_moe_module.py / test_moe_backend.py cover the new backend
and that the MoE guide stay high-level.

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
@farazkh80 farazkh80 force-pushed the faraz/b12x-flashinfer-moe-pr branch 2 times, most recently from cfc5d8b to 60a2634 Compare May 14, 2026 19:20
farazkh80 added a commit to farazkh80/TensorRT-LLM that referenced this pull request May 14, 2026
The b12x MoE kernel introduced by PR NVIDIA#13773 (FLASHINFER_NVFP4SM12X)
JIT-compiles via nvidia-cutlass-dsl, whose CUDA 13 runtime libraries
ship as a separate optional wheel (nvidia-cutlass-dsl-libs-cu13) and
are NOT pulled automatically by the main nvidia-cutlass-dsl wheel.

Without this wheel, executor initialization on SM120/SM121 hosts dies
with ptxas "Unexpected instruction types specified for '_mma'" because
the chip->compute_target conversion falls back to a path that strips
the 'a' suffix (sm_120a -> sm_120), and ptxas is then invoked with
-opt-arch=sm_120 against PTX that has .target sm_120a with sm_120a-only
mma instruction forms.

The runtime requirement was documented in the PR body but never made
binding via requirements.txt. Pin it explicitly at the same version as
the main wheel so fresh builds reproduce the same working environment.

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
farazkh80 added a commit to farazkh80/TensorRT-LLM that referenced this pull request May 14, 2026
…to-promote

Replace the user-facing `moe_backend: FLASHINFER_NVFP4SM12X` knob with
transparent heuristic auto-promotion on the `CUTLASS` path. When the
user selects `moe_backend: CUTLASS` (the default), `get_moe_cls()` now
returns `FlashInferNvfp4Sm12xFusedMoE` automatically when:

  - quant_config has NVFP4
  - SM version is 120 or 121
  - `import flashinfer` succeeds

Otherwise it returns `CutlassFusedMoE` (the pre-PR behaviour). The
class itself, its weight lifecycle, and its hybrid `m >= 64` decode
dispatch are unchanged — only the selection plumbing moves.

This responds to xxi-nv's review comment on PR NVIDIA#13773 asking whether
the b12x backend could be selected via a heuristic rather than an
explicit name. Mirrors the existing `MEGAMOE_DEEPGEMM` pattern of
`can_implement`-gated promotion with a CUTLASS fallback.

Drops `"FLASHINFER_NVFP4SM12X"` from `MoeConfig.backend` Literal — the
class stays importable as an internal API for tests and for direct
construction, but is no longer a valid user-facing config string.

Tests in `test_flashinfer_nvfp4_sm12x_moe_backend.py` flipped from
"explicit name raises on bad config" to "heuristic auto-promotes vs
falls back to CutlassFusedMoE". Internal `MoeBackendType` entry kept
so `test_moe_backend.py` parametrization continues to cover the
backend; `create_test_backend` routes the enum through
`moe_backend="CUTLASS"` to exercise the same code path users hit.

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
farazkh80 added a commit to farazkh80/TensorRT-LLM that referenced this pull request May 14, 2026
…MOE guide

test_moe_module.py: register MoeBackendType.FLASHINFER_NVFP4SM12X in
BACKEND_TYPES so the unified ConfigurableMoE matrix exercises it.
_create_model_config maps the internal enum value to
moe_backend="CUTLASS" before passing into ModelConfig — the enum is
internal-only after the heuristic auto-promote landed; users reach the
backend via the CUTLASS path.

MOE_DEVELOPER_GUIDE.md: remove the dedicated
FlashInferNvfp4Sm12xFusedMoE section (composition / dispatch policy /
weight-conversion algebra / hard-reject list) and drop the Nvfp4Sm12x
matrix column. The class's NVFP4 support on SM120/121 is already
covered by the CUTLASS row in the matrix (auto-promote target). Only
the single inventory-table entry under "Backends" remains, pointing at
the backend file for anyone who wants the details.

Both changes respond to xxi-nv's review comments on PR NVIDIA#13773 asking
that test_moe_module.py / test_moe_backend.py cover the new backend
and that the MoE guide stay high-level.

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
farazkh80 added 4 commits May 22, 2026 10:06
…to-promote

Replace the user-facing `moe_backend: FLASHINFER_NVFP4SM12X` knob with
transparent heuristic auto-promotion on the `CUTLASS` path. When the
user selects `moe_backend: CUTLASS` (the default), `get_moe_cls()` now
returns `FlashInferNvfp4Sm12xFusedMoE` automatically when:

  - quant_config has NVFP4
  - SM version is 120 or 121
  - `import flashinfer` succeeds

Otherwise it returns `CutlassFusedMoE` (the pre-PR behaviour). The
class itself, its weight lifecycle, and its hybrid `m >= 64` decode
dispatch are unchanged — only the selection plumbing moves.

This responds to xxi-nv's review comment on PR NVIDIA#13773 asking whether
the b12x backend could be selected via a heuristic rather than an
explicit name. Mirrors the existing `MEGAMOE_DEEPGEMM` pattern of
`can_implement`-gated promotion with a CUTLASS fallback.

Drops `"FLASHINFER_NVFP4SM12X"` from `MoeConfig.backend` Literal — the
class stays importable as an internal API for tests and for direct
construction, but is no longer a valid user-facing config string.

Tests in `test_flashinfer_nvfp4_sm12x_moe_backend.py` flipped from
"explicit name raises on bad config" to "heuristic auto-promotes vs
falls back to CutlassFusedMoE". Internal `MoeBackendType` entry kept
so `test_moe_backend.py` parametrization continues to cover the
backend; `create_test_backend` routes the enum through
`moe_backend="CUTLASS"` to exercise the same code path users hit.

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
…MOE guide

test_moe_module.py: register MoeBackendType.FLASHINFER_NVFP4SM12X in
BACKEND_TYPES so the unified ConfigurableMoE matrix exercises it.
_create_model_config maps the internal enum value to
moe_backend="CUTLASS" before passing into ModelConfig — the enum is
internal-only after the heuristic auto-promote landed; users reach the
backend via the CUTLASS path.

MOE_DEVELOPER_GUIDE.md: remove the dedicated
FlashInferNvfp4Sm12xFusedMoE section (composition / dispatch policy /
weight-conversion algebra / hard-reject list) and drop the Nvfp4Sm12x
matrix column. The class's NVFP4 support on SM120/121 is already
covered by the CUTLASS row in the matrix (auto-promote target). Only
the single inventory-table entry under "Backends" remains, pointing at
the backend file for anyone who wants the details.

Both changes respond to xxi-nv's review comments on PR NVIDIA#13773 asking
that test_moe_module.py / test_moe_backend.py cover the new backend
and that the MoE guide stay high-level.

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
Pre-commit hooks flagged 3 cosmetic formatting tweaks (collapse multi-line
ternary/f-string/blank-line) in the MoE test files added/edited earlier
in this PR. No behaviour change.

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
…ethod

Addresses three of @xxi-nv's PR NVIDIA#13773 follow-up comments (May 15):

- Move post_load_weights into a dedicated quantization_method
  (NVFP4CuteDslB12xFusedMoEMethod, sibling of NVFP4CuteDslFusedMoEMethod).
  Backend post_load_weights is now inherited from CutlassFusedMoE and is a
  thin pass-through to self.quant_method.post_load_weights(self). All b12x
  weight prep (SF un-normalization, convert_sf_to_mma_layout,
  B12xMoEWrapper instantiation, shared output buffer) lives next to the
  rest of the NVFP4 quant-method family.

- Make the backend a member of the cuteDSL family: switch the parent class
  to CuteDslFusedMoE. The hybrid prefill path keeps explicit
  CutlassFusedMoE.method(self, ...) calls so the same C++ CUTLASS NVFP4
  GroupGEMM still runs at m>=64 — the MRO change does not affect which
  kernels execute. create_moe.py constructor call moved into the
  CuteDslFusedMoE branch (narrower init signature).

- Rename file / class / enum / test to match the cuteDSL family:
    fused_moe_flashinfer_nvfp4_sm12x.py -> fused_moe_cute_dsl_b12x.py
    FlashInferNvfp4Sm12xFusedMoE        -> CuteDslB12xFusedMoE
    MoeBackendType.FLASHINFER_NVFP4SM12X -> MoeBackendType.CUTE_DSL_B12X
    test_flashinfer_nvfp4_sm12x_moe_backend.py
                                        -> test_cute_dsl_b12x_moe_backend.py

Also adds a local output_dtype fallback in the backend's run_moe before
delegating to CutlassFusedMoE.run_moe — schedulers that drive run_moe
directly (the KV-cache capacity probe) leave it unset, which surfaces as a
'trtllm::fused_moe() Expected ScalarType output_dtype but instead found
NoneType' on the prefill probe. Mirrors fused_moe_cutlass.py:691's
forward_chunk convention; FP4-packed uint8 falls back to bf16.

Validation:
- 25/25 unit tests pass (test_cute_dsl_b12x_moe_backend.py, --noconftest)
- trtllm-bench Nemotron-Super-120B-NVFP4 on SM120: 86.75 tok/s vs 85.92
  pre-refactor baseline (HYBRID_RESULTS.md, May 7) — within 1% noise
- nsys: CUTLASS sm120 block-scaled NVFP4 GroupGEMM kernels fire on
  prefill (m=2048, 96 calls); b12x cuteDSL MoEStatic/Dynamic/Micro kernels
  fire on decode (m=1, 40/80/160 calls); [b12x] quantize_input NVTX
  ranges present (400 calls)

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
@farazkh80 farazkh80 force-pushed the faraz/b12x-flashinfer-moe-pr branch from dfc5343 to f291746 Compare May 22, 2026 17:07
@farazkh80
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49967 [ run ] triggered by Bot. Commit: f291746 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49967 [ run ] completed with state SUCCESS. Commit: f291746
/LLM/main/L0_MergeRequest_PR pipeline #39533 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

The CUTLASS path in get_moe_cls was auto-promoting to CuteDslB12xFusedMoE
on SM120/SM121 + NVFP4 when flashinfer was importable, silently overriding
explicit moe_backend=CUTLASS requests. On GB10 (DGX Spark, sm_121) this
broke L0_Test-SBSA-Single-GPU GB10-PyTorch-1
test_configurable_moe_single_gpu CUTLASS+NVFP4 cases with:

  TypeError: 'CuteDslB12xFusedMoE' object does not support the
             context manager protocol

Selection now lives on the CUTEDSL path: CUTEDSL + NVFP4 + SM120/121 +
flashinfer importable -> CuteDslB12xFusedMoE; otherwise CuteDslFusedMoE
(or CutlassFusedMoE for unsupported quant). Explicit moe_backend=CUTLASS
always returns CutlassFusedMoE.

Also register CuteDslB12xFusedMoE in the ConfigurableMoE allowlist in
create_moe(): the bare backend instance lacks __enter__/__exit__, so
`with create_moe(...)` callers (test_configurable_moe_single_gpu) need
the ConfigurableMoE wrapper that already provides the context manager
protocol for the other cuteDSL-family backends.

Tests / docs updated:
- test_cute_dsl_b12x_moe_backend.py: heuristic ownership assertions
  rewritten to verify CUTEDSL-path behaviour (CUTLASS never promotes;
  CUTEDSL selects b12x when eligible, falls back to CuteDslFusedMoE on
  unsupported SM or missing flashinfer, and to CutlassFusedMoE on
  unsupported quant).
- test_moe_module.py / test_moe_backend.py: MoeBackendType.CUTE_DSL_B12X
  internal-only enum now remaps to "CUTEDSL" (was "CUTLASS") so the
  unified test harness drives the same code path users hit.
- MOE_DEVELOPER_GUIDE.md: file-map row clarifies "select via the
  CUTEDSL backend path", and the Backend Capability Matrix's CuteDSL
  column extends NVFP4 to SM100/103/120/121 (b12x is the SM120/121
  cuteDSL-family member).

Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
@farazkh80
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50066 [ run ] triggered by Bot. Commit: 1ce053d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50066 [ run ] completed with state SUCCESS. Commit: 1ce053d
/LLM/main/L0_MergeRequest_PR pipeline #39622 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@farazkh80
Copy link
Copy Markdown
Collaborator Author

/bot help

@NVIDIA NVIDIA deleted a comment from tensorrt-cicd May 24, 2026
@github-actions
Copy link
Copy Markdown

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@NVIDIA NVIDIA deleted a comment from tensorrt-cicd May 24, 2026
@NVIDIA NVIDIA deleted a comment from tensorrt-cicd May 24, 2026
@farazkh80
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50104 [ run ] triggered by Bot. Commit: 1ce053d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50104 [ run ] completed with state SUCCESS. Commit: 1ce053d
/LLM/main/L0_MergeRequest_PR pipeline #39657 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@farazkh80
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50115 [ run ] triggered by Bot. Commit: 1ce053d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50115 [ run ] completed with state SUCCESS. Commit: 1ce053d
/LLM/main/L0_MergeRequest_PR pipeline #39668 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@farazkh80
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50218 [ run ] triggered by Bot. Commit: 1ce053d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50218 [ run ] completed with state SUCCESS. Commit: 1ce053d
/LLM/main/L0_MergeRequest_PR pipeline #39755 completed with status: 'SUCCESS'

CI Report

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants