[None][feat] FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron …#13773
Conversation
📝 WalkthroughWalkthroughThis PR adds a new MoE backend backend option ( ChangesFlashInfer MoE Backend Addition
Sequence DiagramsequenceDiagram
participant User
participant MoeFactory as MoE Factory<br/>(create_moe)
participant FI as FlashInferFusedMoE
participant B12x as B12xMoEWrapper<br/>(FlashInfer)
participant Weights as Weight Storage
User->>MoeFactory: get_moe_cls(backend="FLASHINFER", quant_config)
MoeFactory->>MoeFactory: Validate NVFP4 & SM versions
MoeFactory-->>User: Return FlashInferFusedMoE class
User->>FI: __init__(model_config)
FI->>FI: Validate ep_size==1, no alltoall
FI-->>User: Instance ready
User->>Weights: Load model weights
User->>FI: post_load_weights()
FI->>FI: Import B12xMoEWrapper
FI->>FI: Convert scales to MMA layout
FI->>FI: Build FP4 uint8 weight views
FI->>B12x: B12xMoEWrapper(experts, weights)
B12x-->>FI: Wrapper instantiated
FI-->>User: Initialization complete
loop Forward Pass
User->>FI: quantize_input(x)
FI-->>User: (x, None) — passthrough
User->>FI: run_moe(x, routing_tensors)
FI->>B12x: run(x, token_selected_experts, ...)
B12x->>B12x: Compute FP4 MoE with internal quant
B12x-->>FI: Output tensor
FI-->>User: MoE result
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)
216-238:⚠️ Potential issue | 🔴 CriticalThe
issubclass()dispatch at line 216 makes theCuteDslFusedMoEandDeepGemmFusedMoEbranches unreachable and will cause runtime errors.Both
CuteDslFusedMoEandDeepGemmFusedMoEinherit fromCutlassFusedMoE, so they are now captured by the broaderissubclass(moe_cls, CutlassFusedMoE)check before their exact-class branches (lines 269 and 285) can execute. The generic Cutlass path then attempts to pass arguments likeswiglu_alpha,swiglu_beta, andswiglu_limitthat these subclasses' narrower constructors do not accept, resulting in unexpected keyword argument errors. OnlyFlashInferFusedMoEis compatible because its__init__(self, *args, **kwargs)forwards all arguments.Either revert to exact-class checks (
moe_cls ==) for all three subclasses, or verify that the constructors accept the full argument set being passed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py` around lines 216 - 238, The dispatch using issubclass(moe_cls, CutlassFusedMoE) wrongly catches CuteDslFusedMoE and DeepGemmFusedMoE (both subclassing CutlassFusedMoE) and passes unsupported kwargs (swiglu_alpha, swiglu_beta, swiglu_limit), causing runtime errors; fix by making the dispatch check exact-class comparisons for CuteDslFusedMoE and DeepGemmFusedMoE (i.e. moe_cls == CuteDslFusedMoE and moe_cls == DeepGemmFusedMoE) or move those subclass-specific branches before the generic issubclass(CutlassFusedMoE) branch so their narrower constructors run, ensuring only FlashInferFusedMoE (which accepts arbitrary kwargs) is handled by the broad issubclass(CutlassFusedMoE) path.
🧹 Nitpick comments (1)
tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py (1)
34-131: ⚡ Quick winCover the remaining no-GPU guard rails too.
This file exercises selection-time gating, but the backend's other hard rejects are still untested:
ep_size != 1, alltoall enabled,Fp4QuantizedTensorinput, andx_sf is not None. Those are pure-Python validation paths, so adding them here would catch the runtime guard rails most likely to regress during refactors.As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py` around lines 34 - 131, Add unit tests in tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py that exercise the remaining pure-Python guard rails on selection/creation: call FlashInferFusedMoE.can_implement (and get_moe_cls where appropriate) with ep_size != 1, with alltoall enabled, with an input type simulated as Fp4QuantizedTensor, and with x_sf set (not None) to assert they return (or raise) the expected hard rejects; reference FlashInferFusedMoE.can_implement and get_moe_cls to locate validation logic, patch get_sm_version to a supported SM (e.g., 120) so only these specific guards are hit, and assert the returned ok is False and/or get_moe_cls raises ValueError matching the relevant reason text for each case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md`:
- Around line 145-163: Add the runtime dependency floor for the FlashInfer
backend to this section: state the required minimum versions of the external
packages (e.g., flashinfer and the CUTLASS/DSL package) and any toolchain
constraints so users know the exact package combo needed before selecting
FLASHINFER; place this note alongside the "FlashInferFusedMoE — additional
constraints" paragraph and reference the selection point
(get_moe_cls("FLASHINFER", ...)) and the lazy wrapper initialization in
post_load_weights() so developers see the dependency requirement when reading
the backend constraints.
In `@tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py`:
- Around line 15-21: Add integration test definitions that exercise the
FlashInferFusedMoE backend on SM120/SM121 hardware: create a perf test entry
named with the l0_* pattern (so it runs in pre-merge CI) that sets
moe_backend=FLASHINFER and targets the SM120/SM121 GPU variant, and add
corresponding scheduled QA entries following the llm_perf_* naming convention
that also specify moe_backend=FLASHINFER/FlashInferFusedMoE and the same GPU
targets; ensure the new YAML entries include the job name, test selector,
hardware requirements, and any perf thresholds so the backend is executed in
both pre-merge CI and scheduled QA runs.
---
Outside diff comments:
In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py`:
- Around line 216-238: The dispatch using issubclass(moe_cls, CutlassFusedMoE)
wrongly catches CuteDslFusedMoE and DeepGemmFusedMoE (both subclassing
CutlassFusedMoE) and passes unsupported kwargs (swiglu_alpha, swiglu_beta,
swiglu_limit), causing runtime errors; fix by making the dispatch check
exact-class comparisons for CuteDslFusedMoE and DeepGemmFusedMoE (i.e. moe_cls
== CuteDslFusedMoE and moe_cls == DeepGemmFusedMoE) or move those
subclass-specific branches before the generic issubclass(CutlassFusedMoE) branch
so their narrower constructors run, ensuring only FlashInferFusedMoE (which
accepts arbitrary kwargs) is handled by the broad issubclass(CutlassFusedMoE)
path.
---
Nitpick comments:
In `@tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py`:
- Around line 34-131: Add unit tests in
tests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py that exercise
the remaining pure-Python guard rails on selection/creation: call
FlashInferFusedMoE.can_implement (and get_moe_cls where appropriate) with
ep_size != 1, with alltoall enabled, with an input type simulated as
Fp4QuantizedTensor, and with x_sf set (not None) to assert they return (or
raise) the expected hard rejects; reference FlashInferFusedMoE.can_implement and
get_moe_cls to locate validation logic, patch get_sm_version to a supported SM
(e.g., 120) so only these specific guards are hit, and assert the returned ok is
False and/or get_moe_cls raises ValueError matching the relevant reason text for
each case.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8dc4a17c-d10b-410b-8630-d20e1b303ed2
📒 Files selected for processing (6)
tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.mdtensorrt_llm/_torch/modules/fused_moe/__init__.pytensorrt_llm/_torch/modules/fused_moe/create_moe.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_flashinfer.pytensorrt_llm/llmapi/llm_args.pytests/unittest/_torch/modules/moe/test_flashinfer_moe_backend.py
… diff, helper scripts Captures the investigation findings + reproducible artifacts for the ``B12xLukeFusedMoE`` backend committed in 374b483. Lives under .claude_docs/ rather than docs/ since the artifacts are working-files (bench logs, container scripts, PR body) rather than user-facing docs. Files: - ``B12X_LUKE_RESULTS.md``: bench numbers vs the FI hybrid baseline (TPOT 12.97 ms vs FI 11.49 ms = +12.8% regression) + token-parity result + success-criteria table + root-cause writeup + follow-up probes. - ``FI_VS_LUKE_DELTA.md``: side-by-side architectural comparison of flashinfer-vendored b12x and lukealonso/b12x master HEAD. Includes bisect data (986a405a / c9cc90ec / 1378cea7 all ~13.5 ms TPOT, gap to FI predates published luke history), trace-probe evidence (luke's MoEMicroKernel does fire; the slowdown is intrinsic), and the kernel- source diff that locates the root cause: FI uses Blackwell's warp- specialized producer/consumer pattern (5 warps/CTA: 4 MMA + 1 dedicated TMA-load with cute.arch.setmaxregister_increase/_decrease register repartitioning), luke uses a flat 16-warp design with no producer/ consumer split. Concludes that closing the gap requires either an upstream rewrite or hand-porting FI's kernel (~12k lines of CuTe DSL cascade — out of scope here). - ``PR_BODY_b12x_luke.md``: PR description used when opening the stacked PR against ``faraz/b12x-flashinfer-moe-pr`` on the farazkh80/TensorRT-LLM fork. - ``start_runtime_container_b12x_luke.sh``: docker run helper that installs flashinfer + lukealonso/b12x @ 1378cea7 + cutlass-dsl 4.4.2 trio + cache_dit (rc14 dep absent in rc12 base) + LD_LIBRARY_PATH fix-up so docker exec inherits libnvonnxparser. - ``sync_b12x_luke_files.sh``: syncs edited fused_moe submodule files from the host source tree into the container's wheel-installed site-packages (the rc12 base image's tensorrt_llm has new imports like cache_dit that block PYTHONPATH overlay; targeted file copy is safer). - ``bench_kvoff_b12x_luke.yml``: bench yaml clone of bench_kvoff_flashinfer.yml with moe_config.backend swapped to B12X_LUKE. - ``parity_check_b12x_luke.py``: token-parity script with --moe-backend flag for FLASHINFER vs B12X_LUKE A/B (skipped this run, kept for future use). - ``_patch_tp_moe_trace.py``: idempotent patch that injects [trace-luke] prints into b12x.integration.tp_moe._launch_compact_static, used to prove luke's micro path actually fires (not a fall-through bug). Not needed at runtime; kept for reproducibility. The bench logs themselves and parent-PR (NVIDIA#13773) artifacts (HYBRID_DOC.md, HYBRID_RESULTS.md, etc.) are intentionally NOT committed: bench logs live under /home/farazkh_scratch/logs/ and parent-PR docs belong on the parent branch. Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
…ide Draft PR The fork-side PR_BODY_b12x_luke.md targets the GitHub UI on farazkh80/TensorRT-LLM and assumes a stacked base of `b12x-hybrid`. The NVIDIA-side PR (filed as Draft against NVIDIA:main) carries the 3 NVIDIA#13773 commits as overlap, so its body needs: - A prominent DRAFT-blocked-on-NVIDIA#13773 header at the top. - A condensed framing that names the perf regression upfront. - Same bench data + warp-spec swap evidence + recommendation. Open URL: https://github.com/NVIDIA/TensorRT-LLM/compare/main...farazkh80:b12x-luke-decode?expand=1 Use the "Create draft pull request" dropdown. Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
b4c7031 to
1570171
Compare
|
Disclosure: I work on Atlas. We've been running NVFP4 MoE on Two things that bit us on FlashInfer + NVFP4 on consumer Blackwell that may or may not already be on your radar: 1. E2M1 conversion PTX is Hopper/SM10-only. FlashInfer's CUTLASS headers gate
About 30 lines. Disables the hardware E2M1 path for SM121 specifically and falls back to the software conversion. Roughly 32x speedup over the broken-PTX path on a 35B baseline (1.1 to 35 tok/s for us) so worth catching on the SM120 side too. 2. NVFP4 MoE backend dispatch bug. Upstream FlashInfer's
If FlashInfer's gating already handles this on SM120 cleanly in your branch, ignore. If not, this is the call site that was lying to callers. For end-to-end NVFP4 numbers on
Glad to see Nemotron NVFP4 landing for SM120/121. Happy to dig into either of the patches above if you want to confirm whether your backend wiring handles them differently. |
1 similar comment
|
Disclosure: I work on Atlas. We've been running NVFP4 MoE on Two things that bit us on FlashInfer + NVFP4 on consumer Blackwell that may or may not already be on your radar: 1. E2M1 conversion PTX is Hopper/SM10-only. FlashInfer's CUTLASS headers gate
About 30 lines. Disables the hardware E2M1 path for SM121 specifically and falls back to the software conversion. Roughly 32x speedup over the broken-PTX path on a 35B baseline (1.1 to 35 tok/s for us) so worth catching on the SM120 side too. 2. NVFP4 MoE backend dispatch bug. Upstream FlashInfer's
If FlashInfer's gating already handles this on SM120 cleanly in your branch, ignore. If not, this is the call site that was lying to callers. For end-to-end NVFP4 numbers on
Glad to see Nemotron NVFP4 landing for SM120/121. Happy to dig into either of the patches above if you want to confirm whether your backend wiring handles them differently. |
1570171 to
11dcf4a
Compare
The b12x MoE kernel introduced by PR NVIDIA#13773 (FLASHINFER_NVFP4SM12X) JIT-compiles via nvidia-cutlass-dsl, whose CUDA 13 runtime libraries ship as a separate optional wheel (nvidia-cutlass-dsl-libs-cu13) and are NOT pulled automatically by the main nvidia-cutlass-dsl wheel. Without this wheel, executor initialization on SM120/SM121 hosts dies with ptxas "Unexpected instruction types specified for '_mma'" because the chip->compute_target conversion falls back to a path that strips the 'a' suffix (sm_120a -> sm_120), and ptxas is then invoked with -opt-arch=sm_120 against PTX that has .target sm_120a with sm_120a-only mma instruction forms. The runtime requirement was documented in the PR body but never made binding via requirements.txt. Pin it explicitly at the same version as the main wheel so fresh builds reproduce the same working environment. Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
8f589ea to
bbd21f2
Compare
The b12x MoE kernel introduced by PR NVIDIA#13773 (FLASHINFER_NVFP4SM12X) JIT-compiles via nvidia-cutlass-dsl, whose CUDA 13 runtime libraries ship as a separate optional wheel (nvidia-cutlass-dsl-libs-cu13) and are NOT pulled automatically by the main nvidia-cutlass-dsl wheel. Without this wheel, executor initialization on SM120/SM121 hosts dies with ptxas "Unexpected instruction types specified for '_mma'" because the chip->compute_target conversion falls back to a path that strips the 'a' suffix (sm_120a -> sm_120), and ptxas is then invoked with -opt-arch=sm_120 against PTX that has .target sm_120a with sm_120a-only mma instruction forms. The runtime requirement was documented in the PR body but never made binding via requirements.txt. Pin it explicitly at the same version as the main wheel so fresh builds reproduce the same working environment. Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #48025 [ run ] triggered by Bot. Commit: |
…to-promote Replace the user-facing `moe_backend: FLASHINFER_NVFP4SM12X` knob with transparent heuristic auto-promotion on the `CUTLASS` path. When the user selects `moe_backend: CUTLASS` (the default), `get_moe_cls()` now returns `FlashInferNvfp4Sm12xFusedMoE` automatically when: - quant_config has NVFP4 - SM version is 120 or 121 - `import flashinfer` succeeds Otherwise it returns `CutlassFusedMoE` (the pre-PR behaviour). The class itself, its weight lifecycle, and its hybrid `m >= 64` decode dispatch are unchanged — only the selection plumbing moves. This responds to xxi-nv's review comment on PR NVIDIA#13773 asking whether the b12x backend could be selected via a heuristic rather than an explicit name. Mirrors the existing `MEGAMOE_DEEPGEMM` pattern of `can_implement`-gated promotion with a CUTLASS fallback. Drops `"FLASHINFER_NVFP4SM12X"` from `MoeConfig.backend` Literal — the class stays importable as an internal API for tests and for direct construction, but is no longer a valid user-facing config string. Tests in `test_flashinfer_nvfp4_sm12x_moe_backend.py` flipped from "explicit name raises on bad config" to "heuristic auto-promotes vs falls back to CutlassFusedMoE". Internal `MoeBackendType` entry kept so `test_moe_backend.py` parametrization continues to cover the backend; `create_test_backend` routes the enum through `moe_backend="CUTLASS"` to exercise the same code path users hit. Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
…MOE guide test_moe_module.py: register MoeBackendType.FLASHINFER_NVFP4SM12X in BACKEND_TYPES so the unified ConfigurableMoE matrix exercises it. _create_model_config maps the internal enum value to moe_backend="CUTLASS" before passing into ModelConfig — the enum is internal-only after the heuristic auto-promote landed; users reach the backend via the CUTLASS path. MOE_DEVELOPER_GUIDE.md: remove the dedicated FlashInferNvfp4Sm12xFusedMoE section (composition / dispatch policy / weight-conversion algebra / hard-reject list) and drop the Nvfp4Sm12x matrix column. The class's NVFP4 support on SM120/121 is already covered by the CUTLASS row in the matrix (auto-promote target). Only the single inventory-table entry under "Backends" remains, pointing at the backend file for anyone who wants the details. Both changes respond to xxi-nv's review comments on PR NVIDIA#13773 asking that test_moe_module.py / test_moe_backend.py cover the new backend and that the MoE guide stay high-level. Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
…to-promote Replace the user-facing `moe_backend: FLASHINFER_NVFP4SM12X` knob with transparent heuristic auto-promotion on the `CUTLASS` path. When the user selects `moe_backend: CUTLASS` (the default), `get_moe_cls()` now returns `FlashInferNvfp4Sm12xFusedMoE` automatically when: - quant_config has NVFP4 - SM version is 120 or 121 - `import flashinfer` succeeds Otherwise it returns `CutlassFusedMoE` (the pre-PR behaviour). The class itself, its weight lifecycle, and its hybrid `m >= 64` decode dispatch are unchanged — only the selection plumbing moves. This responds to xxi-nv's review comment on PR NVIDIA#13773 asking whether the b12x backend could be selected via a heuristic rather than an explicit name. Mirrors the existing `MEGAMOE_DEEPGEMM` pattern of `can_implement`-gated promotion with a CUTLASS fallback. Drops `"FLASHINFER_NVFP4SM12X"` from `MoeConfig.backend` Literal — the class stays importable as an internal API for tests and for direct construction, but is no longer a valid user-facing config string. Tests in `test_flashinfer_nvfp4_sm12x_moe_backend.py` flipped from "explicit name raises on bad config" to "heuristic auto-promotes vs falls back to CutlassFusedMoE". Internal `MoeBackendType` entry kept so `test_moe_backend.py` parametrization continues to cover the backend; `create_test_backend` routes the enum through `moe_backend="CUTLASS"` to exercise the same code path users hit. Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
…MOE guide test_moe_module.py: register MoeBackendType.FLASHINFER_NVFP4SM12X in BACKEND_TYPES so the unified ConfigurableMoE matrix exercises it. _create_model_config maps the internal enum value to moe_backend="CUTLASS" before passing into ModelConfig — the enum is internal-only after the heuristic auto-promote landed; users reach the backend via the CUTLASS path. MOE_DEVELOPER_GUIDE.md: remove the dedicated FlashInferNvfp4Sm12xFusedMoE section (composition / dispatch policy / weight-conversion algebra / hard-reject list) and drop the Nvfp4Sm12x matrix column. The class's NVFP4 support on SM120/121 is already covered by the CUTLASS row in the matrix (auto-promote target). Only the single inventory-table entry under "Backends" remains, pointing at the backend file for anyone who wants the details. Both changes respond to xxi-nv's review comments on PR NVIDIA#13773 asking that test_moe_module.py / test_moe_backend.py cover the new backend and that the MoE guide stay high-level. Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
cfc5d8b to
60a2634
Compare
The b12x MoE kernel introduced by PR NVIDIA#13773 (FLASHINFER_NVFP4SM12X) JIT-compiles via nvidia-cutlass-dsl, whose CUDA 13 runtime libraries ship as a separate optional wheel (nvidia-cutlass-dsl-libs-cu13) and are NOT pulled automatically by the main nvidia-cutlass-dsl wheel. Without this wheel, executor initialization on SM120/SM121 hosts dies with ptxas "Unexpected instruction types specified for '_mma'" because the chip->compute_target conversion falls back to a path that strips the 'a' suffix (sm_120a -> sm_120), and ptxas is then invoked with -opt-arch=sm_120 against PTX that has .target sm_120a with sm_120a-only mma instruction forms. The runtime requirement was documented in the PR body but never made binding via requirements.txt. Pin it explicitly at the same version as the main wheel so fresh builds reproduce the same working environment. Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
…to-promote Replace the user-facing `moe_backend: FLASHINFER_NVFP4SM12X` knob with transparent heuristic auto-promotion on the `CUTLASS` path. When the user selects `moe_backend: CUTLASS` (the default), `get_moe_cls()` now returns `FlashInferNvfp4Sm12xFusedMoE` automatically when: - quant_config has NVFP4 - SM version is 120 or 121 - `import flashinfer` succeeds Otherwise it returns `CutlassFusedMoE` (the pre-PR behaviour). The class itself, its weight lifecycle, and its hybrid `m >= 64` decode dispatch are unchanged — only the selection plumbing moves. This responds to xxi-nv's review comment on PR NVIDIA#13773 asking whether the b12x backend could be selected via a heuristic rather than an explicit name. Mirrors the existing `MEGAMOE_DEEPGEMM` pattern of `can_implement`-gated promotion with a CUTLASS fallback. Drops `"FLASHINFER_NVFP4SM12X"` from `MoeConfig.backend` Literal — the class stays importable as an internal API for tests and for direct construction, but is no longer a valid user-facing config string. Tests in `test_flashinfer_nvfp4_sm12x_moe_backend.py` flipped from "explicit name raises on bad config" to "heuristic auto-promotes vs falls back to CutlassFusedMoE". Internal `MoeBackendType` entry kept so `test_moe_backend.py` parametrization continues to cover the backend; `create_test_backend` routes the enum through `moe_backend="CUTLASS"` to exercise the same code path users hit. Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
…MOE guide test_moe_module.py: register MoeBackendType.FLASHINFER_NVFP4SM12X in BACKEND_TYPES so the unified ConfigurableMoE matrix exercises it. _create_model_config maps the internal enum value to moe_backend="CUTLASS" before passing into ModelConfig — the enum is internal-only after the heuristic auto-promote landed; users reach the backend via the CUTLASS path. MOE_DEVELOPER_GUIDE.md: remove the dedicated FlashInferNvfp4Sm12xFusedMoE section (composition / dispatch policy / weight-conversion algebra / hard-reject list) and drop the Nvfp4Sm12x matrix column. The class's NVFP4 support on SM120/121 is already covered by the CUTLASS row in the matrix (auto-promote target). Only the single inventory-table entry under "Backends" remains, pointing at the backend file for anyone who wants the details. Both changes respond to xxi-nv's review comments on PR NVIDIA#13773 asking that test_moe_module.py / test_moe_backend.py cover the new backend and that the MoE guide stay high-level. Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
…to-promote Replace the user-facing `moe_backend: FLASHINFER_NVFP4SM12X` knob with transparent heuristic auto-promotion on the `CUTLASS` path. When the user selects `moe_backend: CUTLASS` (the default), `get_moe_cls()` now returns `FlashInferNvfp4Sm12xFusedMoE` automatically when: - quant_config has NVFP4 - SM version is 120 or 121 - `import flashinfer` succeeds Otherwise it returns `CutlassFusedMoE` (the pre-PR behaviour). The class itself, its weight lifecycle, and its hybrid `m >= 64` decode dispatch are unchanged — only the selection plumbing moves. This responds to xxi-nv's review comment on PR NVIDIA#13773 asking whether the b12x backend could be selected via a heuristic rather than an explicit name. Mirrors the existing `MEGAMOE_DEEPGEMM` pattern of `can_implement`-gated promotion with a CUTLASS fallback. Drops `"FLASHINFER_NVFP4SM12X"` from `MoeConfig.backend` Literal — the class stays importable as an internal API for tests and for direct construction, but is no longer a valid user-facing config string. Tests in `test_flashinfer_nvfp4_sm12x_moe_backend.py` flipped from "explicit name raises on bad config" to "heuristic auto-promotes vs falls back to CutlassFusedMoE". Internal `MoeBackendType` entry kept so `test_moe_backend.py` parametrization continues to cover the backend; `create_test_backend` routes the enum through `moe_backend="CUTLASS"` to exercise the same code path users hit. Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
…MOE guide test_moe_module.py: register MoeBackendType.FLASHINFER_NVFP4SM12X in BACKEND_TYPES so the unified ConfigurableMoE matrix exercises it. _create_model_config maps the internal enum value to moe_backend="CUTLASS" before passing into ModelConfig — the enum is internal-only after the heuristic auto-promote landed; users reach the backend via the CUTLASS path. MOE_DEVELOPER_GUIDE.md: remove the dedicated FlashInferNvfp4Sm12xFusedMoE section (composition / dispatch policy / weight-conversion algebra / hard-reject list) and drop the Nvfp4Sm12x matrix column. The class's NVFP4 support on SM120/121 is already covered by the CUTLASS row in the matrix (auto-promote target). Only the single inventory-table entry under "Backends" remains, pointing at the backend file for anyone who wants the details. Both changes respond to xxi-nv's review comments on PR NVIDIA#13773 asking that test_moe_module.py / test_moe_backend.py cover the new backend and that the MoE guide stay high-level. Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
Pre-commit hooks flagged 3 cosmetic formatting tweaks (collapse multi-line ternary/f-string/blank-line) in the MoE test files added/edited earlier in this PR. No behaviour change. Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
…ethod Addresses three of @xxi-nv's PR NVIDIA#13773 follow-up comments (May 15): - Move post_load_weights into a dedicated quantization_method (NVFP4CuteDslB12xFusedMoEMethod, sibling of NVFP4CuteDslFusedMoEMethod). Backend post_load_weights is now inherited from CutlassFusedMoE and is a thin pass-through to self.quant_method.post_load_weights(self). All b12x weight prep (SF un-normalization, convert_sf_to_mma_layout, B12xMoEWrapper instantiation, shared output buffer) lives next to the rest of the NVFP4 quant-method family. - Make the backend a member of the cuteDSL family: switch the parent class to CuteDslFusedMoE. The hybrid prefill path keeps explicit CutlassFusedMoE.method(self, ...) calls so the same C++ CUTLASS NVFP4 GroupGEMM still runs at m>=64 — the MRO change does not affect which kernels execute. create_moe.py constructor call moved into the CuteDslFusedMoE branch (narrower init signature). - Rename file / class / enum / test to match the cuteDSL family: fused_moe_flashinfer_nvfp4_sm12x.py -> fused_moe_cute_dsl_b12x.py FlashInferNvfp4Sm12xFusedMoE -> CuteDslB12xFusedMoE MoeBackendType.FLASHINFER_NVFP4SM12X -> MoeBackendType.CUTE_DSL_B12X test_flashinfer_nvfp4_sm12x_moe_backend.py -> test_cute_dsl_b12x_moe_backend.py Also adds a local output_dtype fallback in the backend's run_moe before delegating to CutlassFusedMoE.run_moe — schedulers that drive run_moe directly (the KV-cache capacity probe) leave it unset, which surfaces as a 'trtllm::fused_moe() Expected ScalarType output_dtype but instead found NoneType' on the prefill probe. Mirrors fused_moe_cutlass.py:691's forward_chunk convention; FP4-packed uint8 falls back to bf16. Validation: - 25/25 unit tests pass (test_cute_dsl_b12x_moe_backend.py, --noconftest) - trtllm-bench Nemotron-Super-120B-NVFP4 on SM120: 86.75 tok/s vs 85.92 pre-refactor baseline (HYBRID_RESULTS.md, May 7) — within 1% noise - nsys: CUTLASS sm120 block-scaled NVFP4 GroupGEMM kernels fire on prefill (m=2048, 96 calls); b12x cuteDSL MoEStatic/Dynamic/Micro kernels fire on decode (m=1, 40/80/160 calls); [b12x] quantize_input NVTX ranges present (400 calls) Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
dfc5343 to
f291746
Compare
|
/bot run |
|
PR_Github #49967 [ run ] triggered by Bot. Commit: |
|
PR_Github #49967 [ run ] completed with state
|
The CUTLASS path in get_moe_cls was auto-promoting to CuteDslB12xFusedMoE
on SM120/SM121 + NVFP4 when flashinfer was importable, silently overriding
explicit moe_backend=CUTLASS requests. On GB10 (DGX Spark, sm_121) this
broke L0_Test-SBSA-Single-GPU GB10-PyTorch-1
test_configurable_moe_single_gpu CUTLASS+NVFP4 cases with:
TypeError: 'CuteDslB12xFusedMoE' object does not support the
context manager protocol
Selection now lives on the CUTEDSL path: CUTEDSL + NVFP4 + SM120/121 +
flashinfer importable -> CuteDslB12xFusedMoE; otherwise CuteDslFusedMoE
(or CutlassFusedMoE for unsupported quant). Explicit moe_backend=CUTLASS
always returns CutlassFusedMoE.
Also register CuteDslB12xFusedMoE in the ConfigurableMoE allowlist in
create_moe(): the bare backend instance lacks __enter__/__exit__, so
`with create_moe(...)` callers (test_configurable_moe_single_gpu) need
the ConfigurableMoE wrapper that already provides the context manager
protocol for the other cuteDSL-family backends.
Tests / docs updated:
- test_cute_dsl_b12x_moe_backend.py: heuristic ownership assertions
rewritten to verify CUTEDSL-path behaviour (CUTLASS never promotes;
CUTEDSL selects b12x when eligible, falls back to CuteDslFusedMoE on
unsupported SM or missing flashinfer, and to CutlassFusedMoE on
unsupported quant).
- test_moe_module.py / test_moe_backend.py: MoeBackendType.CUTE_DSL_B12X
internal-only enum now remaps to "CUTEDSL" (was "CUTLASS") so the
unified test harness drives the same code path users hit.
- MOE_DEVELOPER_GUIDE.md: file-map row clarifies "select via the
CUTEDSL backend path", and the Backend Capability Matrix's CuteDSL
column extends NVFP4 to SM100/103/120/121 (b12x is the SM120/121
cuteDSL-family member).
Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
|
/bot run |
|
PR_Github #50066 [ run ] triggered by Bot. Commit: |
|
PR_Github #50066 [ run ] completed with state
|
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot run |
|
PR_Github #50104 [ run ] triggered by Bot. Commit: |
|
PR_Github #50104 [ run ] completed with state
|
|
/bot run |
|
PR_Github #50115 [ run ] triggered by Bot. Commit: |
|
PR_Github #50115 [ run ] completed with state
|
|
/bot run |
|
PR_Github #50218 [ run ] triggered by Bot. Commit: |
|
PR_Github #50218 [ run ] completed with state |
NVIDIA#13773) Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
Description
Adds a new MoE backend,
FlashInferNvfp4Sm12xFusedMoE, for Nemotron-Super-120B-NVFP4 on SM120 (RTX 5090 / RTX PRO 6000 / GB202) and SM121 (DGX Spark / GB10). It is auto-selected on theCUTLASSpath whenquant_algo == NVFP4, SM is 120/121, andflashinferis importable — otherwise falls back to plainCutlassFusedMoE. There is no new user-facingmoe_backendvalue; the heuristic inget_moe_cls()mirrors the existingMEGAMOE_DEEPGEMMcan_implement-gated pattern.Hybrid composition:
x.shape[0] >= 64) routes through the inheritedCutlassFusedMoENVFP4 GroupGEMM. b12x's 12-CTA-per-token MMA pattern is suboptimal at largem.x.shape[0] < 64) dispatches to FlashInfer'sB12xMoEWrapper.run. Beats CUTLASS by ~17.6 % TPOT.Hardware constraints (rejected by
can_implementor__init__)NVFP4 only; SM120/121 only; bf16/fp16 activation only;
ep_size == 1only; no MoE alltoall; activations limited toRelu2andSwiglu; noswiglu_gptoss_style.Fp4QuantizedTensorinput rejected on the decode path (b12x quantizes activations internally).Performance
Single RTX PRO 6000 Blackwell (SM120, 97 GB), Nemotron-Super-120B-NVFP4, TRT-LLM 1.3.0rc14, FlashInfer 0.6.8, cutlass-dsl 4.5.0; ISL=2048, OSL=1024, 5 reqs, conc=1, KV reuse off,
cuda_graph_config: {batch_sizes: [1]},max_num_tokens=2048.moe_backend: CUTLASS(pre-PR)moe_backend: CUTLASS(this PR, auto-promoted)Matches CUTLASS on TTFT (prefill on CUTLASS), matches pure b12x on TPOT (decode on b12x), beats both on total throughput. Tokens/Watt +20.4 % vs CUTLASS.
GSM8K accuracy parity (1319 samples, 8-shot CoT, greedy)
CUTLASS(baseline)CUTLASS(auto-promoted, this PR)Statistically indistinguishable (≈ 2 questions out of 1319; well within 95 % binomial CI of ±1.5 pp).
PR Checklist
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.Summary by CodeRabbit
New Features
Documentation