[None][feat] enable TRTLLM-Gen internal routing#13997
Conversation
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
📝 WalkthroughWalkthroughThis PR adds internal routing support to TRTLLM-Gen MoE fusion and introduces auxiliary stream scheduling for shared experts. It extends the MoE custom op to accept routing logits and bias as alternatives to external routing tensors, implements graph analysis to extract routing metadata from noaux_tc_op producers, and introduces the MultiStreamMOE transform for performance optimization. ChangesInternal Routing and Auxiliary Stream MoE Optimization
Sequence Diagram(s)The custom op routing flow and transform orchestration are visualized in the hidden review stack artifact above. 🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/auto_deploy/transform/library/multi_stream_moe.py (1)
195-223:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRewire every shared-branch root to the aux stream, not just the first one.
first_sharedonly covers one direct consumer offork_point. Shared-expert MLPs commonly fan out immediately into parallel gate/up projections, so sibling roots keep readingfork_pointon the main stream. That splits one logical shared branch across streams, and the laterend_aux/wait_auxnodes no longer bracket the full shared-expert computation.Suggested fix
- shared_nodes.sort(key=lambda n: node_order.get(n, 0)) - first_shared = shared_nodes[0] - - # Sanity check: the first shared op must directly consume the fork - # point so we can wire begin_aux_stream_passthrough into it. - if fork_point not in first_shared.all_input_nodes: + shared_nodes.sort(key=lambda n: node_order.get(n, 0)) + root_shared_nodes = [n for n in shared_nodes if fork_point in n.all_input_nodes] + if not root_shared_nodes: ad_logger.warning( - f"First shared-expert op ({first_shared.name}) does not directly " - f"consume fork point ({fork_point.name}); skipping." + f"No shared-expert root directly consumes fork point ({fork_point.name}); skipping." ) continue + first_shared = root_shared_nodes[0] with graph.inserting_before(first_shared): begin_aux_node = graph.call_function( begin_aux_stream_passthrough, args=(fork_point,), ) - first_shared.args = tuple( - begin_aux_node if arg is fork_point else arg for arg in first_shared.args - ) + for root in root_shared_nodes: + root.args = tuple( + begin_aux_node if arg is fork_point else arg for arg in root.args + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/auto_deploy/transform/library/multi_stream_moe.py` around lines 195 - 223, The current code only rewires the single first_shared consumer to read from the begin_aux_node, leaving sibling shared roots still tied to fork_point; change the rewrite so that every shared branch root in shared_nodes that directly consumes fork_point is updated to use begin_aux_node instead of fork_point. Locate the block where begin_aux_node is created via begin_aux_stream_passthrough and replace the single reassignment of first_shared.args with a loop (or comprehension) over shared_nodes that for each node checks its args (or all_input_nodes membership) and replaces any arg that is fork_point with begin_aux_node; ensure this uses the same begin_aux_node and preserves nodes that do not reference fork_point so the entire shared-expert subtree executes on the aux stream and remains bracketed by the corresponding end_aux/wait_aux nodes.
🧹 Nitpick comments (2)
examples/auto_deploy/model_registry/configs/super_v3.yaml (1)
54-56: Add perf coverage before enabling this path by default.This turns on a MoE routing hot path for
trtllm_gen, but I don't see a matching perf sanity/test-list update in the diff to catch latency or throughput regressions. Please add atests/integration/defs/perf/test_perf_sanity.pycase plus atests/integration/test_lists/test-db/l0_perf.ymlentry, and the corresponding QAllm_perf_*.ymlupdate if you want scheduled coverage.As per coding guidelines, "If the PR touches performance-sensitive paths ... check whether a perf test entry is present or updated in: (a) tests/integration/test_lists/test-db/l0_perf.yml ... and (b) tests/integration/test_lists/qa/llm_perf_*.yml ..."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/auto_deploy/model_registry/configs/super_v3.yaml` around lines 54 - 56, The new config enables a MoE routing hot path (fuse_nvfp4_moe with backend: trtllm_gen and enable_trtllm_gen_internal_routing: true) but no perf coverage was added; add a performance sanity test and test-list entries: create a tests/integration/defs/perf/test_perf_sanity.py that exercises the trtllm_gen MoE routing case (reference the fuse_nvfp4_moe config name in the test), add a corresponding entry in tests/integration/test_lists/test-db/l0_perf.yml to include that test, and update the appropriate QA list under tests/integration/test_lists/qa/ (llm_perf_*.yml) so this path is scheduled for perf runs.tests/unittest/auto_deploy/singlegpu/transformations/library/test_moe_fusion.py (1)
1631-1658: ⚡ Quick winCover the transform rewrite, not just the helper.
This only proves
_extract_noaux_internal_routing()can parse a hand-built graph. The risky behavior in this PR is thefuse_nvfp4_moerewrite that drops external routing tensors, threadsrouter_logits/routing_biasintotrtllm_nvfp4_trtllm_gen_moe_fused, and stays on the external path for all-to-all. A regression there would still pass this test. Please add a transform-level regression for both the internal-routing rewrite and the all-to-all fallback.As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/auto_deploy/singlegpu/transformations/library/test_moe_fusion.py` around lines 1631 - 1658, The current unit test only validates the helper _extract_noaux_internal_routing on a hand-built graph; you must add transform-level tests that run the actual fuse_nvfp4_moe rewrite and assert its behavior for both internal-routing fusion and the all-to-all fallback. Create two tests (or one parametrized) that build an FX graph with real external routing tensors (router_logits, routing_bias) and the noaux op, apply the transformation function fuse_nvfp4_moe, and then assert: (1) when internal-routing is detected the graph contains a single trtllm_nvfp4_trtllm_gen_moe_fused node and router_logits/routing_bias are threaded into that node, and (2) when routing is externally required (all-to-all fallback case) the transform does not drop external routing tensors and keeps the all-to-all path intact; include the with_ep_mask True/False cases similar to test_nvfp4_trtllm_gen_internal_routing to cover masking behavior. Ensure assertions inspect node targets/args in the transformed FX graph to uniquely identify fuse_nvfp4_moe and trtllm_nvfp4_trtllm_gen_moe_fused behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py`:
- Around line 1-2: This file missing the required NVIDIA copyright header; open
fused_moe.py (module fused_moe) and insert the standard NVIDIA source header
(with the current modification year) at the very top of the file before any
imports (before the existing import math / import operator lines); ensure the
header text matches the project's canonical copyright header used in other .py
files and retains any required SPDX or license lines.
In
`@tests/unittest/auto_deploy/singlegpu/transformations/library/test_moe_fusion.py`:
- Around line 1-2: Add the required NVIDIA copyright header block at the very
top of this modified Python file (before the existing import lines such as
"import json" and "import operator") in the same format used by other source
files in the repo and update the modification year to the current year; ensure
the header is a proper multi-line comment/docblock matching the project's
standard and leave the rest of the file (including imports and tests in
test_moe_fusion.py) unchanged below the header.
---
Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/transform/library/multi_stream_moe.py`:
- Around line 195-223: The current code only rewires the single first_shared
consumer to read from the begin_aux_node, leaving sibling shared roots still
tied to fork_point; change the rewrite so that every shared branch root in
shared_nodes that directly consumes fork_point is updated to use begin_aux_node
instead of fork_point. Locate the block where begin_aux_node is created via
begin_aux_stream_passthrough and replace the single reassignment of
first_shared.args with a loop (or comprehension) over shared_nodes that for each
node checks its args (or all_input_nodes membership) and replaces any arg that
is fork_point with begin_aux_node; ensure this uses the same begin_aux_node and
preserves nodes that do not reference fork_point so the entire shared-expert
subtree executes on the aux stream and remains bracketed by the corresponding
end_aux/wait_aux nodes.
---
Nitpick comments:
In `@examples/auto_deploy/model_registry/configs/super_v3.yaml`:
- Around line 54-56: The new config enables a MoE routing hot path
(fuse_nvfp4_moe with backend: trtllm_gen and enable_trtllm_gen_internal_routing:
true) but no perf coverage was added; add a performance sanity test and
test-list entries: create a tests/integration/defs/perf/test_perf_sanity.py that
exercises the trtllm_gen MoE routing case (reference the fuse_nvfp4_moe config
name in the test), add a corresponding entry in
tests/integration/test_lists/test-db/l0_perf.yml to include that test, and
update the appropriate QA list under tests/integration/test_lists/qa/
(llm_perf_*.yml) so this path is scheduled for perf runs.
In
`@tests/unittest/auto_deploy/singlegpu/transformations/library/test_moe_fusion.py`:
- Around line 1631-1658: The current unit test only validates the helper
_extract_noaux_internal_routing on a hand-built graph; you must add
transform-level tests that run the actual fuse_nvfp4_moe rewrite and assert its
behavior for both internal-routing fusion and the all-to-all fallback. Create
two tests (or one parametrized) that build an FX graph with real external
routing tensors (router_logits, routing_bias) and the noaux op, apply the
transformation function fuse_nvfp4_moe, and then assert: (1) when
internal-routing is detected the graph contains a single
trtllm_nvfp4_trtllm_gen_moe_fused node and router_logits/routing_bias are
threaded into that node, and (2) when routing is externally required (all-to-all
fallback case) the transform does not drop external routing tensors and keeps
the all-to-all path intact; include the with_ep_mask True/False cases similar to
test_nvfp4_trtllm_gen_internal_routing to cover masking behavior. Ensure
assertions inspect node targets/args in the transformed FX graph to uniquely
identify fuse_nvfp4_moe and trtllm_nvfp4_trtllm_gen_moe_fused behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d26d551d-4218-4461-a57e-0bd2562bbf93
📒 Files selected for processing (5)
examples/auto_deploy/model_registry/configs/super_v3.yamltensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.pytensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.pytensorrt_llm/_torch/auto_deploy/transform/library/multi_stream_moe.pytests/unittest/auto_deploy/singlegpu/transformations/library/test_moe_fusion.py
|
PR_Github #47733 [ run ] triggered by Bot. Commit: |
|
PR_Github #47733 [ run ] completed with state
|
Route matched noaux NVFP4 MoE patterns to TRTLLM-Gen internal routing by default when possible. Keep the external-routing path for all-to-all cases, preserve routing_bias dtype for TRTLLM-Gen, and cover direct and EP-masked matcher shapes. Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>
39322a9 to
66129fd
Compare
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #47892 [ run ] triggered by Bot. Commit: |
|
PR_Github #47892 [ run ] completed with state
|
Signed-off-by: tcherckez-nvidia <127761168+tcherckez-nvidia@users.noreply.github.com>
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #47961 [ run ] triggered by Bot. Commit: |
|
PR_Github #47961 [ run ] completed with state
|
|
/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #48099 [ run ] triggered by Bot. Commit: |
|
PR_Github #48099 [ run ] completed with state |
|
/bot run |
|
PR_Github #48135 [ run ] triggered by Bot. Commit: |
|
PR_Github #48135 [ run ] completed with state |
…p4_trtllm_gen_moe_fused The custom_op-decorated real impl of trtllm_nvfp4_trtllm_gen_moe_fused lost its batch_info_host parameter when main was merged into this branch (merge d8f3517 pulled in NVIDIA#13997's TRTLLM-Gen internal-routing rewrite, which dropped batch_info_host while introducing router_logits/routing_bias/ top_k/n_group/topk_group/routed_scaling_factor). The register_fake impl kept batch_info_host, so the registered schema and the fake schema went out of sync. The MoE call site (added by this PR's runtime-max-tokens feature) passes batch_info_host uniformly to every MoE op. The dispatcher rejects it on trtllm_nvfp4_trtllm_gen_moe_fused with: RuntimeError: Unknown keyword argument 'batch_info_host' for operator 'auto_deploy::trtllm_nvfp4_trtllm_gen_moe_fused'. surfaced by TestNemotronSuperV3::test_accuracy[nvfp4-4-attn_dp_on-trtllm] on DGX_B200-4_GPUs-AutoDeploy-1. Re-add batch_info_host to the real-impl signature in the same position the fake impl already has it, and forward it into the underlying _trtllm_nvfp4_trtllm_gen_moe_impl helper (which already accepts and threads it through). Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com> Signed-off-by: tcherckez-nvidia <127761168+tcherckez-nvidia@users.noreply.github.com>
Route matched noaux NVFP4 MoE patterns to TRTLLM-Gen internal routing by default when possible.
Keep the external-routing path for all-to-all cases, preserve routing_bias dtype for TRTLLM-Gen, and cover direct and EP-masked matcher shapes.
Summary by CodeRabbit
New Features
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.