[None][perf] Extend customMoeRouting kernel to support Qwen3.5#13433
[None][perf] Extend customMoeRouting kernel to support Qwen3.5#13433nv-guomingz merged 1 commit intoNVIDIA:mainfrom
Conversation
|
/bot run |
📝 WalkthroughWalkthroughThe PR extends MoE routing capabilities to support larger expert counts (up to 512) and higher top-k values (up to 16). CUDA kernels are refactored to adaptively select block sizes based on expert count, constraint validation is relaxed in host-side ops, dispatch logic thresholds are updated, and test coverage is expanded for the new parameter ranges. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (1)
256-265:CASE(384)is unreachable dead code.
nextPowerOfTwo()only returns powers of 2 (32, 64, 128, 256, 512, ...). Since 384 is not a power of 2,maxNumExpertswill never equal 384, makingCASE(384)unreachable. This wastes compilation time instantiating unused kernel templates.Note: The pre-existing
CASE(96)has the same issue.Consider removing
CASE(384)or, if non-power-of-2 expert counts should be supported efficiently, modify the dispatch logic to round to the closest supported template value rather than the next power of two.♻️ Proposed fix to remove unreachable case
CASE(128) CASE(256) - CASE(384) CASE(512)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu` around lines 256 - 265, The switch dispatch contains unreachable instantiations CASE(384) and CASE(96) because nextPowerOfTwo() only yields powers of two (e.g., 32,64,128,256,512), so remove the unreachable CASE(384) and CASE(96) entries from the switch over maxNumExperts (or alternatively change the dispatching logic that computes maxNumExperts from nextPowerOfTwo() to instead round/clip to the nearest supported template values); ensure kernelInstance handling remains correct after removing those CASEs and reference the symbols maxNumExperts, nextPowerOfTwo(), CASE(...), and kernelInstance when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu`:
- Around line 256-265: The switch dispatch contains unreachable instantiations
CASE(384) and CASE(96) because nextPowerOfTwo() only yields powers of two (e.g.,
32,64,128,256,512), so remove the unreachable CASE(384) and CASE(96) entries
from the switch over maxNumExperts (or alternatively change the dispatching
logic that computes maxNumExperts from nextPowerOfTwo() to instead round/clip to
the nearest supported template values); ensure kernelInstance handling remains
correct after removing those CASEs and reference the symbols maxNumExperts,
nextPowerOfTwo(), CASE(...), and kernelInstance when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: bfe24621-9b9d-4f13-b7b1-3cc633d809ac
📒 Files selected for processing (5)
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cucpp/tensorrt_llm/kernels/moeTopKFuncs.cuhcpp/tensorrt_llm/thop/customMoeRoutingOp.cpptensorrt_llm/_torch/modules/fused_moe/routing.pytests/unittest/_torch/modules/test_moe_routing.py
|
PR_Github #45396 [ run ] triggered by Bot. Commit: |
|
PR_Github #45396 [ run ] completed with state
|
1620bcf to
8c509b8
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #45579 [ run ] triggered by Bot. Commit: |
|
PR_Github #45579 [ run ] completed with state
|
This PR enables gatherTopK+softmax → customMoeRoutingKernel opt for qwen3.5. Qwen3.5-397B-A17B-NVFP4 and Qwen3-Next-80B have num_experts=512, top_k=10, which previously failed the fast-path guard (`num_experts > 128 || top_k > 8`) in RenormalizeMoeRoutingMethod.apply and fell back to a pytorch path (torch.topk + softmax_warp_forward), costing ~2.5% of GPU time per MoE layer on ADP4 1k/1k decode. Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
8c509b8 to
bc693ed
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #45632 [ run ] triggered by Bot. Commit: |
|
PR_Github #45632 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45641 [ run ] triggered by Bot. Commit: |
|
PR_Github #45641 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #45852 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #45959 [ run ] triggered by Bot. Commit: |
|
PR_Github #45959 [ run ] completed with state |
This PR enables gatherTopK+softmax → customMoeRoutingKernel opt for qwen3.5.
Qwen3.5-397B-A17B-NVFP4 and Qwen3-Next-80B have num_experts=512, top_k=10, which previously failed the fast-path guard (
num_experts > 128 || top_k > 8) in RenormalizeMoeRoutingMethod.apply and fell back to a pytorch path (torch.topk + softmax_warp_forward), costing ~2.5% of GPU time per MoE layer on ADP4 1k/1k decode.Summary by CodeRabbit
Release Notes
New Features
Performance
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.