[https://nvbugs/6140411][fix] Fix index error of shared expert when loading weights#13856
Conversation
📝 WalkthroughWalkthroughThis change adds conditional dynamic quantization support to ChangesDynamic Quantization Parameter and Computation Refactoring
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)
3239-3263: 💤 Low valueConsider: TRTLLMGen path may need similar handling for dynamic quantization.
The
_finalize_shared_expert_alphasmethod calls_reconcile_and_compute_alphaswithout passingdst_fc31_weight_scale_2anddst_fc2_weight_scale_2. IfNVFP4TRTLLMGenFusedMoEMethod(or related classes) are used withforce_dynamic_quantization=True, the weight_scale_2 tensors won't be populated for TRTLLMGen-specific shared experts.This may be intentional if TRTLLMGen doesn't support dynamic quantization for shared experts yet, but worth noting for future work if this feature combination becomes needed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py` around lines 3239 - 3263, The _finalize_shared_expert_alphas function calls _reconcile_and_compute_alphas for TRTLLMGen shared experts but does not pass dst_fc31_weight_scale_2 or dst_fc2_weight_scale_2, so when NVFP4TRTLLMGenFusedMoEMethod (or related classes) is used with force_dynamic_quantization=True the TRTLLMGen-specific weight_scale_2 tensors won't be populated; update _finalize_shared_expert_alphas to either supply the appropriate dst_fc31_weight_scale_2 and dst_fc2_weight_scale_2 (e.g. from module attributes created for TRTLLMGen path like tmp_trtllmgen_shared_weight_scale_2 variants) or call _reconcile_and_compute_alphas with explicit None/placeholder and ensure _reconcile_and_compute_alphas handles dynamic-quantization case for TRTLLMGen, referencing the symbols _finalize_shared_expert_alphas, _reconcile_and_compute_alphas, tmp_trtllmgen_shared_weight_scale_2, dst_fc31_weight_scale_2, dst_fc2_weight_scale_2, NVFP4TRTLLMGenFusedMoEMethod and force_dynamic_quantization so shared experts get their weight_scale_2 populated or gracefully skipped.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py`:
- Around line 3239-3263: The _finalize_shared_expert_alphas function calls
_reconcile_and_compute_alphas for TRTLLMGen shared experts but does not pass
dst_fc31_weight_scale_2 or dst_fc2_weight_scale_2, so when
NVFP4TRTLLMGenFusedMoEMethod (or related classes) is used with
force_dynamic_quantization=True the TRTLLMGen-specific weight_scale_2 tensors
won't be populated; update _finalize_shared_expert_alphas to either supply the
appropriate dst_fc31_weight_scale_2 and dst_fc2_weight_scale_2 (e.g. from module
attributes created for TRTLLMGen path like tmp_trtllmgen_shared_weight_scale_2
variants) or call _reconcile_and_compute_alphas with explicit None/placeholder
and ensure _reconcile_and_compute_alphas handles dynamic-quantization case for
TRTLLMGen, referencing the symbols _finalize_shared_expert_alphas,
_reconcile_and_compute_alphas, tmp_trtllmgen_shared_weight_scale_2,
dst_fc31_weight_scale_2, dst_fc2_weight_scale_2, NVFP4TRTLLMGenFusedMoEMethod
and force_dynamic_quantization so shared experts get their weight_scale_2
populated or gracefully skipped.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e71e12b5-87b7-4dbd-9c65-234fe0354127
📒 Files selected for processing (1)
tensorrt_llm/_torch/modules/fused_moe/quantization.py
|
/bot run --stage-list "GB300-4_GPUs-PyTorch-PerfSanity-Post-Merge-1" |
|
PR_Github #47216 [ run ] triggered by Bot. Commit: |
|
PR_Github #47216 [ run ] completed with state
|
|
/bot run --stage-list "GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2" |
|
/bot run --disable-fail-fast --stage-list "GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-9,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-6,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-10" |
|
PR_Github #47323 [ run ] triggered by Bot. Commit: |
|
PR_Github #47322 [ run ] triggered by Bot. Commit: |
|
PR_Github #47322 [ run ] completed with state
|
|
/bot run --disable-fail-fast --stage-list "GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-9,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-6,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-10" |
Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>
Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>
ef80c2b to
f54ac11
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #47480 [ run ] triggered by Bot. Commit: |
|
PR_Github #47480 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47485 [ run ] triggered by Bot. Commit: |
|
PR_Github #47485 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47530 [ run ] triggered by Bot. Commit: |
|
PR_Github #47530 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47546 [ run ] triggered by Bot. Commit: |
|
PR_Github #47546 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47547 [ run ] triggered by Bot. Commit: |
|
PR_Github #47547 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47581 [ run ] triggered by Bot. Commit: |
|
PR_Github #47581 [ run ] completed with state |
…oading weights (NVIDIA#13856) Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>
Summary by CodeRabbit
New Features
Improvements
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.