[None][fix] Continue MoE comm fallback on init exceptions#13331
[None][fix] Continue MoE comm fallback on init exceptions#13331xxi-nv merged 3 commits intoNVIDIA:mainfrom
Conversation
Catch non-RuntimeError failures during communication strategy initialization so auto-selection can continue to the next fallback. Log these failures at info level so fallback decisions are visible in runtime logs. Signed-off-by: xxi <95731198+xxi-nv@users.noreply.github.com>
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughSingle file updated to broaden exception handling in Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py (1)
1-1:⚠️ Potential issue | 🟡 MinorUpdate copyright year on this modified source file.
Line [1] still shows
2025even though this file has meaningful modifications in this PR.As per coding guidelines: “All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification”.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py` at line 1, Update the SPDX copyright header year at the top of this file from "2025" to the year of latest meaningful modification (e.g., "2026") so the header reflects the current modification date; edit the header comment line that begins with "SPDX-FileCopyrightText:" in this module (communication_factory.py) to replace the old year with the correct one.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py`:
- Around line 143-145: The current strategy probe blocks in
communication_factory.py (e.g., the NVLinkOneSided probe and the other three
strategy probes) catch Exception too broadly; change each try/except to only
catch recoverable initialization errors (RuntimeError and ValueError) and log
the error, and for any other unexpected exception re-raise it after logging the
traceback (use logger.exception or capture traceback before raising). Update the
except clauses around the NVLinkOneSided, NVLinkP2P, IB and Socket/other
strategy constructor probes to follow this pattern so only (RuntimeError,
ValueError) are swallowed and all other exceptions propagate after being logged.
---
Outside diff comments:
In
`@tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py`:
- Line 1: Update the SPDX copyright header year at the top of this file from
"2025" to the year of latest meaningful modification (e.g., "2026") so the
header reflects the current modification date; edit the header comment line that
begins with "SPDX-FileCopyrightText:" in this module (communication_factory.py)
to replace the old year with the correct one.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: d5775bc9-f73a-408e-aa4d-095556992418
📒 Files selected for processing (1)
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
|
PR_Github #44941 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #45046 [ run ] triggered by Bot. Commit: |
|
PR_Github #45046 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #45272 [ run ] triggered by Bot. Commit: |
|
PR_Github #45272 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #45461 [ run ] triggered by Bot. Commit: |
|
PR_Github #45461 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #46024 [ run ] triggered by Bot. Commit: |
|
PR_Github #46024 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #46229 [ run ] triggered by Bot. Commit: |
|
PR_Github #46229 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #46306 [ run ] triggered by Bot. Commit: |
|
PR_Github #46306 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #46495 [ run ] triggered by Bot. Commit: |
|
PR_Github #46495 [ run ] completed with state |
Catch non-RuntimeError failures during communication strategy initialization so auto-selection can continue to the next fallback. Log these failures at info level so fallback decisions are visible in runtime logs.
Summary by CodeRabbit
Bug Fixes
Chores
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.