Skip to content

[None][fix] Continue MoE comm fallback on init exceptions#13331

Merged
xxi-nv merged 3 commits intoNVIDIA:mainfrom
xxi-nv:fixCommFallBack
May 2, 2026
Merged

[None][fix] Continue MoE comm fallback on init exceptions#13331
xxi-nv merged 3 commits intoNVIDIA:mainfrom
xxi-nv:fixCommFallBack

Conversation

@xxi-nv
Copy link
Copy Markdown
Collaborator

@xxi-nv xxi-nv commented Apr 22, 2026

Catch non-RuntimeError failures during communication strategy initialization so auto-selection can continue to the next fallback. Log these failures at info level so fallback decisions are visible in runtime logs.

Summary by CodeRabbit

  • Bug Fixes

    • Improved error handling in communication strategy selection to catch a wider range of exceptions and ensure graceful fallback to alternative strategies.
  • Chores

    • Enhanced logging visibility for communication strategy diagnostics by promoting informational messages.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Catch non-RuntimeError failures during communication strategy initialization so auto-selection can continue to the next fallback. Log these failures at info level so fallback decisions are visible in runtime logs.

Signed-off-by: xxi <95731198+xxi-nv@users.noreply.github.com>
@xxi-nv xxi-nv requested a review from a team as a code owner April 22, 2026 08:43
@xxi-nv xxi-nv requested a review from QiJune April 22, 2026 08:43
@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 22, 2026

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

📝 Walkthrough

Walkthrough

Single file updated to broaden exception handling in create_strategy's auto-selection mechanism. The function now catches Exception instead of RuntimeError for four candidate strategies (NVLinkOneSided, NVLinkTwoSided, DeepEP, DeepEPLowLatency), with logging level changed from debug to info for unavailability cases.

Changes

Cohort / File(s) Summary
Error Handling & Logging
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
Updated exception handling in create_strategy to catch Exception instead of RuntimeError for four candidate strategies, and elevated log level from debug to info for strategy unavailability notifications.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description explains the issue and solution, but the description template sections (Description, Test Coverage) are not filled out—only the header explanation is provided before the template comments. Fill out the 'Description' section explaining the issue and why the fix is needed, and the 'Test Coverage' section listing relevant tests that validate the changes.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: catching exceptions during MoE communication fallback initialization to allow continuation to next strategy.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Update copyright year on this modified source file.

Line [1] still shows 2025 even though this file has meaningful modifications in this PR.

As per coding guidelines: “All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification”.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py`
at line 1, Update the SPDX copyright header year at the top of this file from
"2025" to the year of latest meaningful modification (e.g., "2026") so the
header reflects the current modification date; edit the header comment line that
begins with "SPDX-FileCopyrightText:" in this module (communication_factory.py)
to replace the old year with the correct one.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py`:
- Around line 143-145: The current strategy probe blocks in
communication_factory.py (e.g., the NVLinkOneSided probe and the other three
strategy probes) catch Exception too broadly; change each try/except to only
catch recoverable initialization errors (RuntimeError and ValueError) and log
the error, and for any other unexpected exception re-raise it after logging the
traceback (use logger.exception or capture traceback before raising). Update the
except clauses around the NVLinkOneSided, NVLinkP2P, IB and Socket/other
strategy constructor probes to follow this pattern so only (RuntimeError,
ValueError) are swallowed and all other exceptions propagate after being logged.

---

Outside diff comments:
In
`@tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py`:
- Line 1: Update the SPDX copyright header year at the top of this file from
"2025" to the year of latest meaningful modification (e.g., "2026") so the
header reflects the current modification date; edit the header comment line that
begins with "SPDX-FileCopyrightText:" in this module (communication_factory.py)
to replace the old year with the correct one.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d5775bc9-f73a-408e-aa4d-095556992418

📥 Commits

Reviewing files that changed from the base of the PR and between 36fb5f0 and 34a9ba3.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44941 [ run ] triggered by Bot. Commit: 34a9ba3 Link to invocation

Copy link
Copy Markdown
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 23, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45046 [ run ] triggered by Bot. Commit: 34a9ba3 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45046 [ run ] completed with state SUCCESS. Commit: 34a9ba3
/LLM/main/L0_MergeRequest_PR pipeline #35352 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 24, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45272 [ run ] triggered by Bot. Commit: b8e794f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45272 [ run ] completed with state SUCCESS. Commit: b8e794f
/LLM/main/L0_MergeRequest_PR pipeline #35529 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 24, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45461 [ run ] triggered by Bot. Commit: bea3ec2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45461 [ run ] completed with state SUCCESS. Commit: bea3ec2
/LLM/main/L0_MergeRequest_PR pipeline #35693 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 29, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46024 [ run ] triggered by Bot. Commit: bea3ec2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46024 [ run ] completed with state SUCCESS. Commit: bea3ec2
/LLM/main/L0_MergeRequest_PR pipeline #36177 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 29, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46229 [ run ] triggered by Bot. Commit: bea3ec2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46229 [ run ] completed with state SUCCESS. Commit: bea3ec2
/LLM/main/L0_MergeRequest_PR pipeline #36341 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 30, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46306 [ run ] triggered by Bot. Commit: bea3ec2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46306 [ run ] completed with state SUCCESS. Commit: bea3ec2
/LLM/main/L0_MergeRequest_PR pipeline #36406 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented May 1, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46495 [ run ] triggered by Bot. Commit: bea3ec2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46495 [ run ] completed with state SUCCESS. Commit: bea3ec2
/LLM/main/L0_MergeRequest_PR pipeline #36557 completed with status: 'SUCCESS'

CI Report

Link to invocation

@xxi-nv xxi-nv merged commit 0f68ba6 into NVIDIA:main May 2, 2026
5 checks passed
@xxi-nv xxi-nv deleted the fixCommFallBack branch May 2, 2026 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants