Skip to content

[https://nvbugs/5859886][fix] Skip DeepEP when NVLink symmetric memory init fails#13172

Open
ziyixiong-nv wants to merge 1 commit intoNVIDIA:mainfrom
ziyixiong-nv:repair-bot-bug5859886
Open

[https://nvbugs/5859886][fix] Skip DeepEP when NVLink symmetric memory init fails#13172
ziyixiong-nv wants to merge 1 commit intoNVIDIA:mainfrom
ziyixiong-nv:repair-bot-bug5859886

Conversation

@ziyixiong-nv
Copy link
Copy Markdown
Collaborator

@ziyixiong-nv ziyixiong-nv commented Apr 18, 2026

Summary

  • Fix for NVBugs 5859886: [TensorRT-LLM][L0][Post-Merge][main]accuracy/test_llm_api_pytorch.py::TestDeepSeekV32::test_fp8_blockscale[disable_skip_indexer] timeout
  • Root cause: Skip DeepEP when NVLink symmetric memory init fails
  • Fix: (auto-detected from git commit)
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Bug Fixes

    • Improved NVLink workspace initialization failure handling with graceful fallback mechanism to prevent repeated initialization attempts on known failures.
  • Tests

    • Removed skip waiver for a DeepSeek test case.

@ziyixiong-nv ziyixiong-nv requested a review from a team as a code owner April 18, 2026 00:26
@ziyixiong-nv ziyixiong-nv requested a review from yuxianq April 18, 2026 00:26
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 18, 2026

📝 Walkthrough

Walkthrough

The changes add failure tracking and error handling for NVLink workspace initialization in the MOE communication module. A new class-level flag records initialization failures, and the strategy factory conditionally skips DeepEP strategies when this flag is set. A test waiver entry is removed.

Changes

Cohort / File(s) Summary
MOE Communication Error Handling
tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py, tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
Added class-level flag _WORKSPACE_INIT_FAILED to track NVLink initialization failures. nvlink_one_sided.py now wraps workspace allocation in try/except to capture RuntimeError and AssertionError, setting the flag and re-raising on failure. communication_factory.py conditionally bypasses DeepEP/DeepEPLowLatency strategy selection when the flag is set, falling back to AllGatherReduceScatter.
Test Waivers
tests/integration/test_lists/waives.txt
Removed waiver entry for accuracy/test_llm_api_pytorch.py::TestDeepSeekV32::test_fp8_blockscale[disable_skip_indexer].

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description includes a summary, test plan, and links, but lacks detailed explanation of what and why the changes were made. Expand the description to explain the root cause, the implementation approach, and how the fix resolves the timeout issue in more detail.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title clearly and specifically summarizes the main change: skipping DeepEP when NVLink symmetric memory initialization fails, which directly addresses the root cause of the timeout issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py (1)

230-237: Consider checking _WORKSPACE_INIT_FAILED before MnnvlMemory.initialize().

The failure check (line 232) occurs after MnnvlMemory.initialize() (line 230). If MnnvlMemory.initialize() is safe to call repeatedly regardless of workspace state, this is fine. However, moving the check earlier would be slightly more defensive.

♻️ Optional: Move failure check before MnnvlMemory.initialize()
+        if self._WORKSPACE_INIT_FAILED:
+            raise RuntimeError(
+                "NVLinkOneSided: workspace initialization (MNNVL/NVSHMEM) previously "
+                "failed on this node, skipping repeated initialization attempt."
+            )
+
         # Initialize or reuse workspace
         MnnvlMemory.initialize()
 
-        if self._WORKSPACE_INIT_FAILED:
-            raise RuntimeError(
-                "NVLinkOneSided: workspace initialization (MNNVL/NVSHMEM) previously "
-                "failed on this node, skipping repeated initialization attempt."
-            )
-
         if self._WORKSPACE is None:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py`
around lines 230 - 237, The _WORKSPACE_INIT_FAILED flag should be checked before
attempting to re-initialize workspace resources; move the check for
self._WORKSPACE_INIT_FAILED to immediately before the call to
MnnvlMemory.initialize() in NVLinkOneSided so that if the workspace init
previously failed we raise the RuntimeError early and skip calling
MnnvlMemory.initialize(); keep the same RuntimeError message and behavior
otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py`:
- Around line 230-237: The _WORKSPACE_INIT_FAILED flag should be checked before
attempting to re-initialize workspace resources; move the check for
self._WORKSPACE_INIT_FAILED to immediately before the call to
MnnvlMemory.initialize() in NVLinkOneSided so that if the workspace init
previously failed we raise the RuntimeError early and skip calling
MnnvlMemory.initialize(); keep the same RuntimeError message and behavior
otherwise.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ef44833d-2b45-4f1e-8224-51e047ee2be3

📥 Commits

Reviewing files that changed from the base of the PR and between 813d877 and 8f778a3.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
  • tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py
  • tests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt

@ziyixiong-nv ziyixiong-nv changed the title [https://nvbugs/5859886][fix] [TensorRT-LLM][L0][Post-Merge][main]accuracy/test_ [https://nvbugs/5859886][fix] Skip DeepEP when NVLink symmetric memory init fails Apr 18, 2026
…ails

When NVLinkOneSided workspace initialization fails (MNNVL allocation or
NVSHMEM moe_a2a_initialize), cache the failure and propagate it to skip
DeepEP/DeepEPLowLatency strategies in CommunicationFactory. DeepEP also
relies on NVSHMEM internally and would hang during forward pass if the
NVLink symmetric memory infrastructure is unavailable.

Changes:
- NVLinkOneSided: wrap workspace init (MnnvlMemory + moe_a2a_initialize)
  in try-except, set _WORKSPACE_INIT_FAILED on failure to avoid repeated
  attempts across MoE layers and signal the factory.
- CommunicationFactory: check _WORKSPACE_INIT_FAILED before trying
  DeepEP/DeepEPLowLatency; fall through to AllGatherReduceScatter (NCCL).
- Remove test waiver for test_fp8_blockscale[disable_skip_indexer].

Signed-off-by: Ziyi Xiong <219238287+ziyixiong-nv@users.noreply.github.com>
@ziyixiong-nv ziyixiong-nv force-pushed the repair-bot-bug5859886 branch from 8f778a3 to ef935a1 Compare April 20, 2026 02:22
@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44256 [ run ] triggered by Bot. Commit: ef935a1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44256 [ run ] completed with state FAILURE. Commit: ef935a1
/LLM/main/L0_MergeRequest_PR pipeline #34677 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants