[https://nvbugs/5859886][fix] Skip DeepEP when NVLink symmetric memory init fails#13172
[https://nvbugs/5859886][fix] Skip DeepEP when NVLink symmetric memory init fails#13172ziyixiong-nv wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughThe changes add failure tracking and error handling for NVLink workspace initialization in the MOE communication module. A new class-level flag records initialization failures, and the strategy factory conditionally skips DeepEP strategies when this flag is set. A test waiver entry is removed. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py (1)
230-237: Consider checking_WORKSPACE_INIT_FAILEDbeforeMnnvlMemory.initialize().The failure check (line 232) occurs after
MnnvlMemory.initialize()(line 230). IfMnnvlMemory.initialize()is safe to call repeatedly regardless of workspace state, this is fine. However, moving the check earlier would be slightly more defensive.♻️ Optional: Move failure check before MnnvlMemory.initialize()
+ if self._WORKSPACE_INIT_FAILED: + raise RuntimeError( + "NVLinkOneSided: workspace initialization (MNNVL/NVSHMEM) previously " + "failed on this node, skipping repeated initialization attempt." + ) + # Initialize or reuse workspace MnnvlMemory.initialize() - if self._WORKSPACE_INIT_FAILED: - raise RuntimeError( - "NVLinkOneSided: workspace initialization (MNNVL/NVSHMEM) previously " - "failed on this node, skipping repeated initialization attempt." - ) - if self._WORKSPACE is None:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py` around lines 230 - 237, The _WORKSPACE_INIT_FAILED flag should be checked before attempting to re-initialize workspace resources; move the check for self._WORKSPACE_INIT_FAILED to immediately before the call to MnnvlMemory.initialize() in NVLinkOneSided so that if the workspace init previously failed we raise the RuntimeError early and skip calling MnnvlMemory.initialize(); keep the same RuntimeError message and behavior otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py`:
- Around line 230-237: The _WORKSPACE_INIT_FAILED flag should be checked before
attempting to re-initialize workspace resources; move the check for
self._WORKSPACE_INIT_FAILED to immediately before the call to
MnnvlMemory.initialize() in NVLinkOneSided so that if the workspace init
previously failed we raise the RuntimeError early and skip calling
MnnvlMemory.initialize(); keep the same RuntimeError message and behavior
otherwise.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: ef44833d-2b45-4f1e-8224-51e047ee2be3
📒 Files selected for processing (3)
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.pytensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.pytests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
- tests/integration/test_lists/waives.txt
…ails When NVLinkOneSided workspace initialization fails (MNNVL allocation or NVSHMEM moe_a2a_initialize), cache the failure and propagate it to skip DeepEP/DeepEPLowLatency strategies in CommunicationFactory. DeepEP also relies on NVSHMEM internally and would hang during forward pass if the NVLink symmetric memory infrastructure is unavailable. Changes: - NVLinkOneSided: wrap workspace init (MnnvlMemory + moe_a2a_initialize) in try-except, set _WORKSPACE_INIT_FAILED on failure to avoid repeated attempts across MoE layers and signal the factory. - CommunicationFactory: check _WORKSPACE_INIT_FAILED before trying DeepEP/DeepEPLowLatency; fall through to AllGatherReduceScatter (NCCL). - Remove test waiver for test_fp8_blockscale[disable_skip_indexer]. Signed-off-by: Ziyi Xiong <219238287+ziyixiong-nv@users.noreply.github.com>
8f778a3 to
ef935a1
Compare
|
/bot run |
|
PR_Github #44256 [ run ] triggered by Bot. Commit: |
|
PR_Github #44256 [ run ] completed with state
|
Summary
Test plan
Links
Summary by CodeRabbit
Bug Fixes
Tests