[https://nvbugs/6072808][fix] Retry on EADDRINUSE in AutoDeploy allreduce-fusion test#13610
Conversation
|
/bot run --stages "DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #46122 Bot args parsing error: usage: /bot [-h] |
|
/bot run --help |
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot run --stage-list "DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #46129 Bot args parsing error: usage: /bot [-h] |
|
PR_Github #46130 [ run ] triggered by Bot. Commit: |
|
PR_Github #46130 [ run ] completed with state |
|
/bot run |
📝 WalkthroughWalkthroughA test file now implements a retry mechanism to handle port availability race conditions during distributed GPU initialization. Up to five attempts are made, each with a newly selected port, with specific error handling for address-in-use scenarios versus other network errors. Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Process
participant Port as Port Selector
participant MPI as MpiPoolSession
participant Worker as Worker Processes
participant NCCL as dist.init_process_group
loop Retry (up to 5 times)
Test->>Port: Select free port
Port-->>Test: Port number
Test->>MPI: Create session with port
Test->>Worker: Submit job
Worker->>NCCL: Initialize with selected port
alt Success
NCCL-->>Worker: Initialized
Worker-->>Test: Job completed
Test->>Test: Exit retry loop
else EADDRINUSE (Port claimed)
NCCL-->>Worker: DistNetworkError
Worker-->>Test: Job failed
Test->>Test: Suppress error, continue to next retry
else Other DistNetworkError
NCCL-->>Worker: DistNetworkError
Worker-->>Test: Job failed
Test->>Test: Re-raise immediately
end
end
alt All retries exhausted
Test->>Test: Raise RuntimeError (repeated port conflicts)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py (1)
153-183: QA list update is not needed for this PR scopeThis change is confined to
tests/unittest/...retry behavior, so notests/integration/test_lists/qa/*entry update is required.As per coding guidelines: "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py` around lines 153 - 183, Update the PR description and changelog note to explicitly state that QA list updates are unnecessary because the change only touches unit tests under tests/unittest/...; mention this explicitly in the review response or PR body and reference the test function name test_allreduce_fusion (and the file tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py) so reviewers can verify the scope is limited and no tests/integration/test_lists/qa/* entry update is required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py`:
- Around line 175-178: The except DistNetworkError as e block currently checks
the error message with mixed casing and misses lowercase-only variants; update
the port-conflict detection in that except block (where last_exc = e is set) to
examine a lower-cased error string once and reject raising only if none of the
port-conflict keywords are present (e.g., check if not any(k in str(e).lower()
for k in ("eaddrinuse", "address already in use")) then raise); this ensures
messages like "eaddrinuse" are recognized and treated as port conflicts for
retry.
---
Nitpick comments:
In
`@tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py`:
- Around line 153-183: Update the PR description and changelog note to
explicitly state that QA list updates are unnecessary because the change only
touches unit tests under tests/unittest/...; mention this explicitly in the
review response or PR body and reference the test function name
test_allreduce_fusion (and the file
tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py)
so reviewers can verify the scope is limited and no
tests/integration/test_lists/qa/* entry update is required.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 70f5afce-8fbe-4e87-8268-be505c29558d
📒 Files selected for processing (1)
tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py
|
PR_Github #46296 [ run ] triggered by Bot. Commit: |
|
already opened #13606 which has different approach |
|
PR_Github #46296 [ run ] completed with state
|
@tcherckez-nvidia your PR:
The PRs are complementary, not competing. your two ideas are both legitimate hardening:
So let's merge both |
|
/bot run |
|
PR_Github #46363 [ run ] triggered by Bot. Commit: |
|
PR_Github #46363 [ run ] completed with state
|
|
Can we fix it other then trying 5 times? |
|
@tcherckez-nvidia summatizing our conversation for now: its too complex to add such port manager and won't eliminate collisions because not all tests will be using it. I think this method gives a very high success rate. it solved the problem with the other dis tests that use pt dist. this test uses mpi dist so it also needs such mechanism |
|
/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #46629 [ run ] triggered by Bot. Commit: |
The multi-GPU test_allreduce_fusion test reserves a port via
get_free_port() in the parent and broadcasts it to MpiPoolSession
workers, which then call dist.init_process_group("nccl"). There is a
TOCTOU race: the port can be grabbed between the parent socket close
and the NCCL bind, producing DistNetworkError(EADDRINUSE) and a CI
failure (e.g. on DGX_H100-4_GPUs-AutoDeploy-1).
The spawn_multiprocess_job path already recovers from this via
_PORT_CONFLICT_EXIT_CODE, but the MpiPoolSession path does not. Wrap
submit_sync in a 5-attempt retry that fetches a fresh port and
recreates the MPI pool when DistNetworkError(EADDRINUSE) is observed;
other errors propagate immediately, mirroring the existing
spawn_multiprocess_job behavior.
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
fd7de86 to
5a3a621
Compare
|
/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #46631 [ run ] triggered by Bot. Commit: |
|
PR_Github #46631 [ run ] completed with state
|
|
/bot skip --comment "the failures are not related to this change, all AD tests passed" |
|
PR_Github #46678 [ skip ] triggered by Bot. Commit: |
|
PR_Github #46678 [ skip ] completed with state |
The multi-GPU test_allreduce_fusion test reserves a port via get_free_port() in the parent and broadcasts it to MpiPoolSession workers, which then call dist.init_process_group("nccl"). There is a TOCTOU race: the port can be grabbed between the parent socket close and the NCCL bind, producing DistNetworkError(EADDRINUSE) and a CI failure (e.g. on DGX_H100-4_GPUs-AutoDeploy-1).
The spawn_multiprocess_job path already recovers from this via _PORT_CONFLICT_EXIT_CODE, but the MpiPoolSession path does not. Wrap submit_sync in a 5-attempt retry that fetches a fresh port and recreates the MPI pool when DistNetworkError(EADDRINUSE) is observed; other errors propagate immediately, mirroring the existing spawn_multiprocess_job behavior.
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.