Skip to content

[https://nvbugs/6072808][fix] Retry on EADDRINUSE in AutoDeploy allreduce-fusion test#13610

Merged
MrGeva merged 2 commits intoNVIDIA:mainfrom
nv-auto-deploy:fix/ad-allreduce-fusion-port-retry
May 4, 2026
Merged

[https://nvbugs/6072808][fix] Retry on EADDRINUSE in AutoDeploy allreduce-fusion test#13610
MrGeva merged 2 commits intoNVIDIA:mainfrom
nv-auto-deploy:fix/ad-allreduce-fusion-port-retry

Conversation

@MrGeva
Copy link
Copy Markdown
Collaborator

@MrGeva MrGeva commented Apr 29, 2026

The multi-GPU test_allreduce_fusion test reserves a port via get_free_port() in the parent and broadcasts it to MpiPoolSession workers, which then call dist.init_process_group("nccl"). There is a TOCTOU race: the port can be grabbed between the parent socket close and the NCCL bind, producing DistNetworkError(EADDRINUSE) and a CI failure (e.g. on DGX_H100-4_GPUs-AutoDeploy-1).

The spawn_multiprocess_job path already recovers from this via _PORT_CONFLICT_EXIT_CODE, but the MpiPoolSession path does not. Wrap submit_sync in a 5-attempt retry that fetches a fresh port and recreates the MPI pool when DistNetworkError(EADDRINUSE) is observed; other errors propagate immediately, mirroring the existing spawn_multiprocess_job behavior.

Summary by CodeRabbit

  • Tests
    • Enhanced test reliability by implementing retry logic to handle port allocation conflicts during distributed training initialization.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented Apr 29, 2026

/bot run --stages "DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46122 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --stages DGX_H100-4_GPUs-AutoDeploy-1

Link to invocation

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented Apr 29, 2026

/bot run --help

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented Apr 29, 2026

/bot help

@github-actions
Copy link
Copy Markdown

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented Apr 29, 2026

/bot run --stage-list "DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46129 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --help

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46130 [ run ] triggered by Bot. Commit: 77472c9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46130 [ run ] completed with state SUCCESS. Commit: 77472c9
/LLM/main/L0_MergeRequest_PR pipeline #36262 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented Apr 30, 2026

/bot run

@MrGeva MrGeva marked this pull request as ready for review April 30, 2026 04:41
@MrGeva MrGeva requested a review from a team as a code owner April 30, 2026 04:41
@MrGeva MrGeva requested a review from tcherckez-nvidia April 30, 2026 04:41
@MrGeva MrGeva enabled auto-merge (squash) April 30, 2026 04:41
@MrGeva MrGeva changed the title [None][fix] Retry on EADDRINUSE in AutoDeploy allreduce-fusion test [https://nvbugs/6072808][fix] Retry on EADDRINUSE in AutoDeploy allreduce-fusion test Apr 30, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

📝 Walkthrough

Walkthrough

A test file now implements a retry mechanism to handle port availability race conditions during distributed GPU initialization. Up to five attempts are made, each with a newly selected port, with specific error handling for address-in-use scenarios versus other network errors.

Changes

Cohort / File(s) Summary
Distributed Test Retry Logic
tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py
Modified test to retry up to 5 times with fresh port selection on each attempt. Catches DistNetworkError with EADDRINUSE to retry, re-raises other DistNetworkError immediately, and raises RuntimeError if all retries fail due to port conflicts.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Process
    participant Port as Port Selector
    participant MPI as MpiPoolSession
    participant Worker as Worker Processes
    participant NCCL as dist.init_process_group

    loop Retry (up to 5 times)
        Test->>Port: Select free port
        Port-->>Test: Port number
        Test->>MPI: Create session with port
        Test->>Worker: Submit job
        Worker->>NCCL: Initialize with selected port
        
        alt Success
            NCCL-->>Worker: Initialized
            Worker-->>Test: Job completed
            Test->>Test: Exit retry loop
        else EADDRINUSE (Port claimed)
            NCCL-->>Worker: DistNetworkError
            Worker-->>Test: Job failed
            Test->>Test: Suppress error, continue to next retry
        else Other DistNetworkError
            NCCL-->>Worker: DistNetworkError
            Worker-->>Test: Job failed
            Test->>Test: Re-raise immediately
        end
    end
    
    alt All retries exhausted
        Test->>Test: Raise RuntimeError (repeated port conflicts)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description explains the issue and solution comprehensively. However, it does not fill in the required template sections (Description, Test Coverage, PR Checklist) as structured in the template. Organize description by filling in the required template sections: Description (explaining what and why), Test Coverage (listing relevant tests), and confirm PR Checklist items as applicable.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding retry logic for EADDRINUSE errors in the AutoDeploy allreduce-fusion test.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py (1)

153-183: QA list update is not needed for this PR scope

This change is confined to tests/unittest/... retry behavior, so no tests/integration/test_lists/qa/* entry update is required.

As per coding guidelines: "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py`
around lines 153 - 183, Update the PR description and changelog note to
explicitly state that QA list updates are unnecessary because the change only
touches unit tests under tests/unittest/...; mention this explicitly in the
review response or PR body and reference the test function name
test_allreduce_fusion (and the file
tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py)
so reviewers can verify the scope is limited and no
tests/integration/test_lists/qa/* entry update is required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py`:
- Around line 175-178: The except DistNetworkError as e block currently checks
the error message with mixed casing and misses lowercase-only variants; update
the port-conflict detection in that except block (where last_exc = e is set) to
examine a lower-cased error string once and reject raising only if none of the
port-conflict keywords are present (e.g., check if not any(k in str(e).lower()
for k in ("eaddrinuse", "address already in use")) then raise); this ensures
messages like "eaddrinuse" are recognized and treated as port conflicts for
retry.

---

Nitpick comments:
In
`@tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py`:
- Around line 153-183: Update the PR description and changelog note to
explicitly state that QA list updates are unnecessary because the change only
touches unit tests under tests/unittest/...; mention this explicitly in the
review response or PR body and reference the test function name
test_allreduce_fusion (and the file
tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py)
so reviewers can verify the scope is limited and no
tests/integration/test_lists/qa/* entry update is required.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 70f5afce-8fbe-4e87-8268-be505c29558d

📥 Commits

Reviewing files that changed from the base of the PR and between e903428 and 77472c9.

📒 Files selected for processing (1)
  • tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46296 [ run ] triggered by Bot. Commit: 77472c9 Link to invocation

@tcherckez-nvidia
Copy link
Copy Markdown
Collaborator

already opened #13606 which has different approach

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46296 [ run ] completed with state SUCCESS. Commit: 77472c9
/LLM/main/L0_MergeRequest_PR pipeline #36397 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented Apr 30, 2026

already opened #13606 which has different approach

@tcherckez-nvidia your PR:

  1. Move get_free_port() into the workers — rank 0 picks the port and mpi_broadcasts it to peers, instead of the parent picking it
    and passing it through submit_sync. Shrinks the TOCTOU window from "parent picks → MPI submit/spin-up → workers init NCCL"
    (milliseconds–seconds) to "worker rank 0 picks → broadcast → init NCCL" (microseconds).

  2. Add torch.distributed.barrier() before cleanup() — keeps all ranks aligned at teardown so a fast rank can't return to the
    parent and start the next parametrization while a slow rank still holds the rendezvous socket.

The PRs are complementary, not competing. your two ideas are both legitimate hardening:

  • The barrier addresses something my fix doesn't — leaked rendezvous between adjacent parametrizations.
  • Picking the port inside workers is genuinely cleaner (no large race window), but on shared CI hosts a microsecond window can and does hit; "smaller probability" ≠ "won't happen."

So let's merge both

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented Apr 30, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46363 [ run ] triggered by Bot. Commit: fd7de86 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46363 [ run ] completed with state SUCCESS. Commit: fd7de86
/LLM/main/L0_MergeRequest_PR pipeline #36449 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lucaslie lucaslie disabled auto-merge May 1, 2026 03:24
@tcherckez-nvidia
Copy link
Copy Markdown
Collaborator

Can we fix it other then trying 5 times?
maybe there should be some sort of tests resource manager that assigns ports per process to avoid conflicts

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 4, 2026

@tcherckez-nvidia summatizing our conversation for now: its too complex to add such port manager and won't eliminate collisions because not all tests will be using it. I think this method gives a very high success rate. it solved the problem with the other dis tests that use pt dist. this test uses mpi dist so it also needs such mechanism

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 4, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@MrGeva MrGeva enabled auto-merge (squash) May 4, 2026 08:32
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46629 [ run ] triggered by Bot. Commit: fd7de86 Link to invocation

MrGeva added 2 commits May 4, 2026 11:39
The multi-GPU test_allreduce_fusion test reserves a port via
get_free_port() in the parent and broadcasts it to MpiPoolSession
workers, which then call dist.init_process_group("nccl"). There is a
TOCTOU race: the port can be grabbed between the parent socket close
and the NCCL bind, producing DistNetworkError(EADDRINUSE) and a CI
failure (e.g. on DGX_H100-4_GPUs-AutoDeploy-1).

The spawn_multiprocess_job path already recovers from this via
_PORT_CONFLICT_EXIT_CODE, but the MpiPoolSession path does not. Wrap
submit_sync in a 5-attempt retry that fetches a fresh port and
recreates the MPI pool when DistNetworkError(EADDRINUSE) is observed;
other errors propagate immediately, mirroring the existing
spawn_multiprocess_job behavior.

Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
@MrGeva MrGeva force-pushed the fix/ad-allreduce-fusion-port-retry branch from fd7de86 to 5a3a621 Compare May 4, 2026 08:39
@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 4, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46631 [ run ] triggered by Bot. Commit: 5a3a621 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46631 [ run ] completed with state SUCCESS. Commit: 5a3a621
/LLM/main/L0_MergeRequest_PR pipeline #36676 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 4, 2026

/bot skip --comment "the failures are not related to this change, all AD tests passed"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46678 [ skip ] triggered by Bot. Commit: 5a3a621 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46678 [ skip ] completed with state SUCCESS. Commit: 5a3a621
Skipping testing for commit 5a3a621

Link to invocation

@MrGeva MrGeva merged commit 2a04493 into NVIDIA:main May 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants