[https://nvbugs/6072808][fix] Retry on EADDRINUSE in AutoDeploy allreduce-fusion test by MrGeva · Pull Request #13610 · NVIDIA/TensorRT-LLM

MrGeva · 2026-04-29T11:00:15Z

The multi-GPU test_allreduce_fusion test reserves a port via get_free_port() in the parent and broadcasts it to MpiPoolSession workers, which then call dist.init_process_group("nccl"). There is a TOCTOU race: the port can be grabbed between the parent socket close and the NCCL bind, producing DistNetworkError(EADDRINUSE) and a CI failure (e.g. on DGX_H100-4_GPUs-AutoDeploy-1).

The spawn_multiprocess_job path already recovers from this via _PORT_CONFLICT_EXIT_CODE, but the MpiPoolSession path does not. Wrap submit_sync in a 5-attempt retry that fetches a fresh port and recreates the MPI pool when DistNetworkError(EADDRINUSE) is observed; other errors propagate immediately, mirroring the existing spawn_multiprocess_job behavior.

Summary by CodeRabbit

Tests
- Enhanced test reliability by implementing retry logic to handle port allocation conflicts during distributed training initialization.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

MrGeva · 2026-04-29T11:03:26Z

/bot run --stages "DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-04-29T11:10:25Z

PR_Github #46122 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --stages DGX_H100-4_GPUs-AutoDeploy-1

Link to invocation

MrGeva · 2026-04-29T12:10:19Z

/bot run --help

MrGeva · 2026-04-29T12:10:36Z

/bot help

github-actions · 2026-04-29T12:10:45Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

MrGeva · 2026-04-29T12:11:22Z

/bot run --stage-list "DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-04-29T12:17:35Z

PR_Github #46129 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --help

Link to invocation

tensorrt-cicd · 2026-04-29T12:18:37Z

PR_Github #46130 [ run ] triggered by Bot. Commit: 77472c9 Link to invocation

tensorrt-cicd · 2026-04-29T18:59:03Z

PR_Github #46130 [ run ] completed with state SUCCESS. Commit: 77472c9
/LLM/main/L0_MergeRequest_PR pipeline #36262 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

MrGeva · 2026-04-30T04:40:57Z

/bot run

coderabbitai · 2026-04-30T04:45:00Z

📝 Walkthrough

Walkthrough

A test file now implements a retry mechanism to handle port availability race conditions during distributed GPU initialization. Up to five attempts are made, each with a newly selected port, with specific error handling for address-in-use scenarios versus other network errors.

Changes

Cohort / File(s)	Summary
Distributed Test Retry Logic `tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py`	Modified test to retry up to 5 times with fresh port selection on each attempt. Catches `DistNetworkError` with `EADDRINUSE` to retry, re-raises other `DistNetworkError` immediately, and raises `RuntimeError` if all retries fail due to port conflicts.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Process
    participant Port as Port Selector
    participant MPI as MpiPoolSession
    participant Worker as Worker Processes
    participant NCCL as dist.init_process_group

    loop Retry (up to 5 times)
        Test->>Port: Select free port
        Port-->>Test: Port number
        Test->>MPI: Create session with port
        Test->>Worker: Submit job
        Worker->>NCCL: Initialize with selected port
        
        alt Success
            NCCL-->>Worker: Initialized
            Worker-->>Test: Job completed
            Test->>Test: Exit retry loop
        else EADDRINUSE (Port claimed)
            NCCL-->>Worker: DistNetworkError
            Worker-->>Test: Job failed
            Test->>Test: Suppress error, continue to next retry
        else Other DistNetworkError
            NCCL-->>Worker: DistNetworkError
            Worker-->>Test: Job failed
            Test->>Test: Re-raise immediately
        end
    end
    
    alt All retries exhausted
        Test->>Test: Raise RuntimeError (repeated port conflicts)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description explains the issue and solution comprehensively. However, it does not fill in the required template sections (Description, Test Coverage, PR Checklist) as structured in the template.	Organize description by filling in the required template sections: Description (explaining what and why), Test Coverage (listing relevant tests), and confirm PR Checklist items as applicable.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding retry logic for EADDRINUSE errors in the AutoDeploy allreduce-fusion test.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py (1)
153-183: QA list update is not needed for this PR scope

This change is confined to tests/unittest/... retry behavior, so no tests/integration/test_lists/qa/* entry update is required.

As per coding guidelines: "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py`
around lines 153 - 183, Update the PR description and changelog note to
explicitly state that QA list updates are unnecessary because the change only
touches unit tests under tests/unittest/...; mention this explicitly in the
review response or PR body and reference the test function name
test_allreduce_fusion (and the file
tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py)
so reviewers can verify the scope is limited and no
tests/integration/test_lists/qa/* entry update is required.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py`:
- Around line 175-178: The except DistNetworkError as e block currently checks
the error message with mixed casing and misses lowercase-only variants; update
the port-conflict detection in that except block (where last_exc = e is set) to
examine a lower-cased error string once and reject raising only if none of the
port-conflict keywords are present (e.g., check if not any(k in str(e).lower()
for k in ("eaddrinuse", "address already in use")) then raise); this ensures
messages like "eaddrinuse" are recognized and treated as port conflicts for
retry.

---

Nitpick comments:
In
`@tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py`:
- Around line 153-183: Update the PR description and changelog note to
explicitly state that QA list updates are unnecessary because the change only
touches unit tests under tests/unittest/...; mention this explicitly in the
review response or PR body and reference the test function name
test_allreduce_fusion (and the file
tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py)
so reviewers can verify the scope is limited and no
tests/integration/test_lists/qa/* entry update is required.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 70f5afce-8fbe-4e87-8268-be505c29558d

📥 Commits

Reviewing files that changed from the base of the PR and between e903428 and 77472c9.

📒 Files selected for processing (1)

tests/unittest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py

tensorrt-cicd · 2026-04-30T04:47:56Z

PR_Github #46296 [ run ] triggered by Bot. Commit: 77472c9 Link to invocation

tcherckez-nvidia · 2026-04-30T05:29:45Z

already opened #13606 which has different approach

tensorrt-cicd · 2026-04-30T09:23:41Z

PR_Github #46296 [ run ] completed with state SUCCESS. Commit: 77472c9
/LLM/main/L0_MergeRequest_PR pipeline #36397 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

MrGeva · 2026-04-30T11:22:20Z

already opened #13606 which has different approach

@tcherckez-nvidia your PR:

Move get_free_port() into the workers — rank 0 picks the port and mpi_broadcasts it to peers, instead of the parent picking it
and passing it through submit_sync. Shrinks the TOCTOU window from "parent picks → MPI submit/spin-up → workers init NCCL"
(milliseconds–seconds) to "worker rank 0 picks → broadcast → init NCCL" (microseconds).
Add torch.distributed.barrier() before cleanup() — keeps all ranks aligned at teardown so a fast rank can't return to the
parent and start the next parametrization while a slow rank still holds the rendezvous socket.

The PRs are complementary, not competing. your two ideas are both legitimate hardening:

The barrier addresses something my fix doesn't — leaked rendezvous between adjacent parametrizations.
Picking the port inside workers is genuinely cleaner (no large race window), but on shared CI hosts a microsecond window can and does hit; "smaller probability" ≠ "won't happen."

So let's merge both

MrGeva · 2026-04-30T11:31:54Z

/bot run

tensorrt-cicd · 2026-04-30T11:39:12Z

PR_Github #46363 [ run ] triggered by Bot. Commit: fd7de86 Link to invocation

tensorrt-cicd · 2026-04-30T17:06:29Z

PR_Github #46363 [ run ] completed with state SUCCESS. Commit: fd7de86
/LLM/main/L0_MergeRequest_PR pipeline #36449 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

tcherckez-nvidia · 2026-05-03T10:05:40Z

Can we fix it other then trying 5 times?
maybe there should be some sort of tests resource manager that assigns ports per process to avoid conflicts

MrGeva · 2026-05-04T08:31:27Z

@tcherckez-nvidia summatizing our conversation for now: its too complex to add such port manager and won't eliminate collisions because not all tests will be using it. I think this method gives a very high success rate. it solved the problem with the other dis tests that use pt dist. this test uses mpi dist so it also needs such mechanism

MrGeva · 2026-05-04T08:32:21Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-05-04T08:38:22Z

PR_Github #46629 [ run ] triggered by Bot. Commit: fd7de86 Link to invocation

The multi-GPU test_allreduce_fusion test reserves a port via get_free_port() in the parent and broadcasts it to MpiPoolSession workers, which then call dist.init_process_group("nccl"). There is a TOCTOU race: the port can be grabbed between the parent socket close and the NCCL bind, producing DistNetworkError(EADDRINUSE) and a CI failure (e.g. on DGX_H100-4_GPUs-AutoDeploy-1). The spawn_multiprocess_job path already recovers from this via _PORT_CONFLICT_EXIT_CODE, but the MpiPoolSession path does not. Wrap submit_sync in a 5-attempt retry that fetches a fresh port and recreates the MPI pool when DistNetworkError(EADDRINUSE) is observed; other errors propagate immediately, mirroring the existing spawn_multiprocess_job behavior. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>

Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>

MrGeva · 2026-05-04T08:39:32Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-05-04T08:46:30Z

PR_Github #46631 [ run ] triggered by Bot. Commit: 5a3a621 Link to invocation

tensorrt-cicd · 2026-05-04T18:00:06Z

PR_Github #46631 [ run ] completed with state SUCCESS. Commit: 5a3a621
/LLM/main/L0_MergeRequest_PR pipeline #36676 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

MrGeva · 2026-05-04T19:28:39Z

/bot skip --comment "the failures are not related to this change, all AD tests passed"

tensorrt-cicd · 2026-05-04T19:34:47Z

PR_Github #46678 [ skip ] triggered by Bot. Commit: 5a3a621 Link to invocation

tensorrt-cicd · 2026-05-04T19:45:51Z

PR_Github #46678 [ skip ] completed with state SUCCESS. Commit: 5a3a621
Skipping testing for commit 5a3a621

Link to invocation

github-actions Bot assigned MrGeva Apr 29, 2026

MrGeva marked this pull request as ready for review April 30, 2026 04:41

MrGeva requested a review from a team as a code owner April 30, 2026 04:41

MrGeva requested a review from tcherckez-nvidia April 30, 2026 04:41

MrGeva enabled auto-merge (squash) April 30, 2026 04:41

MrGeva changed the title ~~[None][fix] Retry on EADDRINUSE in AutoDeploy allreduce-fusion test~~ [https://nvbugs/6072808][fix] Retry on EADDRINUSE in AutoDeploy allreduce-fusion test Apr 30, 2026

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread ...ttest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py

suyoggupta reviewed Apr 30, 2026

View reviewed changes

Comment thread ...ttest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py Outdated

suyoggupta reviewed Apr 30, 2026

View reviewed changes

Comment thread ...ttest/auto_deploy/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py

lucaslie disabled auto-merge May 1, 2026 03:24

tcherckez-nvidia approved these changes May 4, 2026

View reviewed changes

MrGeva enabled auto-merge (squash) May 4, 2026 08:32

MrGeva added 2 commits May 4, 2026 11:39

fixed documentation in the code

5a3a621

Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>

MrGeva force-pushed the fix/ad-allreduce-fusion-port-retry branch from fd7de86 to 5a3a621 Compare May 4, 2026 08:39

MrGeva merged commit 2a04493 into NVIDIA:main May 4, 2026
6 checks passed

Conversation

MrGeva commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

MrGeva commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

MrGeva commented Apr 29, 2026

Uh oh!

MrGeva commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

MrGeva commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

MrGeva commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tcherckez-nvidia commented Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

MrGeva commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MrGeva commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tcherckez-nvidia commented May 3, 2026

Uh oh!

MrGeva commented May 4, 2026

Uh oh!

MrGeva commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

MrGeva commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

MrGeva commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

MrGeva commented Apr 29, 2026 •

edited

Loading

MrGeva commented Apr 30, 2026 •

edited

Loading