Skip to content

Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0#459

Merged
mawad-amd merged 3 commits intomainfrom
copilot/fix-torchrun-port-collisions
Mar 19, 2026
Merged

Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0#459
mawad-amd merged 3 commits intomainfrom
copilot/fix-torchrun-port-collisions

Conversation

Copy link
Contributor

Copilot AI commented Mar 18, 2026

Concurrent CI jobs on the same self-hosted runner intermittently fail with EADDRINUSE on port 29500 because torchrun defaults to that port for rendezvous.

Changes

Updated every torchrun invocation across all CI workflow files and helper scripts to use --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1, letting the OS assign a free ephemeral port at runtime:

  • .github/workflows/iris-external-validation-test.yml — 2 invocations (External Validation Test and External Gluon Validation Test)
  • .github/scripts/run_tests.sh — replaced --standalone with the explicit rendezvous flags
  • .github/scripts/run_new_examples.sh — replaced --standalone with the explicit rendezvous flags
  • .github/scripts/run_perf_benchmark.sh — added rendezvous flags (previously had no port collision protection)

Before:

torchrun --nproc_per_node=2 test_iris_distributed.py
torchrun --nproc_per_node=$NUM_RANKS --standalone tests/run_tests_distributed.py ...
torchrun --nproc_per_node=8 examples/${EXAMPLE_PATH}/benchmark.py ...

After:

torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1 --nproc_per_node=2 test_iris_distributed.py
torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1 --nproc_per_node=$NUM_RANKS tests/run_tests_distributed.py ...
torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1 --nproc_per_node=8 examples/${EXAMPLE_PATH}/benchmark.py ...
Original prompt

This section details on the original issue you should resolve

<issue_title>Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0</issue_title>
<issue_description>## Problem

CI jobs using torchrun fail intermittently with EADDRINUSE on port 29500 when multiple jobs land on the same self-hosted runner concurrently:

torch.distributed.DistNetworkError: The server socket has failed to listen on any
local network address. port: 29500, useIpv6: false, code: -98, name: EADDRINUSE,
message: address already in use

This happens because torchrun defaults to port 29500 for rendezvous and the workflows do not override it. Example from iris-external-validation-test.yml:

torchrun --nproc_per_node=2 test_iris_distributed.py

Solution

Per the PyTorch docs on stacked single-node multi-worker, the fix for --nnodes=1 is to let the OS auto-assign a free port:

torchrun \
    --rdzv-backend=c10d \
    --rdzv-endpoint=localhost:0 \
    --nnodes=1 \
    --nproc_per_node=2 \
    test_iris_distributed.py

localhost:0 tells the kernel to pick a free ephemeral port — no collisions, no race conditions.

What needs to change

Update every torchrun invocation across all CI workflow files in .github/workflows/ to include --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1. At minimum:

  • .github/workflows/iris-external-validation-test.yml (two torchrun calls: External Validation Test and External Gluon Validation Test)
  • Any other workflow files that invoke torchrun (check iris-tests.yml, iris-performance-regression-test.yml, etc.)</issue_description>

Comments on the Issue (you are @copilot in this section)

@mawad-amd @copilot Please fix this issue. Update every `torchrun` invocation across all CI workflow files in `.github/workflows/` to use `--rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1` so the OS auto-assigns a free port instead of defaulting to 29500.

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

…oint=localhost:0 --nnodes=1

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0 Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0 Mar 18, 2026
Copilot AI requested a review from mawad-amd March 18, 2026 20:33
@mawad-amd
Copy link
Collaborator

@copilot You only updated some of the workflow files. Please update all CI YAML files and scripts under .github/workflows/ that invoke torchrun. Check every .yml file — including iris-tests.yml, iris-performance-regression-test.yml, and any others — and add --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1 to every torchrun invocation.

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 18, 2026

@copilot You only updated some of the workflow files. Please update all CI YAML files and scripts under .github/workflows/ that invoke torchrun. Check every .yml file — including `iris-tes...

Updated all torchrun invocations across all CI scripts in commit 2ad1ee6:

  • .github/workflows/iris-external-validation-test.yml (already done — 2 invocations)
  • .github/scripts/run_tests.sh — replaced --standalone with explicit --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1
  • .github/scripts/run_new_examples.sh — same replacement
  • .github/scripts/run_perf_benchmark.sh — added the flags (this one had no port collision protection at all)

@mawad-amd mawad-amd marked this pull request as ready for review March 19, 2026 00:57
@mawad-amd mawad-amd requested review from BKP and neoblizz as code owners March 19, 2026 00:57
Copilot AI review requested due to automatic review settings March 19, 2026 00:57
Copy link
Collaborator

@mawad-amd mawad-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good. Verified all torchrun invocations across CI files are updated with --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to eliminate intermittent torchrun rendezvous port collisions (EADDRINUSE on default port 29500) in concurrent CI jobs by switching all single-node runs to use an OS-assigned ephemeral rendezvous port via --rdzv-endpoint=localhost:0 and explicitly setting --nnodes=1.

Changes:

  • Updated torchrun calls in the external validation workflow to pass explicit rendezvous flags and --nnodes=1.
  • Replaced --standalone in CI helper scripts with explicit rendezvous flags using localhost:0.
  • Added rendezvous flags to the performance benchmark script that previously had no collision protection.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
.github/workflows/iris-external-validation-test.yml Adds explicit rendezvous configuration to avoid default port collisions in workflow runs.
.github/scripts/run_tests.sh Replaces --standalone with explicit rendezvous flags using an ephemeral port.
.github/scripts/run_new_examples.sh Replaces --standalone with explicit rendezvous flags using an ephemeral port.
.github/scripts/run_perf_benchmark.sh Adds explicit rendezvous configuration to prevent benchmark job collisions.
Comments suppressed due to low confidence (2)

.github/scripts/run_tests.sh:1

  • Consider using --rdzv-endpoint=127.0.0.1:0 instead of localhost:0 to avoid environments where localhost resolves to IPv6 (::1) and IPv6 is disabled/misconfigured on the runner. Using an explicit IPv4 loopback makes rendezvous binding/connection behavior more deterministic in CI. If you change this, apply it consistently to all updated torchrun invocations.
#!/bin/bash

.github/scripts/run_new_examples.sh:1

  • The rendezvous flags are now duplicated across multiple scripts. To reduce copy/paste drift (e.g., future flag changes needing edits in several places), consider defining a shared variable (e.g., TORCHRUN_RDZV_ARGS=...) near the top of each script and reusing it in the torchrun invocations.
#!/bin/bash

You can also share your feedback on Copilot code review. Take the survey.

@mawad-amd mawad-amd merged commit 1c5ee38 into main Mar 19, 2026
200 of 206 checks passed
@mawad-amd mawad-amd deleted the copilot/fix-torchrun-port-collisions branch March 19, 2026 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0

3 participants