Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0 by Copilot · Pull Request #459 · ROCm/iris

Copilot · 2026-03-18T20:31:40Z

Concurrent CI jobs on the same self-hosted runner intermittently fail with EADDRINUSE on port 29500 because torchrun defaults to that port for rendezvous.

Changes

Updated every torchrun invocation across all CI workflow files and helper scripts to use --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1, letting the OS assign a free ephemeral port at runtime:

.github/workflows/iris-external-validation-test.yml — 2 invocations (External Validation Test and External Gluon Validation Test)
.github/scripts/run_tests.sh — replaced --standalone with the explicit rendezvous flags
.github/scripts/run_new_examples.sh — replaced --standalone with the explicit rendezvous flags
.github/scripts/run_perf_benchmark.sh — added rendezvous flags (previously had no port collision protection)

Before:

torchrun --nproc_per_node=2 test_iris_distributed.py
torchrun --nproc_per_node=$NUM_RANKS --standalone tests/run_tests_distributed.py ...
torchrun --nproc_per_node=8 examples/${EXAMPLE_PATH}/benchmark.py ...

After:

torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1 --nproc_per_node=2 test_iris_distributed.py
torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1 --nproc_per_node=$NUM_RANKS tests/run_tests_distributed.py ...
torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1 --nproc_per_node=8 examples/${EXAMPLE_PATH}/benchmark.py ...

Original prompt

This section details on the original issue you should resolve

<issue_title>Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0</issue_title>
<issue_description>## Problem

CI jobs using torchrun fail intermittently with EADDRINUSE on port 29500 when multiple jobs land on the same self-hosted runner concurrently:
torch.distributed.DistNetworkError: The server socket has failed to listen on any
local network address. port: 29500, useIpv6: false, code: -98, name: EADDRINUSE,
message: address already in use
This happens because torchrun defaults to port 29500 for rendezvous and the workflows do not override it. Example from iris-external-validation-test.yml:
torchrun --nproc_per_node=2 test_iris_distributed.py
Solution

Per the PyTorch docs on stacked single-node multi-worker, the fix for --nnodes=1 is to let the OS auto-assign a free port:
torchrun \
    --rdzv-backend=c10d \
    --rdzv-endpoint=localhost:0 \
    --nnodes=1 \
    --nproc_per_node=2 \
    test_iris_distributed.py
localhost:0 tells the kernel to pick a free ephemeral port — no collisions, no race conditions.

What needs to change

Update every torchrun invocation across all CI workflow files in .github/workflows/ to include --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1. At minimum:

.github/workflows/iris-external-validation-test.yml (two torchrun calls: External Validation Test and External Gluon Validation Test)

Any other workflow files that invoke torchrun (check iris-tests.yml, iris-performance-regression-test.yml, etc.)</issue_description>

Comments on the Issue (you are @copilot in this section)
@mawad-amd @copilot Please fix this issue. Update every `torchrun` invocation across all CI workflow files in `.github/workflows/` to use `--rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1` so the OS auto-assigns a free port instead of defaulting to 29500.

Fixes Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0 #458

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

…oint=localhost:0 --nnodes=1 Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

mawad-amd · 2026-03-18T20:36:29Z

@copilot You only updated some of the workflow files. Please update all CI YAML files and scripts under .github/workflows/ that invoke torchrun. Check every .yml file — including iris-tests.yml, iris-performance-regression-test.yml, and any others — and add --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1 to every torchrun invocation.

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Copilot · 2026-03-18T20:40:17Z

@copilot You only updated some of the workflow files. Please update all CI YAML files and scripts under .github/workflows/ that invoke torchrun. Check every .yml file — including `iris-tes...

Updated all torchrun invocations across all CI scripts in commit 2ad1ee6:

.github/workflows/iris-external-validation-test.yml (already done — 2 invocations)
.github/scripts/run_tests.sh — replaced --standalone with explicit --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1
.github/scripts/run_new_examples.sh — same replacement
.github/scripts/run_perf_benchmark.sh — added the flags (this one had no port collision protection at all)

mawad-amd

PR looks good. Verified all torchrun invocations across CI files are updated with --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1.

Copilot

Pull request overview

This PR aims to eliminate intermittent torchrun rendezvous port collisions (EADDRINUSE on default port 29500) in concurrent CI jobs by switching all single-node runs to use an OS-assigned ephemeral rendezvous port via --rdzv-endpoint=localhost:0 and explicitly setting --nnodes=1.

Changes:

Updated torchrun calls in the external validation workflow to pass explicit rendezvous flags and --nnodes=1.
Replaced --standalone in CI helper scripts with explicit rendezvous flags using localhost:0.
Added rendezvous flags to the performance benchmark script that previously had no collision protection.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
.github/workflows/iris-external-validation-test.yml	Adds explicit rendezvous configuration to avoid default port collisions in workflow runs.
.github/scripts/run_tests.sh	Replaces `--standalone` with explicit rendezvous flags using an ephemeral port.
.github/scripts/run_new_examples.sh	Replaces `--standalone` with explicit rendezvous flags using an ephemeral port.
.github/scripts/run_perf_benchmark.sh	Adds explicit rendezvous configuration to prevent benchmark job collisions.

Comments suppressed due to low confidence (2)

.github/scripts/run_tests.sh:1

Consider using --rdzv-endpoint=127.0.0.1:0 instead of localhost:0 to avoid environments where localhost resolves to IPv6 (::1) and IPv6 is disabled/misconfigured on the runner. Using an explicit IPv4 loopback makes rendezvous binding/connection behavior more deterministic in CI. If you change this, apply it consistently to all updated torchrun invocations.

#!/bin/bash

.github/scripts/run_new_examples.sh:1

The rendezvous flags are now duplicated across multiple scripts. To reduce copy/paste drift (e.g., future flag changes needing edits in several places), consider defining a shared variable (e.g., TORCHRUN_RDZV_ARGS=...) near the top of each script and reusing it in the torchrun invocations.

#!/bin/bash

You can also share your feedback on Copilot code review. Take the survey.

Initial plan

91d4cb8

Copilot AI assigned Copilot and mawad-amd Mar 18, 2026

Copilot started work on behalf of mawad-amd March 18, 2026 20:31 View session

Fix torchrun port collisions by using --rdzv-backend=c10d --rdzv-endp…

fcf5a53

…oint=localhost:0 --nnodes=1 Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0~~ Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0 Mar 18, 2026

Copilot AI requested a review from mawad-amd March 18, 2026 20:33

Copilot finished work on behalf of mawad-amd March 18, 2026 20:33

Copilot started work on behalf of mawad-amd March 18, 2026 20:36 View session

Fix torchrun port collisions in all CI scripts and workflows

2ad1ee6

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Copilot finished work on behalf of mawad-amd March 18, 2026 20:40

mawad-amd marked this pull request as ready for review March 19, 2026 00:57

mawad-amd requested review from BKP and neoblizz as code owners March 19, 2026 00:57

Copilot AI review requested due to automatic review settings March 19, 2026 00:57

mawad-amd approved these changes Mar 19, 2026

View reviewed changes

Copilot AI reviewed Mar 19, 2026

View reviewed changes

mawad-amd merged commit 1c5ee38 into main Mar 19, 2026
200 of 206 checks passed

mawad-amd deleted the copilot/fix-torchrun-port-collisions branch March 19, 2026 22:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0#459

Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0#459
mawad-amd merged 3 commits intomainfrom
copilot/fix-torchrun-port-collisions

Copilot AI commented Mar 18, 2026 •

edited

Loading

Uh oh!

mawad-amd commented Mar 18, 2026

Uh oh!

Copilot AI commented Mar 18, 2026

Uh oh!

mawad-amd left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Solution

What needs to change

Comments on the Issue (you are @copilot in this section)

Uh oh!

mawad-amd commented Mar 18, 2026

Uh oh!

Copilot AI commented Mar 18, 2026

Uh oh!

mawad-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 18, 2026 •

edited

Loading