Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0#459
Fix torchrun port collisions in CI by using rdzv-endpoint localhost:0#459
Conversation
…oint=localhost:0 --nnodes=1 Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
|
@copilot You only updated some of the workflow files. Please update all CI YAML files and scripts under |
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Updated all
|
mawad-amd
left a comment
There was a problem hiding this comment.
PR looks good. Verified all torchrun invocations across CI files are updated with --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1.
There was a problem hiding this comment.
Pull request overview
This PR aims to eliminate intermittent torchrun rendezvous port collisions (EADDRINUSE on default port 29500) in concurrent CI jobs by switching all single-node runs to use an OS-assigned ephemeral rendezvous port via --rdzv-endpoint=localhost:0 and explicitly setting --nnodes=1.
Changes:
- Updated
torchruncalls in the external validation workflow to pass explicit rendezvous flags and--nnodes=1. - Replaced
--standalonein CI helper scripts with explicit rendezvous flags usinglocalhost:0. - Added rendezvous flags to the performance benchmark script that previously had no collision protection.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| .github/workflows/iris-external-validation-test.yml | Adds explicit rendezvous configuration to avoid default port collisions in workflow runs. |
| .github/scripts/run_tests.sh | Replaces --standalone with explicit rendezvous flags using an ephemeral port. |
| .github/scripts/run_new_examples.sh | Replaces --standalone with explicit rendezvous flags using an ephemeral port. |
| .github/scripts/run_perf_benchmark.sh | Adds explicit rendezvous configuration to prevent benchmark job collisions. |
Comments suppressed due to low confidence (2)
.github/scripts/run_tests.sh:1
- Consider using
--rdzv-endpoint=127.0.0.1:0instead oflocalhost:0to avoid environments wherelocalhostresolves to IPv6 (::1) and IPv6 is disabled/misconfigured on the runner. Using an explicit IPv4 loopback makes rendezvous binding/connection behavior more deterministic in CI. If you change this, apply it consistently to all updatedtorchruninvocations.
#!/bin/bash
.github/scripts/run_new_examples.sh:1
- The rendezvous flags are now duplicated across multiple scripts. To reduce copy/paste drift (e.g., future flag changes needing edits in several places), consider defining a shared variable (e.g.,
TORCHRUN_RDZV_ARGS=...) near the top of each script and reusing it in thetorchruninvocations.
#!/bin/bash
You can also share your feedback on Copilot code review. Take the survey.
Concurrent CI jobs on the same self-hosted runner intermittently fail with
EADDRINUSEon port29500becausetorchrundefaults to that port for rendezvous.Changes
Updated every
torchruninvocation across all CI workflow files and helper scripts to use--rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1, letting the OS assign a free ephemeral port at runtime:.github/workflows/iris-external-validation-test.yml— 2 invocations (External Validation Test and External Gluon Validation Test).github/scripts/run_tests.sh— replaced--standalonewith the explicit rendezvous flags.github/scripts/run_new_examples.sh— replaced--standalonewith the explicit rendezvous flags.github/scripts/run_perf_benchmark.sh— added rendezvous flags (previously had no port collision protection)Before:
After:
Original prompt
🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.