Skip to content

Simplifying socket connect, allow for using host address#279

Merged
nileshnegi merged 1 commit intoROCm:candidatefrom
gilbertlee-amd:SocketCommUpgrade
May 1, 2026
Merged

Simplifying socket connect, allow for using host address#279
nileshnegi merged 1 commit intoROCm:candidatefrom
gilbertlee-amd:SocketCommUpgrade

Conversation

@gilbertlee-amd
Copy link
Copy Markdown
Collaborator

Motivation

Attempting to make using socket connections simpler.
The new flow is now:

First rank just needs to specify the number of ranks in total using TB_NUM_RANKS

Node 0> TB_NUM_RANKS=4 ./TransferBench

This will then provide connection information for other ranks:

[INFO] TB_MASTER_ADDR not set; using detected IPv4 12.34.56.78
[INFO] Socket rank 0: on each other host set TB_RANK to a unique value in 1..3, then for example:
       TB_NUM_RANKS=4 TB_MASTER_ADDR=12.34.56.78 TB_MASTER_PORT=29500 TB_RANK=1
[INFO] Waiting for connections from 3 other rank(s) [TB_MASTER_ADDR=12.34.56.78 TB_MASTER_PORT=29500]

Other ranks can then connect by providing those enviroment variables (and adjusting TB_RANK for each node)

Node 1> TB_NUM_RANKS=4 TB_MASTER_ADDR=12.34.56.78 TB_MASTER_PORT=29500 TB_RANK=1 ./TransferBench
Node 2> TB_NUM_RANKS=4 TB_MASTER_ADDR=12.34.56.78 TB_MASTER_PORT=29500 TB_RANK=2 ./TransferBench
Node 3> TB_NUM_RANKS=4 TB_MASTER_ADDR=12.34.56.78 TB_MASTER_PORT=29500 TB_RANK=3 ./TransferBench

Host address can also be used for setting TB_MASTER_ADDR

Node 0> TB_NUM_RANKS=4 TB_MASTER_ADDR=hostname ./TransferBench

Technical Details

Minor changes to the socket code

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR streamlines TransferBench’s socket-based multi-rank startup so rank 0 can start listening with only TB_NUM_RANKS set, auto-detect (or resolve) a reachable master address, and print concrete connection instructions for worker ranks.

Changes:

  • Update socket communicator enablement to trigger when TB_NUM_RANKS>=2, default TB_RANK to 0, and require TB_MASTER_ADDR only for workers.
  • Add IPv4 auto-detection for rank 0 (TB_MASTER_IFACE optional) and allow hostname resolution for TB_MASTER_ADDR via getaddrinfo.
  • Refresh env var help/usage text and changelog to reflect the simplified socket flow.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/header/TransferBench.hpp Implements new socket setup flow, IPv4 detection, and hostname-to-IPv4 resolution.
src/client/EnvVars.hpp Updates printed documentation for socket-related environment variables (incl. TB_MASTER_IFACE).
src/client/Client.cpp Updates CLI usage examples to match the new socket startup flow.
CHANGELOG.md Notes the socket communicator usability change for the release notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/header/TransferBench.hpp
Comment thread src/header/TransferBench.hpp
Comment thread src/header/TransferBench.hpp
@nileshnegi nileshnegi merged commit 15b7605 into ROCm:candidate May 1, 2026
8 checks passed
nileshnegi added a commit that referenced this pull request May 2, 2026
- Initial pod communication support (#235)
- cuda + MNNVL update & pod presets (#241)
- Increase CQ size for high qps (#244)
- fix hang when NVML is present but fabricmanager isnt (#246)
- Adding nica2a preset  (#248)
- Adding HBM read bandwidth preset (#250)
- Pod Ring preset (#251)
- gfxsweep preset (#254) (#256)
- Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255)
- Adding a wallclock consistency detection preset (#258)
- Adding smoketest preset for simple correctness tests (#266)
- Help / envvars / presets presets (#267)
- Modernize CMake build (#268)
- Replace version-based pod/amd-smi detection with compile-time API probes (#269)
- Fix collective mismatch hangs in multi-rank error paths (#270)
- Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271)
- Reformat a2asweep output to match gfxsweep style (#272)
- Gfx sweep update (#274)
- Increasing flush frequency in smoketest (#275)
- Adding new experimental copy-only GFX kernel, gfxsweep update (#277)
- Fixes for cuMem compilation and invalid device ordinal (#278)
- Simplifying socket connect, allow for using host address (#279)
- Updating podring to run on single node without need to force single pod (#280)
- Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281)

---------

Co-authored-by: AtlantaPepsi <timhu102@gmail.com>
Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@nileshnegi nileshnegi mentioned this pull request May 2, 2026
1 task
nileshnegi added a commit that referenced this pull request May 2, 2026
- Initial pod communication support (#235)
- cuda + MNNVL update & pod presets (#241)
- Increase CQ size for high qps (#244)
- fix hang when NVML is present but fabricmanager isnt (#246)
- Adding nica2a preset  (#248)
- Adding HBM read bandwidth preset (#250)
- Pod Ring preset (#251)
- gfxsweep preset (#254) (#256)
- Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255)
- Adding a wallclock consistency detection preset (#258)
- Adding smoketest preset for simple correctness tests (#266)
- Help / envvars / presets presets (#267)
- Modernize CMake build (#268)
- Replace version-based pod/amd-smi detection with compile-time API probes (#269)
- Fix collective mismatch hangs in multi-rank error paths (#270)
- Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271)
- Reformat a2asweep output to match gfxsweep style (#272)
- Gfx sweep update (#274)
- Increasing flush frequency in smoketest (#275)
- Adding new experimental copy-only GFX kernel, gfxsweep update (#277)
- Fixes for cuMem compilation and invalid device ordinal (#278)
- Simplifying socket connect, allow for using host address (#279)
- Updating podring to run on single node without need to force single pod (#280)
- Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281)

---------

Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com>
Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants