Skip to content

[cherry-pick] Add configurable cvs_exec_timeout, rccl-tests -T, and -A algoproto output (#153)#168

Open
speriaswamy-amd wants to merge 1 commit into
release/cvs-0.2.0from
cherry-pick/cvs-0.2.0/pr-153
Open

[cherry-pick] Add configurable cvs_exec_timeout, rccl-tests -T, and -A algoproto output (#153)#168
speriaswamy-amd wants to merge 1 commit into
release/cvs-0.2.0from
cherry-pick/cvs-0.2.0/pr-153

Conversation

@speriaswamy-amd
Copy link
Copy Markdown
Contributor

Cherry-pick of #153 into release/cvs-0.2.0.

Adds three optional rccl_test_params keys applied symmetrically to rccl_regression and rccl_perf:

  • cvs_exec_timeout (default 2400s) replaces the hardcoded shdl.exec 500s timeout
  • rccl_timeout appends -T <value> when set
  • output_algo_proto_channels appends -A 1 when truthy

Commit:

Made with Cursor

…proto output to rccl_perf and rccl_regression (#153)

* Add configurable cvs_exec_timeout, rccl-tests -T timeout, and -A algoproto output to rccl_perf and rccl_regression

At cluster scale (62 nodes x 12 dtypes x 24 sizes per collective), one
rccl-tests invocation can run 30+ minutes. Both `rccl_regression` and
`rccl_perf` wrap their `mpirun` call in `shdl.exec(cmd, timeout=500)` (8min
20sec), which aborts perfectly healthy benchmarks well before they finish.
There is also no way today to (a) bound rccl-tests' own internal timer or
(b) ask rccl-tests to print which NCCL algorithm/protocol/channel
combination it selected per message size — both critical for triaging
real-world performance.

This change adds three optional `rccl_test_params` keys, applied
symmetrically to both `rccl_regression` and `rccl_perf`:

| Key                          | Default          | Effect                                           |
|------------------------------|------------------|--------------------------------------------------|
| `cvs_exec_timeout`           | `2400` (40 min)  | Replaces hardcoded `timeout=500` on `shdl.exec`. |
| `rccl_timeout`               | `None` (omitted) | When set, appended as `-T <value>`.              |
| `output_algo_proto_channels` | `False`          | When truthy, appends `-A 1`. Boolean toggle.     |

`-A` is a boolean per the rccl-tests help text:

    -A,--output_algo_proto_channels <0/1>
       enable algorithm/protocol/channels output (default: 0)

`output_algo_proto_channels` is therefore a Python boolean (`bool(...)`)
that becomes `-A 1` on the command line when truthy and is omitted otherwise.
Users do not need to know about the integer wire format; they write
`"output_algo_proto_channels": true` in their config.

The model config at `cvs/input/config_file/rccl/rccl_config.json` is updated
to demonstrate the three new keys with an inline `_comment_*` description,
matching the file's existing self-documenting style.

Both functions get the same key extractions, the same `extra_flags` build
block, the same splice point in `test_cmd` (immediately before `-Z json`),
and the same `timeout=cvs_exec_timeout` substitution. This ensures users
have a single uniform schema across `tests/rccl/rccl_perf.py` and
`tests/rccl/rccl_regression.py`.

Validated:
- 4-node smoke with `cvs_exec_timeout=600`, `rccl_timeout=300`,
  `output_algo_proto_channels=true`: rccl-tests command line in test.log
  contains ` -T 300 -A 1 ` immediately before `-Z json`; pytest passes.
- 62-node overnight: `cvs_exec_timeout=21600` and `rccl_timeout=1800` both
  exercised, rccl-tests invocations completing 25-35 minutes without CVS
  abort and without rccl-tests overrunning its own timer.

* Move cvs_exec_timeout to cvs params section in config json

(cherry picked from commit e470ab4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants