[cherry-pick] Add configurable cvs_exec_timeout, rccl-tests -T, and -A algoproto output (#153)#168
Open
speriaswamy-amd wants to merge 1 commit into
Open
Conversation
…proto output to rccl_perf and rccl_regression (#153) * Add configurable cvs_exec_timeout, rccl-tests -T timeout, and -A algoproto output to rccl_perf and rccl_regression At cluster scale (62 nodes x 12 dtypes x 24 sizes per collective), one rccl-tests invocation can run 30+ minutes. Both `rccl_regression` and `rccl_perf` wrap their `mpirun` call in `shdl.exec(cmd, timeout=500)` (8min 20sec), which aborts perfectly healthy benchmarks well before they finish. There is also no way today to (a) bound rccl-tests' own internal timer or (b) ask rccl-tests to print which NCCL algorithm/protocol/channel combination it selected per message size — both critical for triaging real-world performance. This change adds three optional `rccl_test_params` keys, applied symmetrically to both `rccl_regression` and `rccl_perf`: | Key | Default | Effect | |------------------------------|------------------|--------------------------------------------------| | `cvs_exec_timeout` | `2400` (40 min) | Replaces hardcoded `timeout=500` on `shdl.exec`. | | `rccl_timeout` | `None` (omitted) | When set, appended as `-T <value>`. | | `output_algo_proto_channels` | `False` | When truthy, appends `-A 1`. Boolean toggle. | `-A` is a boolean per the rccl-tests help text: -A,--output_algo_proto_channels <0/1> enable algorithm/protocol/channels output (default: 0) `output_algo_proto_channels` is therefore a Python boolean (`bool(...)`) that becomes `-A 1` on the command line when truthy and is omitted otherwise. Users do not need to know about the integer wire format; they write `"output_algo_proto_channels": true` in their config. The model config at `cvs/input/config_file/rccl/rccl_config.json` is updated to demonstrate the three new keys with an inline `_comment_*` description, matching the file's existing self-documenting style. Both functions get the same key extractions, the same `extra_flags` build block, the same splice point in `test_cmd` (immediately before `-Z json`), and the same `timeout=cvs_exec_timeout` substitution. This ensures users have a single uniform schema across `tests/rccl/rccl_perf.py` and `tests/rccl/rccl_regression.py`. Validated: - 4-node smoke with `cvs_exec_timeout=600`, `rccl_timeout=300`, `output_algo_proto_channels=true`: rccl-tests command line in test.log contains ` -T 300 -A 1 ` immediately before `-Z json`; pytest passes. - 62-node overnight: `cvs_exec_timeout=21600` and `rccl_timeout=1800` both exercised, rccl-tests invocations completing 25-35 minutes without CVS abort and without rccl-tests overrunning its own timer. * Move cvs_exec_timeout to cvs params section in config json (cherry picked from commit e470ab4)
cijohnson
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-pick of #153 into
release/cvs-0.2.0.Adds three optional
rccl_test_paramskeys applied symmetrically torccl_regressionandrccl_perf:cvs_exec_timeout(default 2400s) replaces the hardcodedshdl.exec500s timeoutrccl_timeoutappends-T <value>when setoutput_algo_proto_channelsappends-A 1when truthyCommit:
e470ab4Add configurable cvs_exec_timeout, rccl-tests -T timeout, and -A algoproto output to rccl_perf and rccl_regression (Add configurable cvs_exec_timeout, rccl-tests -T timeout, and -A algoproto output to rccl_perf and rccl_regression #153)Made with Cursor