feat(aorta): multi-node disaggregated launch via single cluster.json (AIMVT-173)#171
Open
speriaswamy-amd wants to merge 2 commits into
Open
feat(aorta): multi-node disaggregated launch via single cluster.json (AIMVT-173)#171speriaswamy-amd wants to merge 2 commits into
speriaswamy-amd wants to merge 2 commits into
Conversation
…(AIMVT-173) Run the CVS Aorta pipeline across N nodes from one cluster.json by orchestrating torchrun on every node in parallel, rendezvous-ing on the head, and consolidating per-node torch_profiler trees into <aorta_path>/combined_traces/node_<rank>/ for the host parser. Single-node behavior is unchanged: master_launch_mode='auto' keeps the legacy script path for 1-node clusters. Notable runtime fixes shaken out by the 2-node validation on g17u19+f16u13: - launch container as root (+render group) so /dev/kfd is accessible - pull head-node traces over SSH when orchestrator != head physical host - pack all training_overrides behind a single --override (aorta argparse uses nargs="*" and silently drops earlier groups otherwise) - initialise trace_mtime before the freshest-trace comparison Adds AortaMultiNodeConfig + Pydantic schema, refactors AortaRunner.run(), documents the new block in docs/reference/configuration-files/aorta.rst. Co-authored-by: Cursor <cursoragent@cursor.com>
24 unittest cases covering the new launch-mode resolution, master-port picking, torchrun command construction, base-env merging, combined_traces helper, local trace-tree copy, train_script existence check, and the Pydantic AortaMultiNodeConfigFile schema. Also pins the "single --override group" invariant in two places to prevent the argparse(nargs="*") regression. Co-authored-by: Cursor <cursoragent@cursor.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements AIMVT-173: run the CVS Aorta benchmark across N nodes from a single
cluster.json, mirroring the disaggregated launch pattern used by the existing PyTorch xDiT and SGLang multi-node test suites.The
AortaRunnernow orchestratestorchrunon every node in parallel (rendezvousing on the head), and consolidates per-nodetorch_profilertrees into<aorta_path>/combined_traces/node_<rank>/so the host parser sees a single unified set. Single-node behavior is unchanged:multi_node.master_launch_mode='auto'keeps the legacyexperiment_scriptpath for 1-node clusters, and pre-existing yamls without amulti_node:block still validate (Pydantic supplies sensible defaults).What's in the diff
cvs/runners/aorta.py—AortaMultiNodeConfigdataclass;_resolve_launch_mode,_pick_master_port,_build_torchrun_command,_run_single_node,_collect_multi_node_traces+ local/remote copy helpers; refactoredrun()cvs/parsers/schemas.py—AortaMultiNodeConfigFilePydantic schema +train_scriptexistence checkcvs/input/config_file/aorta/aorta_benchmark.yaml— newmulti_node:block with inline docsdocs/reference/configuration-files/aorta.rst— new "Multi-node disaggregated launch" section + parameter tablecvs/runners/unittests/test_aorta_multinode.py— 24 unit tests (launch-mode resolution, port selection, command construction, env merging, trace-tree copy, schema validation, single--overridegroup invariant)cvs/tests/benchmark/test_aorta.py— wiremulti_nodeblock through the runner-config fixtureValidation
End-to-end
cvs run test_aortaagainst a real 2-node cluster (g17u19head +f16u13worker, 16xMI300X total) — 5/5 pytest cases pass in 148s, traces collected from both nodes, host parser produced metrics for all 16 ranks. Four runtime bugs surfaced and were fixed during this validation:jenkinsUID and couldn't open/dev/kfddespite--privileged→ now passesuser="root"andgroup_add=["video","render"]UnboundLocalError: trace_mtimein the freshest-trace selector → initialised upfront--override key=val --override key=val …collapsed to last group only (aorta train.py usesargparse(nargs="*")) → packed behind a single--overrideTest plan
ruff check . --exclude .venv— cleanruff format --check— cleanpython -m unittest discover -s cvs— 288/288 pass (existing 264 + 24 new)cvs run test_aortaon real 2-node cluster — 5/5 passmulti_node:block still validates and runs single-nodeMade with Cursor