Multi-orchestrator backend: pluggable baremetal/container execution layer + RVS health migration#135
Conversation
cijohnson
left a comment
There was a problem hiding this comment.
@atnair-amd , i see that you have done good optimizations like test case parameterization etc, but lets not mix that with the orchestrator support PR, you can create a AIMVT ticket for this nice to have features and work on it later.
@solaiys , @frepaul , @anujmittal-amd please start your review after @atnair-amd fixies this comment.
@solaiys , @frepaul , @anujmittal-amd @cijohnson reverted and ready for review |
|
Proof of rvs run on docker |
| return OrchestratorConfig( | ||
| orchestrator="baremetal", | ||
| node_dict={"10.0.0.1": {}, "10.0.0.2": {}}, | ||
| username="atnair", |
There was a problem hiding this comment.
Can we use some generic user say "testuser"
| cfg = OrchestratorConfig( | ||
| orchestrator="container", | ||
| node_dict={"10.0.0.1": {}}, | ||
| username="atnair", |
| self.log.debug("Container mode not enabled, skipping teardown") | ||
| return True | ||
|
|
||
| if self.container_config.get('launch', False): |
There was a problem hiding this comment.
This should be if not self.container_config.get('launch', False):
| return True | ||
|
|
||
| if self.container_config.get('launch', False): | ||
| self.log.debug("launch is true, not stopping externally managed containers") |
There was a problem hiding this comment.
This shold be self.log.debug("launch is False, not stopping externally managed containers")
|
|
||
| This method should be called explicitly by tests for cleanup. | ||
| It respects the 'enabled' flag and handles container lifecycle appropriately. | ||
| If 'launch' is true, assumes containers should not be stopped (externally managed). |
There was a problem hiding this comment.
If 'launch' is False, assumes containers should not be stopped (externally managed).
| bool: True if container is running on all hosts, False otherwise | ||
| """ | ||
| # Check if container is running on all hosts | ||
| cmd = f"sudo docker ps --filter name=^{container_name}$ --filter status=running --format '{{{{.Names}}}}'" |
There was a problem hiding this comment.
docker hardcoded in container.py, instead runtime/docker.py should implemnt is_running and call it here
There was a problem hiding this comment.
changed; moved to runtimes
|
|
||
| if cfg.orchestrator == "container": | ||
| if not orch.setup_containers(): | ||
| pytest.fail("ContainerOrchestrator.setup_containers() failed") |
There was a problem hiding this comment.
Why dont we fail with msg "Failed to launch container : "
| if not orch.setup_containers(): | ||
| pytest.fail("ContainerOrchestrator.setup_containers() failed") | ||
| if not orch.setup_sshd(): | ||
| pytest.fail("ContainerOrchestrator.setup_sshd() failed") |
There was a problem hiding this comment.
Why dont we fail with msg "Failed to setup sshd in container : "
212fe46 to
a903145
Compare
| Args: | ||
| **kwargs: Required orchestrator configuration keys | ||
| """ | ||
| self.orchestrator = kwargs['orchestrator'] |
There was a problem hiding this comment.
Possible keyError here
self.orchestrator = kwargs['orchestrator']
self.node_dict = kwargs['node_dict']
self.username = kwargs['username']
self.priv_key_file = kwargs['priv_key_file']
self.password = kwargs.get('password')
There was a problem hiding this comment.
added error checking; now raises value error if any key is missing
| return cmd_output | ||
|
|
||
| def _handle_timeout_exception(self, output, e): | ||
| def inform_unreachability_structured(self, cmd_output): |
There was a problem hiding this comment.
is this method dead code , if yes we can remove this method
|
@atnair-amd , can you run the rvs test in both baremetal and container env and attach the log/reports zip file in the https://amd-hub.atlassian.net/browse/AIMVT-156 |
…ardown_containers The teardown short-circuit was the inverse of the documented contract. Per PR #135's container block schema, launch=true means CVS owns the container lifecycle (and must tear down on cleanup); launch=false means externally managed (CVS only verifies and never stops them). Inverts the condition, updates the log line and docstring, and flips the matching unit tests in test_container.py to pin the corrected semantics. The no-container-id short-circuit test is also retargeted to launch=True so it actually exercises the no-id branch (under the old wrong logic launch=False short-circuited first and absorbed it).
CVS is unambiguously AMD-only (README.md:7 "validate AMD AI clusters end to end"; docs at rocm.docs.amd.com/projects/cvs). The gpu_passthrough config key was introduced by PR #135 itself and had zero non-test references. When True (the original default) it emitted --gpus all, which on docker 29.4.1 fails on AMD-only hosts without the AMD container toolkit ("failed to discover GPU vendor from CDI"). AMD GPU access is already provided by the device flags auto-injected from DEFAULT_CONTAINER_ARGS (/dev/kfd, /dev/dri, /dev/infiniband). The knob and the --gpus all rendering are pure dead weight; remove both: - Drop the gpu_passthrough lookup, gpu_args local, and --gpus all emission in DockerRuntime.setup_containers. - Update the docstring to call out the AMD-only intent explicitly. Pinned by new test_cmd_never_contains_gpus_all (3 subTest scenarios: minimal config, legacy config still setting gpu_passthrough=True silently ignored, config with extra runtime args). Defends against future regression that re-introduces the flag. Verified byte-equivalence against the cleaned code: the rendered docker run cmd from the post-fix container validation run is character-for-character identical to what the cleaned code produces with the same orchestrator inputs.
6636601 to
3b2fea9
Compare
Introduce a pluggable execution layer under cvs/core so distributed tests do not depend on raw parallel-SSH handles. Cluster and testsuite JSON are merged into OrchestratorConfig; OrchestratorFactory instantiates a backend based on the orchestrator key (default baremetal). Orchestrators: - BaremetalOrchestrator runs commands over SSH using existing Pssh clients for the head node and for all nodes, and builds MPI jobs (hostfile, mpirun) via distribute_using_mpi. - ContainerOrchestrator extends baremetal for running work inside long-lived containers, with Docker and Enroot runtimes under cvs/core/runtimes. rccl_lib is refactored to accept a single orchestrator: UCX and PML helpers use exec_on_head, and RCCL collectives use distribute_using_mpi instead of hand-built mpirun strings with separate phdl/shdl. RCCL pytest modules use a module-scoped orchestrator fixture built from CLI cluster and config paths, with teardown cleanup. Cluster metric snapshots use the parallel handle (orch.all) where full-cluster execution is required. Pssh gains an optional detailed mode on exec() so callers receive per-host output plus exit_code dicts, with matching unit tests. The abstract Orchestrator base documents future Slurm/Kubernetes-style backends; only baremetal and container are registered in the factory in this change. Signed-off-by: Ignatious Johnson <ichristo@amd.com>
Replaces the phdl Pssh fixture with an orch fixture built via OrchestratorFactory.create_orchestrator(...). For container backends, the fixture calls setup_containers() + setup_sshd() and registers teardown_containers() as a finalizer. For baremetal, no container lifecycle is invoked. - Renames phdl -> orch across all 8 test signatures, 5 helper signatures, and the ~13 phdl.exec(...) call sites. Pssh.exec and Orchestrator.exec share the (cmd, timeout=...) -> Dict[host, str] shape so the call-site changes are name-only. - Revert RCCL lib and tests to origin/main The orchestrator-based rewrite of cvs/lib/rccl_lib.py and the cvs/tests/rccl/*.py suite from 81a8ccb is not ready; restore those files to their origin/main state. The new orchestrator/runtime layer under cvs/core/ and the additive parallel_ssh_lib.py changes are retained for follow-up work.
3b2fea9 to
140be4f
Compare
What this changes
Introduces a pluggable execution substrate under
cvs/core/so test suitescan run identically on baremetal hosts (SSH) or inside long-lived per-host containers
(
docker exec), with the choice deferred to the cluster JSON. Migrates the RVS healthsuite to the new ABC as the reference consumer.
Layered as:
OrchestratorABC (cvs/core/orchestrators/base.py) —exec,exec_on_head,setup_env,cleanup,distribute_using_mpi, plus the additive surface needed for backend-blind suites:privileged_prefix(),prepare(),dispose(),host_all,host_head.BaremetalOrchestrator(cvs/core/orchestrators/baremetal.py) — runs commands overPssh(paramiko);host_all/host_headaliasself.all/self.headsince the SSH namespace already targets the host.ContainerOrchestrator(cvs/core/orchestrators/container.py) — inherits the host-SSH plumbing and overlays a runtime;prepare()doessetup_containers+setup_sshdwith rollback on partial failure;dispose()tears down via the runtime directly to bypass the existinglaunch:trueshort-circuit.ContainerRuntimeProtocol (cvs/core/runtimes/base.py) with concreteDockerRuntimeand stubEnrootRuntime.cvs/core/orchestrators/factory.py,cvs/core/runtimes/factory.py) instantiated fromOrchestratorConfig.from_configs(cluster_file, test_config).cvs/tests/health/rvs_cvs.pyis rewritten backend-blind (noorchestrator_typechecks, noisinstance, no"baremetal"/"container"literals, no/tmp/literals — all scratch viasealed_tmp(...)keyed onMULTIORCH_RUN_ID). New conftest + helpers incvs/tests/health/conftest.pyandcvs/tests/health/_rvs_orch_helpers.py.cvs/lib/parallel_ssh_lib.py) gains an optionaldetailedmode onexec()returning per-host{output, exit_code}; covered bycvs/lib/unittests/test_parallel_ssh_lib.py.cvs/core/unittests/test_orchestrator_abc.py.cvs/core/README.md.Sibling suites (
transferbench_cvs.py,agfhc_cvs.py,csp_qual_agfhc.py) areuntouched and continue to work — every new fixture and hook is gated on the test
opting in via the
orchfixture.Configuring the backend
Selection is via the
orchestratorkey in the cluster JSON. Two worked examples ship in this PR:Baremetal —
cvs/input/cluster_file/cluster_baremetal.json{ "orchestrator": "baremetal", "username": "{user-id}", "priv_key_file": "/home/{user-id}/.ssh/id_rsa", "head_node_dict": { "mgmt_ip": "{xx.xx.xx.xx|hostname-1}" }, "node_dict": { "{xx.xx.xx.xx|hostname-1}": { "bmc_ip": "NA", "vpc_ip": "{xx.xx.xx.xx|hostname-1}" }, "{xx.xx.xx.xx|hostname-2}": { "bmc_ip": "NA", "vpc_ip": "{xx.xx.xx.xx|hostname-2}" } } }orchestratordefaults to"baremetal"if omitted. Same shape as the existingcluster.json; 3-node variant incluster_3node.json.Container (Docker) —
cvs/input/cluster_file/cluster_container.json{ "orchestrator": "container", "username": "{user-id}", "priv_key_file": "/home/{user-id}/.ssh/id_rsa", "head_node_dict": { "mgmt_ip": "{xx.xx.xx.xx|hostname-1}" }, "node_dict": { "{xx.xx.xx.xx|hostname-1}": { "bmc_ip": "NA", "vpc_ip": "{xx.xx.xx.xx|hostname-1}" }, "{xx.xx.xx.xx|hostname-2}": { "bmc_ip": "NA", "vpc_ip": "{xx.xx.xx.xx|hostname-2}" } }, "container": { "enabled": true, "launch": true, "image": "rocm/cvs:latest", "name": "cvs_container", "runtime": { "name": "docker", "args": { "network": "host", "ipc": "host", "privileged": true } } } }Container-mode notes (full reference in
cvs/core/README.md):launch: true→ CVS owns the container lifecycle (started inprepare(), stopped indispose());false→ externally managed, CVS only verifies and uses.prepare()starts the container and runssshd -p 2224inside it; subsequentorch.execSSHes into the container namespace as root, which is whyprivileged_prefix()is""for the container backend (vs"sudo "for baremetal).orch.host_all/orch.host_head.runtime.argskey come fromDEFAULT_CONTAINER_ARGSincvs/core/orchestrators/container.py.Commits
Please focus review on these 5 commits (new in this PR)
040dacbExtend Orchestrator ABC for backend-blind test suites6decba7Add multi-orch health conftest, helpers, and capstdout enrichment7e176c3Migratervs_cvs.pyto the multi-orch backend14288b4Add developer documentation on adding more containers / orchestrators867f98fAdd sample cluster configs (cluster_baremetal.json,cluster_container.json,cluster_3node.json)Already validated — included for context only (no re-review needed)
81a8ccbAdd cluster orchestrator layer and refactor RCCL to use it7f44134Revert RCCL lib and tests to origin/mainThese two predate the new work, were validated previously, and the RCCL portion of
81a8ccbwas reverted by7f44134(so the net effect on RCCL is zero — only the orchestrator/runtime layer remains, which is then extended by the 5 commits above).