feat(deployment): add slurm_multi self-managed multi-node SLURM launcher (Revised PR#124)#130
Conversation
Introduce the `slurm_multi` self-managed multi-node SLURM launcher, fully additive vs develop. Verbatim minimal-delta port from PR #86 limited to the slurm_multi surface; no develop functionality removed. - common.py: register `slurm_multi` in VALID_LAUNCHERS, add hyphen alias normalization (`slurm-multi` -> `slurm_multi`). All other develop content (functools.lru_cache, _ROCPROF_FAMILY_TOOL_NAMES, tools_include_rocprof_family, full configure_multi_node_profiling rocprof-family handling, sglang-disagg in VALID_LAUNCHERS) preserved. - slurm.py: SLURM_MULTI_ALIASES; reservation field + allocation detection in __init__; helper methods (_get_allocation_node_count, _validate_allocation_nodes); slurm_multi early dispatch in prepare(); _prepare_slurm_multi_script (sets `_is_slurm_multi = True` for deploy() gate); _generate_slurm_multi_command; new elif in _generate_launcher_command (existing primus/sglang-disagg dispatch preserved); gated bash-in-salloc branch in deploy() (slurm_multi only -- non-slurm_multi launchers continue to use sbatch); _run_inside_existing_allocation; _collect_slurm_multi_results + early dispatch in collect_results(); reservation= kwarg threaded to existing SlurmNodeSelector(...) call. - slurm_node_selector.py: optional `reservation` parameter so the reservation propagates to srun health/cleanup commands. Source: ROCm/madengine PR #86 (slurm_multi surface only). 0 deletions from develop. Co-authored-by: Cursor <cursoragent@cursor.com>
…or slurm_multi Wire the build/run orchestration paths needed by the slurm_multi launcher. All existing flows for non-slurm_multi launchers continue to work unchanged; new behavior is gated on `distributed.launcher == "slurm_multi"` (or `slurm-multi`). - execution/container_runner.py: SLURM_MULTI_ALIASES, SELF_MANAGED_LAUNCHERS constants; `_run_self_managed` method (runs the model script directly on baremetal so srun/scontrol work inside it, propagates env from model_info.env_vars and additional_context.env_vars); self-managed launcher early dispatch in run_container() that fires only when launcher matches SELF_MANAGED_LAUNCHERS; `.slurm` extension recognized alongside `.sh` in script-extension detection (T-C2). Develop helpers preserved verbatim (_print_run_env_table, _docker_image_exists_locally, _bash_quote_path, _cp_model_dir_file_to_cwd_cmd, _resolve_docker_image, _ENV_KEY_RE env-var validation, MAD_OUTPUT_CSV passthrough, PRIMUS_CONFIG_PATH/PRIMUS_CLI_EXTRA in env-var allowlist, PERFORMANCE_LOG_PATTERN regex, CSV fieldname stripping). - execution/docker_builder.py: after build, write resolved registry image into built_models[*].env_vars.DOCKER_IMAGE_NAME so the run-time `srun docker pull "\$DOCKER_IMAGE_NAME"` on each compute node finds the registry image. Pure addition (one block before manifest assembly). - orchestration/build_orchestrator.py: import shutil; capture user-supplied slurm keys (_original_user_slurm_keys); add `use_image: Optional[str]` and `build_on_compute: bool` to execute(); mutex validation between use_image and build_on_compute; use_image dispatch (incl. "auto" sentinel that resolves from model card via _resolve_image_from_model_card); build_on_compute dispatch (registry required, validated in _execute_build_on_compute); slurm_multi registry gate that runs DiscoverModels early to detect slurm_multi tags and either raises a structured ConfigurationError with helpful suggestions or auto-uses DOCKER_IMAGE_NAME from the model card env_vars (implicit --use-image fallback); _execute_with_prebuilt_image; _resolve_image_from_model_card; _execute_build_on_compute (sbatch wrapper that builds + pushes from one compute node, polls completion, marks `built_on_compute: true` in the manifest). Develop's load_credentials top-level usage, detect_local_gpu_arch parameter, and resolved_arch print all preserved unchanged. Note: PR #86's universal merge of model_info.env_vars into docker_env_vars in run_container() was DELIBERATELY DROPPED -- it is not slurm_multi-essential (slurm_multi gets env_vars via _prepare_slurm_multi_script's wrapper export block and via _run_self_managed's own merge). Applying it universally would silently change behavior for non-slurm_multi launchers and has an incorrect precedence vs additional_context.env_vars; it can be revisited as a separate PR if needed. Source: ROCm/madengine PR #86 (slurm_multi surface only). 0 deletions from develop except the agreed `.sh` -> `.sh|.slurm` line modification (T-C2: harmless extension to script-extension recognition). Co-authored-by: Cursor <cursoragent@cursor.com>
Surface the orchestrator paths from the previous commit through the `madengine build` CLI: - cli/commands/build.py: add `--use-image [IMAGE | auto]` (skip the local Docker build and use the named image; `auto` resolves from the model card's DOCKER_IMAGE_NAME); add `--build-on-compute` (build on a SLURM compute node and push to the configured registry; manifest records `built_on_compute: true`); 3 mutex validation blocks (--use-image vs --registry, --use-image vs --build-on-compute, --build-on-compute requires --registry); pass both new kwargs through to validate_additional_context, create_args_namespace, and BuildOrchestrator.execute(). - cli/validators.py: add `use_image: Optional[str] = None` parameter (signature + docstring only -- body unchanged) so build.py's call site does not TypeError. PR #86's other validators.py rewrites (drops of ROCM_PATH / MAD_ROCM_PATH validation, dockerfile_matched fallback removal) are NOT inherited; develop validation is preserved. Required for the slurm_multi launcher's documented build options in MAD-private #186. Co-authored-by: Cursor <cursoragent@cursor.com>
- tests/unit/test_slurm_multi.py: nine tests across three classes
- TestSlurmMultiRegistration: `slurm_multi` is in VALID_LAUNCHERS;
SLURM_MULTI_ALIASES contains both canonical and hyphen forms.
- TestNormalizeSlurmMultiAliases: canonical and hyphen alias both
normalize to `slurm_multi`; unknown values still fall through to
the existing `slurm -> docker` default.
- TestSlurmMultiPrepareScript: end-to-end fixture that mirrors
MAD-private PR #186's `pyt_sglang_disagg_qwen3-32b_short` entry
verbatim. Asserts that prepare() takes the slurm_multi early-
dispatch path, sets `_is_slurm_multi = True` for the deploy()
gate, emits an SBATCH header that reflects the model card slurm
block, and emits every model_info.env_vars key as
`export KEY="value"` in the wrapper script. This is the explicit
safety net for the Q2 decision to drop PR #86's universal
docker_env_vars merge -- if slurm_multi ever stops propagating
those env_vars to the workload, this test fails loudly.
Existing tests in tests/unit/test_deployment.py and
tests/unit/test_container_runner.py are NOT modified.
- examples/slurm-configs/minimal/slurm-multi-minimal.json:
small reference config alongside develop's existing
*-minimal.json examples (`-f` add because *.json is gitignored,
matching how existing files in this dir were originally added).
Co-authored-by: Cursor <cursoragent@cursor.com>
…lti bash branch The earlier feat(deployment) commit ported `_run_inside_existing_allocation` verbatim from PR #86 (used by the slurm_multi bash-in-salloc branch). That method constructs `DeploymentResult(..., skip_monitoring=True)` to signal that the script ran synchronously and the monitor phase should be skipped. Develop's `DeploymentResult` dataclass did not have that field, so the construction raised `TypeError` at runtime, the failure was swallowed by the orchestrator, and `madengine run` exited 0 even when the wrapper script failed. Two minimal additive edits to deployment/base.py: - DeploymentResult: add `skip_monitoring: bool = False` field. - BaseDeployment.execute(): change the monitor guard from `if self.config.monitor:` to `if self.config.monitor and not result.skip_monitoring:` so the bash branch can correctly bypass the SLURM job poll (there is no SLURM job to poll — the script ran inline). Behavior for non-slurm_multi launchers is unchanged: they never construct DeploymentResult with skip_monitoring=True, the field defaults to False, and the monitor() call still fires as before. Discovered during R5.1b cluster smoke test of `pyt_sglang_disagg_qwen3-32b_short` from MAD-private PR #186. Co-authored-by: Cursor <cursoragent@cursor.com>
…eporter
The slurm_multi model script (e.g., MAD-private's run_xPyD_models.slurm)
writes its perf CSV to /shared_inference/$USER/$JOBID/perf.csv.
`_collect_slurm_multi_results` correctly finds and ingests that CSV via
`_collect_results_parse_perf_csv` (results['successful_runs'] is
populated, manifest exit codes are right). However the CLI's
post-run reporter (`display_performance_table` in cli/utils.py) reads
from cwd 'perf.csv' by default, so users saw a cosmetic
"Performance CSV not found: perf.csv" warning even on a successful
slurm_multi run. Local and classic-SLURM flows already leave a
cumulative perf.csv in cwd via update_perf_csv(); slurm_multi did not.
Fix: inside `_collect_slurm_multi_results`, after the per-job CSV is
located and parsed, also write its rows into the conventional cwd
perf.csv path:
* if cwd 'perf.csv' is absent: shutil.copy() the per-job file
* if it exists (from previous runs): append data rows (skip the
per-job header so the cwd file stays single-headed)
Original per-job CSV at /shared_inference/$USER/$JOBID/perf.csv is
not modified or deleted. Wrapped in try/except so any I/O failure
degrades to a yellow warning rather than failing the run.
Affects only `_collect_slurm_multi_results`, which only runs for the
slurm_multi launcher dispatch added earlier on this branch. Zero
effect on non-slurm_multi launchers (they still take develop's
classic collect_results path which writes cwd perf.csv via
update_perf_csv).
Validated end-to-end: Llama-3.1-8B smoke on 3 reservation nodes
(Distributed_Inference_CI), bash branch in salloc, parallel docker
pull on all nodes, proxy/prefill/decode came up, benchmark
concurrency=8 / 32-32 ISL/OSL ran to completion, run exit 0, cwd
perf.csv now contains the per-job row (THROUGHPUT/TTFT/ITL columns
match per-job source).
Co-authored-by: Cursor <cursoragent@cursor.com>
- CHANGELOG.md: new [Unreleased] section documenting the slurm_multi launcher, --use-image / --build-on-compute CLI flags, the slurm_multi build registry gate, the bash-in-salloc execution path, the DeploymentResult.skip_monitoring field, the SlurmNodeSelector reservation parameter, the new contract tests + minimal example, and the cwd perf.csv aggregation fix. - cli/commands/build.py: comment near the --use-image Typer option flagging that `is_flag=False, flag_value="auto"` is being deprecated by Typer, with the planned migration path. Behavior unchanged (matches MAD-private PR #186's documented UX). Co-authored-by: Cursor <cursoragent@cursor.com>
Manifest-shape correctness (Copilot C2):
- _execute_with_prebuilt_image now keys built_images by model_name (one
entry per model, all pointing at the same use_image) so it matches
built_models and the rest of the codebase's invariant that
ContainerRunner.run_models_from_manifest() relies on at
built_models.get(image_name, {}). Without this, --use-image (and the
implicit slurm_multi DOCKER_IMAGE_NAME gate) silently skipped every
model in the run phase.
Multi-model dedup TypeError fix (Copilot C3):
- Replace tuple(sorted((m.get('distributed') or {}).items())) and the
same pattern for slurm with json.dumps(..., sort_keys=True, default=str).
The old pattern raised TypeError: unhashable type: 'dict' as soon as a
model's distributed block contained nested dicts (sglang_disagg /
vllm_disagg).
slurm_multi wrapper completion-marker on failure (Copilot C6):
- The wrapper script enables `set -e` at the top, so a non-zero exit
from the model script terminated the wrapper before SCRIPT_EXIT_CODE
was captured and the completion marker was written -- failed runs
looked like hangs to monitor(). Wrap the bash invocation in
`set +e` / `set -e` so the marker is always written.
Self-managed launcher hygiene (Copilot C4, C5, C7):
- Drop unused `import shutil` inside _run_self_managed.
- Drop unused script_name local in _run_self_managed (only script_path
is consumed by the function).
- Redact env_vars values in the run log (`ENV: KEY=<set>`) so model-card
credentials (HF_TOKEN, MAD_DOCKERHUB_PASSWORD, etc.) don't leak.
Build orchestrator hygiene (Copilot C1, C8):
- Drop top-level `import shutil` (unused).
- Drop the two `requires_auth = ...` assignments in
_execute_build_on_compute (assigned but never read).
Validators docstring honesty (Copilot C10):
- validate_additional_context's `use_image` parameter is currently
informational only (parameter retained for build.py call-site
compatibility); the docstring previously claimed it skipped required-
field validation, which the body never implemented. Tightened the
docstring to match reality.
New contract tests (tests/unit/test_slurm_multi.py):
- test_wrapper_disables_set_e_around_model_script: locks in C6's
set +e/SCRIPT_EXIT_CODE/set -e ordering around the bash invocation.
- test_built_images_and_models_share_keys: locks in C2's invariant
using a fake 2-model fixture and a stubbed BuildOrchestrator.
- test_multi_model_nested_dict_distributed_does_not_raise: regression
for C3 using two sglang_disagg-style cards with differing nested
dicts in `distributed`; would TypeError under the old code.
Skipped: Copilot C9 (broader BuildOrchestrator branch tests for
--use-image / --build-on-compute / registry-gating). The slurm_multi
contract suite + multi-node cluster smoke already cover the slurm_multi
surface end-to-end; broader orchestrator-branch coverage is a worthwhile
follow-up but out of scope for this minimal-additive PR.
Tests:
- pytest tests/unit/test_slurm_multi.py -v -> 12 passed (was 9).
- pytest tests/unit -q -> 436 passed / 2 failed; the 2 failures are
the same PermissionError environmental baseline that exists on
develop (test_upload_file_to_mongodb_file_not_found and
test_validate_additional_context_file_not_found, both raise
PermissionError on '/nonexistent/file.json' instead of FileNotFound
on this filesystem). Zero new failures.
Co-authored-by: Cursor <cursoragent@cursor.com>
…launcher # Conflicts: # CHANGELOG.md
…d support multi-model build-on-compute - Extract is_self_managed_launcher() helper in deployment/common.py; replace 3 duplicated SLURM_MULTI_ALIASES + normalize blocks across slurm.py, container_runner.py, and build_orchestrator.py. Remove dead _generate_slurm_multi_command (orphaned after early-dispatch bypass). - Preserve model-card xP/yD over distributed.sglang_disagg defaults (skip-if-set in _prepare_slurm_multi_script). - Generalize _execute_build_on_compute to build/tag/push all discovered models in a single sbatch job; key manifest by model_name to match _execute_with_prebuilt_image so ContainerRunner can join entries. Apply shlex.quote to registry_host/cwd/image names (consistent with v2.0.3 hardening pass). - Tests: parametrize is_self_managed_launcher, add TestXpYdSkipIfSet and TestBuildOnComputeManifestShape covering the multi-model contract. - CHANGELOG: clarify slurm_multi as an escape-hatch launcher distinct from templated launchers. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new slurm_multi self-managed multi-node SLURM launcher: a wrapper SBATCH script that runs the model's own .slurm script directly on the head node (no madengine wrapper on compute nodes) and lets the script orchestrate its own per-node Docker containers via srun. Existing templated launchers are untouched and the new path is gated on distributed.launcher == "slurm_multi" (or slurm-multi). The PR also adds madengine build --use-image (with auto model-card resolution) and --build-on-compute for building on a compute node and pushing to a registry, plus a result-collection path that ingests model-generated perf.csv.
Changes:
- New
slurm_multideployment path: registry/alias normalization,_prepare_slurm_multi_script, bash-in-salloc execution, reservation propagation throughSlurmNodeSelector, and a dedicated_collect_slurm_multi_results. - New self-managed launcher execution path in
ContainerRunner(_run_self_managed) with shell-quoted script/args, key-only env logging, and.slurmrecognition. - New
build --use-image/--build-on-computeorchestration including registry validation/normalization, multi-model build manifest, and a slurm_multi-aware fallback to the model card'sDOCKER_IMAGE_NAME.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/madengine/deployment/common.py |
Adds slurm_multi to VALID_LAUNCHERS, hyphen alias normalization, and is_self_managed_launcher helper. |
src/madengine/deployment/slurm.py |
Allocation detection, _prepare_slurm_multi_script, bash-in-salloc deploy, slurm_multi result collection, reservation passthrough. |
src/madengine/deployment/slurm_node_selector.py |
Optional reservation forwarded to health/cleanup srun commands. |
src/madengine/deployment/base.py |
New DeploymentResult.skip_monitoring field and monitor-skip logic. |
src/madengine/execution/container_runner.py |
Self-managed launcher dispatch + _run_self_managed; .slurm recognized in container path. |
src/madengine/execution/docker_builder.py |
Propagates registry_image into built_models[*].env_vars.DOCKER_IMAGE_NAME. |
src/madengine/orchestration/build_orchestrator.py |
use_image/build_on_compute flags, prebuilt/build-on-compute manifest emission, slurm_multi registry gate. |
src/madengine/cli/commands/build.py |
New --use-image / --build-on-compute Typer options with mutual-exclusion checks. |
src/madengine/cli/validators.py |
Threads use_image through validate_additional_context (informational only). |
tests/unit/test_slurm_multi.py |
Contract tests for registry, normalization, _prepare_slurm_multi_script, xP/yD, prebuilt and build-on-compute manifest shapes. |
examples/slurm-configs/minimal/slurm-multi-minimal.json |
Minimal 3-node slurm_multi example config. |
CHANGELOG.md |
Unreleased entry documenting the new launcher and CLI flags. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…leanup files Cache the DiscoverModels result from the registry-check pass so the build step reuses it instead of running a second discovery (avoiding duplicate get_models_json.py execution and duplicate console output). Expand DEFAULT_CLEAN_FILES to include build_manifest.json and related perf artefacts so stale manifests from prior e2e tests can't silently cause the wrong image to be executed. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Resolves the CHANGELOG conflict by adopting upstream's v2.0.3 release date (2026-05-26 — finalized upstream via #122) and graduating this branch's Unreleased entries into a new v2.1.0 section dated 2026-05-28, since slurm_multi / --use-image / --build-on-compute are feature work. Auto-merged from upstream: - fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment (#126) - docs/wiki/index.html (wiki path rename, #129)
- slurm.py: shell-quote env_var values, model script name, and model args
in the generated SLURM wrapper script to prevent shell metacharacters
($(), backticks, ;, "$, etc.) in user-supplied inputs from breaking
the script or triggering host-shell expansion. Mirrors the hardening
pattern in container_runner._run_self_managed.
- container_runner.py: log env var keys from additional_context using
the same `ENV: {key}=<set>` pattern already used for model-card env
vars, so operators can see what was applied without leaking values.
- docker_builder.py: warn (via rich_console) when an entry in
built_images has no matching key in built_models while injecting
DOCKER_IMAGE_NAME for slurm_multi parallel pulls, so future keying
divergence is caught instead of silently no-opping.
- build_orchestrator.py (prebuilt manifest): mark built_images entries
with `local_image: True` so ContainerRunner.run_models_from_manifest()
resolves run_image to use_image (and pulls on demand) instead of
falling through to the else branch where run_image would be the model
name (the new dict key) and _resolve_docker_image() would fail. Fixes
--use-image for non-slurm_multi targets.
- build_orchestrator.py (build_on_compute heredoc): shell-quote
model_name (and reuse the already-quoted local/registry/dockerfile
tokens) in echo lines too, so a `"` or `$` in those fields cannot
break the generated sbatch script.
Expand v2.1.0 CHANGELOG with missing entries (local self-managed execution, model card config merge, DockerBuilder registry injection, shell-quoting security hardening, early discovery reuse, e2e cleanup defaults). Update launchers.md with full slurm_multi section (architecture, build/run workflows, salloc support). Update cli-reference.md with --use-image and --build-on-compute flags. Update deployment.md with slurm_multi, salloc, pre-built image sections, and reservation parameter. Update configuration.md with reservation, exclusive, and expanded launcher list. Update slurm-configs README with slurm_multi minimal example, workflow, and config reference. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Summary
Introduces the
slurm_multilauncher — a self-managed, bash-in-salloc execution pathfor multi-node SLURM workloads. Unlike the templated
sbatch-based launcher,slurm_multiruns model scripts directly on the allocation, allowing user scripts to drive
srun,scontrol, and other SLURM utilities without interference from madengine's job template.This is a fully additive change: no existing local, Kubernetes, or templated SLURM flows
are modified. New behavior is gated on
distributed.launcher == "slurm_multi"(orslurm-multi).Changes
Deployment layer
deployment/common.py: registersslurm_multiinVALID_LAUNCHERS; normalizesslurm-multi→slurm_multi; extractsis_self_managed_launcher()helper (replacesthree duplicated alias/normalize blocks across the codebase)
deployment/slurm.py: addsSLURM_MULTI_ALIASES; reservation field + allocationdetection;
_prepare_slurm_multi_script; gated bash-in-salloc branch indeploy();_run_inside_existing_allocation;_collect_slurm_multi_results+ early dispatch incollect_results();slurm_node_selectorreservation propagationdeployment/slurm_node_selector.py: optionalreservationparameter for health/cleanupsruncommandsExecution layer
execution/container_runner.py:SELF_MANAGED_LAUNCHERSconstant;_run_self_managed(runs script on bare metal, propagates env vars); self-managed launcher early dispatch
in
run_container();.slurmextension recognized alongside.shexecution/docker_builder.py: writes resolved registry image intobuilt_models[*].env_vars.DOCKER_IMAGE_NAMEsosrun docker pullon each compute noderesolves correctly
Orchestration layer
orchestration/build_orchestrator.py: addsuse_imageandbuild_on_computeflagsto
execute(); mutex validation between the two;--use-image autosentinel thatresolves from model card;
build_on_computedispatch (registry required);slurm_multiregistry gate with auto-fallback to model card
DOCKER_IMAGE_NAME;_execute_build_on_computesbatch wrapper that builds + pushes from one compute nodeand polls for completion
CLI
cli/commands/build.py: exposes--use-imageand--build-on-computeflags onmadengine buildTests
tests/unit/test_slurm_multi.py: contract tests foris_self_managed_launcher,xP/yD skip-if-set behavior, and multi-model
build_on_computemanifest shapeexamples/slurm-configs/minimal/slurm-multi-minimal.json: minimal example configDesign notes
xP/yDvalues are preserved overdistributed.sglang_disaggdefaults(skip-if-set in
_prepare_slurm_multi_script)build_on_computebuilds and tags all discovered models in a singlesbatchjob;manifest is keyed by
model_nameto match_execute_with_prebuilt_imagesoContainerRunnercan join entries correctly_execute_build_on_computeare hardened withshlex.quoteconsistent with the v2.0.3 hardening pass
model_info.env_varsmerge from PR Updates to slurm launcher #86 was deliberately not ported —it is not slurm_multi-essential and would silently change behavior for other launchers;
can be revisited as a separate PR