feat(deployment): add slurm_multi self-managed multi-node SLURM launcher (Revised PR#124) by coketaste · Pull Request #130 · ROCm/madengine

coketaste · 2026-05-27T21:25:30Z

Summary

Introduces the slurm_multi launcher — a self-managed, bash-in-salloc execution path
for multi-node SLURM workloads. Unlike the templated sbatch-based launcher, slurm_multi
runs model scripts directly on the allocation, allowing user scripts to drive srun,
scontrol, and other SLURM utilities without interference from madengine's job template.

This is a fully additive change: no existing local, Kubernetes, or templated SLURM flows
are modified. New behavior is gated on distributed.launcher == "slurm_multi" (or slurm-multi).

Changes

Deployment layer

deployment/common.py: registers slurm_multi in VALID_LAUNCHERS; normalizes
slurm-multi → slurm_multi; extracts is_self_managed_launcher() helper (replaces
three duplicated alias/normalize blocks across the codebase)
deployment/slurm.py: adds SLURM_MULTI_ALIASES; reservation field + allocation
detection; _prepare_slurm_multi_script; gated bash-in-salloc branch in deploy();
_run_inside_existing_allocation; _collect_slurm_multi_results + early dispatch in
collect_results(); slurm_node_selector reservation propagation
deployment/slurm_node_selector.py: optional reservation parameter for health/cleanup
srun commands

Execution layer

execution/container_runner.py: SELF_MANAGED_LAUNCHERS constant; _run_self_managed
(runs script on bare metal, propagates env vars); self-managed launcher early dispatch
in run_container(); .slurm extension recognized alongside .sh
execution/docker_builder.py: writes resolved registry image into
built_models[*].env_vars.DOCKER_IMAGE_NAME so srun docker pull on each compute node
resolves correctly

Orchestration layer

orchestration/build_orchestrator.py: adds use_image and build_on_compute flags
to execute(); mutex validation between the two; --use-image auto sentinel that
resolves from model card; build_on_compute dispatch (registry required); slurm_multi
registry gate with auto-fallback to model card DOCKER_IMAGE_NAME;
_execute_build_on_compute sbatch wrapper that builds + pushes from one compute node
and polls for completion

CLI

cli/commands/build.py: exposes --use-image and --build-on-compute flags on
madengine build

Tests

tests/unit/test_slurm_multi.py: contract tests for is_self_managed_launcher,
xP/yD skip-if-set behavior, and multi-model build_on_compute manifest shape
examples/slurm-configs/minimal/slurm-multi-minimal.json: minimal example config

Design notes

Model-card xP/yD values are preserved over distributed.sglang_disagg defaults
(skip-if-set in _prepare_slurm_multi_script)
build_on_compute builds and tags all discovered models in a single sbatch job;
manifest is keyed by model_name to match _execute_with_prebuilt_image so
ContainerRunner can join entries correctly
Shell arguments in _execute_build_on_compute are hardened with shlex.quote
consistent with the v2.0.3 hardening pass
The universal model_info.env_vars merge from PR Updates to slurm launcher #86 was deliberately not ported —
it is not slurm_multi-essential and would silently change behavior for other launchers;
can be revisited as a separate PR

Introduce the `slurm_multi` self-managed multi-node SLURM launcher, fully additive vs develop. Verbatim minimal-delta port from PR #86 limited to the slurm_multi surface; no develop functionality removed. - common.py: register `slurm_multi` in VALID_LAUNCHERS, add hyphen alias normalization (`slurm-multi` -> `slurm_multi`). All other develop content (functools.lru_cache, _ROCPROF_FAMILY_TOOL_NAMES, tools_include_rocprof_family, full configure_multi_node_profiling rocprof-family handling, sglang-disagg in VALID_LAUNCHERS) preserved. - slurm.py: SLURM_MULTI_ALIASES; reservation field + allocation detection in __init__; helper methods (_get_allocation_node_count, _validate_allocation_nodes); slurm_multi early dispatch in prepare(); _prepare_slurm_multi_script (sets `_is_slurm_multi = True` for deploy() gate); _generate_slurm_multi_command; new elif in _generate_launcher_command (existing primus/sglang-disagg dispatch preserved); gated bash-in-salloc branch in deploy() (slurm_multi only -- non-slurm_multi launchers continue to use sbatch); _run_inside_existing_allocation; _collect_slurm_multi_results + early dispatch in collect_results(); reservation= kwarg threaded to existing SlurmNodeSelector(...) call. - slurm_node_selector.py: optional `reservation` parameter so the reservation propagates to srun health/cleanup commands. Source: ROCm/madengine PR #86 (slurm_multi surface only). 0 deletions from develop. Co-authored-by: Cursor <cursoragent@cursor.com>

…or slurm_multi Wire the build/run orchestration paths needed by the slurm_multi launcher. All existing flows for non-slurm_multi launchers continue to work unchanged; new behavior is gated on `distributed.launcher == "slurm_multi"` (or `slurm-multi`). - execution/container_runner.py: SLURM_MULTI_ALIASES, SELF_MANAGED_LAUNCHERS constants; `_run_self_managed` method (runs the model script directly on baremetal so srun/scontrol work inside it, propagates env from model_info.env_vars and additional_context.env_vars); self-managed launcher early dispatch in run_container() that fires only when launcher matches SELF_MANAGED_LAUNCHERS; `.slurm` extension recognized alongside `.sh` in script-extension detection (T-C2). Develop helpers preserved verbatim (_print_run_env_table, _docker_image_exists_locally, _bash_quote_path, _cp_model_dir_file_to_cwd_cmd, _resolve_docker_image, _ENV_KEY_RE env-var validation, MAD_OUTPUT_CSV passthrough, PRIMUS_CONFIG_PATH/PRIMUS_CLI_EXTRA in env-var allowlist, PERFORMANCE_LOG_PATTERN regex, CSV fieldname stripping). - execution/docker_builder.py: after build, write resolved registry image into built_models[*].env_vars.DOCKER_IMAGE_NAME so the run-time `srun docker pull "\$DOCKER_IMAGE_NAME"` on each compute node finds the registry image. Pure addition (one block before manifest assembly). - orchestration/build_orchestrator.py: import shutil; capture user-supplied slurm keys (_original_user_slurm_keys); add `use_image: Optional[str]` and `build_on_compute: bool` to execute(); mutex validation between use_image and build_on_compute; use_image dispatch (incl. "auto" sentinel that resolves from model card via _resolve_image_from_model_card); build_on_compute dispatch (registry required, validated in _execute_build_on_compute); slurm_multi registry gate that runs DiscoverModels early to detect slurm_multi tags and either raises a structured ConfigurationError with helpful suggestions or auto-uses DOCKER_IMAGE_NAME from the model card env_vars (implicit --use-image fallback); _execute_with_prebuilt_image; _resolve_image_from_model_card; _execute_build_on_compute (sbatch wrapper that builds + pushes from one compute node, polls completion, marks `built_on_compute: true` in the manifest). Develop's load_credentials top-level usage, detect_local_gpu_arch parameter, and resolved_arch print all preserved unchanged. Note: PR #86's universal merge of model_info.env_vars into docker_env_vars in run_container() was DELIBERATELY DROPPED -- it is not slurm_multi-essential (slurm_multi gets env_vars via _prepare_slurm_multi_script's wrapper export block and via _run_self_managed's own merge). Applying it universally would silently change behavior for non-slurm_multi launchers and has an incorrect precedence vs additional_context.env_vars; it can be revisited as a separate PR if needed. Source: ROCm/madengine PR #86 (slurm_multi surface only). 0 deletions from develop except the agreed `.sh` -> `.sh|.slurm` line modification (T-C2: harmless extension to script-extension recognition). Co-authored-by: Cursor <cursoragent@cursor.com>

Surface the orchestrator paths from the previous commit through the `madengine build` CLI: - cli/commands/build.py: add `--use-image [IMAGE | auto]` (skip the local Docker build and use the named image; `auto` resolves from the model card's DOCKER_IMAGE_NAME); add `--build-on-compute` (build on a SLURM compute node and push to the configured registry; manifest records `built_on_compute: true`); 3 mutex validation blocks (--use-image vs --registry, --use-image vs --build-on-compute, --build-on-compute requires --registry); pass both new kwargs through to validate_additional_context, create_args_namespace, and BuildOrchestrator.execute(). - cli/validators.py: add `use_image: Optional[str] = None` parameter (signature + docstring only -- body unchanged) so build.py's call site does not TypeError. PR #86's other validators.py rewrites (drops of ROCM_PATH / MAD_ROCM_PATH validation, dockerfile_matched fallback removal) are NOT inherited; develop validation is preserved. Required for the slurm_multi launcher's documented build options in MAD-private #186. Co-authored-by: Cursor <cursoragent@cursor.com>

- tests/unit/test_slurm_multi.py: nine tests across three classes - TestSlurmMultiRegistration: `slurm_multi` is in VALID_LAUNCHERS; SLURM_MULTI_ALIASES contains both canonical and hyphen forms. - TestNormalizeSlurmMultiAliases: canonical and hyphen alias both normalize to `slurm_multi`; unknown values still fall through to the existing `slurm -> docker` default. - TestSlurmMultiPrepareScript: end-to-end fixture that mirrors MAD-private PR #186's `pyt_sglang_disagg_qwen3-32b_short` entry verbatim. Asserts that prepare() takes the slurm_multi early- dispatch path, sets `_is_slurm_multi = True` for the deploy() gate, emits an SBATCH header that reflects the model card slurm block, and emits every model_info.env_vars key as `export KEY="value"` in the wrapper script. This is the explicit safety net for the Q2 decision to drop PR #86's universal docker_env_vars merge -- if slurm_multi ever stops propagating those env_vars to the workload, this test fails loudly. Existing tests in tests/unit/test_deployment.py and tests/unit/test_container_runner.py are NOT modified. - examples/slurm-configs/minimal/slurm-multi-minimal.json: small reference config alongside develop's existing *-minimal.json examples (`-f` add because *.json is gitignored, matching how existing files in this dir were originally added). Co-authored-by: Cursor <cursoragent@cursor.com>

…lti bash branch The earlier feat(deployment) commit ported `_run_inside_existing_allocation` verbatim from PR #86 (used by the slurm_multi bash-in-salloc branch). That method constructs `DeploymentResult(..., skip_monitoring=True)` to signal that the script ran synchronously and the monitor phase should be skipped. Develop's `DeploymentResult` dataclass did not have that field, so the construction raised `TypeError` at runtime, the failure was swallowed by the orchestrator, and `madengine run` exited 0 even when the wrapper script failed. Two minimal additive edits to deployment/base.py: - DeploymentResult: add `skip_monitoring: bool = False` field. - BaseDeployment.execute(): change the monitor guard from `if self.config.monitor:` to `if self.config.monitor and not result.skip_monitoring:` so the bash branch can correctly bypass the SLURM job poll (there is no SLURM job to poll — the script ran inline). Behavior for non-slurm_multi launchers is unchanged: they never construct DeploymentResult with skip_monitoring=True, the field defaults to False, and the monitor() call still fires as before. Discovered during R5.1b cluster smoke test of `pyt_sglang_disagg_qwen3-32b_short` from MAD-private PR #186. Co-authored-by: Cursor <cursoragent@cursor.com>

…eporter The slurm_multi model script (e.g., MAD-private's run_xPyD_models.slurm) writes its perf CSV to /shared_inference/$USER/$JOBID/perf.csv. `_collect_slurm_multi_results` correctly finds and ingests that CSV via `_collect_results_parse_perf_csv` (results['successful_runs'] is populated, manifest exit codes are right). However the CLI's post-run reporter (`display_performance_table` in cli/utils.py) reads from cwd 'perf.csv' by default, so users saw a cosmetic "Performance CSV not found: perf.csv" warning even on a successful slurm_multi run. Local and classic-SLURM flows already leave a cumulative perf.csv in cwd via update_perf_csv(); slurm_multi did not. Fix: inside `_collect_slurm_multi_results`, after the per-job CSV is located and parsed, also write its rows into the conventional cwd perf.csv path: * if cwd 'perf.csv' is absent: shutil.copy() the per-job file * if it exists (from previous runs): append data rows (skip the per-job header so the cwd file stays single-headed) Original per-job CSV at /shared_inference/$USER/$JOBID/perf.csv is not modified or deleted. Wrapped in try/except so any I/O failure degrades to a yellow warning rather than failing the run. Affects only `_collect_slurm_multi_results`, which only runs for the slurm_multi launcher dispatch added earlier on this branch. Zero effect on non-slurm_multi launchers (they still take develop's classic collect_results path which writes cwd perf.csv via update_perf_csv). Validated end-to-end: Llama-3.1-8B smoke on 3 reservation nodes (Distributed_Inference_CI), bash branch in salloc, parallel docker pull on all nodes, proxy/prefill/decode came up, benchmark concurrency=8 / 32-32 ISL/OSL ran to completion, run exit 0, cwd perf.csv now contains the per-job row (THROUGHPUT/TTFT/ITL columns match per-job source). Co-authored-by: Cursor <cursoragent@cursor.com>

- CHANGELOG.md: new [Unreleased] section documenting the slurm_multi launcher, --use-image / --build-on-compute CLI flags, the slurm_multi build registry gate, the bash-in-salloc execution path, the DeploymentResult.skip_monitoring field, the SlurmNodeSelector reservation parameter, the new contract tests + minimal example, and the cwd perf.csv aggregation fix. - cli/commands/build.py: comment near the --use-image Typer option flagging that `is_flag=False, flag_value="auto"` is being deprecated by Typer, with the planned migration path. Behavior unchanged (matches MAD-private PR #186's documented UX). Co-authored-by: Cursor <cursoragent@cursor.com>

Manifest-shape correctness (Copilot C2): - _execute_with_prebuilt_image now keys built_images by model_name (one entry per model, all pointing at the same use_image) so it matches built_models and the rest of the codebase's invariant that ContainerRunner.run_models_from_manifest() relies on at built_models.get(image_name, {}). Without this, --use-image (and the implicit slurm_multi DOCKER_IMAGE_NAME gate) silently skipped every model in the run phase. Multi-model dedup TypeError fix (Copilot C3): - Replace tuple(sorted((m.get('distributed') or {}).items())) and the same pattern for slurm with json.dumps(..., sort_keys=True, default=str). The old pattern raised TypeError: unhashable type: 'dict' as soon as a model's distributed block contained nested dicts (sglang_disagg / vllm_disagg). slurm_multi wrapper completion-marker on failure (Copilot C6): - The wrapper script enables `set -e` at the top, so a non-zero exit from the model script terminated the wrapper before SCRIPT_EXIT_CODE was captured and the completion marker was written -- failed runs looked like hangs to monitor(). Wrap the bash invocation in `set +e` / `set -e` so the marker is always written. Self-managed launcher hygiene (Copilot C4, C5, C7): - Drop unused `import shutil` inside _run_self_managed. - Drop unused script_name local in _run_self_managed (only script_path is consumed by the function). - Redact env_vars values in the run log (`ENV: KEY=<set>`) so model-card credentials (HF_TOKEN, MAD_DOCKERHUB_PASSWORD, etc.) don't leak. Build orchestrator hygiene (Copilot C1, C8): - Drop top-level `import shutil` (unused). - Drop the two `requires_auth = ...` assignments in _execute_build_on_compute (assigned but never read). Validators docstring honesty (Copilot C10): - validate_additional_context's `use_image` parameter is currently informational only (parameter retained for build.py call-site compatibility); the docstring previously claimed it skipped required- field validation, which the body never implemented. Tightened the docstring to match reality. New contract tests (tests/unit/test_slurm_multi.py): - test_wrapper_disables_set_e_around_model_script: locks in C6's set +e/SCRIPT_EXIT_CODE/set -e ordering around the bash invocation. - test_built_images_and_models_share_keys: locks in C2's invariant using a fake 2-model fixture and a stubbed BuildOrchestrator. - test_multi_model_nested_dict_distributed_does_not_raise: regression for C3 using two sglang_disagg-style cards with differing nested dicts in `distributed`; would TypeError under the old code. Skipped: Copilot C9 (broader BuildOrchestrator branch tests for --use-image / --build-on-compute / registry-gating). The slurm_multi contract suite + multi-node cluster smoke already cover the slurm_multi surface end-to-end; broader orchestrator-branch coverage is a worthwhile follow-up but out of scope for this minimal-additive PR. Tests: - pytest tests/unit/test_slurm_multi.py -v -> 12 passed (was 9). - pytest tests/unit -q -> 436 passed / 2 failed; the 2 failures are the same PermissionError environmental baseline that exists on develop (test_upload_file_to_mongodb_file_not_found and test_validate_additional_context_file_not_found, both raise PermissionError on '/nonexistent/file.json' instead of FileNotFound on this filesystem). Zero new failures. Co-authored-by: Cursor <cursoragent@cursor.com>

…launcher # Conflicts: # CHANGELOG.md

…d support multi-model build-on-compute - Extract is_self_managed_launcher() helper in deployment/common.py; replace 3 duplicated SLURM_MULTI_ALIASES + normalize blocks across slurm.py, container_runner.py, and build_orchestrator.py. Remove dead _generate_slurm_multi_command (orphaned after early-dispatch bypass). - Preserve model-card xP/yD over distributed.sglang_disagg defaults (skip-if-set in _prepare_slurm_multi_script). - Generalize _execute_build_on_compute to build/tag/push all discovered models in a single sbatch job; key manifest by model_name to match _execute_with_prebuilt_image so ContainerRunner can join entries. Apply shlex.quote to registry_host/cwd/image names (consistent with v2.0.3 hardening pass). - Tests: parametrize is_self_managed_launcher, add TestXpYdSkipIfSet and TestBuildOnComputeManifestShape covering the multi-model contract. - CHANGELOG: clarify slurm_multi as an escape-hatch launcher distinct from templated launchers. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new slurm_multi self-managed multi-node SLURM launcher: a wrapper SBATCH script that runs the model's own .slurm script directly on the head node (no madengine wrapper on compute nodes) and lets the script orchestrate its own per-node Docker containers via srun. Existing templated launchers are untouched and the new path is gated on distributed.launcher == "slurm_multi" (or slurm-multi). The PR also adds madengine build --use-image (with auto model-card resolution) and --build-on-compute for building on a compute node and pushing to a registry, plus a result-collection path that ingests model-generated perf.csv.

Changes:

New slurm_multi deployment path: registry/alias normalization, _prepare_slurm_multi_script, bash-in-salloc execution, reservation propagation through SlurmNodeSelector, and a dedicated _collect_slurm_multi_results.
New self-managed launcher execution path in ContainerRunner (_run_self_managed) with shell-quoted script/args, key-only env logging, and .slurm recognition.
New build --use-image/--build-on-compute orchestration including registry validation/normalization, multi-model build manifest, and a slurm_multi-aware fallback to the model card's DOCKER_IMAGE_NAME.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`src/madengine/deployment/common.py`	Adds `slurm_multi` to `VALID_LAUNCHERS`, hyphen alias normalization, and `is_self_managed_launcher` helper.
`src/madengine/deployment/slurm.py`	Allocation detection, `_prepare_slurm_multi_script`, bash-in-salloc deploy, slurm_multi result collection, reservation passthrough.
`src/madengine/deployment/slurm_node_selector.py`	Optional `reservation` forwarded to health/cleanup `srun` commands.
`src/madengine/deployment/base.py`	New `DeploymentResult.skip_monitoring` field and monitor-skip logic.
`src/madengine/execution/container_runner.py`	Self-managed launcher dispatch + `_run_self_managed`; `.slurm` recognized in container path.
`src/madengine/execution/docker_builder.py`	Propagates `registry_image` into `built_models[*].env_vars.DOCKER_IMAGE_NAME`.
`src/madengine/orchestration/build_orchestrator.py`	`use_image`/`build_on_compute` flags, prebuilt/build-on-compute manifest emission, slurm_multi registry gate.
`src/madengine/cli/commands/build.py`	New `--use-image` / `--build-on-compute` Typer options with mutual-exclusion checks.
`src/madengine/cli/validators.py`	Threads `use_image` through `validate_additional_context` (informational only).
`tests/unit/test_slurm_multi.py`	Contract tests for registry, normalization, `_prepare_slurm_multi_script`, xP/yD, prebuilt and build-on-compute manifest shapes.
`examples/slurm-configs/minimal/slurm-multi-minimal.json`	Minimal 3-node `slurm_multi` example config.
`CHANGELOG.md`	Unreleased entry documenting the new launcher and CLI flags.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…leanup files Cache the DiscoverModels result from the registry-check pass so the build step reuses it instead of running a second discovery (avoiding duplicate get_models_json.py execution and duplicate console output). Expand DEFAULT_CLEAN_FILES to include build_manifest.json and related perf artefacts so stale manifests from prior e2e tests can't silently cause the wrong image to be executed. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Resolves the CHANGELOG conflict by adopting upstream's v2.0.3 release date (2026-05-26 — finalized upstream via #122) and graduating this branch's Unreleased entries into a new v2.1.0 section dated 2026-05-28, since slurm_multi / --use-image / --build-on-compute are feature work. Auto-merged from upstream: - fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment (#126) - docs/wiki/index.html (wiki path rename, #129)

- slurm.py: shell-quote env_var values, model script name, and model args in the generated SLURM wrapper script to prevent shell metacharacters ($(), backticks, ;, "$, etc.) in user-supplied inputs from breaking the script or triggering host-shell expansion. Mirrors the hardening pattern in container_runner._run_self_managed. - container_runner.py: log env var keys from additional_context using the same `ENV: {key}=<set>` pattern already used for model-card env vars, so operators can see what was applied without leaking values. - docker_builder.py: warn (via rich_console) when an entry in built_images has no matching key in built_models while injecting DOCKER_IMAGE_NAME for slurm_multi parallel pulls, so future keying divergence is caught instead of silently no-opping. - build_orchestrator.py (prebuilt manifest): mark built_images entries with `local_image: True` so ContainerRunner.run_models_from_manifest() resolves run_image to use_image (and pulls on demand) instead of falling through to the else branch where run_image would be the model name (the new dict key) and _resolve_docker_image() would fail. Fixes --use-image for non-slurm_multi targets. - build_orchestrator.py (build_on_compute heredoc): shell-quote model_name (and reuse the already-quoted local/registry/dockerfile tokens) in echo lines too, so a `"` or `$` in those fields cannot break the generated sbatch script.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Expand v2.1.0 CHANGELOG with missing entries (local self-managed execution, model card config merge, DockerBuilder registry injection, shell-quoting security hardening, early discovery reuse, e2e cleanup defaults). Update launchers.md with full slurm_multi section (architecture, build/run workflows, salloc support). Update cli-reference.md with --use-image and --build-on-compute flags. Update deployment.md with slurm_multi, salloc, pre-built image sections, and reservation parameter. Update configuration.md with reservation, exclusive, and expanded launcher list. Update slurm-configs README with slurm_multi minimal example, workflow, and config reference. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

raviguptaamd and others added 10 commits May 9, 2026 08:10

Merge remote-tracking branch 'upstream/develop' into add_slurm_multi_…

2e8f1a4

…launcher # Conflicts: # CHANGELOG.md

coketaste assigned coketaste and raviguptaamd May 27, 2026

Copilot AI review requested due to automatic review settings May 27, 2026 21:25

coketaste requested review from Cemberk, Rohan138, gargrahul and leconcio as code owners May 27, 2026 21:25

Copilot started reviewing on behalf of coketaste May 27, 2026 21:25 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

coketaste and others added 3 commits May 28, 2026 02:19

Copilot AI review requested due to automatic review settings May 28, 2026 02:54

Copilot started reviewing on behalf of coketaste May 28, 2026 02:54 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread tests/unit/test_slurm_multi.py

Comment thread tests/unit/test_slurm_multi.py

coketaste merged commit 38cae42 into develop May 29, 2026

This was referenced May 30, 2026

docs(wiki): comprehensive codebase wiki rewrite for v2.1.0 #135

Merged

feat(slurm): add slurm_multi multi-node SLURM launcher (minimal-additive port from PR #86) #124

Closed

Updates to slurm launcher #86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(deployment): add slurm_multi self-managed multi-node SLURM launcher (Revised PR#124)#130

feat(deployment): add slurm_multi self-managed multi-node SLURM launcher (Revised PR#124)#130
coketaste merged 14 commits into
developfrom
add_slurm_multi_launcher-patch

coketaste commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

coketaste commented May 27, 2026

Summary

Changes

Design notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants