Skip to content

feat(deployment): add slurm_multi self-managed multi-node SLURM launcher (Revised PR#124)#130

Merged
coketaste merged 14 commits into
developfrom
add_slurm_multi_launcher-patch
May 29, 2026
Merged

feat(deployment): add slurm_multi self-managed multi-node SLURM launcher (Revised PR#124)#130
coketaste merged 14 commits into
developfrom
add_slurm_multi_launcher-patch

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

Summary

Introduces the slurm_multi launcher — a self-managed, bash-in-salloc execution path
for multi-node SLURM workloads. Unlike the templated sbatch-based launcher, slurm_multi
runs model scripts directly on the allocation, allowing user scripts to drive srun,
scontrol, and other SLURM utilities without interference from madengine's job template.

This is a fully additive change: no existing local, Kubernetes, or templated SLURM flows
are modified. New behavior is gated on distributed.launcher == "slurm_multi" (or slurm-multi).

Changes

Deployment layer

  • deployment/common.py: registers slurm_multi in VALID_LAUNCHERS; normalizes
    slurm-multislurm_multi; extracts is_self_managed_launcher() helper (replaces
    three duplicated alias/normalize blocks across the codebase)
  • deployment/slurm.py: adds SLURM_MULTI_ALIASES; reservation field + allocation
    detection; _prepare_slurm_multi_script; gated bash-in-salloc branch in deploy();
    _run_inside_existing_allocation; _collect_slurm_multi_results + early dispatch in
    collect_results(); slurm_node_selector reservation propagation
  • deployment/slurm_node_selector.py: optional reservation parameter for health/cleanup
    srun commands

Execution layer

  • execution/container_runner.py: SELF_MANAGED_LAUNCHERS constant; _run_self_managed
    (runs script on bare metal, propagates env vars); self-managed launcher early dispatch
    in run_container(); .slurm extension recognized alongside .sh
  • execution/docker_builder.py: writes resolved registry image into
    built_models[*].env_vars.DOCKER_IMAGE_NAME so srun docker pull on each compute node
    resolves correctly

Orchestration layer

  • orchestration/build_orchestrator.py: adds use_image and build_on_compute flags
    to execute(); mutex validation between the two; --use-image auto sentinel that
    resolves from model card; build_on_compute dispatch (registry required); slurm_multi
    registry gate with auto-fallback to model card DOCKER_IMAGE_NAME;
    _execute_build_on_compute sbatch wrapper that builds + pushes from one compute node
    and polls for completion

CLI

  • cli/commands/build.py: exposes --use-image and --build-on-compute flags on
    madengine build

Tests

  • tests/unit/test_slurm_multi.py: contract tests for is_self_managed_launcher,
    xP/yD skip-if-set behavior, and multi-model build_on_compute manifest shape
  • examples/slurm-configs/minimal/slurm-multi-minimal.json: minimal example config

Design notes

  • Model-card xP/yD values are preserved over distributed.sglang_disagg defaults
    (skip-if-set in _prepare_slurm_multi_script)
  • build_on_compute builds and tags all discovered models in a single sbatch job;
    manifest is keyed by model_name to match _execute_with_prebuilt_image so
    ContainerRunner can join entries correctly
  • Shell arguments in _execute_build_on_compute are hardened with shlex.quote
    consistent with the v2.0.3 hardening pass
  • The universal model_info.env_vars merge from PR Updates to slurm launcher #86 was deliberately not ported —
    it is not slurm_multi-essential and would silently change behavior for other launchers;
    can be revisited as a separate PR

raviguptaamd and others added 10 commits May 9, 2026 08:10
Introduce the `slurm_multi` self-managed multi-node SLURM launcher,
fully additive vs develop. Verbatim minimal-delta port from PR #86
limited to the slurm_multi surface; no develop functionality removed.

- common.py: register `slurm_multi` in VALID_LAUNCHERS, add hyphen
  alias normalization (`slurm-multi` -> `slurm_multi`). All other
  develop content (functools.lru_cache, _ROCPROF_FAMILY_TOOL_NAMES,
  tools_include_rocprof_family, full configure_multi_node_profiling
  rocprof-family handling, sglang-disagg in VALID_LAUNCHERS) preserved.
- slurm.py: SLURM_MULTI_ALIASES; reservation field + allocation
  detection in __init__; helper methods (_get_allocation_node_count,
  _validate_allocation_nodes); slurm_multi early dispatch in prepare();
  _prepare_slurm_multi_script (sets `_is_slurm_multi = True` for
  deploy() gate); _generate_slurm_multi_command; new elif in
  _generate_launcher_command (existing primus/sglang-disagg dispatch
  preserved); gated bash-in-salloc branch in deploy() (slurm_multi
  only -- non-slurm_multi launchers continue to use sbatch);
  _run_inside_existing_allocation; _collect_slurm_multi_results +
  early dispatch in collect_results(); reservation= kwarg threaded
  to existing SlurmNodeSelector(...) call.
- slurm_node_selector.py: optional `reservation` parameter so the
  reservation propagates to srun health/cleanup commands.

Source: ROCm/madengine PR #86 (slurm_multi surface only). 0 deletions
from develop.

Co-authored-by: Cursor <cursoragent@cursor.com>
…or slurm_multi

Wire the build/run orchestration paths needed by the slurm_multi
launcher. All existing flows for non-slurm_multi launchers continue
to work unchanged; new behavior is gated on
`distributed.launcher == "slurm_multi"` (or `slurm-multi`).

- execution/container_runner.py: SLURM_MULTI_ALIASES,
  SELF_MANAGED_LAUNCHERS constants; `_run_self_managed` method (runs
  the model script directly on baremetal so srun/scontrol work
  inside it, propagates env from model_info.env_vars and
  additional_context.env_vars); self-managed launcher early dispatch
  in run_container() that fires only when launcher matches
  SELF_MANAGED_LAUNCHERS; `.slurm` extension recognized alongside
  `.sh` in script-extension detection (T-C2). Develop helpers
  preserved verbatim (_print_run_env_table, _docker_image_exists_locally,
  _bash_quote_path, _cp_model_dir_file_to_cwd_cmd, _resolve_docker_image,
  _ENV_KEY_RE env-var validation, MAD_OUTPUT_CSV passthrough,
  PRIMUS_CONFIG_PATH/PRIMUS_CLI_EXTRA in env-var allowlist,
  PERFORMANCE_LOG_PATTERN regex, CSV fieldname stripping).
- execution/docker_builder.py: after build, write resolved registry
  image into built_models[*].env_vars.DOCKER_IMAGE_NAME so the
  run-time `srun docker pull "\$DOCKER_IMAGE_NAME"` on each compute
  node finds the registry image. Pure addition (one block before
  manifest assembly).
- orchestration/build_orchestrator.py: import shutil; capture
  user-supplied slurm keys (_original_user_slurm_keys); add
  `use_image: Optional[str]` and `build_on_compute: bool` to
  execute(); mutex validation between use_image and build_on_compute;
  use_image dispatch (incl. "auto" sentinel that resolves from model
  card via _resolve_image_from_model_card); build_on_compute
  dispatch (registry required, validated in _execute_build_on_compute);
  slurm_multi registry gate that runs DiscoverModels early to detect
  slurm_multi tags and either raises a structured ConfigurationError
  with helpful suggestions or auto-uses DOCKER_IMAGE_NAME from the
  model card env_vars (implicit --use-image fallback);
  _execute_with_prebuilt_image; _resolve_image_from_model_card;
  _execute_build_on_compute (sbatch wrapper that builds + pushes from
  one compute node, polls completion, marks `built_on_compute: true`
  in the manifest). Develop's load_credentials top-level usage,
  detect_local_gpu_arch parameter, and resolved_arch print all
  preserved unchanged.

Note: PR #86's universal merge of model_info.env_vars into
docker_env_vars in run_container() was DELIBERATELY DROPPED -- it
is not slurm_multi-essential (slurm_multi gets env_vars via
_prepare_slurm_multi_script's wrapper export block and via
_run_self_managed's own merge). Applying it universally would
silently change behavior for non-slurm_multi launchers and has an
incorrect precedence vs additional_context.env_vars; it can be
revisited as a separate PR if needed.

Source: ROCm/madengine PR #86 (slurm_multi surface only). 0 deletions
from develop except the agreed `.sh` -> `.sh|.slurm` line modification
(T-C2: harmless extension to script-extension recognition).

Co-authored-by: Cursor <cursoragent@cursor.com>
Surface the orchestrator paths from the previous commit through the
`madengine build` CLI:

- cli/commands/build.py: add `--use-image [IMAGE | auto]` (skip the
  local Docker build and use the named image; `auto` resolves from
  the model card's DOCKER_IMAGE_NAME); add `--build-on-compute`
  (build on a SLURM compute node and push to the configured
  registry; manifest records `built_on_compute: true`); 3 mutex
  validation blocks (--use-image vs --registry, --use-image vs
  --build-on-compute, --build-on-compute requires --registry); pass
  both new kwargs through to validate_additional_context,
  create_args_namespace, and BuildOrchestrator.execute().
- cli/validators.py: add `use_image: Optional[str] = None`
  parameter (signature + docstring only -- body unchanged) so
  build.py's call site does not TypeError. PR #86's other
  validators.py rewrites (drops of ROCM_PATH / MAD_ROCM_PATH
  validation, dockerfile_matched fallback removal) are NOT
  inherited; develop validation is preserved.

Required for the slurm_multi launcher's documented build options
in MAD-private #186.

Co-authored-by: Cursor <cursoragent@cursor.com>
- tests/unit/test_slurm_multi.py: nine tests across three classes
  - TestSlurmMultiRegistration: `slurm_multi` is in VALID_LAUNCHERS;
    SLURM_MULTI_ALIASES contains both canonical and hyphen forms.
  - TestNormalizeSlurmMultiAliases: canonical and hyphen alias both
    normalize to `slurm_multi`; unknown values still fall through to
    the existing `slurm -> docker` default.
  - TestSlurmMultiPrepareScript: end-to-end fixture that mirrors
    MAD-private PR #186's `pyt_sglang_disagg_qwen3-32b_short` entry
    verbatim. Asserts that prepare() takes the slurm_multi early-
    dispatch path, sets `_is_slurm_multi = True` for the deploy()
    gate, emits an SBATCH header that reflects the model card slurm
    block, and emits every model_info.env_vars key as
    `export KEY="value"` in the wrapper script. This is the explicit
    safety net for the Q2 decision to drop PR #86's universal
    docker_env_vars merge -- if slurm_multi ever stops propagating
    those env_vars to the workload, this test fails loudly.
  Existing tests in tests/unit/test_deployment.py and
  tests/unit/test_container_runner.py are NOT modified.
- examples/slurm-configs/minimal/slurm-multi-minimal.json:
  small reference config alongside develop's existing
  *-minimal.json examples (`-f` add because *.json is gitignored,
  matching how existing files in this dir were originally added).

Co-authored-by: Cursor <cursoragent@cursor.com>
…lti bash branch

The earlier feat(deployment) commit ported `_run_inside_existing_allocation`
verbatim from PR #86 (used by the slurm_multi bash-in-salloc branch).
That method constructs `DeploymentResult(..., skip_monitoring=True)` to
signal that the script ran synchronously and the monitor phase should
be skipped. Develop's `DeploymentResult` dataclass did not have that
field, so the construction raised `TypeError` at runtime, the failure
was swallowed by the orchestrator, and `madengine run` exited 0 even
when the wrapper script failed.

Two minimal additive edits to deployment/base.py:

- DeploymentResult: add `skip_monitoring: bool = False` field.
- BaseDeployment.execute(): change the monitor guard from
  `if self.config.monitor:` to
  `if self.config.monitor and not result.skip_monitoring:`
  so the bash branch can correctly bypass the SLURM job poll
  (there is no SLURM job to poll — the script ran inline).

Behavior for non-slurm_multi launchers is unchanged: they never
construct DeploymentResult with skip_monitoring=True, the field
defaults to False, and the monitor() call still fires as before.

Discovered during R5.1b cluster smoke test of
`pyt_sglang_disagg_qwen3-32b_short` from MAD-private PR #186.

Co-authored-by: Cursor <cursoragent@cursor.com>
…eporter

The slurm_multi model script (e.g., MAD-private's run_xPyD_models.slurm)
writes its perf CSV to /shared_inference/$USER/$JOBID/perf.csv.
`_collect_slurm_multi_results` correctly finds and ingests that CSV via
`_collect_results_parse_perf_csv` (results['successful_runs'] is
populated, manifest exit codes are right). However the CLI's
post-run reporter (`display_performance_table` in cli/utils.py) reads
from cwd 'perf.csv' by default, so users saw a cosmetic
"Performance CSV not found: perf.csv" warning even on a successful
slurm_multi run. Local and classic-SLURM flows already leave a
cumulative perf.csv in cwd via update_perf_csv(); slurm_multi did not.

Fix: inside `_collect_slurm_multi_results`, after the per-job CSV is
located and parsed, also write its rows into the conventional cwd
perf.csv path:
  * if cwd 'perf.csv' is absent: shutil.copy() the per-job file
  * if it exists (from previous runs): append data rows (skip the
    per-job header so the cwd file stays single-headed)

Original per-job CSV at /shared_inference/$USER/$JOBID/perf.csv is
not modified or deleted. Wrapped in try/except so any I/O failure
degrades to a yellow warning rather than failing the run.

Affects only `_collect_slurm_multi_results`, which only runs for the
slurm_multi launcher dispatch added earlier on this branch. Zero
effect on non-slurm_multi launchers (they still take develop's
classic collect_results path which writes cwd perf.csv via
update_perf_csv).

Validated end-to-end: Llama-3.1-8B smoke on 3 reservation nodes
(Distributed_Inference_CI), bash branch in salloc, parallel docker
pull on all nodes, proxy/prefill/decode came up, benchmark
concurrency=8 / 32-32 ISL/OSL ran to completion, run exit 0, cwd
perf.csv now contains the per-job row (THROUGHPUT/TTFT/ITL columns
match per-job source).

Co-authored-by: Cursor <cursoragent@cursor.com>
- CHANGELOG.md: new [Unreleased] section documenting the slurm_multi
  launcher, --use-image / --build-on-compute CLI flags, the slurm_multi
  build registry gate, the bash-in-salloc execution path, the
  DeploymentResult.skip_monitoring field, the SlurmNodeSelector
  reservation parameter, the new contract tests + minimal example,
  and the cwd perf.csv aggregation fix.
- cli/commands/build.py: comment near the --use-image Typer option
  flagging that `is_flag=False, flag_value="auto"` is being deprecated
  by Typer, with the planned migration path. Behavior unchanged
  (matches MAD-private PR #186's documented UX).

Co-authored-by: Cursor <cursoragent@cursor.com>
Manifest-shape correctness (Copilot C2):
- _execute_with_prebuilt_image now keys built_images by model_name (one
  entry per model, all pointing at the same use_image) so it matches
  built_models and the rest of the codebase's invariant that
  ContainerRunner.run_models_from_manifest() relies on at
  built_models.get(image_name, {}). Without this, --use-image (and the
  implicit slurm_multi DOCKER_IMAGE_NAME gate) silently skipped every
  model in the run phase.

Multi-model dedup TypeError fix (Copilot C3):
- Replace tuple(sorted((m.get('distributed') or {}).items())) and the
  same pattern for slurm with json.dumps(..., sort_keys=True, default=str).
  The old pattern raised TypeError: unhashable type: 'dict' as soon as a
  model's distributed block contained nested dicts (sglang_disagg /
  vllm_disagg).

slurm_multi wrapper completion-marker on failure (Copilot C6):
- The wrapper script enables `set -e` at the top, so a non-zero exit
  from the model script terminated the wrapper before SCRIPT_EXIT_CODE
  was captured and the completion marker was written -- failed runs
  looked like hangs to monitor(). Wrap the bash invocation in
  `set +e` / `set -e` so the marker is always written.

Self-managed launcher hygiene (Copilot C4, C5, C7):
- Drop unused `import shutil` inside _run_self_managed.
- Drop unused script_name local in _run_self_managed (only script_path
  is consumed by the function).
- Redact env_vars values in the run log (`ENV: KEY=<set>`) so model-card
  credentials (HF_TOKEN, MAD_DOCKERHUB_PASSWORD, etc.) don't leak.

Build orchestrator hygiene (Copilot C1, C8):
- Drop top-level `import shutil` (unused).
- Drop the two `requires_auth = ...` assignments in
  _execute_build_on_compute (assigned but never read).

Validators docstring honesty (Copilot C10):
- validate_additional_context's `use_image` parameter is currently
  informational only (parameter retained for build.py call-site
  compatibility); the docstring previously claimed it skipped required-
  field validation, which the body never implemented. Tightened the
  docstring to match reality.

New contract tests (tests/unit/test_slurm_multi.py):
- test_wrapper_disables_set_e_around_model_script: locks in C6's
  set +e/SCRIPT_EXIT_CODE/set -e ordering around the bash invocation.
- test_built_images_and_models_share_keys: locks in C2's invariant
  using a fake 2-model fixture and a stubbed BuildOrchestrator.
- test_multi_model_nested_dict_distributed_does_not_raise: regression
  for C3 using two sglang_disagg-style cards with differing nested
  dicts in `distributed`; would TypeError under the old code.

Skipped: Copilot C9 (broader BuildOrchestrator branch tests for
--use-image / --build-on-compute / registry-gating). The slurm_multi
contract suite + multi-node cluster smoke already cover the slurm_multi
surface end-to-end; broader orchestrator-branch coverage is a worthwhile
follow-up but out of scope for this minimal-additive PR.

Tests:
- pytest tests/unit/test_slurm_multi.py -v  -> 12 passed (was 9).
- pytest tests/unit -q  -> 436 passed / 2 failed; the 2 failures are
  the same PermissionError environmental baseline that exists on
  develop (test_upload_file_to_mongodb_file_not_found and
  test_validate_additional_context_file_not_found, both raise
  PermissionError on '/nonexistent/file.json' instead of FileNotFound
  on this filesystem). Zero new failures.

Co-authored-by: Cursor <cursoragent@cursor.com>
…d support multi-model build-on-compute

- Extract is_self_managed_launcher() helper in deployment/common.py; replace
  3 duplicated SLURM_MULTI_ALIASES + normalize blocks across slurm.py,
  container_runner.py, and build_orchestrator.py. Remove dead
  _generate_slurm_multi_command (orphaned after early-dispatch bypass).
- Preserve model-card xP/yD over distributed.sglang_disagg defaults
  (skip-if-set in _prepare_slurm_multi_script).
- Generalize _execute_build_on_compute to build/tag/push all discovered
  models in a single sbatch job; key manifest by model_name to match
  _execute_with_prebuilt_image so ContainerRunner can join entries.
  Apply shlex.quote to registry_host/cwd/image names (consistent with
  v2.0.3 hardening pass).
- Tests: parametrize is_self_managed_launcher, add TestXpYdSkipIfSet and
  TestBuildOnComputeManifestShape covering the multi-model contract.
- CHANGELOG: clarify slurm_multi as an escape-hatch launcher distinct
  from templated launchers.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 27, 2026 21:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new slurm_multi self-managed multi-node SLURM launcher: a wrapper SBATCH script that runs the model's own .slurm script directly on the head node (no madengine wrapper on compute nodes) and lets the script orchestrate its own per-node Docker containers via srun. Existing templated launchers are untouched and the new path is gated on distributed.launcher == "slurm_multi" (or slurm-multi). The PR also adds madengine build --use-image (with auto model-card resolution) and --build-on-compute for building on a compute node and pushing to a registry, plus a result-collection path that ingests model-generated perf.csv.

Changes:

  • New slurm_multi deployment path: registry/alias normalization, _prepare_slurm_multi_script, bash-in-salloc execution, reservation propagation through SlurmNodeSelector, and a dedicated _collect_slurm_multi_results.
  • New self-managed launcher execution path in ContainerRunner (_run_self_managed) with shell-quoted script/args, key-only env logging, and .slurm recognition.
  • New build --use-image/--build-on-compute orchestration including registry validation/normalization, multi-model build manifest, and a slurm_multi-aware fallback to the model card's DOCKER_IMAGE_NAME.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/madengine/deployment/common.py Adds slurm_multi to VALID_LAUNCHERS, hyphen alias normalization, and is_self_managed_launcher helper.
src/madengine/deployment/slurm.py Allocation detection, _prepare_slurm_multi_script, bash-in-salloc deploy, slurm_multi result collection, reservation passthrough.
src/madengine/deployment/slurm_node_selector.py Optional reservation forwarded to health/cleanup srun commands.
src/madengine/deployment/base.py New DeploymentResult.skip_monitoring field and monitor-skip logic.
src/madengine/execution/container_runner.py Self-managed launcher dispatch + _run_self_managed; .slurm recognized in container path.
src/madengine/execution/docker_builder.py Propagates registry_image into built_models[*].env_vars.DOCKER_IMAGE_NAME.
src/madengine/orchestration/build_orchestrator.py use_image/build_on_compute flags, prebuilt/build-on-compute manifest emission, slurm_multi registry gate.
src/madengine/cli/commands/build.py New --use-image / --build-on-compute Typer options with mutual-exclusion checks.
src/madengine/cli/validators.py Threads use_image through validate_additional_context (informational only).
tests/unit/test_slurm_multi.py Contract tests for registry, normalization, _prepare_slurm_multi_script, xP/yD, prebuilt and build-on-compute manifest shapes.
examples/slurm-configs/minimal/slurm-multi-minimal.json Minimal 3-node slurm_multi example config.
CHANGELOG.md Unreleased entry documenting the new launcher and CLI flags.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/deployment/slurm.py Outdated
Comment thread src/madengine/deployment/slurm.py Outdated
Comment thread src/madengine/orchestration/build_orchestrator.py
Comment thread src/madengine/execution/docker_builder.py Outdated
Comment thread src/madengine/execution/container_runner.py Outdated
Comment thread src/madengine/orchestration/build_orchestrator.py
coketaste and others added 3 commits May 28, 2026 02:19
…leanup files

Cache the DiscoverModels result from the registry-check pass so the
build step reuses it instead of running a second discovery (avoiding
duplicate get_models_json.py execution and duplicate console output).

Expand DEFAULT_CLEAN_FILES to include build_manifest.json and related
perf artefacts so stale manifests from prior e2e tests can't silently
cause the wrong image to be executed.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Resolves the CHANGELOG conflict by adopting upstream's v2.0.3 release
date (2026-05-26 — finalized upstream via #122) and graduating this
branch's Unreleased entries into a new v2.1.0 section dated 2026-05-28,
since slurm_multi / --use-image / --build-on-compute are feature work.

Auto-merged from upstream:
- fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment (#126)
- docs/wiki/index.html (wiki path rename, #129)
- slurm.py: shell-quote env_var values, model script name, and model args
  in the generated SLURM wrapper script to prevent shell metacharacters
  ($(), backticks, ;, "$, etc.) in user-supplied inputs from breaking
  the script or triggering host-shell expansion. Mirrors the hardening
  pattern in container_runner._run_self_managed.
- container_runner.py: log env var keys from additional_context using
  the same `ENV: {key}=<set>` pattern already used for model-card env
  vars, so operators can see what was applied without leaking values.
- docker_builder.py: warn (via rich_console) when an entry in
  built_images has no matching key in built_models while injecting
  DOCKER_IMAGE_NAME for slurm_multi parallel pulls, so future keying
  divergence is caught instead of silently no-opping.
- build_orchestrator.py (prebuilt manifest): mark built_images entries
  with `local_image: True` so ContainerRunner.run_models_from_manifest()
  resolves run_image to use_image (and pulls on demand) instead of
  falling through to the else branch where run_image would be the model
  name (the new dict key) and _resolve_docker_image() would fail. Fixes
  --use-image for non-slurm_multi targets.
- build_orchestrator.py (build_on_compute heredoc): shell-quote
  model_name (and reuse the already-quoted local/registry/dockerfile
  tokens) in echo lines too, so a `"` or `$` in those fields cannot
  break the generated sbatch script.
Copilot AI review requested due to automatic review settings May 28, 2026 02:54
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Comment thread tests/unit/test_slurm_multi.py
Comment thread tests/unit/test_slurm_multi.py
Expand v2.1.0 CHANGELOG with missing entries (local self-managed
execution, model card config merge, DockerBuilder registry injection,
shell-quoting security hardening, early discovery reuse, e2e cleanup
defaults).

Update launchers.md with full slurm_multi section (architecture,
build/run workflows, salloc support). Update cli-reference.md with
--use-image and --build-on-compute flags. Update deployment.md with
slurm_multi, salloc, pre-built image sections, and reservation
parameter. Update configuration.md with reservation, exclusive, and
expanded launcher list. Update slurm-configs README with slurm_multi
minimal example, workflow, and config reference.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants