Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,50 @@ All notable changes to madengine will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [2.1.0] - 2026-05-28

### Added

- **`slurm_multi` SLURM escape-hatch launcher**: New self-managed multi-node launcher for workloads that orchestrate their own per-node Docker containers via `srun` (e.g. SGLang Disaggregated proxy + prefill + decode topologies). Selected via `distributed.launcher: "slurm_multi"` (or `"slurm-multi"` alias). Generates a wrapper SBATCH script that runs the model's `.slurm` script directly on baremetal so `srun`/`scontrol` work inside it; performs parallel `srun docker pull` of the registry image on all allocated nodes when the model card sets `env_vars.DOCKER_IMAGE_NAME`. Honors model-card and `--additional-context` `slurm` fields (`partition`, `nodes`, `gpus_per_node`, `time`, `exclusive`, `reservation`, `nodelist`). This launcher coexists with the standard templated launchers (torchrun, vllm, sglang, deepspeed, megatron, torchtitan, primus) — those continue to flow through the standard sbatch template unchanged; only `slurm_multi`/`slurm-multi` takes the self-managed bypass path.

- **`madengine build --use-image [IMAGE | auto]`**: Skip the local Docker build and use a pre-built image instead. With no value, resolves to the model card's `env_vars.DOCKER_IMAGE_NAME` automatically. Mutually exclusive with `--registry` and `--build-on-compute`. Manifest entries are keyed by model name with `local_image: True` so `ContainerRunner.run_models_from_manifest()` resolves `run_image` correctly and pulls on demand.

- **`madengine build --build-on-compute`**: Build Docker images on a SLURM compute node and push to a registry, then have `madengine run` pull the image in parallel on all allocated nodes. Requires `--registry`. The resulting manifest carries `built_on_compute: true`.

- **slurm_multi build registry gate**: When `madengine build` discovers a `slurm_multi` model and no `--registry`/`--use-image`/`--build-on-compute` is given, the orchestrator either auto-uses `env_vars.DOCKER_IMAGE_NAME` from the model card (implicit `--use-image` fallback) or raises a structured `ConfigurationError` with the four supported options listed.

- **bash-in-salloc execution path** for slurm_multi: when `madengine run` detects `SLURM_JOB_ID` (i.e. running inside an existing `salloc`), the slurm_multi launcher runs the generated wrapper synchronously with `bash` instead of nesting another `sbatch` job. Other launchers continue to use `sbatch` even inside `salloc` (no behavior change for non-slurm_multi).

- **Local self-managed launcher execution** (`container_runner.py`): `ContainerRunner._run_self_managed()` runs the model script directly on the host for self-managed launchers, bypassing madengine's Docker wrapper. Used when `madengine run` detects a `slurm_multi` launcher in local/non-SLURM contexts. Environment variables from the model card and `--additional-context` are injected; keys are logged without values to avoid leaking credentials.

- **Model card config merge into manifest `deployment_config`**: `_execute_with_prebuilt_image` now merges the model card's `distributed` and `slurm` sections into the manifest's `deployment_config`, so the run phase auto-detects SLURM deployment and launcher settings without requiring `--additional-context`. User-supplied CLI values take precedence over model card defaults.

- **`DockerBuilder` registry image injection for parallel pull**: After a successful registry push, `DockerBuilder.generate_manifest()` now sets `DOCKER_IMAGE_NAME` in each `built_models` entry's `env_vars` to the registry image, enabling slurm_multi parallel `srun docker pull` on all nodes without requiring manual image specification.

- **`DeploymentResult.skip_monitoring`** (`deployment/base.py`): new dataclass field so synchronous deploy paths (e.g. slurm_multi's bash-in-salloc) can skip the monitor poll.

- **`SlurmNodeSelector` `reservation` parameter**: optional reservation name forwarded to srun health/cleanup commands so node-prep srun calls run inside the reservation.

- **`tests/unit/test_slurm_multi.py`**: contract tests for `slurm_multi` registry membership, hyphen alias normalization, end-to-end env_vars-export contract against MAD-private PR #186's `pyt_sglang_disagg_qwen3-32b_short` model card, and `_execute_with_prebuilt_image` manifest key-set contract (`built_images.keys() == built_models.keys()`).

- **`examples/slurm-configs/minimal/slurm-multi-minimal.json`**: minimal reference config for the new launcher.

### Changed

- **Early model discovery reuse in `BuildOrchestrator`**: The `DiscoverModels` result from the slurm_multi registry-gate check is now cached and reused for the actual build step, avoiding duplicate `get_models_json.py` execution and duplicate console output.

- **E2E test cleanup defaults expanded**: `DEFAULT_CLEAN_FILES` in `tests/fixtures/utils.py` now includes `build_manifest.json` and related perf artefacts (`perf_super.json`, `perf_entry.csv`, etc.) so stale manifests from prior e2e tests cannot silently cause the wrong image to be executed.

### Fixed

- **slurm_multi: cwd `perf.csv` aggregation**: After a successful slurm_multi run, `madengine run` previously printed a cosmetic `Performance CSV not found: perf.csv` warning even though `_collect_slurm_multi_results` had ingested the per-job CSV from `/shared_inference/$USER/$JOBID/perf.csv`. The reporter (`display_performance_table`) reads cwd `perf.csv` by default. Now `_collect_slurm_multi_results` also writes the per-job rows into cwd `perf.csv` (copy if absent, append-data-rows if present) so reporting and HTML generation work without extra args. Local + classic-SLURM flows are unchanged.

### Security

- **Shell injection hardening in slurm_multi wrapper scripts**: `shlex.quote()` is applied to env_var values, the model script name, and model args in the generated SBATCH wrapper script (`slurm.py::_prepare_slurm_multi_script`) and the local self-managed runner (`container_runner.py::_run_self_managed`), preventing shell metacharacters (`$()`, backticks, `;`, `"`, etc.) in user-supplied inputs from triggering host-shell expansion.

## [2.0.3] - 2026-05-26

### Added
Expand Down
15 changes: 13 additions & 2 deletions docs/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,8 @@ madengine build [OPTIONS]
| `--tags` | `-t` | TEXT | `[]` | Model tags to build (can specify multiple) |
| `--target-archs` | `-a` | TEXT | `[]` | Target GPU architectures (e.g., gfx908,gfx90a,gfx942) |
| `--registry` | `-r` | TEXT | `None` | Docker registry to push images to |
| `--use-image` | | TEXT | `None` | Skip Docker build and use a pre-built image. Omit value or pass `auto` to resolve from model card's `DOCKER_IMAGE_NAME`. Mutually exclusive with `--registry` and `--build-on-compute` |
| `--build-on-compute` | | FLAG | `False` | Build Docker images on a SLURM compute node and push to registry. Requires `--registry` |
| `--batch-manifest` | | TEXT | `None` | Input batch.json file for batch build mode |
| `--additional-context` | `-c` | TEXT | `"{}"` | Additional context as JSON string |
| `--additional-context-file` | `-f` | TEXT | `None` | File containing additional context JSON |
Expand Down Expand Up @@ -142,6 +144,15 @@ madengine build --tags model \

# Real-time output with verbose logging
madengine build --tags model --live-output --verbose

# Use a pre-built image (skip Docker build)
madengine build --tags model --use-image lmsysorg/sglang:v0.5.2rc1-rocm700-mi30x

# Auto-detect image from model card's DOCKER_IMAGE_NAME
madengine build --tags model --use-image

# Build on SLURM compute node and push to registry
madengine build --tags model --build-on-compute --registry docker.io/myorg
```

**Default Values:**
Expand Down Expand Up @@ -658,6 +669,6 @@ madengine recognizes these environment variables:

---

**Version:** 2.0.0
**Last Updated:** December 2025
**Version:** 2.1.0
**Last Updated:** May 2026

5 changes: 5 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -472,6 +472,8 @@ Automatically applies (see presets under `src/madengine/deployment/presets/k8s/`
- `gpus_per_node` - GPUs per node (default: 1)
- `nodes` - Number of nodes (default: 1)
- `nodelist` - Comma-separated node names to run on (e.g. `"node01,node02"`); when set, job is restricted to these nodes and automatic node health preflight is skipped
- `reservation` - SLURM reservation name; forwarded to srun health/cleanup commands and SBATCH directives
- `exclusive` - Exclusive node access (default: `true`)
- `time` - Wall time limit HH:MM:SS (required)
- `mem` - Memory per node (e.g., "64G")
- `mail_user` - Email for notifications
Expand Down Expand Up @@ -521,8 +523,11 @@ Automatically applies (see presets under `src/madengine/deployment/presets/k8s/`
- `deepspeed` - ZeRO optimization
- `megatron` - Large transformers (K8s + SLURM)
- `torchtitan` - LLM pre-training
- `primus` - Primus unified pretrain
- `vllm` - LLM inference
- `sglang` - Structured generation
- `sglang-disagg` - Disaggregated SGLang
- `slurm_multi` / `slurm-multi` - Self-managed multi-container topologies (SLURM only)

See [Launchers Guide](launchers.md) for details.

Expand Down
50 changes: 50 additions & 0 deletions docs/deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ This creates:
- `vllm` - LLM inference
- `sglang` - Structured generation
- `sglang-disagg` - Disaggregated SGLang (multi-node)
- `slurm_multi` / `slurm-multi` - Self-managed multi-container topologies (SLURM only, escape hatch)

See [Launchers Guide](launchers.md) for details.

Expand Down Expand Up @@ -242,8 +243,10 @@ The deployment target is automatically detected from the `slurm` key in the conf
- `gpus_per_node`: Number of GPUs per node
- `nodes`: Number of nodes (for multi-node)
- `nodelist`: Comma-separated node names to run on (e.g. `"node01,node02"`); when set, job runs only on these nodes and node health preflight is skipped
- `reservation`: SLURM reservation name; forwarded to srun health/cleanup commands
- `time`: Wall time limit (HH:MM:SS)
- `mem`: Memory per node (e.g., "64G")
- `exclusive`: Exclusive node access (default: `true`)
- `mail_user`: Email for job notifications
- `mail_type`: Notification types (BEGIN, END, FAIL, ALL)

Expand Down Expand Up @@ -291,6 +294,53 @@ scontrol show job <job_id>
tail -f slurm-<job_id>.out
```

### Pre-Built Images and Build-on-Compute

For workloads that use externally maintained Docker images (e.g. SGLang, vLLM releases):

```bash
# Skip Docker build, use a pre-built image
madengine build --tags model --use-image lmsysorg/sglang:latest

# Auto-detect image from model card's DOCKER_IMAGE_NAME
madengine build --tags model --use-image

# Build on a SLURM compute node and push to registry
madengine build --tags model --build-on-compute --registry docker.io/myorg
```

The manifest generated by `--use-image` merges the model card's `distributed` and `slurm` config into `deployment_config`, so the run phase auto-detects SLURM deployment without additional `--additional-context`.

### slurm_multi Launcher (Self-Managed)

For workloads that orchestrate their own per-node Docker containers (e.g. SGLang Disaggregated proxy + prefill + decode topologies), use the `slurm_multi` launcher:

```json
{
"distributed": {
"launcher": "slurm_multi"
},
"slurm": {
"partition": "gpu",
"nodes": 3,
"gpus_per_node": 8,
"reservation": "my-reservation"
}
}
```

Unlike templated launchers, slurm_multi runs the model's `.slurm` script directly on baremetal. The script manages its own Docker containers via `srun` internally. See [Launchers Guide — slurm_multi](launchers.md#9-slurm_multi-self-managed-escape-hatch) for details.

### Running Inside salloc

When `madengine run` detects an existing SLURM allocation (`SLURM_JOB_ID` is set, e.g. inside `salloc`), the slurm_multi launcher runs the generated wrapper script synchronously with `bash` instead of nesting another `sbatch`. Other launchers continue to use `sbatch` even inside `salloc`.

```bash
salloc --nodes=3 --gpus-per-node=8 --partition=gpu
madengine run --manifest-file build_manifest.json
# → Detects salloc, runs synchronously
```

### Cancellation

```bash
Expand Down
108 changes: 107 additions & 1 deletion docs/launchers.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ madengine provides unified support for multiple distributed frameworks, enabling
| **vLLM** | Inference | High-throughput LLM serving | ✅ | ✅ | ✅ |
| **SGLang** | Inference | Fast LLM inference | ✅ | ✅ | ✅ |
| **SGLang Disaggregated** | Inference | Large-scale disaggregated inference | ✅ | ✅ | ✅ (min 3) |
| **slurm_multi** | Escape hatch | Self-managed multi-container topologies | ❌ | ✅ | ✅ |

---

Expand Down Expand Up @@ -557,6 +558,108 @@ madengine run --tags model --config custom-split-config.json

---

### 9. slurm_multi (Self-Managed Escape Hatch)

**Purpose**: Run workloads that manage their own per-node Docker containers via `srun` — an escape hatch for topologies that don't fit the standard templated launchers.

**When to Use**:
- ✅ Multi-container SLURM topologies (e.g. SGLang Disaggregated proxy + prefill + decode)
- ✅ Workloads whose `.slurm` script orchestrates Docker containers via `srun` internally
- ✅ Scenarios requiring baremetal `srun`/`scontrol` access from the model script
- ❌ NOT a peer of templated launchers — use torchrun, vllm, sglang, etc. for standard workloads

**Configuration**:
```json
{
"distributed": {
"launcher": "slurm_multi",
"nnodes": 3,
"nproc_per_node": 8
},
"slurm": {
"partition": "gpu",
"nodes": 3,
"gpus_per_node": 8,
"time": "04:00:00",
"exclusive": true,
"reservation": "my-reservation"
}
}
```

**How It Works**:

Unlike templated launchers that inject `MAD_MULTI_NODE_RUNNER` and wrap the model script inside a Docker container, slurm_multi:

1. Generates a wrapper SBATCH script that exports `env_vars` from the model card
2. Runs the model's own `.slurm` script directly on baremetal (head node)
3. The model script orchestrates per-node Docker containers via `srun` internally
4. Performs parallel `srun docker pull` on all allocated nodes when using registry images
5. Writes a completion marker file for robust job completion detection

```
┌─────────────────────────────────────────────────┐
│ madengine build --use-image <image> │
│ → Generates manifest with pre-built image │
│ → Merges model card slurm/distributed config │
└───────────────────┬─────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ madengine run --manifest-file manifest.json │
│ → Detects slurm_multi launcher │
│ → Generates wrapper SBATCH script │
│ → Parallel docker pull on all nodes (if needed) │
│ → Submits sbatch (or runs bash if inside salloc)│
└───────────────────┬─────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ Model's .slurm script runs on head node │
│ → Orchestrates Docker containers via srun │
│ → Manages its own topology (proxy/prefill/...) │
│ → Writes perf.csv (collected by madengine) │
└─────────────────────────────────────────────────┘
```

**Build Phase**:

slurm_multi models typically use pre-built images. The build phase has a **registry gate**: if no `--registry`, `--use-image`, or `--build-on-compute` is given, the orchestrator either auto-detects `DOCKER_IMAGE_NAME` from the model card (implicit `--use-image`) or raises a `ConfigurationError` with supported options.

```bash
# Use a pre-built image (recommended for slurm_multi)
madengine build --tags my_model --use-image lmsysorg/sglang:latest

# Auto-detect image from model card's DOCKER_IMAGE_NAME
madengine build --tags my_model --use-image

# Build on compute node and push to registry
madengine build --tags my_model --build-on-compute --registry docker.io/myorg
```

**Run Phase — salloc support**:

When `madengine run` detects `SLURM_JOB_ID` (running inside an existing `salloc` allocation), the slurm_multi launcher runs the wrapper script synchronously with `bash` instead of nesting another `sbatch`. Other launchers continue to use `sbatch` inside `salloc` (no behavior change).

```bash
# Inside salloc: runs synchronously with bash
salloc --nodes=3 --gpus-per-node=8 --partition=gpu
madengine run --manifest-file build_manifest.json
```

**Alias**: `"slurm-multi"` (hyphen) is normalized to `"slurm_multi"` (underscore).

**Features**:
- Wrapper SBATCH script with shell-quoted env_vars (injection-safe)
- Parallel `srun docker pull` on all nodes for registry images
- Completion marker for robust job status detection
- bash-in-salloc synchronous execution path
- `DeploymentResult.skip_monitoring` for synchronous runs
- Model card slurm/distributed config auto-merged into manifest

**Examples**:
- SLURM: `examples/slurm-configs/minimal/slurm-multi-minimal.json`

---

## Comparison Matrix

### Training Launchers
Expand Down Expand Up @@ -732,7 +835,7 @@ SGLANG_NODE_RANK=${SLURM_PROCID}
```bash
Error: Unknown launcher type 'xyz'
```
Solution: Use one of: `torchrun`, `deepspeed`, `megatron`, `torchtitan`, `primus`, `vllm`, `sglang`, `sglang-disagg`
Solution: Use one of: `torchrun`, `deepspeed`, `megatron`, `torchtitan`, `primus`, `vllm`, `sglang`, `sglang-disagg`, `slurm_multi` (or `slurm-multi`)

**2. Multi-Node Communication Fails**
```bash
Expand Down Expand Up @@ -782,6 +885,9 @@ $MAD_MULTI_NODE_RUNNER your_training_script.py --args

# For vLLM/sglang (no MAD_MULTI_NODE_RUNNER)
python your_inference_script.py --args

# For slurm_multi (no MAD_MULTI_NODE_RUNNER; script runs on baremetal and manages Docker via srun)
# The model's .slurm script is executed directly — it handles srun, docker run, etc. internally
```

### Launcher Detection
Expand Down
Loading