Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 62 additions & 11 deletions docs/launchers.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ madengine provides unified support for multiple distributed frameworks, enabling
| **DeepSpeed** | Training | ZeRO optimization training | ✅ | ✅ | ✅ |
| **Megatron-LM** | Training | Large-scale transformer training | ✅ | ✅ | ✅ |
| **TorchTitan** | Training | LLM pre-training (FSDP2+TP+PP) | ✅ | ✅ | ✅ |
| **Primus** | Training | Megatron/TorchTitan/Jax via Primus config | ✅ | ✅ | ✅ |
| **vLLM** | Inference | High-throughput LLM serving | ✅ | ✅ | ✅ |
| **SGLang** | Inference | Fast LLM inference | ✅ | ✅ | ✅ |
| **SGLang Disaggregated** | Inference | Large-scale disaggregated inference | ✅ | ✅ | ✅ (min 3) |
Expand Down Expand Up @@ -224,6 +225,49 @@ TORCHTITAN_CONTEXT_PARALLEL_SIZE=1
- K8s: `examples/k8s-configs/minimal/torchtitan-single-node-minimal.json`
- SLURM: `examples/slurm-configs/minimal/torchtitan-single-node-minimal.json`

---

### 5b. Primus

**Purpose**: Unified pretrain entry for Megatron-LM, TorchTitan, and Jax/MaxText via Primus experiment YAML.

**When to Use**:
- Run Primus example configs (e.g. `examples/megatron/configs/MI300X/*.yaml`) via madengine
- Single image plus config path; scheduling and tools/metrics from madengine

**Configuration**:
```json
{
"distributed": {
"launcher": "primus",
"nnodes": 2,
"nproc_per_node": 8,
"primus": {
"config_path": "examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml",
"cli_extra": "",
"backend": "megatron"
}
}
}
```

Optional **`primus.backend`** (e.g. `MaxText`, `megatron`) emits `export BACKEND=...` before your model script when path-based detection is not enough. If omitted, madengine infers `BACKEND` from the **model name** when it follows `primus_pretrain/<launcher>_<arch>_...` (e.g. `primus_pretrain/torchtitan_MI300X_qwen3_4B-pretrain` → `torchtitan`), matching `scripts/primus_pretrain/get_models_json.py` in MAD-internal.

**Features**:
- Launcher sets `PRIMUS_CONFIG_PATH`, optional `PRIMUS_CLI_EXTRA`, and optional `BACKEND` from `primus.backend`; no `MAD_MULTI_NODE_RUNNER`
- Model script (e.g. `run.sh`) sets `EXP` and calls Primus `run_pretrain.sh`
- NNODES, NODE_RANK, MASTER_ADDR, etc. set by madengine job template
- Use with MAD-Internal Primus submodule and `scripts/primus_pretrain/run.sh`

**Container image**: Prefer `docker/primus.ubuntu.amd.Dockerfile` with `COPY scripts/Primus/ /workspace/Primus/` and `PRIMUS_ROOT=/workspace/Primus`. On **Kubernetes**, the Job’s emptyDir hides image files under `/workspace`; madengine bundles `scripts/Primus/examples/...` into the ConfigMap as `Primus/examples/...` so the init container recreates `/workspace/Primus`. `run.sh` resolves `PRIMUS_ROOT` in that order (see script comments).

**Examples**:
- SLURM: `examples/slurm-configs/minimal/primus-minimal.json`
- K8s: `examples/k8s-configs/minimal/primus-minimal.json`
- K8s (Primus vs upstream workload API, MaxText caveats, TorchTitan/Megatron/MaxText sample JSON): `examples/k8s-configs/README.md` section **Primus on Kubernetes**

---

**Model Configuration** (TOML):
```toml
[model]
Expand Down Expand Up @@ -519,16 +563,16 @@ madengine run --tags model --config custom-split-config.json

### Training Launchers

| Feature | torchrun | DeepSpeed | Megatron-LM | TorchTitan |
|---------|----------|-----------|-------------|------------|
| **Data Parallel** | ✅ DDP | ✅ ZeRO | ✅ | ✅ FSDP2 |
| **Tensor Parallel** | ❌ | ❌ | ✅ | ✅ |
| **Pipeline Parallel** | ❌ | ✅ | ✅ | ✅ |
| **Memory Efficiency** | Medium | High (ZeRO) | High | Very High |
| **Ease of Use** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ |
| **Model Size** | Small-Medium | Medium-Large | Very Large | Very Large |
| **K8s Support** | ✅ | ✅ | ✅ | ✅ |
| **SLURM Support** | ✅ | ✅ | ✅ | ✅ |
| Feature | torchrun | DeepSpeed | Megatron-LM | TorchTitan | Primus |
|---------|----------|-----------|-------------|------------|--------|
| **Data Parallel** | ✅ DDP | ✅ ZeRO | ✅ | ✅ FSDP2 | via config |
| **Tensor Parallel** | ❌ | ❌ | ✅ | ✅ | via config |
| **Pipeline Parallel** | ❌ | ✅ | ✅ | ✅ | via config |
| **Memory Efficiency** | Medium | High (ZeRO) | High | Very High | config-driven |
| **Ease of Use** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| **Model Size** | Small-Medium | Medium-Large | Very Large | Very Large | config-driven |
| **K8s Support** | ✅ | ✅ | ✅ | ✅ | ✅ |
| **SLURM Support** | ✅ | ✅ | ✅ | ✅ | ✅ |

### Inference Launchers

Expand Down Expand Up @@ -646,6 +690,13 @@ TORCHTITAN_FSDP_ENABLED=1
MAD_MULTI_NODE_RUNNER="torchrun ..."
```

**Primus**:
```bash
PRIMUS_CONFIG_PATH="examples/megatron/configs/MI300X/..."
PRIMUS_CLI_EXTRA="" # optional
# No MAD_MULTI_NODE_RUNNER (model script calls Primus run_pretrain.sh)
```

**vLLM**:
```bash
VLLM_TENSOR_PARALLEL_SIZE=4
Expand Down Expand Up @@ -681,7 +732,7 @@ SGLANG_NODE_RANK=${SLURM_PROCID}
```bash
Error: Unknown launcher type 'xyz'
```
Solution: Use one of: `torchrun`, `deepspeed`, `megatron`, `torchtitan`, `vllm`, `sglang`, `sglang-disagg`
Solution: Use one of: `torchrun`, `deepspeed`, `megatron`, `torchtitan`, `primus`, `vllm`, `sglang`, `sglang-disagg`

**2. Multi-Node Communication Fails**
```bash
Expand Down
37 changes: 35 additions & 2 deletions examples/k8s-configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,10 +185,33 @@ To validate rendered YAML after a debug run, install [kubeconform](https://githu

### Multi-node DNS (PyTorch vs Ray)

For **PyTorch-native** launchers (`torchrun`, `deepspeed`, `torchtitan`, `megatron`), multi-node Jobs use a **headless Service** whose name matches `pod.spec.subdomain`, per Kubernetes DNS rules, so pods get stable per-pod DNS names for rendezvous.
For **PyTorch-native** launchers (`torchrun`, `deepspeed`, `torchtitan`, `megatron`, `primus`), multi-node Jobs use a **headless Service** whose name matches `pod.spec.subdomain`, per Kubernetes DNS rules, so pods get stable per-pod DNS names for rendezvous.

For **Ray-based** multi-node (`vllm`, `sglang`), a headless Service may still be created for networking, but **per-pod DNS via `subdomain` is not applied** the same way as for PyTorch; production multi-node Ray on Kubernetes often uses **KubeRay** (see upstream vLLM / Ray docs). Treat Job-based multi-node Ray as a best-effort path.

### Primus on Kubernetes

Upstream Primus separates two ideas: a **universal** training driver (`examples/run_pretrain.sh`, used by all backends) and an optional **Kubernetes workload API client** (`examples/run_k8s_pretrain.sh`) that talks to a remote service to create workloads. Those are not the same integration as madengine’s native K8s path.

| Approach | Role |
|----------|------|
| Primus `run_k8s_pretrain.sh` | HTTP client to an external **workload API** (replicas, image, `backend`, `exp`, node labels, etc.). Does **not** match madengine’s `kubectl`-style Job YAML flow. |
| madengine `distributed.launcher: "primus"` | Renders a standard **Job** (and headless **Service** when `nnodes > 1`), injects cluster env (`MASTER_ADDR`, `NNODES`, `NODE_RANK` / `JOB_COMPLETION_INDEX`, `GPUS_PER_NODE`), `PRIMUS_CONFIG_PATH`, optional `PRIMUS_CLI_EXTRA`, and optional `BACKEND` from `distributed.primus.backend`, then runs your model script (e.g. `scripts/primus_pretrain/run.sh`). |

**Backends** (one madengine launcher; Primus routes inside `run_pretrain.sh`):

| Experiment path (typical) | Backend |
|---------------------------|---------|
| `examples/torchtitan/...` | TorchTitan (PyTorch / `torchrun`) |
| `examples/megatron/...` | Megatron-LM |
| `examples/maxtext/...` | MaxText (JAX; coordinator env from master) |

Set `distributed.primus.config_path` to your YAML under the Primus repo layout. Use optional `distributed.primus.backend` (e.g. `MaxText`, `megatron`) to emit `export BACKEND=...` when you want to override inference. If `backend` is omitted, madengine sets `BACKEND` from the **model name** when it matches `primus_pretrain/<launcher>_<arch>_...` (e.g. `primus_pretrain/torchtitan_MI300X_qwen3_4B-pretrain` → `torchtitan`).

**MaxText caveat:** For **multi-node** MaxText, Primus `run_pretrain.sh` may run **in-container `apt` installs** (InfiniBand-related packages). Many clusters disallow that unless the image is pre-baked or policy allows it. madengine logs a **warning** when MaxText is detected (`backend` or path) and `nnodes > 1`.

Primus examples under [`basic/`](basic/): [`primus-single-node-multi-gpu.json`](basic/primus-single-node-multi-gpu.json) (one pod, multi-GPU) and [`primus-multi-node.json`](basic/primus-multi-node.json) (Indexed Job). The same files work for TorchTitan, Megatron, and MaxText: set `distributed.primus.config_path` to your experiment YAML, and rely on madengine’s `BACKEND` inference from the model name (`primus_pretrain/<launcher>_<arch>_...`) or set `distributed.primus.backend` only when you need an explicit override. Use `docker_env_vars.HF_TOKEN` as a placeholder or runtime secrets — do not commit real tokens.

### Full Configs (Reference Examples)

Complete configurations showing all available fields:
Expand All @@ -213,6 +236,8 @@ Complete configurations showing all available fields:
| [`basic/torchtitan-multi-node-basic.json`](basic/torchtitan-multi-node-basic.json) | 8/node | 4 | TorchTitan | Llama 3.1 70B+ training |
| [`basic/vllm-multi-node-basic.json`](basic/vllm-multi-node-basic.json) | 4/node | 2 | vLLM | High-throughput inference |
| [`basic/sglang-multi-node-basic.json`](basic/sglang-multi-node-basic.json) | 4/node | 2 | SGLang | Distributed inference |
| [`basic/primus-single-node-multi-gpu.json`](basic/primus-single-node-multi-gpu.json) | 8 | 1 | primus | Primus pretrain (single pod; edit `primus.config_path`) |
| [`basic/primus-multi-node.json`](basic/primus-multi-node.json) | 8/node | 2+ | primus | Primus pretrain (multi-pod; edit `nnodes`, `primus.config_path`) |

---

Expand Down Expand Up @@ -551,13 +576,21 @@ Configuration for distributed workloads (training and inference):

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `launcher` | string | - | Launcher type: `torchrun`, `deepspeed`, `torchtitan`, `vllm`, `sglang` |
| `launcher` | string | - | Launcher type: `torchrun`, `deepspeed`, `torchtitan`, `megatron`, `primus`, `vllm`, `sglang` |
| `enabled` | boolean | `false` | Enable distributed execution (legacy, prefer `launcher`) |
| `backend` | string | `"nccl"` | `"nccl"`, `"gloo"`, or `"mpi"` |
| `nnodes` | integer | `1` | Number of nodes |
| `nproc_per_node` | integer | gpu_count | Processes per node (= GPUs per node) |
| `master_port` | integer | `29500` | Master communication port |

When `launcher` is **`primus`**, set nested **`primus`** (under `distributed`):

| Field | Type | Description |
|-------|------|-------------|
| `primus.config_path` | string | Path to the Primus experiment YAML (e.g. under `examples/torchtitan/...`). |
| `primus.cli_extra` | string | Optional extra arguments passed to Primus CLI. |
| `primus.backend` | string | Optional. If set, madengine emits `export BACKEND=...` (e.g. `MaxText`, `megatron`) before your `run.sh`. |

#### Environment Variables

Custom environment variables for containers:
Expand Down
1 change: 1 addition & 0 deletions src/madengine/deployment/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
"torchtitan",
"deepspeed",
"megatron-lm",
"primus",
"vllm",
"sglang",
"sglang-disagg"
Expand Down
140 changes: 140 additions & 0 deletions src/madengine/deployment/k8s_names.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
#!/usr/bin/env python3
"""
Kubernetes-safe names for metadata.name, label values, and container names.

Model names from data.json may contain ``/``, spaces, or uppercase letters that
are invalid for ``metadata.name`` (RFC 1123 subdomain) or for label values.
Container names must be a single DNS label (no dots), stricter than Job names.

Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
"""

from __future__ import annotations

import hashlib
import re
from typing import Final

# Kubernetes DNS subdomain total length (metadata.name)
_MAX_OBJECT_NAME_LEN: Final[int] = 253
# Label value max length
_MAX_LABEL_VALUE_LEN: Final[int] = 63
# Container / initContainer names: DNS label only (no dots); see Pod validation.
_MAX_DNS_LABEL_LEN: Final[int] = 63


def _trim_edges_alnum(s: str) -> str:
"""Ensure string starts and ends with [a-z0-9] (required for RFC1123 names)."""
s = s.strip("-.")
if not s:
return "x"
# Strip leading non-alphanumeric
while s and not s[0].isalnum():
s = s[1:]
while s and not s[-1].isalnum():
s = s[:-1]
return s or "x"


def sanitize_k8s_object_name(prefix: str, raw_model_name: str, max_total_len: int = _MAX_OBJECT_NAME_LEN) -> str:
"""
Build a valid ``metadata.name`` substring from a model name.

Args:
prefix: Leading segment (e.g. ``madengine``). May contain only chars valid in the final name.
raw_model_name: Original model name (may include ``/``, ``_``, spaces).
max_total_len: Maximum total length (default 253).

Returns:
A lowercase name safe for Kubernetes ``metadata.name`` (Job, PVC, Service, etc.).
"""
raw = (raw_model_name or "").strip()
pfx = (prefix or "").strip().lower()
pfx = re.sub(r"[^a-z0-9.-]+", "-", pfx)
pfx = re.sub(r"-+", "-", pfx).strip("-")
if not pfx:
pfx = "madengine"

body = raw.lower()
body = re.sub(r"[^a-z0-9.-]+", "-", body)
body = re.sub(r"-+", "-", body).strip("-")
if not body:
body = "model"

combined = f"{pfx}-{body}"
combined = _trim_edges_alnum(combined)
# Dots are allowed in RFC1123 but avoid double semantics; keep as-is if present
if len(combined) <= max_total_len:
return combined

# Too long: stable short hash + truncated body
digest = hashlib.sha256(raw.encode("utf-8")).hexdigest()[:12]
# room: prefix + "-" + digest + "-" + rest
anchor = f"{pfx}-{digest}"
room = max_total_len - len(anchor) - 1
if room < 8:
# Extreme: prefix alone too long — fall back to hash-only tail
return _trim_edges_alnum(f"{digest}-{hashlib.sha256(raw.encode()).hexdigest()[:20]}")[:max_total_len]

tail = body[:room] if room > 0 else ""
tail = _trim_edges_alnum(tail) if tail else "m"
out = f"{anchor}-{tail}"
if len(out) > max_total_len:
out = out[:max_total_len]
return _trim_edges_alnum(out)


def sanitize_k8s_container_name(name_hint: str, max_len: int = _MAX_DNS_LABEL_LEN) -> str:
"""
Sanitize for ``spec.containers[].name`` / initContainer names.

Kubernetes rejects dots and other subdomain punctuation here: names must be a
single DNS **label** (``[a-z0-9]([-a-z0-9]*[a-z0-9])?``), max 63 characters.
Job/PVC ``metadata.name`` may still contain dots; do not reuse that string
verbatim as a container name.
"""
s = (name_hint or "").strip().lower()
s = re.sub(r"[^a-z0-9-]+", "-", s)
s = re.sub(r"-+", "-", s).strip("-")
if not s:
s = "madengine-main"
s = _trim_edges_alnum(s)
if len(s) > max_len:
digest = hashlib.sha256((name_hint or "").encode("utf-8")).hexdigest()[:8]
room = max_len - len(digest) - 1
if room < 4:
return digest[:max_len]
head = s[:room]
head = _trim_edges_alnum(head)
out = f"{digest}-{head}"
if len(out) > max_len:
out = out[:max_len]
return _trim_edges_alnum(out) or "m"
return s


def sanitize_k8s_label_value(raw: str, max_len: int = _MAX_LABEL_VALUE_LEN) -> str:
"""
Sanitize a string for use as a Kubernetes **label value** (max 63 chars).

Label values must be empty or begin/end with alphanumeric, with ``-``, ``_``, ``.`` inside.
"""
s = (raw or "").strip().lower()
s = re.sub(r"[^a-z0-9._-]+", "-", s)
s = re.sub(r"-+", "-", s).strip("-_.")
if not s:
return "model"
s = _trim_edges_alnum(s)
if len(s) <= max_len:
return s
digest = hashlib.sha256(raw.encode("utf-8")).hexdigest()[:8]
# digest + '-' + remainder
remainder = max_len - len(digest) - 1
if remainder < 4:
return digest[:max_len]
tail = s[:remainder]
tail = _trim_edges_alnum(tail)
out = f"{digest}-{tail}"
if len(out) > max_len:
out = out[:max_len]
return _trim_edges_alnum(out)
Loading