ROCm · coketaste · Apr 15, 2026 · Apr 10, 2026 · Apr 11, 2026 · Apr 14, 2026
diff --git a/docs/launchers.md b/docs/launchers.md
@@ -16,6 +16,7 @@ madengine provides unified support for multiple distributed frameworks, enabling
 | **DeepSpeed** | Training | ZeRO optimization training | ✅ | ✅ | ✅ |
 | **Megatron-LM** | Training | Large-scale transformer training | ✅ | ✅ | ✅ |
 | **TorchTitan** | Training | LLM pre-training (FSDP2+TP+PP) | ✅ | ✅ | ✅ |
+| **Primus** | Training | Megatron/TorchTitan/Jax via Primus config | ✅ | ✅ | ✅ |
 | **vLLM** | Inference | High-throughput LLM serving | ✅ | ✅ | ✅ |
 | **SGLang** | Inference | Fast LLM inference | ✅ | ✅ | ✅ |
 | **SGLang Disaggregated** | Inference | Large-scale disaggregated inference | ✅ | ✅ | ✅ (min 3) |
@@ -224,6 +225,49 @@ TORCHTITAN_CONTEXT_PARALLEL_SIZE=1
 - K8s: `examples/k8s-configs/minimal/torchtitan-single-node-minimal.json`
 - SLURM: `examples/slurm-configs/minimal/torchtitan-single-node-minimal.json`
 
+---
+
+### 5b. Primus
+
+**Purpose**: Unified pretrain entry for Megatron-LM, TorchTitan, and Jax/MaxText via Primus experiment YAML.
+
+**When to Use**:
+- Run Primus example configs (e.g. `examples/megatron/configs/MI300X/*.yaml`) via madengine
+- Single image plus config path; scheduling and tools/metrics from madengine
+
+**Configuration**:
+```json
+{
+  "distributed": {
+    "launcher": "primus",
+    "nnodes": 2,
+    "nproc_per_node": 8,
+    "primus": {
+      "config_path": "examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml",
+      "cli_extra": "",
+      "backend": "megatron"
+    }
+  }
+}
+```
+
+Optional **`primus.backend`** (e.g. `MaxText`, `megatron`) emits `export BACKEND=...` before your model script when path-based detection is not enough. If omitted, madengine infers `BACKEND` from the **model name** when it follows `primus_pretrain/<launcher>_<arch>_...` (e.g. `primus_pretrain/torchtitan_MI300X_qwen3_4B-pretrain` → `torchtitan`), matching `scripts/primus_pretrain/get_models_json.py` in MAD-internal.
+
+**Features**:
+- Launcher sets `PRIMUS_CONFIG_PATH`, optional `PRIMUS_CLI_EXTRA`, and optional `BACKEND` from `primus.backend`; no `MAD_MULTI_NODE_RUNNER`
+- Model script (e.g. `run.sh`) sets `EXP` and calls Primus `run_pretrain.sh`
+- NNODES, NODE_RANK, MASTER_ADDR, etc. set by madengine job template
+- Use with MAD-Internal Primus submodule and `scripts/primus_pretrain/run.sh`
+
+**Container image**: Prefer `docker/primus.ubuntu.amd.Dockerfile` with `COPY scripts/Primus/ /workspace/Primus/` and `PRIMUS_ROOT=/workspace/Primus`. On **Kubernetes**, the Job’s emptyDir hides image files under `/workspace`; madengine bundles `scripts/Primus/examples/...` into the ConfigMap as `Primus/examples/...` so the init container recreates `/workspace/Primus`. `run.sh` resolves `PRIMUS_ROOT` in that order (see script comments).
+
+**Examples**:
+- SLURM: `examples/slurm-configs/minimal/primus-minimal.json`
+- K8s: `examples/k8s-configs/minimal/primus-minimal.json`
+- K8s (Primus vs upstream workload API, MaxText caveats, TorchTitan/Megatron/MaxText sample JSON): `examples/k8s-configs/README.md` section **Primus on Kubernetes**
+
+---
+
 **Model Configuration** (TOML):
 ```toml
 [model]
@@ -519,16 +563,16 @@ madengine run --tags model --config custom-split-config.json
 
 ### Training Launchers
 
-| Feature | torchrun | DeepSpeed | Megatron-LM | TorchTitan |
-|---------|----------|-----------|-------------|------------|
-| **Data Parallel** | ✅ DDP | ✅ ZeRO | ✅ | ✅ FSDP2 |
-| **Tensor Parallel** | ❌ | ❌ | ✅ | ✅ |
-| **Pipeline Parallel** | ❌ | ✅ | ✅ | ✅ |
-| **Memory Efficiency** | Medium | High (ZeRO) | High | Very High |
-| **Ease of Use** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ |
-| **Model Size** | Small-Medium | Medium-Large | Very Large | Very Large |
-| **K8s Support** | ✅ | ✅ | ✅ | ✅ |
-| **SLURM Support** | ✅ | ✅ | ✅ | ✅ |
+| Feature | torchrun | DeepSpeed | Megatron-LM | TorchTitan | Primus |
+|---------|----------|-----------|-------------|------------|--------|
+| **Data Parallel** | ✅ DDP | ✅ ZeRO | ✅ | ✅ FSDP2 | via config |
+| **Tensor Parallel** | ❌ | ❌ | ✅ | ✅ | via config |
+| **Pipeline Parallel** | ❌ | ✅ | ✅ | ✅ | via config |
+| **Memory Efficiency** | Medium | High (ZeRO) | High | Very High | config-driven |
+| **Ease of Use** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ |
+| **Model Size** | Small-Medium | Medium-Large | Very Large | Very Large | config-driven |
+| **K8s Support** | ✅ | ✅ | ✅ | ✅ | ✅ |
+| **SLURM Support** | ✅ | ✅ | ✅ | ✅ | ✅ |
 
 ### Inference Launchers
 
@@ -646,6 +690,13 @@ TORCHTITAN_FSDP_ENABLED=1
 MAD_MULTI_NODE_RUNNER="torchrun ..."
 ```
 
+**Primus**:
+```bash
+PRIMUS_CONFIG_PATH="examples/megatron/configs/MI300X/..."
+PRIMUS_CLI_EXTRA=""   # optional
+# No MAD_MULTI_NODE_RUNNER (model script calls Primus run_pretrain.sh)
+```
+
 **vLLM**:
 ```bash
 VLLM_TENSOR_PARALLEL_SIZE=4
@@ -681,7 +732,7 @@ SGLANG_NODE_RANK=${SLURM_PROCID}
 ```bash
 Error: Unknown launcher type 'xyz'
 ```
-Solution: Use one of: `torchrun`, `deepspeed`, `megatron`, `torchtitan`, `vllm`, `sglang`, `sglang-disagg`
+Solution: Use one of: `torchrun`, `deepspeed`, `megatron`, `torchtitan`, `primus`, `vllm`, `sglang`, `sglang-disagg`
 
 **2. Multi-Node Communication Fails**
 ```bash

diff --git a/examples/k8s-configs/README.md b/examples/k8s-configs/README.md
@@ -185,10 +185,33 @@ To validate rendered YAML after a debug run, install [kubeconform](https://githu
 
 ### Multi-node DNS (PyTorch vs Ray)
 
-For **PyTorch-native** launchers (`torchrun`, `deepspeed`, `torchtitan`, `megatron`), multi-node Jobs use a **headless Service** whose name matches `pod.spec.subdomain`, per Kubernetes DNS rules, so pods get stable per-pod DNS names for rendezvous.
+For **PyTorch-native** launchers (`torchrun`, `deepspeed`, `torchtitan`, `megatron`, `primus`), multi-node Jobs use a **headless Service** whose name matches `pod.spec.subdomain`, per Kubernetes DNS rules, so pods get stable per-pod DNS names for rendezvous.
 
 For **Ray-based** multi-node (`vllm`, `sglang`), a headless Service may still be created for networking, but **per-pod DNS via `subdomain` is not applied** the same way as for PyTorch; production multi-node Ray on Kubernetes often uses **KubeRay** (see upstream vLLM / Ray docs). Treat Job-based multi-node Ray as a best-effort path.
 
+### Primus on Kubernetes
+
+Upstream Primus separates two ideas: a **universal** training driver (`examples/run_pretrain.sh`, used by all backends) and an optional **Kubernetes workload API client** (`examples/run_k8s_pretrain.sh`) that talks to a remote service to create workloads. Those are not the same integration as madengine’s native K8s path.
+
+| Approach | Role |
+|----------|------|
+| Primus `run_k8s_pretrain.sh` | HTTP client to an external **workload API** (replicas, image, `backend`, `exp`, node labels, etc.). Does **not** match madengine’s `kubectl`-style Job YAML flow. |
+| madengine `distributed.launcher: "primus"` | Renders a standard **Job** (and headless **Service** when `nnodes > 1`), injects cluster env (`MASTER_ADDR`, `NNODES`, `NODE_RANK` / `JOB_COMPLETION_INDEX`, `GPUS_PER_NODE`), `PRIMUS_CONFIG_PATH`, optional `PRIMUS_CLI_EXTRA`, and optional `BACKEND` from `distributed.primus.backend`, then runs your model script (e.g. `scripts/primus_pretrain/run.sh`). |
+
+**Backends** (one madengine launcher; Primus routes inside `run_pretrain.sh`):
+
+| Experiment path (typical) | Backend |
+|---------------------------|---------|
+| `examples/torchtitan/...` | TorchTitan (PyTorch / `torchrun`) |
+| `examples/megatron/...` | Megatron-LM |
+| `examples/maxtext/...` | MaxText (JAX; coordinator env from master) |
+
+Set `distributed.primus.config_path` to your YAML under the Primus repo layout. Use optional `distributed.primus.backend` (e.g. `MaxText`, `megatron`) to emit `export BACKEND=...` when you want to override inference. If `backend` is omitted, madengine sets `BACKEND` from the **model name** when it matches `primus_pretrain/<launcher>_<arch>_...` (e.g. `primus_pretrain/torchtitan_MI300X_qwen3_4B-pretrain` → `torchtitan`).
+
+**MaxText caveat:** For **multi-node** MaxText, Primus `run_pretrain.sh` may run **in-container `apt` installs** (InfiniBand-related packages). Many clusters disallow that unless the image is pre-baked or policy allows it. madengine logs a **warning** when MaxText is detected (`backend` or path) and `nnodes > 1`.
+
+Primus examples under [`basic/`](basic/): [`primus-single-node-multi-gpu.json`](basic/primus-single-node-multi-gpu.json) (one pod, multi-GPU) and [`primus-multi-node.json`](basic/primus-multi-node.json) (Indexed Job). The same files work for TorchTitan, Megatron, and MaxText: set `distributed.primus.config_path` to your experiment YAML, and rely on madengine’s `BACKEND` inference from the model name (`primus_pretrain/<launcher>_<arch>_...`) or set `distributed.primus.backend` only when you need an explicit override. Use `docker_env_vars.HF_TOKEN` as a placeholder or runtime secrets — do not commit real tokens.
+
 ### Full Configs (Reference Examples)
 
 Complete configurations showing all available fields:
@@ -213,6 +236,8 @@ Complete configurations showing all available fields:
 | [`basic/torchtitan-multi-node-basic.json`](basic/torchtitan-multi-node-basic.json) | 8/node | 4 | TorchTitan | Llama 3.1 70B+ training |
 | [`basic/vllm-multi-node-basic.json`](basic/vllm-multi-node-basic.json) | 4/node | 2 | vLLM | High-throughput inference |
 | [`basic/sglang-multi-node-basic.json`](basic/sglang-multi-node-basic.json) | 4/node | 2 | SGLang | Distributed inference |
+| [`basic/primus-single-node-multi-gpu.json`](basic/primus-single-node-multi-gpu.json) | 8 | 1 | primus | Primus pretrain (single pod; edit `primus.config_path`) |
+| [`basic/primus-multi-node.json`](basic/primus-multi-node.json) | 8/node | 2+ | primus | Primus pretrain (multi-pod; edit `nnodes`, `primus.config_path`) |
 
 ---
 
@@ -551,13 +576,21 @@ Configuration for distributed workloads (training and inference):
 
 | Field | Type | Default | Description |
 |-------|------|---------|-------------|
-| `launcher` | string | - | Launcher type: `torchrun`, `deepspeed`, `torchtitan`, `vllm`, `sglang` |
+| `launcher` | string | - | Launcher type: `torchrun`, `deepspeed`, `torchtitan`, `megatron`, `primus`, `vllm`, `sglang` |
 | `enabled` | boolean | `false` | Enable distributed execution (legacy, prefer `launcher`) |
 | `backend` | string | `"nccl"` | `"nccl"`, `"gloo"`, or `"mpi"` |
 | `nnodes` | integer | `1` | Number of nodes |
 | `nproc_per_node` | integer | gpu_count | Processes per node (= GPUs per node) |
 | `master_port` | integer | `29500` | Master communication port |
 
+When `launcher` is **`primus`**, set nested **`primus`** (under `distributed`):
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `primus.config_path` | string | Path to the Primus experiment YAML (e.g. under `examples/torchtitan/...`). |
+| `primus.cli_extra` | string | Optional extra arguments passed to Primus CLI. |
+| `primus.backend` | string | Optional. If set, madengine emits `export BACKEND=...` (e.g. `MaxText`, `megatron`) before your `run.sh`. |
+
 #### Environment Variables
 
 Custom environment variables for containers:

diff --git a/src/madengine/deployment/common.py b/src/madengine/deployment/common.py
@@ -17,6 +17,7 @@
     "torchtitan",
     "deepspeed",
     "megatron-lm",
+    "primus",
     "vllm",
     "sglang",
     "sglang-disagg"

diff --git a/src/madengine/deployment/k8s_names.py b/src/madengine/deployment/k8s_names.py
@@ -0,0 +1,140 @@
+#!/usr/bin/env python3
+"""
+Kubernetes-safe names for metadata.name, label values, and container names.
+
+Model names from data.json may contain ``/``, spaces, or uppercase letters that
+are invalid for ``metadata.name`` (RFC 1123 subdomain) or for label values.
+Container names must be a single DNS label (no dots), stricter than Job names.
+
+Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import re
+from typing import Final
+
+# Kubernetes DNS subdomain total length (metadata.name)
+_MAX_OBJECT_NAME_LEN: Final[int] = 253
+# Label value max length
+_MAX_LABEL_VALUE_LEN: Final[int] = 63
+# Container / initContainer names: DNS label only (no dots); see Pod validation.
+_MAX_DNS_LABEL_LEN: Final[int] = 63
+
+
+def _trim_edges_alnum(s: str) -> str:
+    """Ensure string starts and ends with [a-z0-9] (required for RFC1123 names)."""
+    s = s.strip("-.")
+    if not s:
+        return "x"
+    # Strip leading non-alphanumeric
+    while s and not s[0].isalnum():
+        s = s[1:]
+    while s and not s[-1].isalnum():
+        s = s[:-1]
+    return s or "x"
+
+
+def sanitize_k8s_object_name(prefix: str, raw_model_name: str, max_total_len: int = _MAX_OBJECT_NAME_LEN) -> str:
+    """
+    Build a valid ``metadata.name`` substring from a model name.
+
+    Args:
+        prefix: Leading segment (e.g. ``madengine``). May contain only chars valid in the final name.
+        raw_model_name: Original model name (may include ``/``, ``_``, spaces).
+        max_total_len: Maximum total length (default 253).
+
+    Returns:
+        A lowercase name safe for Kubernetes ``metadata.name`` (Job, PVC, Service, etc.).
+    """
+    raw = (raw_model_name or "").strip()
+    pfx = (prefix or "").strip().lower()
+    pfx = re.sub(r"[^a-z0-9.-]+", "-", pfx)
+    pfx = re.sub(r"-+", "-", pfx).strip("-")
+    if not pfx:
+        pfx = "madengine"
+
+    body = raw.lower()
+    body = re.sub(r"[^a-z0-9.-]+", "-", body)
+    body = re.sub(r"-+", "-", body).strip("-")
+    if not body:
+        body = "model"
+
+    combined = f"{pfx}-{body}"
+    combined = _trim_edges_alnum(combined)
+    # Dots are allowed in RFC1123 but avoid double semantics; keep as-is if present
+    if len(combined) <= max_total_len:
+        return combined
+
+    # Too long: stable short hash + truncated body
+    digest = hashlib.sha256(raw.encode("utf-8")).hexdigest()[:12]
+    # room: prefix + "-" + digest + "-" + rest
+    anchor = f"{pfx}-{digest}"
+    room = max_total_len - len(anchor) - 1
+    if room < 8:
+        # Extreme: prefix alone too long — fall back to hash-only tail
+        return _trim_edges_alnum(f"{digest}-{hashlib.sha256(raw.encode()).hexdigest()[:20]}")[:max_total_len]
+
+    tail = body[:room] if room > 0 else ""
+    tail = _trim_edges_alnum(tail) if tail else "m"
+    out = f"{anchor}-{tail}"
+    if len(out) > max_total_len:
+        out = out[:max_total_len]
+    return _trim_edges_alnum(out)
+
+
+def sanitize_k8s_container_name(name_hint: str, max_len: int = _MAX_DNS_LABEL_LEN) -> str:
+    """
+    Sanitize for ``spec.containers[].name`` / initContainer names.
+
+    Kubernetes rejects dots and other subdomain punctuation here: names must be a
+    single DNS **label** (``[a-z0-9]([-a-z0-9]*[a-z0-9])?``), max 63 characters.
+    Job/PVC ``metadata.name`` may still contain dots; do not reuse that string
+    verbatim as a container name.
+    """
+    s = (name_hint or "").strip().lower()
+    s = re.sub(r"[^a-z0-9-]+", "-", s)
+    s = re.sub(r"-+", "-", s).strip("-")
+    if not s:
+        s = "madengine-main"
+    s = _trim_edges_alnum(s)
+    if len(s) > max_len:
+        digest = hashlib.sha256((name_hint or "").encode("utf-8")).hexdigest()[:8]
+        room = max_len - len(digest) - 1
+        if room < 4:
+            return digest[:max_len]
+        head = s[:room]
+        head = _trim_edges_alnum(head)
+        out = f"{digest}-{head}"
+        if len(out) > max_len:
+            out = out[:max_len]
+        return _trim_edges_alnum(out) or "m"
+    return s
+
+
+def sanitize_k8s_label_value(raw: str, max_len: int = _MAX_LABEL_VALUE_LEN) -> str:
+    """
+    Sanitize a string for use as a Kubernetes **label value** (max 63 chars).
+
+    Label values must be empty or begin/end with alphanumeric, with ``-``, ``_``, ``.`` inside.
+    """
+    s = (raw or "").strip().lower()
+    s = re.sub(r"[^a-z0-9._-]+", "-", s)
+    s = re.sub(r"-+", "-", s).strip("-_.")
+    if not s:
+        return "model"
+    s = _trim_edges_alnum(s)
+    if len(s) <= max_len:
+        return s
+    digest = hashlib.sha256(raw.encode("utf-8")).hexdigest()[:8]
+    # digest + '-' + remainder
+    remainder = max_len - len(digest) - 1
+    if remainder < 4:
+        return digest[:max_len]
+    tail = s[:remainder]
+    tail = _trim_edges_alnum(tail)
+    out = f"{digest}-{tail}"
+    if len(out) > max_len:
+        out = out[:max_len]
+    return _trim_edges_alnum(out)