Skip to content

feat(deployment): Primus on local/K8s/SLURM#99

Merged
coketaste merged 3 commits intomainfrom
coketaste/v2-primus
Apr 15, 2026
Merged

feat(deployment): Primus on local/K8s/SLURM#99
coketaste merged 3 commits intomainfrom
coketaste/v2-primus

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

Add first-class Primus launcher support for Kubernetes and SLURM: infer backend and examples overlay from model/config, sanitize Job/Service/ ConfigMap/container names (k8s_names), and extend templates and mixin command generation.

Docker: resolve repo-root build context for Primus Dockerfiles; align container run log filenames with build logs when the image is a full registry/repo:tag reference (container_runner_helpers).

Docs: launchers and k8s-config examples README.

Tests: integration (Docker context), unit coverage for execution helpers, K8s secrets/PVC/names, and consolidated Primus tests (test_primus).

Add first-class Primus launcher support for Kubernetes and SLURM: infer
backend and examples overlay from model/config, sanitize Job/Service/
ConfigMap/container names (k8s_names), and extend templates and mixin
command generation.

Docker: resolve repo-root build context for Primus Dockerfiles; align
container run log filenames with build logs when the image is a full
registry/repo:tag reference (container_runner_helpers).

Docs: launchers and k8s-config examples README.

Tests: integration (Docker context), unit coverage for execution helpers,
K8s secrets/PVC/names, and consolidated Primus tests (test_primus).
@coketaste coketaste self-assigned this Apr 10, 2026
@coketaste coketaste changed the title feat(deployment): Primus on K8s/SLURM, K8s naming, and run log parity feat(deployment): Primus on local/K8s/SLURM Apr 10, 2026
Copilot AI review requested due to automatic review settings April 14, 2026 14:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class Primus launcher support across local, Kubernetes, and SLURM deployments, plus supporting infra for K8s-safe naming and improved Docker context/log handling so Primus images/configs run consistently across environments.

Changes:

  • Introduce Primus backend/config inference helpers and integrate Primus launcher command/env generation for Kubernetes and SLURM.
  • Add Kubernetes name/label/container sanitization utilities and update K8s templates to use sanitized labels/container names.
  • Improve Docker build/run ergonomics for Primus (repo-root build context; normalize run log naming for full image refs) and add corresponding tests/docs.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/unit/test_primus.py New unit coverage for Primus backend/overlay inference and K8s Primus command generation
tests/unit/test_k8s.py Adds unit tests for K8s name/label/container sanitization utilities
tests/unit/test_execution.py Extends log naming tests to cover full registry refs mapping to CI tag
tests/integration/test_docker_integration.py Integration test ensuring Primus Docker builds use repo-root context
src/madengine/orchestration/run_orchestrator.py Restores docker_env_vars from manifest context during local run
src/madengine/execution/docker_builder.py Chooses repo-root Docker build context for Primus dockerfiles by hint
src/madengine/execution/container_runner_helpers.py Normalizes docker image refs for run log filename consistency
src/madengine/execution/container_runner.py Adds Primus env passthrough; resolves missing Primus images via fallback; preserves resolved image in results
src/madengine/deployment/templates/slurm/job.sh.j2 Avoids printing MAD_MULTI_NODE_RUNNER for launchers that don’t use it
src/madengine/deployment/templates/kubernetes/service.yaml.j2 Switches model label to sanitized model_label
src/madengine/deployment/templates/kubernetes/job.yaml.j2 Switches model label to model_label; uses main_container_name; includes primus/torchtitan in distributed env block
src/madengine/deployment/templates/kubernetes/configmap.yaml.j2 Switches model label to sanitized model_label
src/madengine/deployment/slurm.py Adds Primus launcher env generation (PRIMUS_CONFIG_PATH/CLI_EXTRA/BACKEND inference)
src/madengine/deployment/primus_backend.py New helper module: merges Primus config; infers BACKEND and examples overlay subdirs
src/madengine/deployment/kubernetes_launcher_mixin.py Adds K8s Primus launcher env/command generation
src/madengine/deployment/kubernetes.py Adds K8s name sanitization; bundles Primus examples overlay into ConfigMap; wires Primus launcher
src/madengine/deployment/k8s_names.py New K8s naming utilities for object names, label values, and container names
src/madengine/deployment/common.py Adds “primus” to valid launcher list
examples/k8s-configs/README.md Documents Primus-on-K8s behavior, caveats, and example configs
docs/launchers.md Adds Primus launcher documentation and updates launcher lists/tables
Comments suppressed due to low confidence (1)

src/madengine/deployment/templates/kubernetes/service.yaml.j2:12

  • service.yaml.j2 uses service_name as metadata.name and job.yaml.j2 uses the same value via pod.spec.subdomain/launcher DNS. Kubernetes Service names must be DNS labels (no dots, <=63). If service_name is derived from job_name that may contain dots (e.g. '1.7b') or exceed 63, the Service will be rejected and multi-node DNS (pod-0....) will fail. Ensure service_name is sanitized as a DNS label and keep subdomain/launcher DNS consistent with that sanitized value.
  name: {{ service_name }}
  namespace: {{ namespace }}
  labels:
    app: madengine
    model: {{ model_label }}
spec:
  clusterIP: None  # Headless service for torchrun coordination
  selector:
    job-name: {{ job_name }}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coketaste coketaste merged commit 641ae9f into main Apr 15, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants