Kubernetes deployment (split from PR #86) by raviguptaamd · Pull Request #118 · ROCm/madengine

raviguptaamd · 2026-05-01T16:12:03Z

Summary

Splits the Kubernetes deployment work out of PR #86 into a dedicated PR so reviewers can focus #86 on slurm_multi and related changes. This PR contains the full Kubernetes deployment stack — adapters, templates, presets, scripts, examples, and unit tests.

Why split?

PR #86 was originally based on coketaste/refactor-dist, which introduced both the SLURM and Kubernetes deployment paths. The PR base has since been retargeted to develop. Reviewers asked to keep #86 scoped to slurm_multi; this PR carries the rest forward independently.

What's in this PR (and only this PR)

src/madengine/deployment/
- kubernetes.py — KubernetesDeployment (auto-registered by factory.py when module is importable)
- kubernetes_launcher_mixin.py
- k8s_secrets.py
- presets/k8s/** — defaults, GPU vendor profiles, multi-/single-gpu / multi-node profiles
- templates/kubernetes/** — job.yaml.j2, configmap.yaml.j2, service.yaml.j2, pvc*.yaml.j2
src/madengine/scripts/k8s/** — data download wrappers (download_aws/local/minio/nas.sh), tooling, profiler/rocenv wrappers
examples/k8s-configs/** — minimal and basic configs for native, torchrun, sglang, vllm, megatron, deepspeed, slurm-multi-on-k8s
tests/unit/test_k8s.py

The shared / mixed orchestration & deployment files (run_orchestrator.py, build_orchestrator.py, base.py, common.py, config_loader.py, validators.py) that already contain k8s code paths land via PR #86; this PR re-introduces the runtime callers (KubernetesDeployment, secrets, etc.) once #86 merges.

Sequencing

Updates to slurm launcher #86 (slurm_multi, focused) merges into develop.
This PR rebases onto develop (the dormant k8s code paths in shared files become live again) and merges.
After both land, a follow-up cherry-picks the combined work into main per repo policy.

Test plan

tests/unit/test_k8s.py passes locally with kubernetes extras installed
madengine run --additional-context '{"deploy": "k8s", ...}' smoke test on a dev cluster
pytest tests/unit/test_config_loader.py tests/unit/test_deployment.py still passes (shared code unchanged in this PR)

Made with Cursor

…t all Docker operations

… for truly successful runs while correctly identifying and handling various failure scenarios.

Implement a batch input arg for madengine-cli build

…derlying build logic use the batch_build_metadata argument to perform the correct tagging and pushing for each model.

…n phase

…age from batch.json for each model.

Update the flow use per-model registry settings

…batch mode

…late updates (ROCm#88) * refactor(k8s): secrets handling, launcher mixin, and template updates - Add k8s_secrets helpers for image pull / runtime secrets with configurable strategies; add unit tests. - Introduce KubernetesLauncherMixin for torchrun and related launcher command generation; wire into kubernetes deployment. - Update K8s Job/ConfigMap templates, k8s defaults preset, and kubernetes.py. - Refresh configuration/deployment docs and k8s-configs README. - Adjust tests (conftest, dummy credential fixture) for new behavior. - Fix CLI merge of --additional-context with --additional-context-file when the CLI value uses Python dict syntax (fallback after JSON parse failure). * fix(k8s): Path import for vLLM/SGLang commands; public registry secret helper - Import Path at module level in kubernetes_launcher_mixin so _generate_vllm_command and _generate_sglang_command do not raise NameError; drop redundant inner import in _generate_torchrun_command. - Rename _build_registry_secret_data to build_registry_secret_data and document it as the supported API for dockerconfigjson preview logic. - Update kubernetes.py and unit tests accordingly.

ROCm#89) * feat(cli): add default values for gpu_vendor and guest_os in build command Add sensible defaults (AMD/UBUNTU) to simplify build operations for the common use case while maintaining full backward compatibility and explicit override capability. * fix(build): wire gpu_vendor/guest_os defaults into Context for Dockerfile filtering validate_additional_context already applied AMD/UBUNTU defaults, but the merged dict was not passed to BuildOrchestrator, so Context.filter() saw no vendor keys and could build both AMD and NVIDIA Dockerfiles. - Add core/additional_context_defaults.py with apply_build_context_defaults() - Refactor CLI validators to use the shared helper - build command: pass repr(validated context) and drop duplicate file load - BuildOrchestrator: apply defaults after merge; build Context from final additional_context post-ConfigLoader - Update unit/integration tests; add tests for defaults helper * feat(cli): validate additional context and scope GPU arch warnings Add structural checks for --additional-context (nested types, known keys) with clear typer.Exit messages and examples. Unify run merge/validate with build when context is non-empty. Remove the duplicate MAD_SYSTEM_GPU_ARCHITECTURE info line from build-only Context init. After model discovery, warn only when a selected Dockerfile needs ARG MAD_SYSTEM_GPU_ARCHITECTURE without a default and the effective arch cannot be resolved (context or Dockerfile). Add dockerfile_requires_explicit_mad_arch_build_arg helper, drop duplicate DockerBuilder methods, and extend unit tests (validators, CLI, Dockerfile heuristic, orchestrator warning, context messages). * fix(cli): normalize gpu_vendor/guest_os; clean unused imports in validators - After validation, write canonical uppercase gpu_vendor and guest_os into the context dict so Context.filter() matches exactly. - Align test_validate_additional_context_case_insensitive with that behavior. - Drop unused imports: Panel in validators.py; patch/MagicMock in test_validators.py (flake8 F401). * docs(readme): fix misleading AMD/UBUNTU wording for `madengine run` quick start * docs(usage): fix misleading AMD/UBUNTU wording for `madengine run` quick start * Fixed the docker registry url * Added unit tests for docker builder

* fix(k8s): harden collect_results and align PVC defaults with examples - kubernetes: wait for terminal pod phase before recording per-node status; read primary workload container exit code by name (not container_statuses[0]) - presets + PVC templates: align shared/results PVC behavior with nfs-banff / RWX usage - examples: update torchrun single- and multi-node JSON configs; document in k8s-configs README * feat(k8s): per-pod results layout, PVC collection hardening, expand k8s tests Kubernetes deployment: - Collect PVC artifacts into k8s_results/<job>/<pod>/pvc/ by matching /results/<subdir>/ to full pod names (indexed Job pods, unmapped fallback). - Harden collector: short settle delay, retries for listing /results, explicit -c collector on kubectl exec/cp, clearer warnings when collection fails. Docs: - Document local k8s_results layout (pod.log vs pvc/) and multi-node Service behavior (PyTorch vs Ray notes in k8s-configs README). Tests: - Rename test_k8s_secrets.py to test_k8s.py; keep secrets/configmap tests and add unit tests for PVC subdir ↔ pod name mapping helpers.

Restore v1-style --skip-model-run on madengine run: model execution is skipped only when this invocation ran a build (_did_build_phase). If an existing --manifest-file is used, the flag is ignored and a warning is printed. - CLI: new --skip-model-run, passed through args; workflow panel and success messaging when the run phase is skipped - RunOrchestrator: short-circuit before local/distributed execute; empty run summary with skipped_model_run; use (skip_model_run is True) so MagicMock test doubles do not enable the path by accident - Tests: unit coverage for skip vs run-only; help text; conftest skip_model_run

* feat(execution): configurable log error pattern scan (additional_context) Allow disabling or tuning post-run log grep via additional_context and/or model_info: - log_error_pattern_scan / disable_log_error_scan to skip substring scan - log_error_benign_patterns to extend exclusions (e.g. noisy DLM logs) - log_error_patterns to override default pattern list Extract resolution into container_runner_helpers.resolve_log_error_scan_config; validate new keys in CLI validators; add unit tests. Motivation: avoid false FAILURE from benign RuntimeError: lines in suites already validated by pytest/JUnit (e.g. pyt_dl_tool_unit_test / AIMODELS-571). * feat(execution): configurable log error pattern scan; document and drop duplicate flag Add optional additional_context / model keys to control post-run log substring checks (log_error_pattern_scan, log_error_benign_patterns, log_error_patterns). Resolve settings in container_runner_helpers; wire into ContainerRunner; validate CLI types. Remove disable_log_error_scan in favor of log_error_pattern_scan only. Document behavior in README, docs/configuration.md, cli-reference, usage, profiling, and docs index. Cross-link troubleshooting for benign RuntimeError: false failures. Tests: extend test_container_runner_helpers; unit suite passes.

…egex; test validators (ROCm#93) * feat(execution): configurable log error pattern scan (additional_context) Allow disabling or tuning post-run log grep via additional_context and/or model_info: - log_error_pattern_scan / disable_log_error_scan to skip substring scan - log_error_benign_patterns to extend exclusions (e.g. noisy DLM logs) - log_error_patterns to override default pattern list Extract resolution into container_runner_helpers.resolve_log_error_scan_config; validate new keys in CLI validators; add unit tests. Motivation: avoid false FAILURE from benign RuntimeError: lines in suites already validated by pytest/JUnit (e.g. pyt_dl_tool_unit_test / AIMODELS-571). * feat(execution): configurable log error pattern scan; document and drop duplicate flag Add optional additional_context / model keys to control post-run log substring checks (log_error_pattern_scan, log_error_benign_patterns, log_error_patterns). Resolve settings in container_runner_helpers; wire into ContainerRunner; validate CLI types. Remove disable_log_error_scan in favor of log_error_pattern_scan only. Document behavior in README, docs/configuration.md, cli-reference, usage, profiling, and docs index. Cross-link troubleshooting for benign RuntimeError: false failures. Tests: extend test_container_runner_helpers; unit suite passes. * refactor: scan container logs in Python; validate log_error_* in tests Replace sh/grep-based log error detection with in-process scanning to avoid shell quoting and injection issues when patterns come from model_info or additional_context. - Add log_text_has_error_pattern() with literal benign substrings vs explicit regex benign rules (ROCProf). - Wire ContainerRunner to read the log once and use the helper. - Document log_error_benign_patterns as literal substrings in helpers. - Extend validate_additional_context tests for log_error_pattern_scan, log_error_benign_patterns, and log_error_patterns (valid + invalid).

Resolve conflicts with v2 branch: - context.py, reporting/update_perf_super.py: keep refactor-dis implementations - reporting/update_perf_csv.py: use main's safer column reorder (available_columns) - utils/config_parser.py: take main's path resolution (KNOWN_REPOS, candidates, cache) - Drop legacy tools/discover_models.py and tools/run_models.py (superseded by utils/) Unit tests: extend Context mocks so init_gpu_context does not require ROCm on CI. Made-with: Cursor

- Add docker_image field to built_images and built_models in --use-image mode - Merge model's distributed.launcher into deployment_config for proper detection - Make reservation optional (empty string is ignored) - Pass reservation to SlurmNodeSelector for health checks Co-authored-by: Cursor <cursoragent@cursor.com>

Add missing newline after 'fi' statement to prevent syntax error in generated sbatch script when using --build-on-compute option. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace legacy launcher names with unified slurm_multi: - Remove sglang-disagg, vllm-disagg from VALID_LAUNCHERS - Add slurm_multi as the only baremetal multi-node launcher - Update deployment logic in slurm.py and kubernetes.py - Rename example config files from sglang-disagg-* to slurm-multi-* - Update docs/launchers.md references Breaking change: sglang-disagg and vllm-disagg are no longer valid launcher names. Use slurm_multi instead. Made-with: Cursor

…ures docs/cli-reference.md: - Add --use-image and --build-on-compute options to build command - Add examples for pre-built image and compute node build workflows - Document when to use each mode docs/usage.md: - Add "Pre-built Image Mode" section with examples - Add "Build on Compute Node" section explaining workflow docs/deployment.md: - Add "SLURM Allocation Detection" section - Add "Baremetal Execution (slurm_multi)" section - Document environment variables and behavior inside salloc docs/launchers.md: - Fix example file paths (slurm_multi -> slurm-multi) Made-with: Cursor

--use-image changes: - Make image name optional (auto-detect from model card's DOCKER_IMAGE_NAME) - Add mutual exclusivity with --registry and --build-on-compute - Handle multiple models with different images (use first, warn) --build-on-compute changes: - Require --registry option (build once, push, pull everywhere) - Build on 1 compute node only, push to registry - Merge SLURM config from model card + additional-context (CLI overrides) - Add registry credential validation before build - Add docker login step in build script - Smart registry image naming (namespace/repo vs namespace formats) - Parallel docker pull on all nodes in run phase - Clear error messages for missing partition, credentials Run phase changes: - Detect built_on_compute flag in manifest - Pull image in parallel on all nodes via srun before model execution Documentation updated for new workflows. Made-with: Cursor

- slurm_multi launcher now requires --registry or --use-image - Parallel docker pull on all nodes for any registry image (not just built_on_compute) - DOCKER_IMAGE_NAME added to manifest env_vars for all registry builds - Documentation updated for slurm_multi requirements Made-with: Cursor

Ensure consistent documentation across: - deployment.md - launchers.md - usage.md All now mention that slurm_multi requires --registry or --use-image. Made-with: Cursor

- Make gpu_vendor/guest_os optional when using --use-image (pre-built image) - Remove model name truncation in CLI results table - Merge model card slurm and distributed config into build_manifest.json - Add skip_monitoring flag for synchronous salloc execution - Add completion marker file for robust sbatch job completion detection Made-with: Cursor

…cher column - Emit #SBATCH --nodelist in _prepare_baremetal_script() when nodelist is set via model card or --additional-context - Include nodelist in model card slurm merge key list so it propagates to the build manifest - Rename perf table column from "Launcher" to "Workload" for clarity Made-with: Cursor

…time) When --additional-context provides partial slurm overrides (e.g. only nodelist), ConfigLoader defaults were overwriting model card values like nodes and time. Fixed by capturing original user slurm keys before ConfigLoader runs, and deep-merging manifest slurm config with runtime context instead of replacing it wholesale. Made-with: Cursor

- Rename BAREMETAL_LAUNCHERS to SELF_MANAGED_LAUNCHERS to avoid confusion with PR ROCm#62's baremetal_vm (Docker-less nodes with VMs) - Rename _run_on_baremetal -> _run_self_managed, _prepare_baremetal_script -> _prepare_slurm_multi_script throughout - Add slurm_multi to common.py VALID_LAUNCHERS (canonical list) and add hyphen-variant normalization (slurm-multi -> slurm_multi) - Remove duplicate VALID_LAUNCHERS from kubernetes.py (now inherits from common.py) - Fix SBATCH output filename pattern: _%j -> _%j_%t to match the glob pattern in collect_results (madengine-*_{deployment_id}_*.out) - Add _collect_slurm_multi_results for explicit slurm_multi results collection path with perf.csv retry and NFS propagation handling - Add --use-image / --build-on-compute mutual exclusivity check in build_orchestrator.py - Document --use-image vs MAD_CONTAINER_IMAGE relationship in cli-reference.md Made-with: Cursor

Address all 9 inline comments from copilot-pull-request-reviewer[bot]: ROCm#1 build_orchestrator.py — _execute_with_prebuilt_image now keys manifest['built_models'] by model_name (not use_image), so multiple models that share the same pre-built image are all preserved in the manifest. ROCm#2 build_orchestrator.py — warn when discovered models have differing distributed/slurm configs in the prebuilt-image flow; the post-merge step still uses models[0]'s config but operators are now told. ROCm#3 build_orchestrator.py — _execute_build_on_compute() now raises ConfigurationError early when registry is None instead of falling into registry.replace/.split/.lower with NoneType. ROCm#4 build_orchestrator.py — credentials-required error now emits per-registry hints (docker.io / ghcr.io / gcr.io / quay.io / nvcr.io) instead of Docker-Hub-only PAT guidance. ROCm#5 container_runner.py — document the shell=True trust boundary on the inner subprocess.run; cmd is internally constructed and any user model_args are routed through shlex-quoted assembly in the caller. ROCm#6 slurm.py — drop duplicate `from typing import Optional` import. ROCm#7 slurm.py — slurm_multi wrapper no longer hard-codes `#SBATCH --exclusive`; honours self.slurm_config.get('exclusive', True) to match the standard SLURM template behaviour. ROCm#8 slurm_node_selector.py — cleanup_node()'s srun_cmd is now built once and includes both --job-name (when provided) and --reservation (when set); the second in-try reassignment that dropped --job-name is gone. ROCm#9 run_orchestrator.py — replace the shallow `merged.update(...)` with a real recursive _deep_merge so the comment ("deep-merge") matches the behaviour: nested dicts under slurm/k8s/distributed/etc. are merged per-leaf, runtime --additional-context still wins on conflicts. Made-with: Cursor

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

coketaste added 30 commits July 11, 2025 15:58

Implemented a batch input arg for madengine-cli build

b22bc7b

enhanced logging system is now active and will automatically highligh…

768dcf9

…t all Docker operations

Fix the error local variable docker_image referenced before assignment

a4b324f

Updated the perf dataframe output

ebfb472

The fixes are backward compatible and maintain existing functionality…

e47572e

… for truly successful runs while correctly identifying and handling various failure scenarios.

Fixed the problematic log

3a73edc

Fixed the error pattern, removed the wrong string

e1000a4

Fixed the error of test prof

06934d3

Updated the interface of mad_cli

59dd584

Merge pull request ROCm#17 from ROCm/coketaste/refactor-stage

d696784

Implement a batch input arg for madengine-cli build

Update README.md

5821b3b

ensure that the DistributedOrchestrator.build_phase method and the un…

30f1329

…derlying build logic use the batch_build_metadata argument to perform the correct tagging and pushing for each model.

Updated the build batch manifest to distributed orchestrator

f6c18fa

Debug the batch manifest

11895f9

Update the flow use per-model registry settings for both build and ru…

27627aa

…n phase

correct registry image will be used for each model as intended

c7c6d37

The push_image function now accepts and uses the explicit registry_im…

7449493

…age from batch.json for each model.

Updated the explicit_registry_image assignment

7f2c63b

Debug the registry info setting

9f50d04

Updated the function of export build manifest

05f8a26

Add verbose for debugging

8f8dc88

Debug the export build manifest

de6b49c

Debug the registry extract from batch build metadata

f1a3905

Debug the exaction

d412956

Merge pull request ROCm#23 from ROCm/coketaste/refactor-batch

a03fa0d

Update the flow use per-model registry settings

Corrected the content of synthetic image which built_new is false in …

624cc29

…batch mode

Fixed the type error in additional context

af7ddb4

Debug the parsing of gpu vendoer and guest os

b5a800b

Correct the pattern of Dockerfile

bc18784

Updated the print

558b7af

coketaste and others added 22 commits March 23, 2026 17:41

refactor: improve code quality and streamline unit tests (ROCm#94)

8753fb6

Added ChangeLog for 2.0.0

fd0f4ec

Updates to slurm launcher

284f7d2

Fix syntax error in build-on-compute sbatch script generation

e2c089f

Add missing newline after 'fi' statement to prevent syntax error in generated sbatch script when using --build-on-compute option. Co-authored-by: Cursor <cursoragent@cursor.com>

Add slurm_multi registry requirement to all docs

b13e1f1

Ensure consistent documentation across: - deployment.md - launchers.md - usage.md All now mention that slurm_multi requires --registry or --use-image. Made-with: Cursor

Copilot AI review requested due to automatic review settings May 1, 2026 16:12

raviguptaamd requested review from Cemberk, Rohan138, coketaste, gargrahul and leconcio as code owners May 1, 2026 16:12

Copilot AI reviewed May 1, 2026

View reviewed changes

raviguptaamd mentioned this pull request May 1, 2026

Updates to slurm launcher #86

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes deployment (split from PR #86)#118

Kubernetes deployment (split from PR #86)#118
raviguptaamd wants to merge 332 commits intoROCm:developfrom
raviguptaamd:raviguptaamd/k8s_deployment

raviguptaamd commented May 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

raviguptaamd commented May 1, 2026

Summary

Why split?

What's in this PR (and only this PR)

Sequencing

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants