Skip to content

Kubernetes deployment (split from PR #86)#118

Open
raviguptaamd wants to merge 332 commits intoROCm:developfrom
raviguptaamd:raviguptaamd/k8s_deployment
Open

Kubernetes deployment (split from PR #86)#118
raviguptaamd wants to merge 332 commits intoROCm:developfrom
raviguptaamd:raviguptaamd/k8s_deployment

Conversation

@raviguptaamd
Copy link
Copy Markdown

Summary

Splits the Kubernetes deployment work out of PR #86 into a dedicated PR so reviewers can focus #86 on slurm_multi and related changes. This PR contains the full Kubernetes deployment stack — adapters, templates, presets, scripts, examples, and unit tests.

Why split?

PR #86 was originally based on coketaste/refactor-dist, which introduced both the SLURM and Kubernetes deployment paths. The PR base has since been retargeted to develop. Reviewers asked to keep #86 scoped to slurm_multi; this PR carries the rest forward independently.

What's in this PR (and only this PR)

  • src/madengine/deployment/
    • kubernetes.pyKubernetesDeployment (auto-registered by factory.py when module is importable)
    • kubernetes_launcher_mixin.py
    • k8s_secrets.py
    • presets/k8s/** — defaults, GPU vendor profiles, multi-/single-gpu / multi-node profiles
    • templates/kubernetes/**job.yaml.j2, configmap.yaml.j2, service.yaml.j2, pvc*.yaml.j2
  • src/madengine/scripts/k8s/** — data download wrappers (download_aws/local/minio/nas.sh), tooling, profiler/rocenv wrappers
  • examples/k8s-configs/** — minimal and basic configs for native, torchrun, sglang, vllm, megatron, deepspeed, slurm-multi-on-k8s
  • tests/unit/test_k8s.py

The shared / mixed orchestration & deployment files (run_orchestrator.py, build_orchestrator.py, base.py, common.py, config_loader.py, validators.py) that already contain k8s code paths land via PR #86; this PR re-introduces the runtime callers (KubernetesDeployment, secrets, etc.) once #86 merges.

Sequencing

  1. Updates to slurm launcher #86 (slurm_multi, focused) merges into develop.
  2. This PR rebases onto develop (the dormant k8s code paths in shared files become live again) and merges.
  3. After both land, a follow-up cherry-picks the combined work into main per repo policy.

Test plan

  • tests/unit/test_k8s.py passes locally with kubernetes extras installed
  • madengine run --additional-context '{"deploy": "k8s", ...}' smoke test on a dev cluster
  • pytest tests/unit/test_config_loader.py tests/unit/test_deployment.py still passes (shared code unchanged in this PR)

Made with Cursor

coketaste added 30 commits July 11, 2025 15:58
… for truly successful runs while correctly identifying and handling various failure scenarios.
Implement a batch input arg for madengine-cli build
…derlying build logic use the batch_build_metadata argument to perform the correct tagging and pushing for each model.
Update the flow use per-model registry settings
coketaste and others added 22 commits March 23, 2026 17:41
…late updates (ROCm#88)

* refactor(k8s): secrets handling, launcher mixin, and template updates

- Add k8s_secrets helpers for image pull / runtime secrets with configurable
  strategies; add unit tests.
- Introduce KubernetesLauncherMixin for torchrun and related launcher command
  generation; wire into kubernetes deployment.
- Update K8s Job/ConfigMap templates, k8s defaults preset, and kubernetes.py.
- Refresh configuration/deployment docs and k8s-configs README.
- Adjust tests (conftest, dummy credential fixture) for new behavior.
- Fix CLI merge of --additional-context with --additional-context-file when
  the CLI value uses Python dict syntax (fallback after JSON parse failure).

* fix(k8s): Path import for vLLM/SGLang commands; public registry secret helper

- Import Path at module level in kubernetes_launcher_mixin so
  _generate_vllm_command and _generate_sglang_command do not raise
  NameError; drop redundant inner import in _generate_torchrun_command.
- Rename _build_registry_secret_data to build_registry_secret_data and
  document it as the supported API for dockerconfigjson preview logic.
- Update kubernetes.py and unit tests accordingly.
ROCm#89)

* feat(cli): add default values for gpu_vendor and guest_os in build command

Add sensible defaults (AMD/UBUNTU) to simplify build operations for the
common use case while maintaining full backward compatibility and explicit
override capability.

* fix(build): wire gpu_vendor/guest_os defaults into Context for Dockerfile filtering

validate_additional_context already applied AMD/UBUNTU defaults, but the merged
dict was not passed to BuildOrchestrator, so Context.filter() saw no vendor keys
and could build both AMD and NVIDIA Dockerfiles.

- Add core/additional_context_defaults.py with apply_build_context_defaults()
- Refactor CLI validators to use the shared helper
- build command: pass repr(validated context) and drop duplicate file load
- BuildOrchestrator: apply defaults after merge; build Context from final
  additional_context post-ConfigLoader
- Update unit/integration tests; add tests for defaults helper

* feat(cli): validate additional context and scope GPU arch warnings

Add structural checks for --additional-context (nested types, known keys)
with clear typer.Exit messages and examples. Unify run merge/validate with
build when context is non-empty.

Remove the duplicate MAD_SYSTEM_GPU_ARCHITECTURE info line from build-only
Context init. After model discovery, warn only when a selected Dockerfile
needs ARG MAD_SYSTEM_GPU_ARCHITECTURE without a default and the effective
arch cannot be resolved (context or Dockerfile).

Add dockerfile_requires_explicit_mad_arch_build_arg helper, drop duplicate
DockerBuilder methods, and extend unit tests (validators, CLI, Dockerfile
heuristic, orchestrator warning, context messages).

* fix(cli): normalize gpu_vendor/guest_os; clean unused imports in validators
- After validation, write canonical uppercase gpu_vendor and guest_os into the
  context dict so Context.filter() matches exactly.
- Align test_validate_additional_context_case_insensitive with that behavior.
- Drop unused imports: Panel in validators.py; patch/MagicMock in
  test_validators.py (flake8 F401).

* docs(readme): fix misleading AMD/UBUNTU wording for `madengine run` quick start

* docs(usage): fix misleading AMD/UBUNTU wording for `madengine run` quick start

* Fixed the docker registry url

* Added unit tests for docker builder
* fix(k8s): harden collect_results and align PVC defaults with examples
- kubernetes: wait for terminal pod phase before recording per-node status;
  read primary workload container exit code by name (not container_statuses[0])
- presets + PVC templates: align shared/results PVC behavior with nfs-banff / RWX usage
- examples: update torchrun single- and multi-node JSON configs; document in k8s-configs README

* feat(k8s): per-pod results layout, PVC collection hardening, expand k8s tests
Kubernetes deployment:
- Collect PVC artifacts into k8s_results/<job>/<pod>/pvc/ by matching
  /results/<subdir>/ to full pod names (indexed Job pods, unmapped fallback).
- Harden collector: short settle delay, retries for listing /results, explicit
  -c collector on kubectl exec/cp, clearer warnings when collection fails.

Docs:
- Document local k8s_results layout (pod.log vs pvc/) and multi-node Service
  behavior (PyTorch vs Ray notes in k8s-configs README).

Tests:
- Rename test_k8s_secrets.py to test_k8s.py; keep secrets/configmap tests and
  add unit tests for PVC subdir ↔ pod name mapping helpers.
Restore v1-style --skip-model-run on madengine run: model execution
is skipped only when this invocation ran a build (_did_build_phase). If an
existing --manifest-file is used, the flag is ignored and a warning is printed.

- CLI: new --skip-model-run, passed through args; workflow panel and success
  messaging when the run phase is skipped
- RunOrchestrator: short-circuit before local/distributed execute; empty run
  summary with skipped_model_run; use (skip_model_run is True) so MagicMock test
  doubles do not enable the path by accident
- Tests: unit coverage for skip vs run-only; help text; conftest skip_model_run
* feat(execution): configurable log error pattern scan (additional_context)
Allow disabling or tuning post-run log grep via additional_context and/or
model_info:

- log_error_pattern_scan / disable_log_error_scan to skip substring scan
- log_error_benign_patterns to extend exclusions (e.g. noisy DLM logs)
- log_error_patterns to override default pattern list

Extract resolution into container_runner_helpers.resolve_log_error_scan_config;
validate new keys in CLI validators; add unit tests.

Motivation: avoid false FAILURE from benign RuntimeError: lines in suites
already validated by pytest/JUnit (e.g. pyt_dl_tool_unit_test / AIMODELS-571).

* feat(execution): configurable log error pattern scan; document and drop duplicate flag
Add optional additional_context / model keys to control post-run log substring
checks (log_error_pattern_scan, log_error_benign_patterns, log_error_patterns).
Resolve settings in container_runner_helpers; wire into ContainerRunner; validate
CLI types. Remove disable_log_error_scan in favor of log_error_pattern_scan only.

Document behavior in README, docs/configuration.md, cli-reference, usage, profiling,
and docs index. Cross-link troubleshooting for benign RuntimeError: false failures.

Tests: extend test_container_runner_helpers; unit suite passes.
…egex; test validators (ROCm#93)

* feat(execution): configurable log error pattern scan (additional_context)
Allow disabling or tuning post-run log grep via additional_context and/or
model_info:

- log_error_pattern_scan / disable_log_error_scan to skip substring scan
- log_error_benign_patterns to extend exclusions (e.g. noisy DLM logs)
- log_error_patterns to override default pattern list

Extract resolution into container_runner_helpers.resolve_log_error_scan_config;
validate new keys in CLI validators; add unit tests.

Motivation: avoid false FAILURE from benign RuntimeError: lines in suites
already validated by pytest/JUnit (e.g. pyt_dl_tool_unit_test / AIMODELS-571).

* feat(execution): configurable log error pattern scan; document and drop duplicate flag
Add optional additional_context / model keys to control post-run log substring
checks (log_error_pattern_scan, log_error_benign_patterns, log_error_patterns).
Resolve settings in container_runner_helpers; wire into ContainerRunner; validate
CLI types. Remove disable_log_error_scan in favor of log_error_pattern_scan only.

Document behavior in README, docs/configuration.md, cli-reference, usage, profiling,
and docs index. Cross-link troubleshooting for benign RuntimeError: false failures.

Tests: extend test_container_runner_helpers; unit suite passes.

* refactor: scan container logs in Python; validate log_error_* in tests
Replace sh/grep-based log error detection with in-process scanning to avoid
shell quoting and injection issues when patterns come from model_info or
additional_context.

- Add log_text_has_error_pattern() with literal benign substrings vs
  explicit regex benign rules (ROCProf).
- Wire ContainerRunner to read the log once and use the helper.
- Document log_error_benign_patterns as literal substrings in helpers.
- Extend validate_additional_context tests for log_error_pattern_scan,
  log_error_benign_patterns, and log_error_patterns (valid + invalid).
Resolve conflicts with v2 branch:
- context.py, reporting/update_perf_super.py: keep refactor-dis implementations
- reporting/update_perf_csv.py: use main's safer column reorder (available_columns)
- utils/config_parser.py: take main's path resolution (KNOWN_REPOS, candidates, cache)
- Drop legacy tools/discover_models.py and tools/run_models.py (superseded by utils/)

Unit tests: extend Context mocks so init_gpu_context does not require ROCm on CI.

Made-with: Cursor
- Add docker_image field to built_images and built_models in --use-image mode
- Merge model's distributed.launcher into deployment_config for proper detection
- Make reservation optional (empty string is ignored)
- Pass reservation to SlurmNodeSelector for health checks

Co-authored-by: Cursor <cursoragent@cursor.com>
Add missing newline after 'fi' statement to prevent syntax error
in generated sbatch script when using --build-on-compute option.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace legacy launcher names with unified slurm_multi:
- Remove sglang-disagg, vllm-disagg from VALID_LAUNCHERS
- Add slurm_multi as the only baremetal multi-node launcher
- Update deployment logic in slurm.py and kubernetes.py
- Rename example config files from sglang-disagg-* to slurm-multi-*
- Update docs/launchers.md references

Breaking change: sglang-disagg and vllm-disagg are no longer valid
launcher names. Use slurm_multi instead.

Made-with: Cursor
…ures

docs/cli-reference.md:
- Add --use-image and --build-on-compute options to build command
- Add examples for pre-built image and compute node build workflows
- Document when to use each mode

docs/usage.md:
- Add "Pre-built Image Mode" section with examples
- Add "Build on Compute Node" section explaining workflow

docs/deployment.md:
- Add "SLURM Allocation Detection" section
- Add "Baremetal Execution (slurm_multi)" section
- Document environment variables and behavior inside salloc

docs/launchers.md:
- Fix example file paths (slurm_multi -> slurm-multi)

Made-with: Cursor
--use-image changes:
- Make image name optional (auto-detect from model card's DOCKER_IMAGE_NAME)
- Add mutual exclusivity with --registry and --build-on-compute
- Handle multiple models with different images (use first, warn)

--build-on-compute changes:
- Require --registry option (build once, push, pull everywhere)
- Build on 1 compute node only, push to registry
- Merge SLURM config from model card + additional-context (CLI overrides)
- Add registry credential validation before build
- Add docker login step in build script
- Smart registry image naming (namespace/repo vs namespace formats)
- Parallel docker pull on all nodes in run phase
- Clear error messages for missing partition, credentials

Run phase changes:
- Detect built_on_compute flag in manifest
- Pull image in parallel on all nodes via srun before model execution

Documentation updated for new workflows.

Made-with: Cursor
- slurm_multi launcher now requires --registry or --use-image
- Parallel docker pull on all nodes for any registry image (not just built_on_compute)
- DOCKER_IMAGE_NAME added to manifest env_vars for all registry builds
- Documentation updated for slurm_multi requirements

Made-with: Cursor
Ensure consistent documentation across:
- deployment.md
- launchers.md
- usage.md

All now mention that slurm_multi requires --registry or --use-image.

Made-with: Cursor
- Make gpu_vendor/guest_os optional when using --use-image (pre-built image)
- Remove model name truncation in CLI results table
- Merge model card slurm and distributed config into build_manifest.json
- Add skip_monitoring flag for synchronous salloc execution
- Add completion marker file for robust sbatch job completion detection

Made-with: Cursor
…cher column

- Emit #SBATCH --nodelist in _prepare_baremetal_script() when nodelist
  is set via model card or --additional-context
- Include nodelist in model card slurm merge key list so it propagates
  to the build manifest
- Rename perf table column from "Launcher" to "Workload" for clarity

Made-with: Cursor
…time)

When --additional-context provides partial slurm overrides (e.g. only
nodelist), ConfigLoader defaults were overwriting model card values like
nodes and time. Fixed by capturing original user slurm keys before
ConfigLoader runs, and deep-merging manifest slurm config with runtime
context instead of replacing it wholesale.

Made-with: Cursor
- Rename BAREMETAL_LAUNCHERS to SELF_MANAGED_LAUNCHERS to avoid
  confusion with PR ROCm#62's baremetal_vm (Docker-less nodes with VMs)
- Rename _run_on_baremetal -> _run_self_managed, _prepare_baremetal_script
  -> _prepare_slurm_multi_script throughout
- Add slurm_multi to common.py VALID_LAUNCHERS (canonical list) and
  add hyphen-variant normalization (slurm-multi -> slurm_multi)
- Remove duplicate VALID_LAUNCHERS from kubernetes.py (now inherits
  from common.py)
- Fix SBATCH output filename pattern: _%j -> _%j_%t to match the glob
  pattern in collect_results (madengine-*_{deployment_id}_*.out)
- Add _collect_slurm_multi_results for explicit slurm_multi results
  collection path with perf.csv retry and NFS propagation handling
- Add --use-image / --build-on-compute mutual exclusivity check in
  build_orchestrator.py
- Document --use-image vs MAD_CONTAINER_IMAGE relationship in
  cli-reference.md

Made-with: Cursor
Address all 9 inline comments from copilot-pull-request-reviewer[bot]:

ROCm#1 build_orchestrator.py — _execute_with_prebuilt_image now keys
   manifest['built_models'] by model_name (not use_image), so multiple
   models that share the same pre-built image are all preserved in the
   manifest.

ROCm#2 build_orchestrator.py — warn when discovered models have differing
   distributed/slurm configs in the prebuilt-image flow; the post-merge
   step still uses models[0]'s config but operators are now told.

ROCm#3 build_orchestrator.py — _execute_build_on_compute() now raises
   ConfigurationError early when registry is None instead of falling
   into registry.replace/.split/.lower with NoneType.

ROCm#4 build_orchestrator.py — credentials-required error now emits
   per-registry hints (docker.io / ghcr.io / gcr.io / quay.io / nvcr.io)
   instead of Docker-Hub-only PAT guidance.

ROCm#5 container_runner.py — document the shell=True trust boundary on the
   inner subprocess.run; cmd is internally constructed and any user
   model_args are routed through shlex-quoted assembly in the caller.

ROCm#6 slurm.py — drop duplicate `from typing import Optional` import.

ROCm#7 slurm.py — slurm_multi wrapper no longer hard-codes
   `#SBATCH --exclusive`; honours self.slurm_config.get('exclusive', True)
   to match the standard SLURM template behaviour.

ROCm#8 slurm_node_selector.py — cleanup_node()'s srun_cmd is now built once
   and includes both --job-name (when provided) and --reservation (when
   set); the second in-try reassignment that dropped --job-name is gone.

ROCm#9 run_orchestrator.py — replace the shallow `merged.update(...)` with
   a real recursive _deep_merge so the comment ("deep-merge") matches the
   behaviour: nested dicts under slurm/k8s/distributed/etc. are merged
   per-leaf, runtime --additional-context still wins on conflicts.

Made-with: Cursor
Copilot AI review requested due to automatic review settings May 1, 2026 16:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants