Kubernetes deployment (split from PR #86)#118
Open
raviguptaamd wants to merge 332 commits intoROCm:developfrom
Open
Kubernetes deployment (split from PR #86)#118raviguptaamd wants to merge 332 commits intoROCm:developfrom
raviguptaamd wants to merge 332 commits intoROCm:developfrom
Conversation
…t all Docker operations
… for truly successful runs while correctly identifying and handling various failure scenarios.
Implement a batch input arg for madengine-cli build
…derlying build logic use the batch_build_metadata argument to perform the correct tagging and pushing for each model.
…age from batch.json for each model.
Update the flow use per-model registry settings
…late updates (ROCm#88) * refactor(k8s): secrets handling, launcher mixin, and template updates - Add k8s_secrets helpers for image pull / runtime secrets with configurable strategies; add unit tests. - Introduce KubernetesLauncherMixin for torchrun and related launcher command generation; wire into kubernetes deployment. - Update K8s Job/ConfigMap templates, k8s defaults preset, and kubernetes.py. - Refresh configuration/deployment docs and k8s-configs README. - Adjust tests (conftest, dummy credential fixture) for new behavior. - Fix CLI merge of --additional-context with --additional-context-file when the CLI value uses Python dict syntax (fallback after JSON parse failure). * fix(k8s): Path import for vLLM/SGLang commands; public registry secret helper - Import Path at module level in kubernetes_launcher_mixin so _generate_vllm_command and _generate_sglang_command do not raise NameError; drop redundant inner import in _generate_torchrun_command. - Rename _build_registry_secret_data to build_registry_secret_data and document it as the supported API for dockerconfigjson preview logic. - Update kubernetes.py and unit tests accordingly.
ROCm#89) * feat(cli): add default values for gpu_vendor and guest_os in build command Add sensible defaults (AMD/UBUNTU) to simplify build operations for the common use case while maintaining full backward compatibility and explicit override capability. * fix(build): wire gpu_vendor/guest_os defaults into Context for Dockerfile filtering validate_additional_context already applied AMD/UBUNTU defaults, but the merged dict was not passed to BuildOrchestrator, so Context.filter() saw no vendor keys and could build both AMD and NVIDIA Dockerfiles. - Add core/additional_context_defaults.py with apply_build_context_defaults() - Refactor CLI validators to use the shared helper - build command: pass repr(validated context) and drop duplicate file load - BuildOrchestrator: apply defaults after merge; build Context from final additional_context post-ConfigLoader - Update unit/integration tests; add tests for defaults helper * feat(cli): validate additional context and scope GPU arch warnings Add structural checks for --additional-context (nested types, known keys) with clear typer.Exit messages and examples. Unify run merge/validate with build when context is non-empty. Remove the duplicate MAD_SYSTEM_GPU_ARCHITECTURE info line from build-only Context init. After model discovery, warn only when a selected Dockerfile needs ARG MAD_SYSTEM_GPU_ARCHITECTURE without a default and the effective arch cannot be resolved (context or Dockerfile). Add dockerfile_requires_explicit_mad_arch_build_arg helper, drop duplicate DockerBuilder methods, and extend unit tests (validators, CLI, Dockerfile heuristic, orchestrator warning, context messages). * fix(cli): normalize gpu_vendor/guest_os; clean unused imports in validators - After validation, write canonical uppercase gpu_vendor and guest_os into the context dict so Context.filter() matches exactly. - Align test_validate_additional_context_case_insensitive with that behavior. - Drop unused imports: Panel in validators.py; patch/MagicMock in test_validators.py (flake8 F401). * docs(readme): fix misleading AMD/UBUNTU wording for `madengine run` quick start * docs(usage): fix misleading AMD/UBUNTU wording for `madengine run` quick start * Fixed the docker registry url * Added unit tests for docker builder
* fix(k8s): harden collect_results and align PVC defaults with examples - kubernetes: wait for terminal pod phase before recording per-node status; read primary workload container exit code by name (not container_statuses[0]) - presets + PVC templates: align shared/results PVC behavior with nfs-banff / RWX usage - examples: update torchrun single- and multi-node JSON configs; document in k8s-configs README * feat(k8s): per-pod results layout, PVC collection hardening, expand k8s tests Kubernetes deployment: - Collect PVC artifacts into k8s_results/<job>/<pod>/pvc/ by matching /results/<subdir>/ to full pod names (indexed Job pods, unmapped fallback). - Harden collector: short settle delay, retries for listing /results, explicit -c collector on kubectl exec/cp, clearer warnings when collection fails. Docs: - Document local k8s_results layout (pod.log vs pvc/) and multi-node Service behavior (PyTorch vs Ray notes in k8s-configs README). Tests: - Rename test_k8s_secrets.py to test_k8s.py; keep secrets/configmap tests and add unit tests for PVC subdir ↔ pod name mapping helpers.
Restore v1-style --skip-model-run on madengine run: model execution is skipped only when this invocation ran a build (_did_build_phase). If an existing --manifest-file is used, the flag is ignored and a warning is printed. - CLI: new --skip-model-run, passed through args; workflow panel and success messaging when the run phase is skipped - RunOrchestrator: short-circuit before local/distributed execute; empty run summary with skipped_model_run; use (skip_model_run is True) so MagicMock test doubles do not enable the path by accident - Tests: unit coverage for skip vs run-only; help text; conftest skip_model_run
* feat(execution): configurable log error pattern scan (additional_context) Allow disabling or tuning post-run log grep via additional_context and/or model_info: - log_error_pattern_scan / disable_log_error_scan to skip substring scan - log_error_benign_patterns to extend exclusions (e.g. noisy DLM logs) - log_error_patterns to override default pattern list Extract resolution into container_runner_helpers.resolve_log_error_scan_config; validate new keys in CLI validators; add unit tests. Motivation: avoid false FAILURE from benign RuntimeError: lines in suites already validated by pytest/JUnit (e.g. pyt_dl_tool_unit_test / AIMODELS-571). * feat(execution): configurable log error pattern scan; document and drop duplicate flag Add optional additional_context / model keys to control post-run log substring checks (log_error_pattern_scan, log_error_benign_patterns, log_error_patterns). Resolve settings in container_runner_helpers; wire into ContainerRunner; validate CLI types. Remove disable_log_error_scan in favor of log_error_pattern_scan only. Document behavior in README, docs/configuration.md, cli-reference, usage, profiling, and docs index. Cross-link troubleshooting for benign RuntimeError: false failures. Tests: extend test_container_runner_helpers; unit suite passes.
…egex; test validators (ROCm#93) * feat(execution): configurable log error pattern scan (additional_context) Allow disabling or tuning post-run log grep via additional_context and/or model_info: - log_error_pattern_scan / disable_log_error_scan to skip substring scan - log_error_benign_patterns to extend exclusions (e.g. noisy DLM logs) - log_error_patterns to override default pattern list Extract resolution into container_runner_helpers.resolve_log_error_scan_config; validate new keys in CLI validators; add unit tests. Motivation: avoid false FAILURE from benign RuntimeError: lines in suites already validated by pytest/JUnit (e.g. pyt_dl_tool_unit_test / AIMODELS-571). * feat(execution): configurable log error pattern scan; document and drop duplicate flag Add optional additional_context / model keys to control post-run log substring checks (log_error_pattern_scan, log_error_benign_patterns, log_error_patterns). Resolve settings in container_runner_helpers; wire into ContainerRunner; validate CLI types. Remove disable_log_error_scan in favor of log_error_pattern_scan only. Document behavior in README, docs/configuration.md, cli-reference, usage, profiling, and docs index. Cross-link troubleshooting for benign RuntimeError: false failures. Tests: extend test_container_runner_helpers; unit suite passes. * refactor: scan container logs in Python; validate log_error_* in tests Replace sh/grep-based log error detection with in-process scanning to avoid shell quoting and injection issues when patterns come from model_info or additional_context. - Add log_text_has_error_pattern() with literal benign substrings vs explicit regex benign rules (ROCProf). - Wire ContainerRunner to read the log once and use the helper. - Document log_error_benign_patterns as literal substrings in helpers. - Extend validate_additional_context tests for log_error_pattern_scan, log_error_benign_patterns, and log_error_patterns (valid + invalid).
Resolve conflicts with v2 branch: - context.py, reporting/update_perf_super.py: keep refactor-dis implementations - reporting/update_perf_csv.py: use main's safer column reorder (available_columns) - utils/config_parser.py: take main's path resolution (KNOWN_REPOS, candidates, cache) - Drop legacy tools/discover_models.py and tools/run_models.py (superseded by utils/) Unit tests: extend Context mocks so init_gpu_context does not require ROCm on CI. Made-with: Cursor
- Add docker_image field to built_images and built_models in --use-image mode - Merge model's distributed.launcher into deployment_config for proper detection - Make reservation optional (empty string is ignored) - Pass reservation to SlurmNodeSelector for health checks Co-authored-by: Cursor <cursoragent@cursor.com>
Add missing newline after 'fi' statement to prevent syntax error in generated sbatch script when using --build-on-compute option. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace legacy launcher names with unified slurm_multi: - Remove sglang-disagg, vllm-disagg from VALID_LAUNCHERS - Add slurm_multi as the only baremetal multi-node launcher - Update deployment logic in slurm.py and kubernetes.py - Rename example config files from sglang-disagg-* to slurm-multi-* - Update docs/launchers.md references Breaking change: sglang-disagg and vllm-disagg are no longer valid launcher names. Use slurm_multi instead. Made-with: Cursor
…ures docs/cli-reference.md: - Add --use-image and --build-on-compute options to build command - Add examples for pre-built image and compute node build workflows - Document when to use each mode docs/usage.md: - Add "Pre-built Image Mode" section with examples - Add "Build on Compute Node" section explaining workflow docs/deployment.md: - Add "SLURM Allocation Detection" section - Add "Baremetal Execution (slurm_multi)" section - Document environment variables and behavior inside salloc docs/launchers.md: - Fix example file paths (slurm_multi -> slurm-multi) Made-with: Cursor
--use-image changes: - Make image name optional (auto-detect from model card's DOCKER_IMAGE_NAME) - Add mutual exclusivity with --registry and --build-on-compute - Handle multiple models with different images (use first, warn) --build-on-compute changes: - Require --registry option (build once, push, pull everywhere) - Build on 1 compute node only, push to registry - Merge SLURM config from model card + additional-context (CLI overrides) - Add registry credential validation before build - Add docker login step in build script - Smart registry image naming (namespace/repo vs namespace formats) - Parallel docker pull on all nodes in run phase - Clear error messages for missing partition, credentials Run phase changes: - Detect built_on_compute flag in manifest - Pull image in parallel on all nodes via srun before model execution Documentation updated for new workflows. Made-with: Cursor
- slurm_multi launcher now requires --registry or --use-image - Parallel docker pull on all nodes for any registry image (not just built_on_compute) - DOCKER_IMAGE_NAME added to manifest env_vars for all registry builds - Documentation updated for slurm_multi requirements Made-with: Cursor
Ensure consistent documentation across: - deployment.md - launchers.md - usage.md All now mention that slurm_multi requires --registry or --use-image. Made-with: Cursor
- Make gpu_vendor/guest_os optional when using --use-image (pre-built image) - Remove model name truncation in CLI results table - Merge model card slurm and distributed config into build_manifest.json - Add skip_monitoring flag for synchronous salloc execution - Add completion marker file for robust sbatch job completion detection Made-with: Cursor
…cher column - Emit #SBATCH --nodelist in _prepare_baremetal_script() when nodelist is set via model card or --additional-context - Include nodelist in model card slurm merge key list so it propagates to the build manifest - Rename perf table column from "Launcher" to "Workload" for clarity Made-with: Cursor
…time) When --additional-context provides partial slurm overrides (e.g. only nodelist), ConfigLoader defaults were overwriting model card values like nodes and time. Fixed by capturing original user slurm keys before ConfigLoader runs, and deep-merging manifest slurm config with runtime context instead of replacing it wholesale. Made-with: Cursor
- Rename BAREMETAL_LAUNCHERS to SELF_MANAGED_LAUNCHERS to avoid confusion with PR ROCm#62's baremetal_vm (Docker-less nodes with VMs) - Rename _run_on_baremetal -> _run_self_managed, _prepare_baremetal_script -> _prepare_slurm_multi_script throughout - Add slurm_multi to common.py VALID_LAUNCHERS (canonical list) and add hyphen-variant normalization (slurm-multi -> slurm_multi) - Remove duplicate VALID_LAUNCHERS from kubernetes.py (now inherits from common.py) - Fix SBATCH output filename pattern: _%j -> _%j_%t to match the glob pattern in collect_results (madengine-*_{deployment_id}_*.out) - Add _collect_slurm_multi_results for explicit slurm_multi results collection path with perf.csv retry and NFS propagation handling - Add --use-image / --build-on-compute mutual exclusivity check in build_orchestrator.py - Document --use-image vs MAD_CONTAINER_IMAGE relationship in cli-reference.md Made-with: Cursor
Address all 9 inline comments from copilot-pull-request-reviewer[bot]: ROCm#1 build_orchestrator.py — _execute_with_prebuilt_image now keys manifest['built_models'] by model_name (not use_image), so multiple models that share the same pre-built image are all preserved in the manifest. ROCm#2 build_orchestrator.py — warn when discovered models have differing distributed/slurm configs in the prebuilt-image flow; the post-merge step still uses models[0]'s config but operators are now told. ROCm#3 build_orchestrator.py — _execute_build_on_compute() now raises ConfigurationError early when registry is None instead of falling into registry.replace/.split/.lower with NoneType. ROCm#4 build_orchestrator.py — credentials-required error now emits per-registry hints (docker.io / ghcr.io / gcr.io / quay.io / nvcr.io) instead of Docker-Hub-only PAT guidance. ROCm#5 container_runner.py — document the shell=True trust boundary on the inner subprocess.run; cmd is internally constructed and any user model_args are routed through shlex-quoted assembly in the caller. ROCm#6 slurm.py — drop duplicate `from typing import Optional` import. ROCm#7 slurm.py — slurm_multi wrapper no longer hard-codes `#SBATCH --exclusive`; honours self.slurm_config.get('exclusive', True) to match the standard SLURM template behaviour. ROCm#8 slurm_node_selector.py — cleanup_node()'s srun_cmd is now built once and includes both --job-name (when provided) and --reservation (when set); the second in-try reassignment that dropped --job-name is gone. ROCm#9 run_orchestrator.py — replace the shallow `merged.update(...)` with a real recursive _deep_merge so the comment ("deep-merge") matches the behaviour: nested dicts under slurm/k8s/distributed/etc. are merged per-leaf, runtime --additional-context still wins on conflicts. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Splits the Kubernetes deployment work out of PR #86 into a dedicated PR so reviewers can focus #86 on
slurm_multiand related changes. This PR contains the full Kubernetes deployment stack — adapters, templates, presets, scripts, examples, and unit tests.Why split?
PR #86 was originally based on
coketaste/refactor-dist, which introduced both the SLURM and Kubernetes deployment paths. The PR base has since been retargeted todevelop. Reviewers asked to keep #86 scoped toslurm_multi; this PR carries the rest forward independently.What's in this PR (and only this PR)
src/madengine/deployment/kubernetes.py—KubernetesDeployment(auto-registered byfactory.pywhen module is importable)kubernetes_launcher_mixin.pyk8s_secrets.pypresets/k8s/**— defaults, GPU vendor profiles, multi-/single-gpu / multi-node profilestemplates/kubernetes/**—job.yaml.j2,configmap.yaml.j2,service.yaml.j2,pvc*.yaml.j2src/madengine/scripts/k8s/**— data download wrappers (download_aws/local/minio/nas.sh), tooling, profiler/rocenv wrappersexamples/k8s-configs/**— minimal and basic configs for native, torchrun, sglang, vllm, megatron, deepspeed, slurm-multi-on-k8stests/unit/test_k8s.pyThe shared / mixed orchestration & deployment files (
run_orchestrator.py,build_orchestrator.py,base.py,common.py,config_loader.py,validators.py) that already contain k8s code paths land via PR #86; this PR re-introduces the runtime callers (KubernetesDeployment, secrets, etc.) once #86 merges.Sequencing
develop.develop(the dormant k8s code paths in shared files become live again) and merges.mainper repo policy.Test plan
tests/unit/test_k8s.pypasses locally withkubernetesextras installedmadengine run --additional-context '{"deploy": "k8s", ...}'smoke test on a dev clusterpytest tests/unit/test_config_loader.py tests/unit/test_deployment.pystill passes (shared code unchanged in this PR)Made with Cursor