feat(cli): add default values for gpu_vendor and guest_os in build CLI#89
Merged
coketaste merged 8 commits intocoketaste/refactor-disfrom Mar 31, 2026
Merged
Conversation
…mmand Add sensible defaults (AMD/UBUNTU) to simplify build operations for the common use case while maintaining full backward compatibility and explicit override capability.
…file filtering validate_additional_context already applied AMD/UBUNTU defaults, but the merged dict was not passed to BuildOrchestrator, so Context.filter() saw no vendor keys and could build both AMD and NVIDIA Dockerfiles. - Add core/additional_context_defaults.py with apply_build_context_defaults() - Refactor CLI validators to use the shared helper - build command: pass repr(validated context) and drop duplicate file load - BuildOrchestrator: apply defaults after merge; build Context from final additional_context post-ConfigLoader - Update unit/integration tests; add tests for defaults helper
There was a problem hiding this comment.
Pull request overview
Adds centralized defaults for build-time additional_context (gpu_vendor=AMD, guest_os=UBUNTU) so common build workflows can omit --additional-context, while keeping override support and updating docs/tests accordingly.
Changes:
- Introduce
madengine.core.additional_context_defaultswith shared defaults + helper to apply them. - Apply defaults in both CLI validation (
validate_additional_context) andBuildOrchestratorinitialization. - Update build command wiring, tests, and documentation to reflect defaulting behavior.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
src/madengine/core/additional_context_defaults.py |
Adds shared defaults/constants and a helper to apply them. |
src/madengine/cli/validators.py |
Applies defaults during validation and prints informational messages. |
src/madengine/orchestration/build_orchestrator.py |
Ensures orchestrator has defaults so Dockerfile filtering can work without explicit context. |
src/madengine/cli/commands/build.py |
Uses validated+merged context (and reuses it for batch manifest processing). |
tests/unit/test_validators.py |
New validator-focused tests for defaulting/merging behavior. |
tests/unit/test_additional_context_defaults.py |
New tests for the shared defaults helper. |
tests/unit/test_orchestration.py |
Updates expectations for BuildOrchestrator default context. |
tests/integration/test_orchestrator_workflows.py |
Updates integration expectations for orchestrator defaults. |
docs/usage.md |
Updates quick start examples and notes about defaults. |
docs/configuration.md |
Documents default values and when they apply. |
docs/cli-reference.md |
Documents build defaults and messaging. |
README.md |
Updates quick start examples and notes about defaults. |
Comments suppressed due to low confidence (1)
src/madengine/cli/validators.py:96
gpu_vendoris uppercased for validation, but the normalized value is not written back intocontext. Downstream code (notablyContext.filter()which uses exact string equality) can fail to match Dockerfile contexts if the user passes lowercase values (e.g., "amd"). After validating, persist the canonical value back intocontext(and do the same forguest_os).
# Validate gpu_vendor
gpu_vendor = context["gpu_vendor"].upper()
if gpu_vendor not in VALID_GPU_VENDORS:
console.print(f"❌ Invalid gpu_vendor: [red]{context['gpu_vendor']}[/red]")
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add structural checks for --additional-context (nested types, known keys) with clear typer.Exit messages and examples. Unify run merge/validate with build when context is non-empty. Remove the duplicate MAD_SYSTEM_GPU_ARCHITECTURE info line from build-only Context init. After model discovery, warn only when a selected Dockerfile needs ARG MAD_SYSTEM_GPU_ARCHITECTURE without a default and the effective arch cannot be resolved (context or Dockerfile). Add dockerfile_requires_explicit_mad_arch_build_arg helper, drop duplicate DockerBuilder methods, and extend unit tests (validators, CLI, Dockerfile heuristic, orchestrator warning, context messages).
…dators - After validation, write canonical uppercase gpu_vendor and guest_os into the context dict so Context.filter() matches exactly. - Align test_validate_additional_context_case_insensitive with that behavior. - Drop unused imports: Panel in validators.py; patch/MagicMock in test_validators.py (flake8 F401).
leconcio
previously approved these changes
Mar 27, 2026
Contributor
leconcio
left a comment
There was a problem hiding this comment.
I've reviewed and tested this branch, it's indeed backwards compatible with madengineV1
coketaste
added a commit
that referenced
this pull request
Apr 9, 2026
* Changed docker.io to dockerhub * Fix the test case of context * Updated README.md * Fix the unit test of e2e distributed run with profiling * Fixed the issue of mocks gpu * Rewrite the unit test gpu version * Fixed the manfiest name error * Fixed the missing manifest file * Updated the warning message of missing cred * Updated the MAD_DOCKERHUB_ creds parsing logic * Updatd README * Implemented a batch input arg for madengine-cli build * enhanced logging system is now active and will automatically highlight all Docker operations * Fix the error local variable docker_image referenced before assignment * Updated the perf dataframe output * The fixes are backward compatible and maintain existing functionality for truly successful runs while correctly identifying and handling various failure scenarios. * Fixed the problematic log * Fixed the error pattern, removed the wrong string * Fixed the error of test prof * Updated the interface of mad_cli * Update README.md * ensure that the DistributedOrchestrator.build_phase method and the underlying build logic use the batch_build_metadata argument to perform the correct tagging and pushing for each model. * Updated the build batch manifest to distributed orchestrator * Debug the batch manifest * Update the flow use per-model registry settings for both build and run phase * correct registry image will be used for each model as intended * The push_image function now accepts and uses the explicit registry_image from batch.json for each model. * Updated the explicit_registry_image assignment * Debug the registry info setting * Updated the function of export build manifest * Add verbose for debugging * Debug the export build manifest * Debug the registry extract from batch build metadata * Debug the exaction * Corrected the content of synthetic image which built_new is false in batch mode * Fixed the type error in additional context * Debug the parsing of gpu vendoer and guest os * Correct the pattern of Dockerfile * Updated the print * Update the rich print * Figured out a critical issue about dual CLI implementation creating maintenance burden * Fixed the dockerfile matched * refactored the logic in _process_batch_manifest_entries() to include all fields from the discovered model in the build_manifest * Added unit tests for new unified error handlers * Updated README.md * Implemented a SLURM runner follows the same comprehensive pattern as the existing SSH, Ansible, and Kubernetes runners, ensuring consistency while highlighting SLURM-specific features like job arrays, HPC module systems, and shared filesystem requirements. * Fixed the errors in unit tests * Used Rich console print to replace part of regular print to enhance the formatting log following best practices * Updated rich conosle print to enhance the log readability * Update the new line * Updated the new line for all sections * Updated final table of dataframe * Updated the display of dataframe from head to tail * Updated the checking gpu status * Cleanup * Updated README * Added discover command to mad_cli * Implemented CLI detect MAD_CONTAINER_IMAGE in additional context, production-ready and maintains full backward compatibility with existing madengine workflows * Implemented the core multi-GPU architectures support for docker image building * Implemented unit tests for the feature of multi-gpu arch * Debug and fix the unit test of multi gpu arch * Debug the issue of display results table * Enhanced the results table, and improved the flow of handle gpu arch surfix at docker image name * Creates architecture-specific images with proper naming and metadata, regardless of the underlying Dockerfile configuration. * Fixed the syntax error * ported changes from coketaste/amd-smi * Revert "ported changes from coketaste/amd-smi" This reverts commit 5444a67. * Fixed the tools for distributed mode * Fixed the cleanup * Fixed the table of resutls * Fixed the GPU Product Name * Fixed the issue in selftest * Enhanced unit tests and cleanup * Refactor the architecture and flow * Updated the REFACTOR Plan * Updated PLAN * Update Plan * Update Plan * Refactor the new madengine cli architecture and flow * Fixed the copy issue of scripts * Fixed the migration issues and fixed the tags multitag inputs * Improved error handler and updated unit tests * Fixed the prescript with prof power and vram * Fixed the issues of data provider and tags * Refactored the GPU Tool Manager with factory to handle AMD ROCm and Nvidia CUDA tools, sync up the changes in legacy madengine, updated their unit tests and cleanup codebase * Reorganize codebase: move docker_builder, discover_models, update_perf_csv - Move docker_builder.py to execution/ (Docker operations) - Move discover_models.py to utils/ (shared utility) - Move update_perf_csv.py to reporting/ (shared by both CLIs) - Create database/ placeholder for future MongoDB API - Update all imports across codebase - Add comprehensive README documentation Tests: 58/59 passing (98.3%) Legacy compatibility: Maintained * Reorganize file structure and cleanup - Move documentation files to better organization - Update code references and imports - Clean up test fixtures with dummy values only * cleanup runners * fixed the unit tests for new madengine cli and depreciated unit tests of legacy madengine * Implemented k8s deployment * Fixed the perf csv unified format * Updated the flow of k8s * Init examples of k8s config for different use cases * Fixed the data provider for minio * Fixed the data nas * Implemented k8s tools package and Enabled download and pre-script tools in the flow running on k8s pod * Fixed the tools encapsulated * Added PVC storage support for results generated by workload on k8s pod * Implemented the torchrun as runner for multigpu and multinode * Fixed the torchrun on multigpu on k8s * Fixed the tools as pre/post scripts running on multigpu k8s * Fixed the error handler of performance empty due to benchmark failed on k8s * Fixed the chain tools * Fixed the multinode on k8s * Updated k8s config * Updated the pod creation with PVC * Clean up * Refactored context to handle new madengine cli to support local and deploy cases * Migrated mad_cli with new cli structure desgin * Fixed the unit tests for new madengine cli * Fixed the unit tests * Improved the interface of k8s-configs to simplify the complex config to minimal * Fixed the timeout issue * Added models for testing distribution with different launchers * Removed non-existent environment variables; Kept only standard MIOpen variables: MIOPEN_FIND_MODE=1 and MIOPEN_USER_DB_PATH=/tmp/.miopen; Added cache clearing in kubernetes/job.yaml.j2 before training starts; Deletes corrupted cache files: rm -rf ${MIOPEN_USER_DB_PATH}/*; Ensures every run starts with a clean cache. * Added new models of dummy_megatron_lm and dummy_deepspeed to validate the implementation of distribution on cluster * Implemented launcher of Slurm running distributed workload * Fixed the issue of depolyment config * Fixed the gpu resolution for setting num gpus * Fixed the multinode configs and cleanup old MULTI_NODE args which are not used anymore * Fixed the streaming log * Refactored the torchrun on slurm with multi gpu and multi node * Fixed the job template of slurm for single and multinode * Fixed the multinode job * Fixed the multinode on slurm node * Fixed the multinode slurm depoly case which no SIGBUS crashes and worker nodes succeed with proper status message * Validate madengine-cli on compute node * Fixed the unit tests for slurm deploy update * Created run script of dummy deepspeed * Implemented inference serving launchers using vllm and sglang for distribution on slurm cluster * Fixed the issue in vllm for v1 engine * Debug and test vllm deploy on multinode of slurm with v1 engine and ray as launcher, refactor k8s-configs and slurm-configs * Implemented launchers of vllm and sglang to run workload on single and multinode of slurm cluster; Fixed the slurm-configs * Fixed the unit tests of skip gpu arch * Fixed the issue of copy common and add pyt_huggingface_gpt2 and pyt_huggingface_bert for validating development * v2.0 development: (1) 6 launchers implementation for multigpu and mutlinode on k8s and slurm cluster. (2) bugs fix for error handler (3) cleanup deadcode (4) update docs of project * Updated the gpu_vendor and guest_os fields in config * Updated README.md and its sections in docs * Updated the README and cleanup * Replace sleep to cat * Updated torchrun with run.sh pattern on k8s deployment * Updated Megatron-lm base image using ROCm/megatron-lm:latest * Implemented report and database commands * Added the feature of perf superset to collect configs and multi-results * Fixed the k8s pvc issue * Updated the context saving logic * Reorganize unit tests, remove reduntant and edge cases, add examples of batch-manifest and update documentation of the feature of batch build * Make an universal soluton with docker exec * Replaced sleep with tail * Fixed teh docker pull issue on compute node if the layer of image crashed * Removed MySQL database interface * Updated the Megatron-lm launcher on both k8s and slurm * Updated k8s-configs megatron-lm and deepspeed * Updated the config of gradient_accumulation_steps * Removed legacy madengine and its relative, use madengine as unified modern CLI with k8s and SLURM * Cleanup and enhance README of project * Fixed the vllm multinode on k8s * Fixed the format error in kubernetes * Implemented sglang-disagg launcher for slurm and k8s on multinode * Updated docs of project refer to recent changes * Updated README of project * Fixed the stack tools for tracing * Fixed the tools stack for gpu info power and gpu info vram profilers * Fixed the error of regex pattern mismatch * Added a model to test theRock and its performance on PyTorch framework; Added tool of detect_therock * Fixed the error handling to validate the exact bug * Updated docs refer to recent changes * Fixed the module combining logs of building and running * Fixed the trace shell script * Updated trace script * Fixed the SLURM job setup HIP_VISIBLE_DEVICES for Ray/vLLM * Debug the vllm multinode on SLURM * Fixed the superset upload to mongodb * Fixed the unit tests checking hardward capability * Fixed the superset data generation and uploading * Added new opt to clean perf intermediated files * Enhanced rocprof v3 profiling: 8 pre-configured profiling profiles for different bottleneck types - Hardware counter definitions for compute, memory, and communication analysis - Ready-to-use configuration files for single-GPU, multi-GPU, and multi-node setups - Perfetto visualization support for timeline analysis - Full custom command support via cmd and env_vars fields - Automatic rocprof/rocprofv3 detection via existing wrapper script - Comprehensive documentation with examples for every scenario * Fixed the configs, tracing and wrapper of rocprof v3 * Added profiling-configs and counters configs, enhanced the multigpu profiling with rocprof v3 and stacked tools chain * Updated the unit tests of profiling with rocprof v3 * Updated the docs of project * Added unit tests for superset * Fixed the error of parsing config and gathering multi_resutls to superset dataframe * Fixed the issue of dockerhub auth in pod init for Image Pulling * Implemented multinode profiling and tracing on k8s and slurm cluster, updated the tables of execution results and performance results * Fixed the multi results case on k8s infra * Updated config for porfiling * Added nodelist arg for slurm, and improved the configuration module * Updated docs about changes in slurm * Added a new rocprofv3 config mode for tool * Updated rocprofv3_agent tool config * Added rocprofv3_agent_full with counter collection or instruction mix * Updated trace config * Added rocprofv3 thread tracing * Updated the config of thread tracing and need to set ROCPROF_ATT_LIBRARY_PATH * Updated tools.json and remove thread tracing config for rocprof v3 * Fixed the error patterns and modified the silent error check via subprocess * Updated the additional context and manifest printout * Updated interface of madengine report to-html using --csv-file-path * madengine v2: ROCm path override + RPD e2e dependency fix (#76) * feat(madengine): ROCm path override and RPD e2e fix ROCM_PATH / --rocm-path: - Add get_rocm_path() and wire through Context, ROCmToolManager, gpu_tool_factory, gpu_validator, container_runner, run orchestrator; add --rocm-path to run CLI - Unit tests for get_rocm_path, Context, ROCmToolManager, run --help; update fixtures and test_get_cached_managers for (vendor, rocm_path) cache RPD e2e: - pre_scripts/trace.sh: install nlohmann-json3-dev (Ubuntu) and json-devel (CentOS) so rocmProfileData rpd_tracer build finds nlohmann/json.hpp * Updated docs * Addd madengine logo icon * Resize the logo icon * Updated the icon of madengine * Updated docs * Updated README * Fixed the issue. Appended rows now match the existing header; model, performance, metric, and status show in the correct column (#77) Use header index to replace fixed column order * Updated README.md * feat(k8s): align K8s perf reporting with Docker (multi_result + perf_super) (#78) K8s orchestrator now uses the same reporting path as single-node Docker when multiple_results CSV is present, so both produce the same artifacts: perf.csv, perf_entry.json, perf_entry.csv, perf_super.json, perf_entry_super.json/csv, and perf_super.csv. - Add Docker-compatible reporting path in _collect_results: - _build_common_info_dict() for common_info (args, gpu_arch, etc.) - _ensure_perf_csv_exists() so update_perf_csv can read perf.csv - Call update_perf_csv, update_perf_super_json, update_perf_super_csv with scripts_base_dir so --config from models.json resolves correctly - Multi-node multiple_results: resolve one CSV path per run - _resolve_multiple_results_csv(): single pod → use that CSV; multi-pod → merge all pod CSVs with sum/average rules - _merge_multi_node_multiple_results_csv(): align rows by index; performance aggregated by metric type (throughput→sum, latency→avg, memory→max); extra columns by _aggregation_for_extra_column (sum/avg/max/first) - _aggregation_for_extra_column() for consistent multi-node semantics - Keep legacy row-by-row _write_to_perf_csv when reporting module unavailable; record failure when no CSV found - job.yaml.j2: no functional change required; existing copy block and find fallback for multiple_results already support this refactor * Fixed launcher type issue on k8s * update model discovery to handle tags in subdirectories for madengineV2 (#85) * fixed model discovery to support tags in subdirectories * made addtional context text and file mutually exclusive on build and run * madengine v2 improve code quality (#81) * refactor(madengine): improve code quality, remove dead code, reduce duplication - Remove dead test: delete tests/test_cleanup.py (imported removed madengine.tools.run_models). - Constants: add _get_env_or_creds_or_default() in core/constants.py and refactor NAS_NODES, MAD_AWS_S3, MAD_MINIO, PUBLIC_GITHUB_ROCM_KEY to use it; add tests/unit/test_constants.py. - Path helpers: add utils/path_utils.py with scripts_base_dir_from() and get_madengine_root(); use in container_runner and kubernetes; add tests/unit/test_path_utils.py. - Run-details: add utils/run_details.py with get_pipeline(), get_build_number(), and flatten_tags_in_place(); refactor container_runner and kubernetes to use them; add tests/unit/test_run_details.py. - Deployment: add apply_deployment_config() in config_loader and create_jinja_env() in base; refactor slurm and kubernetes init to use them; add TestApplyDeploymentConfig and tests/unit/test_deployment_base.py. All changes preserve behavior. Unit test suite (157+ tests) passes. * test: reorganize unit and integration tests, reduce duplication Unit tests (tests/unit/): - Remove duplicate get_rocm_path coverage from test_constants (kept in test_rocm_path). - Consolidate by topic: test_utils (path_utils + run_details), test_deployment (base + common), test_execution (container_runner_helpers + dockerfile_utils), test_orchestration (image_filtering + orchestrator_logic). - Parametrize and merge redundant cases (e.g. falsy inputs, env/default, launcher normalization, timeout behavior). - Drop duplicate test (e.g. gfx_compilation in dockerfile_utils) and trim edge-case bloat; keep important behavior. - Reduce from 16 to 12 unit test files; 203 tests passing. Integration tests (tests/integration/): - Fix 6 failures in multi-GPU-arch tests: use madengine.execution.dockerfile_utils instead of removed DockerBuilder private methods. - Merge error tests into test_errors.py (CLI + unified system, dedupe setup_logging and context serialization). - Merge batch manifest into test_orchestrator_workflows; merge multi_gpu_arch into test_platform_integration. - Reduce from 10 to 7 integration test files; 155 passed, 1 skipped. * - Add shared e2e helpers in tests/fixtures/utils.py: - build_run_command(), assert_model_in_perf_csv(), get_run_live_log_path(), get_timeout_seconds_from_log() - DEFAULT_CLEAN_FILES for consistent cleanup lists - Use DEFAULT_CLEAN_FILES (and + extras) in all e2e modules: test_run_workflows, test_execution_features, test_data_workflows, test_build_workflows, test_profiling_workflows, test_scripting_workflows - Parametrize context tests: merge test_dockerfile_picked_on_detected_context_0/1 into test_dockerfile_picked_on_detected_context (ctx_value, expected_performance) - Parametrize timeout-in-log tests: replace four separate tests with test_timeout_value_in_log (tags, log_base_name, expected_seconds, extra_args) - Leave timeout-fire and other e2e tests unchanged; 60 e2e tests still pass * Stop tracking run_directory (now in .gitignore) * Multi-node SLURM: No node writes perf.csv; after the job, collect_results parses each node’s .out, aggregates (e.g. sum for samples_per_second), and writes one row. 2N×8G should report about 2× the 1N×8G throughput. * Refactored the slurm_output to slurm_results to improve the management of result collected from each node * Fixed the workload's output to slurm results * Refactored the reporting system for local, k8s and slurm, respectively, to handle single result and mult results * Fixed the multi results collection for multi node * Fixed the test duration missing in multi results case * Improve madengine reliability, error handling, and exit codes for Jenkins - Record pre-run failures (e.g. pull/setup) in the performance CSV with status FAILURE and add to failed_runs so the perf table and exit code stay correct. - Preserve typer.Exit in cli_main so run/build exit codes (0/1/2/3/4) are passed through to the shell/Jenkins. - Use ExitCode.BUILD_FAILURE (2) when BuildError is raised; let BuildError propagate from RunOrchestrator instead of wrapping in MADRuntimeError. - Make update_perf_csv create the perf CSV with PERF_CSV_HEADER when the file is missing but a result (exception/single/multiple) is provided. - Use shared PERF_CSV_HEADER in container_runner.ensure_perf_csv_exists and fix read_json to use a context manager for the file handle. - Add unit tests for exit codes, cli_main typer.Exit preservation, BuildError → BUILD_FAILURE, setup-failure recording to CSV, and update_perf_csv create-when-missing; fix unclosed file in update_perf_csv. * Cleanup * Updated docs of project * update model discovery to handle tags in subdirectories (#83) --------- Co-authored-by: leconcio <leconcio@amd.com> * feat(discover): support scope/tag to filter by scripts subdir and tag Add --tags scope/tag (e.g. MAD-private/inference) so discovery is limited to models under scripts/<scope>/ with the given tag. Exactly one / and no : in the tag enables scoped matching; scope/all selects all models in that scope. - Parse scope/tag in DiscoverModels.select_models() - Add _model_entry_has_tag() for list or comma-separated tag checks - Document scoped tags in discover command help - Add unit tests in tests/unit/test_discover_models.py * Cleanup * fix: dummy fixtures and core for multi-node and SLURM - Update .gitignore, docker.py, slurm.py - dummy_vllm: Dockerfile (coreutils), run.sh, run_vllm_inference.py - dummy_torchtitan: Dockerfile, run.sh - dummy_sglang: README, run.sh, run_sglang_inference.py * test: consolidate run exit-code smoke tests in test_cli.py * Updated README of project * madengine v2 refactor(k8s) secrets handling, launcher mixin, and template updates (#88) * refactor(k8s): secrets handling, launcher mixin, and template updates - Add k8s_secrets helpers for image pull / runtime secrets with configurable strategies; add unit tests. - Introduce KubernetesLauncherMixin for torchrun and related launcher command generation; wire into kubernetes deployment. - Update K8s Job/ConfigMap templates, k8s defaults preset, and kubernetes.py. - Refresh configuration/deployment docs and k8s-configs README. - Adjust tests (conftest, dummy credential fixture) for new behavior. - Fix CLI merge of --additional-context with --additional-context-file when the CLI value uses Python dict syntax (fallback after JSON parse failure). * fix(k8s): Path import for vLLM/SGLang commands; public registry secret helper - Import Path at module level in kubernetes_launcher_mixin so _generate_vllm_command and _generate_sglang_command do not raise NameError; drop redundant inner import in _generate_torchrun_command. - Rename _build_registry_secret_data to build_registry_secret_data and document it as the supported API for dockerconfigjson preview logic. - Update kubernetes.py and unit tests accordingly. * feat(cli): add default values for gpu_vendor and guest_os in build CLI (#89) * feat(cli): add default values for gpu_vendor and guest_os in build command Add sensible defaults (AMD/UBUNTU) to simplify build operations for the common use case while maintaining full backward compatibility and explicit override capability. * fix(build): wire gpu_vendor/guest_os defaults into Context for Dockerfile filtering validate_additional_context already applied AMD/UBUNTU defaults, but the merged dict was not passed to BuildOrchestrator, so Context.filter() saw no vendor keys and could build both AMD and NVIDIA Dockerfiles. - Add core/additional_context_defaults.py with apply_build_context_defaults() - Refactor CLI validators to use the shared helper - build command: pass repr(validated context) and drop duplicate file load - BuildOrchestrator: apply defaults after merge; build Context from final additional_context post-ConfigLoader - Update unit/integration tests; add tests for defaults helper * feat(cli): validate additional context and scope GPU arch warnings Add structural checks for --additional-context (nested types, known keys) with clear typer.Exit messages and examples. Unify run merge/validate with build when context is non-empty. Remove the duplicate MAD_SYSTEM_GPU_ARCHITECTURE info line from build-only Context init. After model discovery, warn only when a selected Dockerfile needs ARG MAD_SYSTEM_GPU_ARCHITECTURE without a default and the effective arch cannot be resolved (context or Dockerfile). Add dockerfile_requires_explicit_mad_arch_build_arg helper, drop duplicate DockerBuilder methods, and extend unit tests (validators, CLI, Dockerfile heuristic, orchestrator warning, context messages). * fix(cli): normalize gpu_vendor/guest_os; clean unused imports in validators - After validation, write canonical uppercase gpu_vendor and guest_os into the context dict so Context.filter() matches exactly. - Align test_validate_additional_context_case_insensitive with that behavior. - Drop unused imports: Panel in validators.py; patch/MagicMock in test_validators.py (flake8 F401). * docs(readme): fix misleading AMD/UBUNTU wording for `madengine run` quick start * docs(usage): fix misleading AMD/UBUNTU wording for `madengine run` quick start * Fixed the docker registry url * Added unit tests for docker builder * madengine v2: k8s PVC collection hardening (#90) * fix(k8s): harden collect_results and align PVC defaults with examples - kubernetes: wait for terminal pod phase before recording per-node status; read primary workload container exit code by name (not container_statuses[0]) - presets + PVC templates: align shared/results PVC behavior with nfs-banff / RWX usage - examples: update torchrun single- and multi-node JSON configs; document in k8s-configs README * feat(k8s): per-pod results layout, PVC collection hardening, expand k8s tests Kubernetes deployment: - Collect PVC artifacts into k8s_results/<job>/<pod>/pvc/ by matching /results/<subdir>/ to full pod names (indexed Job pods, unmapped fallback). - Harden collector: short settle delay, retries for listing /results, explicit -c collector on kubectl exec/cp, clearer warnings when collection fails. Docs: - Document local k8s_results layout (pod.log vs pvc/) and multi-node Service behavior (PyTorch vs Ray notes in k8s-configs README). Tests: - Rename test_k8s_secrets.py to test_k8s.py; keep secrets/configmap tests and add unit tests for PVC subdir ↔ pod name mapping helpers. * feat(cli): add --skip-model-run for local full workflows (#91) Restore v1-style --skip-model-run on madengine run: model execution is skipped only when this invocation ran a build (_did_build_phase). If an existing --manifest-file is used, the flag is ignored and a warning is printed. - CLI: new --skip-model-run, passed through args; workflow panel and success messaging when the run phase is skipped - RunOrchestrator: short-circuit before local/distributed execute; empty run summary with skipped_model_run; use (skip_model_run is True) so MagicMock test doubles do not enable the path by accident - Tests: unit coverage for skip vs run-only; help text; conftest skip_model_run * feat(execution): configurable log error pattern scan (#92) * feat(execution): configurable log error pattern scan (additional_context) Allow disabling or tuning post-run log grep via additional_context and/or model_info: - log_error_pattern_scan / disable_log_error_scan to skip substring scan - log_error_benign_patterns to extend exclusions (e.g. noisy DLM logs) - log_error_patterns to override default pattern list Extract resolution into container_runner_helpers.resolve_log_error_scan_config; validate new keys in CLI validators; add unit tests. Motivation: avoid false FAILURE from benign RuntimeError: lines in suites already validated by pytest/JUnit (e.g. pyt_dl_tool_unit_test / AIMODELS-571). * feat(execution): configurable log error pattern scan; document and drop duplicate flag Add optional additional_context / model keys to control post-run log substring checks (log_error_pattern_scan, log_error_benign_patterns, log_error_patterns). Resolve settings in container_runner_helpers; wire into ContainerRunner; validate CLI types. Remove disable_log_error_scan in favor of log_error_pattern_scan only. Document behavior in README, docs/configuration.md, cli-reference, usage, profiling, and docs index. Cross-link troubleshooting for benign RuntimeError: false failures. Tests: extend test_container_runner_helpers; unit suite passes. * refactor(container): Python log error scan; split benign literal vs regex; test validators (#93) * feat(execution): configurable log error pattern scan (additional_context) Allow disabling or tuning post-run log grep via additional_context and/or model_info: - log_error_pattern_scan / disable_log_error_scan to skip substring scan - log_error_benign_patterns to extend exclusions (e.g. noisy DLM logs) - log_error_patterns to override default pattern list Extract resolution into container_runner_helpers.resolve_log_error_scan_config; validate new keys in CLI validators; add unit tests. Motivation: avoid false FAILURE from benign RuntimeError: lines in suites already validated by pytest/JUnit (e.g. pyt_dl_tool_unit_test / AIMODELS-571). * feat(execution): configurable log error pattern scan; document and drop duplicate flag Add optional additional_context / model keys to control post-run log substring checks (log_error_pattern_scan, log_error_benign_patterns, log_error_patterns). Resolve settings in container_runner_helpers; wire into ContainerRunner; validate CLI types. Remove disable_log_error_scan in favor of log_error_pattern_scan only. Document behavior in README, docs/configuration.md, cli-reference, usage, profiling, and docs index. Cross-link troubleshooting for benign RuntimeError: false failures. Tests: extend test_container_runner_helpers; unit suite passes. * refactor: scan container logs in Python; validate log_error_* in tests Replace sh/grep-based log error detection with in-process scanning to avoid shell quoting and injection issues when patterns come from model_info or additional_context. - Add log_text_has_error_pattern() with literal benign substrings vs explicit regex benign rules (ROCProf). - Wire ContainerRunner to read the log once and use the helper. - Document log_error_benign_patterns as literal substrings in helpers. - Extend validate_additional_context tests for log_error_pattern_scan, log_error_benign_patterns, and log_error_patterns (valid + invalid). * refactor: improve code quality and streamline unit tests (#94) --------- Co-authored-by: Satya Nikhil <skodukul@amd.com> Co-authored-by: leconcio <leconcio@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(cli): add default values for gpu_vendor and guest_os in build command
Add sensible defaults (AMD/UBUNTU) to simplify build operations for the
common use case while maintaining full backward compatibility and explicit
override capability.
Changes:
Benefits:
Testing:
Documentation:
Backward Compatibility:
Example usage:
Before (required)
madengine build --tags dummy --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
After (optional for AMD/Ubuntu)
madengine build --tags dummy
Override for other environments
madengine build --tags dummy --additional-context '{"gpu_vendor": "NVIDIA"}'