Skip to content

refactor: improve code quality and streamline unit tests#94

Merged
coketaste merged 1 commit intocoketaste/refactor-disfrom
coketaste/refactor-dis-quality-minor
Apr 9, 2026
Merged

refactor: improve code quality and streamline unit tests#94
coketaste merged 1 commit intocoketaste/refactor-disfrom
coketaste/refactor-dis-quality-minor

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

Addressed minor refactorings from code quality review to improve
maintainability, type safety, and test efficiency.

Changes:

  • Replace direct print() calls with Rich console.print() in mongodb.py
    for consistent terminal output formatting
  • Rename RuntimeError to ExecutionError to avoid shadowing Python built-in
    (backward compatibility alias maintained)
  • Add pre-commit hooks configuration with black, isort, flake8, mypy,
    bandit, and general file checks
  • Streamline test_error_handling.py from 503 to 278 lines (45% reduction)
    by removing edge cases and focusing on core functionality

Impact:

  • Net -185 lines while improving code quality
  • 100% backward compatibility maintained
  • All 39 unit tests passing
  • Automated quality enforcement via pre-commit hooks

Files modified:

  • src/madengine/core/errors.py (ExecutionError + alias)
  • src/madengine/database/mongodb.py (console.print)
  • src/madengine/cli/commands/run.py (updated imports)
  • src/madengine/orchestration/run_orchestrator.py (updated imports)
  • tests/unit/test_error_handling.py (streamlined)
  • .pre-commit-config.yaml (new)
  • pyproject.toml (bandit config)

@coketaste coketaste self-assigned this Apr 9, 2026
@coketaste coketaste merged commit 8753fb6 into coketaste/refactor-dis Apr 9, 2026
coketaste added a commit that referenced this pull request Apr 9, 2026
* Changed docker.io to dockerhub

* Fix the test case of context

* Updated README.md

* Fix the unit test of e2e distributed run with profiling

* Fixed the issue of mocks gpu

* Rewrite the unit test gpu version

* Fixed the manfiest name error

* Fixed the missing manifest file

* Updated the warning message of missing cred

* Updated the MAD_DOCKERHUB_ creds parsing logic

* Updatd README

* Implemented a batch input arg for madengine-cli build

* enhanced logging system is now active and will automatically highlight all Docker operations

* Fix the error local variable docker_image referenced before assignment

* Updated the perf dataframe output

* The fixes are backward compatible and maintain existing functionality for truly successful runs while correctly identifying and handling various failure scenarios.

* Fixed the problematic log

* Fixed the error pattern, removed the wrong string

* Fixed the error of test prof

* Updated the interface of mad_cli

* Update README.md

* ensure that the DistributedOrchestrator.build_phase method and the underlying build logic use the batch_build_metadata argument to perform the correct tagging and pushing for each model.

* Updated the build batch manifest to distributed orchestrator

* Debug the batch manifest

* Update the flow use per-model registry settings for both build and run phase

* correct registry image will be used for each model as intended

* The push_image function now accepts and uses the explicit registry_image from batch.json for each model.

* Updated the explicit_registry_image assignment

* Debug the registry info setting

* Updated the function of export build manifest

* Add verbose for debugging

* Debug the export build manifest

* Debug the registry extract from batch build metadata

* Debug the exaction

* Corrected the content of synthetic image which built_new is false in batch mode

* Fixed the type error in additional context

* Debug the parsing of gpu vendoer and guest os

* Correct the pattern of Dockerfile

* Updated the print

* Update the rich print

* Figured out a critical issue about dual CLI implementation creating maintenance burden

* Fixed the dockerfile matched

* refactored the logic in _process_batch_manifest_entries() to include all fields from the discovered model in the build_manifest

* Added unit tests for new unified error handlers

* Updated README.md

* Implemented a SLURM runner follows the same comprehensive pattern as the existing SSH, Ansible, and Kubernetes runners, ensuring consistency while highlighting SLURM-specific features like job arrays, HPC module systems, and shared filesystem requirements.

* Fixed the errors in unit tests

* Used Rich console print to replace part of regular print to enhance the formatting log following best practices

* Updated rich conosle print to enhance the log readability

* Update the new line

* Updated the new line for all sections

* Updated final table of dataframe

* Updated the display of dataframe from head to tail

* Updated the checking gpu status

* Cleanup

* Updated README

* Added discover command to mad_cli

* Implemented CLI detect MAD_CONTAINER_IMAGE in additional context, production-ready and maintains full backward compatibility with existing madengine workflows

* Implemented the core multi-GPU architectures support for docker image building

* Implemented unit tests for the feature of multi-gpu arch

* Debug and fix the unit test of multi gpu arch

* Debug the issue of display results table

* Enhanced the results table, and improved the flow of handle gpu arch surfix at docker image name

* Creates architecture-specific images with proper naming and metadata, regardless of the underlying Dockerfile configuration.

* Fixed the syntax error

* ported changes from coketaste/amd-smi

* Revert "ported changes from coketaste/amd-smi"

This reverts commit 5444a67.

* Fixed the tools for distributed mode

* Fixed the cleanup

* Fixed the table of resutls

* Fixed the GPU Product Name

* Fixed the issue in selftest

* Enhanced unit tests and cleanup

* Refactor the architecture and flow

* Updated the REFACTOR Plan

* Updated PLAN

* Update Plan

* Update Plan

* Refactor the new madengine cli architecture and flow

* Fixed the copy issue of scripts

* Fixed the migration issues and fixed the tags multitag inputs

* Improved error handler and updated unit tests

* Fixed the prescript with prof power and vram

* Fixed the issues of data provider and tags

* Refactored the GPU Tool Manager with factory to handle AMD ROCm and Nvidia CUDA tools, sync up the changes in legacy madengine, updated their unit tests and cleanup codebase

* Reorganize codebase: move docker_builder, discover_models, update_perf_csv

   - Move docker_builder.py to execution/ (Docker operations)
   - Move discover_models.py to utils/ (shared utility)
   - Move update_perf_csv.py to reporting/ (shared by both CLIs)
   - Create database/ placeholder for future MongoDB API
   - Update all imports across codebase
   - Add comprehensive README documentation

   Tests: 58/59 passing (98.3%)
   Legacy compatibility: Maintained

* Reorganize file structure and cleanup

- Move documentation files to better organization
- Update code references and imports
- Clean up test fixtures with dummy values only

* cleanup runners

* fixed the unit tests for new madengine cli and depreciated unit tests of legacy madengine

* Implemented k8s deployment

* Fixed the perf csv unified format

* Updated the flow of k8s

* Init examples of k8s config for different use cases

* Fixed the data provider for minio

* Fixed the data nas

* Implemented k8s tools package and Enabled download and pre-script tools in the flow running on k8s pod

* Fixed the tools encapsulated

* Added PVC storage support for results generated by workload on k8s pod

* Implemented the torchrun as runner for multigpu and multinode

* Fixed the torchrun on multigpu on k8s

* Fixed the tools as pre/post scripts running on multigpu k8s

* Fixed the error handler of performance empty due to benchmark failed on k8s

* Fixed the chain tools

* Fixed the multinode on k8s

* Updated k8s config

* Updated the pod creation with PVC

* Clean up

* Refactored context to handle new madengine cli to support local and deploy cases

* Migrated mad_cli with new cli structure desgin

* Fixed the unit tests for new madengine cli

* Fixed the unit tests

* Improved the interface of k8s-configs to simplify the complex config to minimal

* Fixed the timeout issue

* Added models for testing distribution with different launchers

* Removed non-existent environment variables; Kept only standard MIOpen variables: MIOPEN_FIND_MODE=1 and MIOPEN_USER_DB_PATH=/tmp/.miopen; Added cache clearing in kubernetes/job.yaml.j2 before training starts; Deletes corrupted cache files: rm -rf ${MIOPEN_USER_DB_PATH}/*; Ensures every run starts with a clean cache.

* Added new models of dummy_megatron_lm and dummy_deepspeed to validate the implementation of distribution on cluster

* Implemented launcher of Slurm running distributed workload

* Fixed the issue of depolyment config

* Fixed the gpu resolution for setting num gpus

* Fixed the multinode configs and cleanup old MULTI_NODE args which are not used anymore

* Fixed the streaming log

* Refactored the torchrun on slurm with multi gpu and multi node

* Fixed the job template of slurm for single and multinode

* Fixed the multinode job

* Fixed the multinode on slurm node

* Fixed the multinode slurm depoly case which no SIGBUS crashes and worker nodes succeed with proper status message

* Validate madengine-cli on compute node

* Fixed the unit tests for slurm deploy update

* Created run script of dummy deepspeed

* Implemented inference serving launchers using vllm and sglang for distribution on slurm cluster

* Fixed the issue in vllm for v1 engine

* Debug and test vllm deploy on multinode of slurm with v1 engine and ray as launcher, refactor k8s-configs and slurm-configs

* Implemented launchers of vllm and sglang to run workload on single and multinode of slurm cluster; Fixed the slurm-configs

* Fixed the unit tests of skip gpu arch

* Fixed the issue of copy common and add pyt_huggingface_gpt2 and pyt_huggingface_bert for validating development

* v2.0 development:
(1) 6 launchers implementation for multigpu and mutlinode on k8s and slurm cluster.
(2) bugs fix for error handler
(3) cleanup deadcode
(4) update docs of project

* Updated the gpu_vendor and guest_os fields in config

* Updated README.md and its sections in docs

* Updated the README and cleanup

* Replace sleep to cat

* Updated torchrun with run.sh pattern on k8s deployment

* Updated Megatron-lm base image using ROCm/megatron-lm:latest

* Implemented report and database commands

* Added the feature of perf superset to collect configs and multi-results

* Fixed the k8s pvc issue

* Updated the context saving logic

* Reorganize unit tests, remove reduntant and edge cases, add examples of batch-manifest and update documentation of the feature of batch build

* Make an universal soluton with docker exec

* Replaced sleep with tail

* Fixed teh docker pull issue on compute node if the layer of image crashed

* Removed MySQL database interface

* Updated the Megatron-lm launcher on both k8s and slurm

* Updated k8s-configs megatron-lm and deepspeed

* Updated the config of gradient_accumulation_steps

* Removed legacy madengine and its relative, use madengine as unified modern CLI with k8s and SLURM

* Cleanup and enhance README of project

* Fixed the vllm multinode on k8s

* Fixed the format error in kubernetes

* Implemented sglang-disagg launcher for slurm and k8s on multinode

* Updated docs of project refer to recent changes

* Updated README of project

* Fixed the stack tools for tracing

* Fixed the tools stack for gpu info power and gpu info vram profilers

* Fixed the error of regex pattern mismatch

* Added a model to test theRock and its performance on PyTorch framework; Added tool of detect_therock

* Fixed the error handling to validate the exact bug

* Updated docs refer to recent changes

* Fixed the module combining logs of building and running

* Fixed the trace shell script

* Updated trace script

* Fixed the SLURM job setup HIP_VISIBLE_DEVICES for Ray/vLLM

* Debug the vllm multinode on SLURM

* Fixed the superset upload to mongodb

* Fixed the unit tests checking hardward capability

* Fixed the superset data generation and uploading

* Added new opt to clean perf intermediated files

* Enhanced rocprof v3 profiling: 8 pre-configured profiling profiles for different bottleneck types
- Hardware counter definitions for compute, memory, and communication analysis
- Ready-to-use configuration files for single-GPU, multi-GPU, and multi-node setups
- Perfetto visualization support for timeline analysis
- Full custom command support via cmd and env_vars fields
- Automatic rocprof/rocprofv3 detection via existing wrapper script
- Comprehensive documentation with examples for every scenario

* Fixed the configs, tracing and wrapper of rocprof v3

* Added profiling-configs and counters configs, enhanced the multigpu profiling with rocprof v3 and stacked tools chain

* Updated the unit tests of profiling with rocprof v3

* Updated the docs of project

* Added unit tests for superset

* Fixed the error of parsing config and gathering multi_resutls to superset dataframe

* Fixed the issue of dockerhub auth in pod init for Image Pulling

* Implemented multinode profiling and tracing on k8s and slurm cluster, updated the tables of execution results and performance results

* Fixed the multi results case on k8s infra

* Updated config for porfiling

* Added nodelist arg for slurm, and improved the configuration module

* Updated docs about changes in slurm

* Added a new rocprofv3 config mode for tool

* Updated rocprofv3_agent tool config

* Added rocprofv3_agent_full with counter collection or instruction mix

* Updated trace config

* Added rocprofv3 thread tracing

* Updated the config of thread tracing and need to set ROCPROF_ATT_LIBRARY_PATH

* Updated tools.json and remove thread tracing config for rocprof v3

* Fixed the error patterns and modified the silent error check via subprocess

* Updated the additional context and manifest printout

* Updated interface of madengine report to-html using --csv-file-path

* madengine v2: ROCm path override + RPD e2e dependency fix (#76)

* feat(madengine): ROCm path override and RPD e2e fix

ROCM_PATH / --rocm-path:
- Add get_rocm_path() and wire through Context, ROCmToolManager, gpu_tool_factory,
  gpu_validator, container_runner, run orchestrator; add --rocm-path to run CLI
- Unit tests for get_rocm_path, Context, ROCmToolManager, run --help; update
  fixtures and test_get_cached_managers for (vendor, rocm_path) cache

RPD e2e:
- pre_scripts/trace.sh: install nlohmann-json3-dev (Ubuntu) and json-devel
  (CentOS) so rocmProfileData rpd_tracer build finds nlohmann/json.hpp

* Updated docs

* Addd madengine logo icon

* Resize the logo icon

* Updated the icon of madengine

* Updated docs

* Updated README

* Fixed the issue. Appended rows now match the existing header; model, performance, metric, and status show in the correct column (#77)

Use header index to replace fixed column order

* Updated README.md

* feat(k8s): align K8s perf reporting with Docker (multi_result + perf_super) (#78)

K8s orchestrator now uses the same reporting path as single-node Docker
when multiple_results CSV is present, so both produce the same artifacts:
perf.csv, perf_entry.json, perf_entry.csv, perf_super.json,
perf_entry_super.json/csv, and perf_super.csv.

- Add Docker-compatible reporting path in _collect_results:
  - _build_common_info_dict() for common_info (args, gpu_arch, etc.)
  - _ensure_perf_csv_exists() so update_perf_csv can read perf.csv
  - Call update_perf_csv, update_perf_super_json, update_perf_super_csv
    with scripts_base_dir so --config from models.json resolves correctly

- Multi-node multiple_results: resolve one CSV path per run
  - _resolve_multiple_results_csv(): single pod → use that CSV; multi-pod
    → merge all pod CSVs with sum/average rules
  - _merge_multi_node_multiple_results_csv(): align rows by index;
    performance aggregated by metric type (throughput→sum, latency→avg,
    memory→max); extra columns by _aggregation_for_extra_column (sum/avg/max/first)
  - _aggregation_for_extra_column() for consistent multi-node semantics

- Keep legacy row-by-row _write_to_perf_csv when reporting module
  unavailable; record failure when no CSV found

- job.yaml.j2: no functional change required; existing copy block and
  find fallback for multiple_results already support this refactor

* Fixed launcher type issue on k8s

* update model discovery to handle tags in subdirectories for madengineV2 (#85)

* fixed model discovery to support tags in subdirectories

* made addtional context text and file mutually exclusive on build and run

* madengine v2 improve code quality (#81)

* refactor(madengine): improve code quality, remove dead code, reduce duplication

- Remove dead test: delete tests/test_cleanup.py (imported removed
  madengine.tools.run_models).
- Constants: add _get_env_or_creds_or_default() in core/constants.py and
  refactor NAS_NODES, MAD_AWS_S3, MAD_MINIO, PUBLIC_GITHUB_ROCM_KEY to use
  it; add tests/unit/test_constants.py.
- Path helpers: add utils/path_utils.py with scripts_base_dir_from() and
  get_madengine_root(); use in container_runner and kubernetes; add
  tests/unit/test_path_utils.py.
- Run-details: add utils/run_details.py with get_pipeline(), get_build_number(),
  and flatten_tags_in_place(); refactor container_runner and kubernetes to use
  them; add tests/unit/test_run_details.py.
- Deployment: add apply_deployment_config() in config_loader and create_jinja_env()
  in base; refactor slurm and kubernetes init to use them; add
  TestApplyDeploymentConfig and tests/unit/test_deployment_base.py.

All changes preserve behavior. Unit test suite (157+ tests) passes.

* test: reorganize unit and integration tests, reduce duplication

Unit tests (tests/unit/):
- Remove duplicate get_rocm_path coverage from test_constants (kept in test_rocm_path).
- Consolidate by topic: test_utils (path_utils + run_details), test_deployment
  (base + common), test_execution (container_runner_helpers + dockerfile_utils),
  test_orchestration (image_filtering + orchestrator_logic).
- Parametrize and merge redundant cases (e.g. falsy inputs, env/default, launcher
  normalization, timeout behavior).
- Drop duplicate test (e.g. gfx_compilation in dockerfile_utils) and trim
  edge-case bloat; keep important behavior.
- Reduce from 16 to 12 unit test files; 203 tests passing.

Integration tests (tests/integration/):
- Fix 6 failures in multi-GPU-arch tests: use madengine.execution.dockerfile_utils
  instead of removed DockerBuilder private methods.
- Merge error tests into test_errors.py (CLI + unified system, dedupe setup_logging
  and context serialization).
- Merge batch manifest into test_orchestrator_workflows; merge multi_gpu_arch into
  test_platform_integration.
- Reduce from 10 to 7 integration test files; 155 passed, 1 skipped.

* - Add shared e2e helpers in tests/fixtures/utils.py:
  - build_run_command(), assert_model_in_perf_csv(),
    get_run_live_log_path(), get_timeout_seconds_from_log()
  - DEFAULT_CLEAN_FILES for consistent cleanup lists
- Use DEFAULT_CLEAN_FILES (and + extras) in all e2e modules:
  test_run_workflows, test_execution_features, test_data_workflows,
  test_build_workflows, test_profiling_workflows, test_scripting_workflows
- Parametrize context tests: merge test_dockerfile_picked_on_detected_context_0/1
  into test_dockerfile_picked_on_detected_context (ctx_value, expected_performance)
- Parametrize timeout-in-log tests: replace four separate tests with
  test_timeout_value_in_log (tags, log_base_name, expected_seconds, extra_args)
- Leave timeout-fire and other e2e tests unchanged; 60 e2e tests still pass

* Stop tracking run_directory (now in .gitignore)

* Multi-node SLURM: No node writes perf.csv; after the job, collect_results parses each node’s .out, aggregates (e.g. sum for samples_per_second), and writes one row. 2N×8G should report about 2× the 1N×8G throughput.

* Refactored the slurm_output to slurm_results to improve the management of result collected from each node

* Fixed the workload's output to slurm results

* Refactored the reporting system for local, k8s and slurm, respectively, to handle single result and mult results

* Fixed the multi results collection for multi node

* Fixed the test duration missing in multi results case

* Improve madengine reliability, error handling, and exit codes for Jenkins

- Record pre-run failures (e.g. pull/setup) in the performance CSV with status
  FAILURE and add to failed_runs so the perf table and exit code stay correct.
- Preserve typer.Exit in cli_main so run/build exit codes (0/1/2/3/4) are
  passed through to the shell/Jenkins.
- Use ExitCode.BUILD_FAILURE (2) when BuildError is raised; let BuildError
  propagate from RunOrchestrator instead of wrapping in MADRuntimeError.
- Make update_perf_csv create the perf CSV with PERF_CSV_HEADER when the file
  is missing but a result (exception/single/multiple) is provided.
- Use shared PERF_CSV_HEADER in container_runner.ensure_perf_csv_exists and
  fix read_json to use a context manager for the file handle.
- Add unit tests for exit codes, cli_main typer.Exit preservation, BuildError
  → BUILD_FAILURE, setup-failure recording to CSV, and update_perf_csv
  create-when-missing; fix unclosed file in update_perf_csv.

* Cleanup

* Updated docs of project

* update model discovery to handle tags in subdirectories (#83)

---------

Co-authored-by: leconcio <leconcio@amd.com>

* feat(discover): support scope/tag to filter by scripts subdir and tag

Add --tags scope/tag (e.g. MAD-private/inference) so discovery is limited
to models under scripts/<scope>/ with the given tag. Exactly one / and
no : in the tag enables scoped matching; scope/all selects all models
in that scope.

- Parse scope/tag in DiscoverModels.select_models()
- Add _model_entry_has_tag() for list or comma-separated tag checks
- Document scoped tags in discover command help
- Add unit tests in tests/unit/test_discover_models.py

* Cleanup

* fix: dummy fixtures and core for multi-node and SLURM

- Update .gitignore, docker.py, slurm.py
- dummy_vllm: Dockerfile (coreutils), run.sh, run_vllm_inference.py
- dummy_torchtitan: Dockerfile, run.sh
- dummy_sglang: README, run.sh, run_sglang_inference.py

* test: consolidate run exit-code smoke tests in test_cli.py

* Updated README of project

* madengine v2 refactor(k8s) secrets handling, launcher mixin, and template updates (#88)

* refactor(k8s): secrets handling, launcher mixin, and template updates

- Add k8s_secrets helpers for image pull / runtime secrets with configurable
  strategies; add unit tests.
- Introduce KubernetesLauncherMixin for torchrun and related launcher command
  generation; wire into kubernetes deployment.
- Update K8s Job/ConfigMap templates, k8s defaults preset, and kubernetes.py.
- Refresh configuration/deployment docs and k8s-configs README.
- Adjust tests (conftest, dummy credential fixture) for new behavior.
- Fix CLI merge of --additional-context with --additional-context-file when
  the CLI value uses Python dict syntax (fallback after JSON parse failure).

* fix(k8s): Path import for vLLM/SGLang commands; public registry secret helper

- Import Path at module level in kubernetes_launcher_mixin so
  _generate_vllm_command and _generate_sglang_command do not raise
  NameError; drop redundant inner import in _generate_torchrun_command.
- Rename _build_registry_secret_data to build_registry_secret_data and
  document it as the supported API for dockerconfigjson preview logic.
- Update kubernetes.py and unit tests accordingly.

* feat(cli): add default values for gpu_vendor and guest_os in build CLI (#89)

* feat(cli): add default values for gpu_vendor and guest_os in build command

Add sensible defaults (AMD/UBUNTU) to simplify build operations for the
common use case while maintaining full backward compatibility and explicit
override capability.

* fix(build): wire gpu_vendor/guest_os defaults into Context for Dockerfile filtering

validate_additional_context already applied AMD/UBUNTU defaults, but the merged
dict was not passed to BuildOrchestrator, so Context.filter() saw no vendor keys
and could build both AMD and NVIDIA Dockerfiles.

- Add core/additional_context_defaults.py with apply_build_context_defaults()
- Refactor CLI validators to use the shared helper
- build command: pass repr(validated context) and drop duplicate file load
- BuildOrchestrator: apply defaults after merge; build Context from final
  additional_context post-ConfigLoader
- Update unit/integration tests; add tests for defaults helper

* feat(cli): validate additional context and scope GPU arch warnings

Add structural checks for --additional-context (nested types, known keys)
with clear typer.Exit messages and examples. Unify run merge/validate with
build when context is non-empty.

Remove the duplicate MAD_SYSTEM_GPU_ARCHITECTURE info line from build-only
Context init. After model discovery, warn only when a selected Dockerfile
needs ARG MAD_SYSTEM_GPU_ARCHITECTURE without a default and the effective
arch cannot be resolved (context or Dockerfile).

Add dockerfile_requires_explicit_mad_arch_build_arg helper, drop duplicate
DockerBuilder methods, and extend unit tests (validators, CLI, Dockerfile
heuristic, orchestrator warning, context messages).

* fix(cli): normalize gpu_vendor/guest_os; clean unused imports in validators
- After validation, write canonical uppercase gpu_vendor and guest_os into the
  context dict so Context.filter() matches exactly.
- Align test_validate_additional_context_case_insensitive with that behavior.
- Drop unused imports: Panel in validators.py; patch/MagicMock in
  test_validators.py (flake8 F401).

* docs(readme): fix misleading AMD/UBUNTU wording for `madengine run` quick start

* docs(usage): fix misleading AMD/UBUNTU wording for `madengine run` quick start

* Fixed the docker registry url

* Added unit tests for docker builder

* madengine v2: k8s PVC collection hardening (#90)

* fix(k8s): harden collect_results and align PVC defaults with examples
- kubernetes: wait for terminal pod phase before recording per-node status;
  read primary workload container exit code by name (not container_statuses[0])
- presets + PVC templates: align shared/results PVC behavior with nfs-banff / RWX usage
- examples: update torchrun single- and multi-node JSON configs; document in k8s-configs README

* feat(k8s): per-pod results layout, PVC collection hardening, expand k8s tests
Kubernetes deployment:
- Collect PVC artifacts into k8s_results/<job>/<pod>/pvc/ by matching
  /results/<subdir>/ to full pod names (indexed Job pods, unmapped fallback).
- Harden collector: short settle delay, retries for listing /results, explicit
  -c collector on kubectl exec/cp, clearer warnings when collection fails.

Docs:
- Document local k8s_results layout (pod.log vs pvc/) and multi-node Service
  behavior (PyTorch vs Ray notes in k8s-configs README).

Tests:
- Rename test_k8s_secrets.py to test_k8s.py; keep secrets/configmap tests and
  add unit tests for PVC subdir ↔ pod name mapping helpers.

* feat(cli): add --skip-model-run for local full workflows (#91)

Restore v1-style --skip-model-run on madengine run: model execution
is skipped only when this invocation ran a build (_did_build_phase). If an
existing --manifest-file is used, the flag is ignored and a warning is printed.

- CLI: new --skip-model-run, passed through args; workflow panel and success
  messaging when the run phase is skipped
- RunOrchestrator: short-circuit before local/distributed execute; empty run
  summary with skipped_model_run; use (skip_model_run is True) so MagicMock test
  doubles do not enable the path by accident
- Tests: unit coverage for skip vs run-only; help text; conftest skip_model_run

* feat(execution): configurable log error pattern scan (#92)

* feat(execution): configurable log error pattern scan (additional_context)
Allow disabling or tuning post-run log grep via additional_context and/or
model_info:

- log_error_pattern_scan / disable_log_error_scan to skip substring scan
- log_error_benign_patterns to extend exclusions (e.g. noisy DLM logs)
- log_error_patterns to override default pattern list

Extract resolution into container_runner_helpers.resolve_log_error_scan_config;
validate new keys in CLI validators; add unit tests.

Motivation: avoid false FAILURE from benign RuntimeError: lines in suites
already validated by pytest/JUnit (e.g. pyt_dl_tool_unit_test / AIMODELS-571).

* feat(execution): configurable log error pattern scan; document and drop duplicate flag
Add optional additional_context / model keys to control post-run log substring
checks (log_error_pattern_scan, log_error_benign_patterns, log_error_patterns).
Resolve settings in container_runner_helpers; wire into ContainerRunner; validate
CLI types. Remove disable_log_error_scan in favor of log_error_pattern_scan only.

Document behavior in README, docs/configuration.md, cli-reference, usage, profiling,
and docs index. Cross-link troubleshooting for benign RuntimeError: false failures.

Tests: extend test_container_runner_helpers; unit suite passes.

* refactor(container): Python log error scan; split benign literal vs regex; test validators (#93)

* feat(execution): configurable log error pattern scan (additional_context)
Allow disabling or tuning post-run log grep via additional_context and/or
model_info:

- log_error_pattern_scan / disable_log_error_scan to skip substring scan
- log_error_benign_patterns to extend exclusions (e.g. noisy DLM logs)
- log_error_patterns to override default pattern list

Extract resolution into container_runner_helpers.resolve_log_error_scan_config;
validate new keys in CLI validators; add unit tests.

Motivation: avoid false FAILURE from benign RuntimeError: lines in suites
already validated by pytest/JUnit (e.g. pyt_dl_tool_unit_test / AIMODELS-571).

* feat(execution): configurable log error pattern scan; document and drop duplicate flag
Add optional additional_context / model keys to control post-run log substring
checks (log_error_pattern_scan, log_error_benign_patterns, log_error_patterns).
Resolve settings in container_runner_helpers; wire into ContainerRunner; validate
CLI types. Remove disable_log_error_scan in favor of log_error_pattern_scan only.

Document behavior in README, docs/configuration.md, cli-reference, usage, profiling,
and docs index. Cross-link troubleshooting for benign RuntimeError: false failures.

Tests: extend test_container_runner_helpers; unit suite passes.

* refactor: scan container logs in Python; validate log_error_* in tests
Replace sh/grep-based log error detection with in-process scanning to avoid
shell quoting and injection issues when patterns come from model_info or
additional_context.

- Add log_text_has_error_pattern() with literal benign substrings vs
  explicit regex benign rules (ROCProf).
- Wire ContainerRunner to read the log once and use the helper.
- Document log_error_benign_patterns as literal substrings in helpers.
- Extend validate_additional_context tests for log_error_pattern_scan,
  log_error_benign_patterns, and log_error_patterns (valid + invalid).

* refactor: improve code quality and streamline unit tests (#94)

---------

Co-authored-by: Satya Nikhil <skodukul@amd.com>
Co-authored-by: leconcio <leconcio@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant