Skip to content

madengine v2 improve code quality#81

Merged
coketaste merged 17 commits intocoketaste/refactor-disfrom
coketaste/refactor-dis-quality
Mar 18, 2026
Merged

madengine v2 improve code quality#81
coketaste merged 17 commits intocoketaste/refactor-disfrom
coketaste/refactor-dis-quality

Conversation

@coketaste
Copy link
Collaborator

  1. Dead code and broken tests
    1.1 Remove or replace tests/test_cleanup.py
  2. Duplication and shared helpers
    2.1 Constants loading in core/constants.py
    2.2 Deployment init and Jinja2 setup (SLURM vs K8s)
    2.3 Run-details / perf row construction (K8s + container_runner)
    2.4 scripts_base_dir and madengine root

…uplication

- Remove dead test: delete tests/test_cleanup.py (imported removed
  madengine.tools.run_models).
- Constants: add _get_env_or_creds_or_default() in core/constants.py and
  refactor NAS_NODES, MAD_AWS_S3, MAD_MINIO, PUBLIC_GITHUB_ROCM_KEY to use
  it; add tests/unit/test_constants.py.
- Path helpers: add utils/path_utils.py with scripts_base_dir_from() and
  get_madengine_root(); use in container_runner and kubernetes; add
  tests/unit/test_path_utils.py.
- Run-details: add utils/run_details.py with get_pipeline(), get_build_number(),
  and flatten_tags_in_place(); refactor container_runner and kubernetes to use
  them; add tests/unit/test_run_details.py.
- Deployment: add apply_deployment_config() in config_loader and create_jinja_env()
  in base; refactor slurm and kubernetes init to use them; add
  TestApplyDeploymentConfig and tests/unit/test_deployment_base.py.

All changes preserve behavior. Unit test suite (157+ tests) passes.
Unit tests (tests/unit/):
- Remove duplicate get_rocm_path coverage from test_constants (kept in test_rocm_path).
- Consolidate by topic: test_utils (path_utils + run_details), test_deployment
  (base + common), test_execution (container_runner_helpers + dockerfile_utils),
  test_orchestration (image_filtering + orchestrator_logic).
- Parametrize and merge redundant cases (e.g. falsy inputs, env/default, launcher
  normalization, timeout behavior).
- Drop duplicate test (e.g. gfx_compilation in dockerfile_utils) and trim
  edge-case bloat; keep important behavior.
- Reduce from 16 to 12 unit test files; 203 tests passing.

Integration tests (tests/integration/):
- Fix 6 failures in multi-GPU-arch tests: use madengine.execution.dockerfile_utils
  instead of removed DockerBuilder private methods.
- Merge error tests into test_errors.py (CLI + unified system, dedupe setup_logging
  and context serialization).
- Merge batch manifest into test_orchestrator_workflows; merge multi_gpu_arch into
  test_platform_integration.
- Reduce from 10 to 7 integration test files; 155 passed, 1 skipped.
  - build_run_command(), assert_model_in_perf_csv(),
    get_run_live_log_path(), get_timeout_seconds_from_log()
  - DEFAULT_CLEAN_FILES for consistent cleanup lists
- Use DEFAULT_CLEAN_FILES (and + extras) in all e2e modules:
  test_run_workflows, test_execution_features, test_data_workflows,
  test_build_workflows, test_profiling_workflows, test_scripting_workflows
- Parametrize context tests: merge test_dockerfile_picked_on_detected_context_0/1
  into test_dockerfile_picked_on_detected_context (ctx_value, expected_performance)
- Parametrize timeout-in-log tests: replace four separate tests with
  test_timeout_value_in_log (tags, log_base_name, expected_seconds, extra_args)
- Leave timeout-fire and other e2e tests unchanged; 60 e2e tests still pass
…ults parses each node’s .out, aggregates (e.g. sum for samples_per_second), and writes one row. 2N×8G should report about 2× the 1N×8G throughput.
@coketaste coketaste self-assigned this Mar 8, 2026
coketaste and others added 9 commits March 9, 2026 15:18
…kins

- Record pre-run failures (e.g. pull/setup) in the performance CSV with status
  FAILURE and add to failed_runs so the perf table and exit code stay correct.
- Preserve typer.Exit in cli_main so run/build exit codes (0/1/2/3/4) are
  passed through to the shell/Jenkins.
- Use ExitCode.BUILD_FAILURE (2) when BuildError is raised; let BuildError
  propagate from RunOrchestrator instead of wrapping in MADRuntimeError.
- Make update_perf_csv create the perf CSV with PERF_CSV_HEADER when the file
  is missing but a result (exception/single/multiple) is provided.
- Use shared PERF_CSV_HEADER in container_runner.ensure_perf_csv_exists and
  fix read_json to use a context manager for the file handle.
- Add unit tests for exit codes, cli_main typer.Exit preservation, BuildError
  → BUILD_FAILURE, setup-failure recording to CSV, and update_perf_csv
  create-when-missing; fix unclosed file in update_perf_csv.
@coketaste coketaste changed the title Improve code quality madengine v2 Improve code quality Mar 18, 2026
@coketaste coketaste changed the title madengine v2 Improve code quality madengine v2 improve code quality Mar 18, 2026
@coketaste coketaste merged commit cdb26b0 into coketaste/refactor-dis Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants