update model discovery to handle tags in subdirectories for madenginev1 by leconcio · Pull Request #83 · ROCm/madengine

leconcio · 2026-03-11T21:41:49Z

Motivation

update model discovery to handle tags in subdirectories for AIMODELS-518

Technical Details

Added "tags" for subdirectory model dictionary as {SUBDIR}/{tag} to enable user to isolate models in subdirectory

Test Plan

Check if subdirectory models can be filtered using tags, this was verified on the CI pipeline

Test Result

https://ml-ci-internal.amd.com/job/DLM/job/DeepLearningModels/view/change-requests/job/PR-2887/2/

Submission Checklist

[ X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* refactor(madengine): improve code quality, remove dead code, reduce duplication - Remove dead test: delete tests/test_cleanup.py (imported removed madengine.tools.run_models). - Constants: add _get_env_or_creds_or_default() in core/constants.py and refactor NAS_NODES, MAD_AWS_S3, MAD_MINIO, PUBLIC_GITHUB_ROCM_KEY to use it; add tests/unit/test_constants.py. - Path helpers: add utils/path_utils.py with scripts_base_dir_from() and get_madengine_root(); use in container_runner and kubernetes; add tests/unit/test_path_utils.py. - Run-details: add utils/run_details.py with get_pipeline(), get_build_number(), and flatten_tags_in_place(); refactor container_runner and kubernetes to use them; add tests/unit/test_run_details.py. - Deployment: add apply_deployment_config() in config_loader and create_jinja_env() in base; refactor slurm and kubernetes init to use them; add TestApplyDeploymentConfig and tests/unit/test_deployment_base.py. All changes preserve behavior. Unit test suite (157+ tests) passes. * test: reorganize unit and integration tests, reduce duplication Unit tests (tests/unit/): - Remove duplicate get_rocm_path coverage from test_constants (kept in test_rocm_path). - Consolidate by topic: test_utils (path_utils + run_details), test_deployment (base + common), test_execution (container_runner_helpers + dockerfile_utils), test_orchestration (image_filtering + orchestrator_logic). - Parametrize and merge redundant cases (e.g. falsy inputs, env/default, launcher normalization, timeout behavior). - Drop duplicate test (e.g. gfx_compilation in dockerfile_utils) and trim edge-case bloat; keep important behavior. - Reduce from 16 to 12 unit test files; 203 tests passing. Integration tests (tests/integration/): - Fix 6 failures in multi-GPU-arch tests: use madengine.execution.dockerfile_utils instead of removed DockerBuilder private methods. - Merge error tests into test_errors.py (CLI + unified system, dedupe setup_logging and context serialization). - Merge batch manifest into test_orchestrator_workflows; merge multi_gpu_arch into test_platform_integration. - Reduce from 10 to 7 integration test files; 155 passed, 1 skipped. * - Add shared e2e helpers in tests/fixtures/utils.py: - build_run_command(), assert_model_in_perf_csv(), get_run_live_log_path(), get_timeout_seconds_from_log() - DEFAULT_CLEAN_FILES for consistent cleanup lists - Use DEFAULT_CLEAN_FILES (and + extras) in all e2e modules: test_run_workflows, test_execution_features, test_data_workflows, test_build_workflows, test_profiling_workflows, test_scripting_workflows - Parametrize context tests: merge test_dockerfile_picked_on_detected_context_0/1 into test_dockerfile_picked_on_detected_context (ctx_value, expected_performance) - Parametrize timeout-in-log tests: replace four separate tests with test_timeout_value_in_log (tags, log_base_name, expected_seconds, extra_args) - Leave timeout-fire and other e2e tests unchanged; 60 e2e tests still pass * Stop tracking run_directory (now in .gitignore) * Multi-node SLURM: No node writes perf.csv; after the job, collect_results parses each node’s .out, aggregates (e.g. sum for samples_per_second), and writes one row. 2N×8G should report about 2× the 1N×8G throughput. * Refactored the slurm_output to slurm_results to improve the management of result collected from each node * Fixed the workload's output to slurm results * Refactored the reporting system for local, k8s and slurm, respectively, to handle single result and mult results * Fixed the multi results collection for multi node * Fixed the test duration missing in multi results case * Improve madengine reliability, error handling, and exit codes for Jenkins - Record pre-run failures (e.g. pull/setup) in the performance CSV with status FAILURE and add to failed_runs so the perf table and exit code stay correct. - Preserve typer.Exit in cli_main so run/build exit codes (0/1/2/3/4) are passed through to the shell/Jenkins. - Use ExitCode.BUILD_FAILURE (2) when BuildError is raised; let BuildError propagate from RunOrchestrator instead of wrapping in MADRuntimeError. - Make update_perf_csv create the perf CSV with PERF_CSV_HEADER when the file is missing but a result (exception/single/multiple) is provided. - Use shared PERF_CSV_HEADER in container_runner.ensure_perf_csv_exists and fix read_json to use a context manager for the file handle. - Add unit tests for exit codes, cli_main typer.Exit preservation, BuildError → BUILD_FAILURE, setup-failure recording to CSV, and update_perf_csv create-when-missing; fix unclosed file in update_perf_csv. * Cleanup * Updated docs of project * update model discovery to handle tags in subdirectories (#83) --------- Co-authored-by: leconcio <leconcio@amd.com>

update model discovery to handle tags in subdirectories

0ddf97d

leconcio changed the title ~~update model discovery to handle tags in subdirectories~~ update model discovery to handle tags in subdirectories for madenginev! Mar 11, 2026

leconcio changed the title ~~update model discovery to handle tags in subdirectories for madenginev!~~ update model discovery to handle tags in subdirectories for madenginev1 Mar 11, 2026

gargrahul merged commit 1a3515d into main Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update model discovery to handle tags in subdirectories for madenginev1#83

update model discovery to handle tags in subdirectories for madenginev1#83
gargrahul merged 1 commit intomainfrom
update-subdirectory-discovery-v1

leconcio commented Mar 11, 2026 •

edited by gargrahul

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leconcio commented Mar 11, 2026 • edited by gargrahul Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leconcio commented Mar 11, 2026 •

edited by gargrahul

Loading