refactor(discover): scope-based model tag selection (unscoped vs scoped)#109
refactor(discover): scope-based model tag selection (unscoped vs scoped)#109
Conversation
… selection Introduce a `strict` parameter on `DiscoverModels` (default `False`). When `strict=True`, tag resolution only matches on the full model name, preventing a bare tag like `pyt_huggingface_gpt2` from inadvertently selecting namespaced models (`MAD/pyt_huggingface_gpt2`, `MAD-private/pyt_huggingface_gpt2`). `BuildOrchestrator` (used by `madengine run`) now passes `strict=True`, so `run` does exact-name matching while `discover` retains the existing loose behaviour. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a strict mode to model discovery to enforce exact full-name matching (intended for madengine run) while keeping the existing loose/tag-based matching for discover.
Changes:
- Added
strict: bool = Falseparameter toDiscoverModelsand used it to control selection matching logic. - Updated model selection to use exact-name matching when
strict=True(unscoped tags path). - Updated
BuildOrchestratorto instantiateDiscoverModels(..., strict=True).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/madengine/utils/discover_models.py | Introduces strict flag and updates model selection matching behavior. |
| src/madengine/orchestration/build_orchestrator.py | Enables strict discovery when building via BuildOrchestrator. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Remove the strict flag and short-name backward-compat matching from DiscoverModels. Scoping is now determined solely by the presence of a scope/ prefix in the tag: - Unscoped tags (e.g. inference, pyt_foo) match by tag-list field or exact root-level name; they never cross into a scoped namespace via short-name split. - Scoped tags (e.g. MAD/inference, MAD/pyt_foo) restrict candidates to the named scope, then match by tag-list or exact full name as before. This restores madengine run --tags <tag> loose selection while preventing unintended cross-scope name leakage. Remove strict_discovery from BuildOrchestrator.execute() and RunOrchestrator._build_phase() as they are no longer needed. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR aims to introduce strict exact-name selection for model discovery (primarily for madengine run) to avoid unscoped tags accidentally selecting scoped/namespaced models, while keeping discover behavior loose by default.
Changes:
- Updated
DiscoverModels.select_models()to stop matching models by short-name suffix (name.split('/')[-1]). - Refactored/rewrote unit tests to reflect the new unscoped tag-selection semantics.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/madengine/utils/discover_models.py |
Removes short-name fallback matching for both standard and custom models during tag selection. |
tests/unit/test_discover_models.py |
Updates test cases and expectations around unscoped tag selection and scope boundaries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Refactors DiscoverModels tag selection so unscoped tags no longer “short-name” match scoped models via name.split('/')[-1], aligning discovery behavior with tag shape (unscoped vs scope/...) and updating unit tests accordingly.
Changes:
- Removed short-name suffix matching (
split('/')[-1]) for bothmodels.jsonentries andCustomModels during selection. - Updated unit tests to validate unscoped name matching stays root-only while tag-field matching remains scope-agnostic.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/madengine/utils/discover_models.py |
Removes suffix-based model-name matching in select_models() to prevent unscoped tags from crossing scope boundaries by short name. |
tests/unit/test_discover_models.py |
Replaces short-name backward-compat tests with coverage for unscoped vs scoped selection behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… in unscoped branch When extra args are appended with colon syntax (e.g. inference:batch-size=32), the raw tag string was passed to _model_entry_has_tag and the 'all' check instead of the already-extracted model_name (the portion before ':'). This caused tag-list matching to silently fail for any --tags value that included extra args, because it searched for 'inference:batch-size=32' in the model's tags list instead of 'inference'. Fix both the models and custom_models loops in the unscoped branch to use model_name consistently. Add a regression test for this case. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Clarify that unscoped tag name-based matching is root-only, but tag-field matching is scope-agnostic and can select models in any scope. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- Add get_gpu_arch() utility to tests/fixtures/utils.py for runtime GPU architecture detection (AMD/NVIDIA), used by test fixtures - Add dynamic_skip_gpu_arch_model_dir fixture that injects the current GPU arch into the skip list, so the test works on any machine (not just gfx942 nodes) - Fix gen_sys_env_details context override in container_runner.py: the old 'or' condition made the context key a no-op since the parameter defaults to True; now context can explicitly disable it - Disable sys env collection in test_docker_gpus via additional_context so the pre-script doesn't OOM-kill on MI350X (gfx950) with 6 GPUs Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Refactors model discovery tag selection so scoping is inferred from tag shape (scope/tag vs unscoped), eliminating cross-scope short-name leakage and fixing tag matching when :extra-args are present.
Changes:
- Updated
DiscoverModels.select_models()to remove short-name (split('/')[-1]) matching and to match tags using the pre-colon token (model_name) soinference:...works. - Added/updated unit tests to validate unscoped vs scoped selection semantics and scope boundaries.
- Made sys-env collection in
ContainerRunnerexplicitly overridable viaadditional_context.gen_sys_env_details, and adjusted e2e tests/fixtures accordingly.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/madengine/utils/discover_models.py |
Implements scope-shaped tag selection and fixes tag matching when tags include :extra-args. |
tests/unit/test_discover_models.py |
Updates coverage for unscoped tags (root-only name match, scope-agnostic tag match) and scoped tag behavior. |
src/madengine/execution/container_runner.py |
Allows additional_context to explicitly disable sys-env detail collection. |
tests/e2e/test_run_workflows.py |
Disables sys-env detail collection for a GPU-binding workflow via additional_context. |
tests/fixtures/utils.py |
Introduces get_gpu_arch() helper with caching for e2e fixture manipulation. |
tests/e2e/test_build_workflows.py |
Adds fixture that mutates models.json to include the detected GPU arch in skip_gpu_arch to reduce hardware coupling. |
CHANGELOG.md |
Documents the discovery behavior changes and related fixes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- get_gpu_arch() now returns a short model token (e.g. A100, H100) on NVIDIA instead of the full product string (NVIDIA A100-SXM4-40GB), matching the format expected by skip_gpu_arch model definitions - AMD path now resolves ROCm via get_rocm_path() (consistent with has_gpu()) - filter_images_by_skip_gpu_arch() applies the same two-step normalization (strip NVIDIA prefix, drop SKU suffix) so runtime arch matches skip list - Added @requires_gpu to test_commandline_argument_skip_gpu_arch and test_commandline_argument_disable_skip_gpu_arch_fail to skip on CPU-only CI Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…o coketaste/v2-discover
There was a problem hiding this comment.
Pull request overview
Refines model discovery tag selection so scoping is determined by tag shape (unscoped vs scope/...), reducing accidental cross-scope model selection while keeping tag-based selection flexible. Also updates GPU-arch skip behavior and E2E stability by improving GPU architecture detection and making system-env collection controllable via runtime context.
Changes:
- Adjusted
DiscoverModels.select_models()matching to prevent unscoped name tags from crossing scope boundaries and to correctly handle:<extra-args>tags. - Added GPU architecture detection helper for tests and made E2E skip-gpu-arch tests hardware-agnostic.
- Made system environment collection optionally disable-able via context and updated E2E workflow to avoid OOM in system-env pre-scripts.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/madengine/utils/discover_models.py |
Updates tag parsing/matching rules for unscoped vs scoped tags and fixes :<extra-args> handling. |
tests/unit/test_discover_models.py |
Replaces/updates unit tests to validate new unscoped/scoped tag-selection semantics. |
src/madengine/orchestration/image_filtering.py |
Aligns NVIDIA GPU-arch normalization for skip_gpu_arch filtering. |
tests/fixtures/utils.py |
Adds cached get_gpu_arch() helper used by E2E tests for dynamic skip list injection. |
tests/e2e/test_build_workflows.py |
Introduces dynamic fixture for skip_gpu_arch tests and guards them with requires_gpu. |
src/madengine/execution/container_runner.py |
Fixes precedence so context can disable system environment collection scripts. |
tests/e2e/test_run_workflows.py |
Disables sys-env collection for a GPU-binding test to avoid OOM in pre-scripts. |
CHANGELOG.md |
Documents the discovery behavior changes and test/runtime adjustments. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| skip_list = [arch.strip() for arch in skip_gpu_arch_str.split(",")] | ||
| sys_gpu_arch = runtime_gpu_arch | ||
| if sys_gpu_arch and "NVIDIA" in sys_gpu_arch: | ||
| sys_gpu_arch = sys_gpu_arch.split()[1] | ||
| # Normalize "NVIDIA A100-SXM4-40GB" -> "A100", "NVIDIA H100 PCIe" -> "H100" | ||
| # (mirrors get_gpu_arch() in tests/fixtures/utils.py) | ||
| after_prefix = sys_gpu_arch[sys_gpu_arch.index("NVIDIA") + len("NVIDIA ") :] | ||
| sys_gpu_arch = after_prefix.split("-")[0].split()[0] | ||
|
|
||
| if sys_gpu_arch in skip_list: |
There was a problem hiding this comment.
the Copilot comment as a false positive.
Summary
Refines how
--tagsselect models in discovery so that scoping is driven by the tag shape (unscoped vsScope/...), instead of astrictflag or short-name suffix matching across scopes.Behavior
inference,pyt_foo): match models via the tag list, or by exact name among root-level (unscoped) models only. They do not follow into a scoped tree vianame.split("/")[-1].MAD/inference,MAD/pyt_foo): restrict candidates to the given scope, then match by tag list or full name as before.This restores loose
madengine run --tags <tag>selection (tag-list / root-level naming) while stopping accidental selection ofMAD/...orMAD-private/...models from a bare unscoped tag.Implementation notes
strictparameter and related plumbing fromBuildOrchestrator/RunOrchestrator(nostrict_discovery).Test plan
madengine discover/runwith unscoped and scoped tags cover expected models only.tests/unit/test_discover_models.py