[FEATURE] Extensible tag classification model discovery through Entry Points#463
[FEATURE] Extensible tag classification model discovery through Entry Points#463Roel Bollens (RoelBollens-TomTom) wants to merge 18 commits intodevfrom
Conversation
b8ca2fe to
c00022c
Compare
| filters = [] | ||
|
|
||
| if tags: | ||
| filters.append(lambda key: all(tag in key.tags for tag in tags)) |
There was a problem hiding this comment.
all() here means --tag foo --tag bar requires BOTH tags (AND). Is that the intended semantics, or should repeated --tag flags use OR?
There was a problem hiding this comment.
This was intentional, but I guess this should still be formalized so I'll leave this open for discussion.
from the coding session minutes:
--tag semantics are AND
The mockups assume--tag overture --tag system:featuremeans "must have both tags" (AND). This contradicts the design doc's stated OR semantics. The group implicitly operated with AND throughout the exercise and no one objected. This should be formalized.
There was a problem hiding this comment.
Ah, great. I'm good with that; it didn't track with how I'd been thinking about it earlier is all.
There was a problem hiding this comment.
I'll still leave this open. Maybe after refactoring the other CLI commands opinions might change.
There was a problem hiding this comment.
What we discussed this morning in the coding session: AND makes the most sense for CLI list, but there are other use cases where OR fits better.
I did some very cursory LLM-driven research about common CLI paradigms that support both AND and OR in filtering. The best example I could come up with were kubectl label selectors with -l and docker filters, but none of them fit our exact use case.
I thought about it a bit and came up with some priorities:
- CLI should support both AND and OR.
- Should not use any special shell characters, especially characters like
!that can get expanded even within double quotes. - The most common use cases should be simple.
- Should not require a lot of typing.
The best idea I can come up with is to allow --tag to express an "OR of ANDs" by supporting multiple --tags values.
--tagcan be repeated as many times as you want.- Each
--tagsupports as plus-sign-separated list. (Alternative: comma.) - Within a
--tagthe plus-separated values form an AND expression. - The multiple
--tagarguments are unioned together as a big OR expression. - This allows you to express an "OR of ANDs" which is probably powerful enough for most CLI use cases.
- I would also suggest allowing
*wildcards to further strengthen the CLI power.
Examples
Get all models matching single tag, foo.
--tag foo
Get all models that are both foo AND bar.
--tag foo+bar
Get all models that are in an Overture theme AND also bar.
--tag overture:theme=*+bar
Get all models that are either foo OR bar.
--tag foo --tags bar
Get all models that are either Overture transportation features or Overture places features.
--tag feature+overture:theme=transportation --tag feature+overture:theme=places
There was a problem hiding this comment.
--tag foo+bar I'm less a fan of, because it introduces a query language in the value. I'd rather keep --tag as a simple repeatable flag. But I like adding the wildcard for namespace:* and ns:predicate=*.
A git log style --all-match is something I considered to switch --tag values from OR logic to AND. My concern with that is that some commands may want AND by default, and that still doesn't give a clean way to support both AND and OR on the same command, if that's at all desired.
There was a problem hiding this comment.
Alternative ideas:
--tagto include,--filterto exclude-i/--includeto include,-e/--excludeto exclude- existing is
--tagsand--exclude-tags
There was a problem hiding this comment.
I'm still struggling to wrap my head around whether AND or OR is used because I don't have a mental model that leads to expectations. I think I expect it to differ by command. However, in the process of writing this out, my current inclination is:
--tagtreated as OR to create a preliminary filter / result.--filter(or--includefor symmetry with--exclude, although the AND/OR inconsistency is trouble) treated as AND, applied to the preliminary result with AND to positively filter using large sets.--excludetreated as OR, applied to the preliminary result with AND to filter out smaller sets.
The idea is that --tag does an initial select to reduce "everything" to an inclusive result, and then --filter draws a smaller circle and --exclude does a targeted job of removing undesired models.
# I want to expand my criteria once I commit to filtering
# OR
overture-schema list-types \
--tag overture:theme=places \
--tag overture:theme=buildings
# I want to (positively) filter the preliminary results (from ORed tags) to datasets licensed under CDLA 2.0
# OR, with ANDed filters applied
overture-schema list-types \
--tag overture:theme=places \
--tag overture:theme=base \
--filter license=CDLA-Permissive-2.0
# I want to (negatively) filter out ODbL-licensed datasets
# OR, with ORed exclusions applied as filters
overture-schema list-types \
--tag overture:theme=base \
--tag overture:theme=divisions \
--exclude license=ODbL-1.0
# I want to produce an increasingly broad JSON Schema equivalent
# OR
overture-schema json-schema \
--tag overture:theme=places \
--tag overture:theme=buildings
# I want to select a narrow set of models to validate against
# AND, using --filter instead of --tag
overture-schema validate \
--filter overture:theme=base \
--filter source_type=raster
# I want to generate docs for an arbitrary set of models
# OR; AND would be too limiting
overture-schema-codegen generate --format markdown --output-dir /tmp/overture \
--tag overture:theme=places \
--tag overture:theme=buildingsThere was a problem hiding this comment.
Attempting to clarify my idea further (and use set algebra to describe it):
--tagdefines the scope. Without--tag, the scope is all models. With one or more--tagflags, the scope is the union of models matching any listed tag —--tag X --tag Ykeeps models tagged X or Y.--filternarrows the scope. Each--filterpredicate adds a requirement every model must satisfy —--filter X --filter Ykeeps only models matching both (AND).--excluderemoves from the scope. Models matching any listed exclusion are dropped —--exclude X --exclude Ydrops models matching X OR Y.
So --tag is OR, --filter is AND, --exclude is OR-then-subtract.
Tag-based discovery is the primary entry point. OR is the right default for --tag — users reach for it to declare interest in any of several themes, and is what we concluded while working through this together. --filter is for the stricter case where every result must satisfy an additional requirement.
Equivalent set algebra: result = (⋃ tags) ∩ (⋂ filters) \ (⋃ excludes), with absent classes
contributing no restriction.
a2461e3 to
61bb58f
Compare
b602345 to
1501cc9
Compare
Victor Schappert (vcschapp)
left a comment
There was a problem hiding this comment.
Let some comments, but I'm generally aligned and would merge once Roel Bollens (@RoelBollens-TomTom) and Seth Fitzsimmons (@mojodna) are jointly aligned on merging.
Left some thoughts on the AND/OR issue in the CLI, probably above there somewhere. 👆
|
|
||
| from overture.schema.system.feature import Feature | ||
|
|
||
| logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
This seems private, should i get the appropriate underscore prefix?
| RESERVED_TAGS: dict[str, set[str]] = { | ||
| "overture": {"overture-schema-core"}, | ||
| "feature": {"overture-schema-system"}, | ||
| } |
There was a problem hiding this comment.
I thought feature_provider was intentionally emitting feature rather than system:feature to be less clunky, per the PR description.
I understood from the PR description that you can reserve both plain tags and namespaces.
Personally I like this approach.
| TagProvider: TypeAlias = Callable[[type[BaseModel], ModelKey, set[str]], set[str]] | ||
|
|
||
| ModelDict: TypeAlias = dict[ModelKey, type[BaseModel]] | ||
|
|
||
| TagProviderDict: TypeAlias = dict[TagProviderKey, TagProvider] | ||
|
|
||
| ModelKeyFilter: TypeAlias = Callable[[ModelKey], bool] |
There was a problem hiding this comment.
Are any of these intended to be _Private, e.g. maybe TagProviderDict?
| key = replace( | ||
| key, | ||
| tags=frozenset(generate_tags(model_class, key, tag_providers)), | ||
| ) |
There was a problem hiding this comment.
Basically switching the return type to tuple[str] from set[str]?
I have no opinion on this.
You could go the other way and return collections.abc.Iterable[str] - that way anything that can produce a sequence of strings via iteration would suffice.
1501cc9 to
e5ad4c5
Compare
🗺️ Schema reference docs preview is live!
Note ♻️ This preview updates automatically with each push to this PR. |
e5ad4c5 to
3daf759
Compare
3daf759 to
f19416c
Compare
f19416c to
c8aa891
Compare
c8aa891 to
e9eabf3
Compare
Signed-off-by: Roel <75250264+RoelBollens-TomTom@users.noreply.github.com>
Co-authored-by: Seth Fitzsimmons <sethfitz@amazon.com> Signed-off-by: Roel <75250264+RoelBollens-TomTom@users.noreply.github.com>
Co-authored-by: Seth Fitzsimmons <sethfitz@amazon.com> Signed-off-by: Roel <75250264+RoelBollens-TomTom@users.noreply.github.com>
Signed-off-by: Roel <75250264+RoelBollens-TomTom@users.noreply.github.com>
… filtering logic - Removes overture tag provider (was deferred) - Simplified tags - Reserved tags instead of reserved namespaces - Fixes small issue introduced in earlier commit Signed-off-by: Roel <75250264+RoelBollens-TomTom@users.noreply.github.com>
Signed-off-by: Roel <75250264+RoelBollens-TomTom@users.noreply.github.com>
… CLI commands Signed-off-by: Roel <75250264+RoelBollens-TomTom@users.noreply.github.com>
Signed-off-by: Roel <75250264+RoelBollens-TomTom@users.noreply.github.com>
Signed-off-by: Roel <75250264+RoelBollens-TomTom@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
`filter_models` selects feature types from the registry through three
combinators applied to the same tag grammar (plain `feature`,
namespaced `system:extension`, or compound `overture:theme=buildings`):
--tag OR defines scope (any-of)
--filter AND narrows scope (all-of)
--exclude OR-NOT subtracts (none-of)
--type OR closed-list match on ModelKey.name (orthogonal)
T = ⋃ tag predicates (absent → U)
F = ⋂ filter predicates (absent → U)
E = ⋃ exclude predicates (absent → ∅)
result = (T ∩ F \ E) restricted to type_names if non-empty
The mental model is procedural: --tag widens, --filter narrows,
--exclude subtracts. Without --tag the scope is every registered
model. An empty selector imposes no filtering.
A `TagSelector` value object carries the three tag predicates:
class TagSelector:
include_any: tuple[str, ...] = ()
require_all: tuple[str, ...] = ()
exclude_any: tuple[str, ...] = ()
Field names encode the combinator (any-of / all-of / none-of),
deliberately distinct from CLI flag names. Flags are user-facing
affordances; field names are implementation-facing and self-document
at the call site.
`type_names` lives on `filter_models` as a keyword, not on
`TagSelector`. It's a closed-list match on `ModelKey.name`, orthogonal
to the tag predicate algebra. Isolating it makes `TagSelector`'s
purpose statable in one sentence and confines a future fold-in of
`--type` to a kwarg deletion that doesn't disturb `TagSelector`.
User-facing help text frames flags as acting on feature types
("Include feature types with these tags — defines scope (OR;
repeatable)"). Internal API docstrings keep "models" since they
describe the Python class layer; "feature types" is the user-facing
vocabulary for entry-point-registered top-level types, distinct from
the Pydantic models used for nested fields.
Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
Use provider_key.name (always a string) instead of provider.__name__, which raises AttributeError when a provider is a callable instance without __name__ — masking the original error inside the except block. Add exc_info=True to preserve the traceback in the warning. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
Replace unittest.TestCase classes with module-level pytest functions parametrized over the tag lists. Per-tag parametrization isolates failures to the offending input instead of stopping at the first assertion in a loop. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
Fixes D100 reported by pydocstyle / make docformat. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
Plain tags, namespaces, and predicates now share a single TAG_PART pattern: lowercase alphanumeric start followed by alphanumeric, hyphen, underscore, or dot. Values remain case-permissive. Drops the prior asymmetry where namespaces and predicates allowed dots but plain tags did not. Make generate_tags private (its sole caller is discover_models) and broaden TagProvider's return type to Iterable[str] so providers can yield, return lists, or return sets. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
The provider's first argument is the value loaded from an `overture.models` entry point. For discriminated-union features (e.g. `Segment`) that's `Annotated[Union[...], Field(...)]`, not `type[BaseModel]` — the prior signature was a lie. Widen `TagProvider` and the in-tree providers to accept `Any` and document the boundary in `discovery/types.py`. Strip `typing_util.collect_types` to the cases discovery actually meets today: `Annotated`, `Union`/`X | Y`, plain class. Drop the unreached `NewType` and `Literal` branches. Point at `overture-schema-codegen`'s `extraction/type_analyzer.py:analyze_type` as the more capable implementation, with consolidation across system, core, and cli flagged as future work. `theme_provider` extracts the theme via `_theme_literal`, which asserts that `theme` is a single-value `str` `Literal[...]` and raises `TypeError` otherwise. `_generate_tags` catches and logs at WARNING, so third-party model-definition bugs surface visibly without crashing discovery. Promote tag-rejection logging from DEBUG to WARNING so authorization failures (invalid tags, reserved tags, reserved namespaces) don't disappear silently in normal operation. Convert filter tests from direct `_filter_tags` calls to a fake `TagProvider` driven through `_generate_tags`. Tests now exercise provider invocation and merge wiring, not just the filter, and decouple from the private filter name. Provider-behavior tests still call the providers directly. Add discriminated-union coverage for both `feature_provider` and `theme_provider`, plus a `TypeError` case for a non-Literal `theme`. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
| added_tags = set(provider(model_class, key, tags.copy())) - tags | ||
| filtered_tags = _filter_tags(added_tags, provider_key) | ||
| tags.update(filtered_tags) |
There was a problem hiding this comment.
Dropped tags (when the tag provider manipulates the tag set) are silently ignored, so tag providers may only add tags. While I think this is desired behavior, the tag provider signature makes it look like they can be removed.
Thoughts?
Add Discovery and Tagging sections to system's README, covering the overture.models / overture.tag_providers entry point groups, the tag format, provider contract, namespace and tag reservation, the built-in providers, and TagSelector-based filtering. Update core's README: replace the stale Discovery bullet (discovery has moved to system) with one describing the authority and theme tag providers core contributes. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
3e7503c to
0c50095
Compare
Tag providers now receive the concrete BaseModel subclasses for the entry point instead of the raw entry-point value. _generate_tags walks the model once via collect_types and passes the result to every provider, so providers can't forget to handle discriminated unions and the walk happens once per model rather than once per provider. The TagProvider type alias drops Any in favor of Iterable[type[BaseModel]], honestly typing what providers receive. The first arg of _generate_tags is annotated Any to match the entry-point loader, which yields union expressions that aren't type[BaseModel]. All three registered providers (feature_provider, authority_provider, theme_provider) update to the new signature; unit tests pass concrete classes directly while union-handling tests move to the _generate_tags integration boundary, where the walk now lives. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
04a961b to
1d23610
Compare
There was a problem hiding this comment.
Well, I'm happy with the state of the PR, but that's because I worked through my reactions to it and pushed a series of follow-up commits. (Not to say that it wasn't great before, but I took liberties with actually responding to my own comments by making changes. Mostly specific to filtering and supporting the nuances of union types like Segment.)
It's definitely worth others looking over the TagSelector business, particularly how they're handled (and what the CLI UX looks like), in addition to the other follow-ups I made.
Extensible tag classification model discovery through Entry Points
This replaces the hardcoded model classification system with tag-based classification model discovery through Entry Points. This is based on #440 by Seth and several schema (ad-hoc) coding sessions where Seth, Vic, Dana, Tristan and Roel participated in.
Model discovery moved into
system, eliminating assumptions about Overture in the process. The hardcodednamespaceconcept ("overture","annex") and theModelKindclassifier is replaced with tags -- string labels derived by tag providers. Tags become the filtering, grouping, and classification mechanism for model discovery, driven by introspection and package metadata rather than central coordination.systemprovides generic tag-based grouping without understanding what any particular tag means. Any package can register tag providers that classify models without special support in the discovery layer.Purpose
Tags serve three roles:
--tag system:feature,--tag draft)These roles overlap -- a tag like
overture:theme=buildingsserves both filtering and taxonomy. The design accommodates this overlap through structured tags that encode both ownership and dimension.Tag Format
Tags are strings following the pattern
[prefix:]key[=value]:overture,draft,featuresystem:extension--:separates ownershipoverture:theme=buildings:signals ownership and enables prefix reservation (see Privileged Packages and Tag Reservation).=signals a dimension with a value (groupable via--group-by). One level of each -- no nested colons or multiple=signs.Minimal launch set
feature(was:system:feature)overture:theme=<theme>buildings,transportation)overture(was:overture:official)Reserved tags
Tags can be reserved either as simple tags or by namespaces. These are the tags and namespaces that are currently reserved:
featureoverture-schema-systemovertureoverture-schema-coreoverture:*overture-schema-coresystem:*overture-schema-systemExtensions
Additional extensions and accompanied tags will be introduced in a future PR. Extensions allows to augment existing types with new fields (columns).
CLI
The
list-typescommand has been updated to support filtering and grouping by tags. Currently, it no longer displays the description or fully qualified class name. Thejson-schemaandvalidatecommands from the overture-schema cli andgeneratecommand from the overture-codegen cli have been updated to be able to filter on tags instead of filtering by theme and type. Further changes can be introduced in a future update.Examples
Deviations