Add identifier expression indexes for high-cardinality dataset types by koenvo · Pull Request #66 · PySport/ingestify

koenvo · 2026-04-01T12:59:38Z

SqlAlchemySessionProvider.create_identifier_indexes(): creates composite expression indexes on identifier JSONB keys (Postgres only, IF NOT EXISTS)
DatasetStore.create_indexes(): delegates to repository, configured via identifier_index_configs from dataset_types config
ingestify sync-indexes CLI command to trigger index creation explicitly (never automatic to avoid locking large tables)
identifier_index: true option in dataset_types config
test-postgres job in test.yml with Postgres 15 service

- SqlAlchemySessionProvider.create_identifier_indexes(): creates composite expression indexes on identifier JSONB keys (Postgres only, IF NOT EXISTS) - DatasetStore.create_indexes(): delegates to repository, configured via identifier_index_configs from dataset_types config - `ingestify sync-indexes` CLI command to trigger index creation explicitly (never automatic to avoid locking large tables) - identifier_index: true option in dataset_types config - test-postgres job in test.yml with Postgres 15 service

…usage

…itory

- IdentifierTransformer now stores and returns declared key_type per key - register_transformation() accepts optional key_type ('str' or 'int') - Repository query building uses declared key_type for JSONB cast instead of inferring from Python value type at runtime - create_identifier_indexes() generates typed expressions: (identifier->>'key') for str, ((identifier->>'key')::integer) for int - main.py passes key_type from config to both transformer and index configs - Tests updated to use new dict key format {name, key_type}

Limits each index to a single dataset_type, so it is smaller and dataset_type is an implicit condition rather than a post-scan filter.

Two different providers can share the same dataset_type name, so the partial index WHERE clause now matches both provider and dataset_type. Index name uses provider_dataset_type to avoid collisions.

koenvo added 9 commits April 1, 2026 14:46

Expand docs for identifier_index with YAML examples and sync-indexes …

bf8e465

…usage

Apply black formatting

bfcdc59

Fix test fixtures: use engine from conftest instead of LocalFileRepos…

27b7587

…itory

Use partial index (WHERE dataset_type = '...') for identifier indexes

0a1f785

Limits each index to a single dataset_type, so it is smaller and dataset_type is an implicit condition rather than a post-scan filter.

Include provider in partial index predicate and index name

4190d16

Two different providers can share the same dataset_type name, so the partial index WHERE clause now matches both provider and dataset_type. Index name uses provider_dataset_type to avoid collisions.

Fix test assertions to match provider_dataset_type index name format

755de1d

Update docs: key_type, partial index predicate, correct index names

1710ef1

koenvo merged commit 4187ced into main Apr 3, 2026
13 checks passed

koenvo deleted the feature/identifier-indexes branch April 3, 2026 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add identifier expression indexes for high-cardinality dataset types#66

Add identifier expression indexes for high-cardinality dataset types#66
koenvo merged 9 commits intomainfrom
feature/identifier-indexes

koenvo commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

koenvo commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant