Skip to content

Add identifier expression indexes for high-cardinality dataset types#66

Merged
koenvo merged 9 commits intomainfrom
feature/identifier-indexes
Apr 3, 2026
Merged

Add identifier expression indexes for high-cardinality dataset types#66
koenvo merged 9 commits intomainfrom
feature/identifier-indexes

Conversation

@koenvo
Copy link
Copy Markdown
Contributor

@koenvo koenvo commented Apr 1, 2026

  • SqlAlchemySessionProvider.create_identifier_indexes(): creates composite expression indexes on identifier JSONB keys (Postgres only, IF NOT EXISTS)
  • DatasetStore.create_indexes(): delegates to repository, configured via identifier_index_configs from dataset_types config
  • ingestify sync-indexes CLI command to trigger index creation explicitly (never automatic to avoid locking large tables)
  • identifier_index: true option in dataset_types config
  • test-postgres job in test.yml with Postgres 15 service

koenvo added 9 commits April 1, 2026 14:46
- SqlAlchemySessionProvider.create_identifier_indexes(): creates composite
  expression indexes on identifier JSONB keys (Postgres only, IF NOT EXISTS)
- DatasetStore.create_indexes(): delegates to repository, configured via
  identifier_index_configs from dataset_types config
- `ingestify sync-indexes` CLI command to trigger index creation explicitly
  (never automatic to avoid locking large tables)
- identifier_index: true option in dataset_types config
- test-postgres job in test.yml with Postgres 15 service
- IdentifierTransformer now stores and returns declared key_type per key
- register_transformation() accepts optional key_type ('str' or 'int')
- Repository query building uses declared key_type for JSONB cast instead
  of inferring from Python value type at runtime
- create_identifier_indexes() generates typed expressions:
  (identifier->>'key') for str, ((identifier->>'key')::integer) for int
- main.py passes key_type from config to both transformer and index configs
- Tests updated to use new dict key format {name, key_type}
Limits each index to a single dataset_type, so it is smaller and
dataset_type is an implicit condition rather than a post-scan filter.
Two different providers can share the same dataset_type name, so the
partial index WHERE clause now matches both provider and dataset_type.
Index name uses provider_dataset_type to avoid collisions.
@koenvo koenvo merged commit 4187ced into main Apr 3, 2026
13 checks passed
@koenvo koenvo deleted the feature/identifier-indexes branch April 3, 2026 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant