Skip to content

fix: Add paginated merge and load-vocab-source command#13

Open
nicoloesch wants to merge 2 commits into
mainfrom
12-reduce-mem
Open

fix: Add paginated merge and load-vocab-source command#13
nicoloesch wants to merge 2 commits into
mainfrom
12-reduce-mem

Conversation

@nicoloesch
Copy link
Copy Markdown
Collaborator

Summary

Fixes #12

  • Adds paginated merge operations to orm-loader so large staging-to-target merges commit in bounded batches rather than one transaction.
  • Adds a new load-vocab-source command to omop-alchemy with bulk mode, progress feedback, and crash-resilient retry.

orm-loader: paginated merge via _rownum

Staging tables now get a _rownum BIGINT GENERATED ALWAYS AS IDENTITY column at creation time. merge_insert, merge_replace, and merge_upsert all accept a merge_batch_size parameter (default 1 M rows). For tables larger than one batch, a _rownum index is built on the staging table and rows are processed in range-keyed batches, each committed independently. This bounds WAL accumulation to one batch per transaction instead of the full table. Small tables (below merge_batch_size) fall through to the original single-statement path.

The COPY statement was updated to include an explicit column list so the identity column is excluded from input.

omop-alchemy: load-vocab-source command

New cli_vocab.py implementing a load-vocab-source command with:

  • --bulk-mode: disables FK triggers and drops indexes before loading, then rebuilds after. Substantially faster than per-table management for a full vocabulary reload.
  • --merge-strategy: replace, upsert, or insert_if_empty.
  • --merge-batch-size: passed through to the orm-loader paginated merge.
  • Progress bar with per-phase descriptions including the post-load index rebuild phase, which can take 15+ minutes on concept_ancestor.
  • Crash-resilient retry: if a retryable connection error occurs mid-merge and the strategy is insert_if_empty, the partially loaded table is truncated before retrying. Safe because FK triggers are disabled via ALTER TABLE ... DISABLE TRIGGER ALL and that state persists across crash and recovery.

@nicoloesch nicoloesch requested a review from gkennos June 3, 2026 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vocabulary load is slow and unstable on large tables

1 participant