feat(api): bulk upload id contract — sequence advance and sanity bound (#2362)#2367
Open
jh-RLI wants to merge 3 commits into
Open
feat(api): bulk upload id contract — sequence advance and sanity bound (#2362)#2367jh-RLI wants to merge 3 commits into
jh-RLI wants to merge 3 commits into
Conversation
Adds POST /api/v0/tables/<table>/bulk-upload - the tracer bullet of the bulk upload path (slice 2 of #2362): - The request body IS the CSV (text/csv); the server streams it into PostgreSQL COPY FROM STDIN without buffering the file in memory. - Append-only, one transaction per request: a malformed row anywhere rolls back the entire upload. - The delimiter is a required, whitelisted parameter (comma, semicolon, tab) - never inferred from metadata or content. - The CSV header (required) maps columns by name; header names are whitelisted against the table's actual columns and quoted, so no unvalidated identifier ever reaches the SQL. - Same authorization chain as the row API: token auth, write permission, embargo check, table-registry resolution (internal tables are unreachable by construction). - Deliberately bypasses the edit-journal meta tables: bulk-loaded rows have no per-row change history. This trades the (currently unread) per-row provenance for order-of-magnitude ingestion speed; an audit event record follows in a later slice. - COPY is FROM STDIN only; no code path for COPY FROM file/PROGRAM. New HTTP-seam test module api/tests/test_bulk_upload.py (14 tests) covers the happy path per delimiter, auth/permission/embargo denials, all-or-nothing rollback, journal bypass, and target-table containment. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Slice 3 of #2362, on top of the bulk upload tracer bullet: - Header preflight before the body streams: reject duplicate column names, names not in the table, and missing required columns (NOT NULL without default), each with a 400 naming the offenders. - Strip a UTF-8 BOM from the header (Excel exports). - FORCE_NULL on all uploaded columns: an empty field is NULL whether quoted or not - a deliberate deviation from COPY's native CSV rule, because many writers quote every field and would silently store empty strings instead of NULLs. - Sanitized failure responses: the database's data-level message plus the CSV line number and column (header-adjusted), never raw SQL, server context dumps, or internal paths - including the no-diagnostics fallback (lost connection), which stays generic. Test module grows to 20 HTTP-seam tests covering each contract rule. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
#2362) Slice 4 of #2362. Clients may include or omit the id column: - Omitted: the table's id sequence assigns ids as usual. - Included: after COPY, still inside the upload's transaction, the id sequence is advanced to the table's max(id) via setval(GREATEST(...)), so a subsequent row-API insert cannot hit a duplicate key. GREATEST plus a pg_advisory_xact_lock on the sequence keep the sequence from ever moving backwards, including under concurrent uploads (setval is non-transactional, so racing reads could otherwise regress it). - Uploads that RAISE the table's max(id) above a generous sanity bound (2^48) are rejected and rolled back, so a single upload cannot exhaust the id sequence for every writer of a shared table. The bound only judges ids introduced by the upload itself: a pre-existing high id (the row API enforces no bound) does not poison the table for future bulk uploads. Test module grows to 25 HTTP-seam tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jul 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of the discussion
This PR stacks ontop of #2364 and #2366 wait until both are merged and rebase this PR.
Part of #2362 (Slice 4 — id contract). Clients may include or omit the id
column in a bulk upload:
sequence is advanced to the table's
max(id)(setval(GREATEST(…))), soa subsequent row-API insert can't hit a duplicate key. A
pg_advisory_xact_lockon the sequence serializes concurrent uploads —setvalis non-transactional, and two racing reads could otherwise movethe sequence backwards. The sequence never moves backwards.
max(id)above 2^48are rejected and rolled back, so one upload can't exhaust the sequence
for every writer of a shared table. The bound judges only ids introduced
by the upload itself — a pre-existing high id (the row API enforces no
bound) does not poison the table for future bulk uploads.
The review pass caught two real defects before commit: the bound originally
judged the whole table's max(id) (a table with one legitimately huge
pre-existing id would reject every future id-bearing upload), and the
unserialized setval race. Both fixed and tested.
Tests: module grows to 25 HTTP-seam tests (no-id sequence assignment,
explicit-ids-then-row-insert collision check, bound rejection with
rollback, monotonic sequence, pre-existing-high-id regression). Full suite
green. Changelog entry included.
Automation
Closes #
PR-Assignee
CONTRIBUTING.md
CHANGELOG.md
mkdocs
Reviewer
Reviewer Guidelines