Skip to content

feat(api): bulk upload id contract — sequence advance and sanity bound (#2362)#2367

Open
jh-RLI wants to merge 3 commits into
developfrom
feature-2362-bulk-upload-id-contract
Open

feat(api): bulk upload id contract — sequence advance and sanity bound (#2362)#2367
jh-RLI wants to merge 3 commits into
developfrom
feature-2362-bulk-upload-id-contract

Conversation

@jh-RLI

@jh-RLI jh-RLI commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary of the discussion

This PR stacks ontop of #2364 and #2366 wait until both are merged and rebase this PR.

Part of #2362 (Slice 4 — id contract). Clients may include or omit the id
column in a bulk upload:

  • Omitted: the table's id sequence assigns ids as usual.
  • Included: after COPY, still inside the upload's transaction, the id
    sequence is advanced to the table's max(id) (setval(GREATEST(…))), so
    a subsequent row-API insert can't hit a duplicate key. A
    pg_advisory_xact_lock on the sequence serializes concurrent uploads —
    setval is non-transactional, and two racing reads could otherwise move
    the sequence backwards. The sequence never moves backwards.
  • Sanity bound: uploads that raise the table's max(id) above 2^48
    are rejected and rolled back, so one upload can't exhaust the sequence
    for every writer of a shared table. The bound judges only ids introduced
    by the upload itself — a pre-existing high id (the row API enforces no
    bound) does not poison the table for future bulk uploads.

The review pass caught two real defects before commit: the bound originally
judged the whole table's max(id) (a table with one legitimately huge
pre-existing id would reject every future id-bearing upload), and the
unserialized setval race. Both fixed and tested.

Tests: module grows to 25 HTTP-seam tests (no-id sequence assignment,
explicit-ids-then-row-insert collision check, bound rejection with
rollback, monotonic sequence, pre-existing-high-id regression). Full suite
green. Changelog entry included.

Automation

Closes #

PR-Assignee

Reviewer

  • 🐙 Follow the
    Reviewer Guidelines
  • 🐙 Provided feedback and show sufficient appreciation for the work done

jh-RLI and others added 3 commits July 3, 2026 18:51
Adds POST /api/v0/tables/<table>/bulk-upload - the tracer bullet of the
bulk upload path (slice 2 of #2362):

- The request body IS the CSV (text/csv); the server streams it into
  PostgreSQL COPY FROM STDIN without buffering the file in memory.
- Append-only, one transaction per request: a malformed row anywhere
  rolls back the entire upload.
- The delimiter is a required, whitelisted parameter (comma, semicolon,
  tab) - never inferred from metadata or content.
- The CSV header (required) maps columns by name; header names are
  whitelisted against the table's actual columns and quoted, so no
  unvalidated identifier ever reaches the SQL.
- Same authorization chain as the row API: token auth, write
  permission, embargo check, table-registry resolution (internal
  tables are unreachable by construction).
- Deliberately bypasses the edit-journal meta tables: bulk-loaded rows
  have no per-row change history. This trades the (currently unread)
  per-row provenance for order-of-magnitude ingestion speed; an audit
  event record follows in a later slice.
- COPY is FROM STDIN only; no code path for COPY FROM file/PROGRAM.

New HTTP-seam test module api/tests/test_bulk_upload.py (14 tests)
covers the happy path per delimiter, auth/permission/embargo denials,
all-or-nothing rollback, journal bypass, and target-table containment.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Slice 3 of #2362, on top of the bulk upload tracer bullet:

- Header preflight before the body streams: reject duplicate column
  names, names not in the table, and missing required columns
  (NOT NULL without default), each with a 400 naming the offenders.
- Strip a UTF-8 BOM from the header (Excel exports).
- FORCE_NULL on all uploaded columns: an empty field is NULL whether
  quoted or not - a deliberate deviation from COPY's native CSV rule,
  because many writers quote every field and would silently store
  empty strings instead of NULLs.
- Sanitized failure responses: the database's data-level message plus
  the CSV line number and column (header-adjusted), never raw SQL,
  server context dumps, or internal paths - including the no-diagnostics
  fallback (lost connection), which stays generic.

Test module grows to 20 HTTP-seam tests covering each contract rule.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
#2362)

Slice 4 of #2362. Clients may include or omit the id column:

- Omitted: the table's id sequence assigns ids as usual.
- Included: after COPY, still inside the upload's transaction, the id
  sequence is advanced to the table's max(id) via setval(GREATEST(...)),
  so a subsequent row-API insert cannot hit a duplicate key. GREATEST
  plus a pg_advisory_xact_lock on the sequence keep the sequence from
  ever moving backwards, including under concurrent uploads (setval is
  non-transactional, so racing reads could otherwise regress it).
- Uploads that RAISE the table's max(id) above a generous sanity bound
  (2^48) are rejected and rolled back, so a single upload cannot
  exhaust the id sequence for every writer of a shared table. The
  bound only judges ids introduced by the upload itself: a pre-existing
  high id (the row API enforces no bound) does not poison the table
  for future bulk uploads.

Test module grows to 25 HTTP-seam tests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant