fix(pgque.nack): canonical + idempotent DLQ terminal handling by NikolayS · Pull Request #116 · NikolayS/pgque

NikolayS · 2026-04-30T09:46:57Z

Summary

Fixes two related bugs in pgque.nack() / DLQ flow, tracked in the umbrella #113.

Closes Bug: nack() DLQ path trusts caller-supplied pgque.message and allows forged DLQ rows #98
Closes Bug: repeated nack() of same DLQ-bound message creates duplicate replayable DLQ rows #104

Fix #98 — nack() DLQ path trusts caller-supplied pgque.message (forge)

Root cause: nack() passed caller-supplied composite fields (i_msg.type, i_msg.payload, etc.) directly to event_dead() without verifying the msg_id actually belonged to the active batch. Any writer with an active batch could forge a pgque.message with an arbitrary ev_id, payload, type, or retry count, inserting fake DLQ rows.

Fix: nack() now calls get_batch_events(i_batch_id) and looks up the row by ev_id = i_msg.msg_id. If msg_id is not found in the batch, it raises msg_id % not found in batch %. Only canonical fields from the batch data tables reach event_dead().

Red test evidence:

psql:tests/test_nack_dlq_canonical.sql:84: ERROR:
  FAIL #98: nack() should have rejected forged msg_id 999999

Green (after fix):

NOTICE:  PASS #98: nack() rejected forged msg_id and DLQ is clean

Fix #104 — repeated nack() creates duplicate replayable DLQ rows

Root cause: event_dead() used a plain INSERT with no duplicate guard. Calling nack() twice before ack() on the same terminal message inserted two dead_letter rows, causing dlq_replay_all() to replay the payload twice.

Fix: Added unique index dl_queue_consumer_ev_idx on (dl_queue_id, dl_consumer_id, ev_id) and changed the INSERT in event_dead() to ON CONFLICT (dl_queue_id, dl_consumer_id, ev_id) DO NOTHING. Repeated terminal nack() calls are now idempotent.

Red test evidence:

NOTICE:  DLQ row count: 2
ERROR:  FAIL #104: expected 1 DLQ row, got 2

Green (after fix):

NOTICE:  PASS #104: repeated nack() is idempotent (1 DLQ row)

Changes

File	Change
`sql/pgque-api/receive.sql`	`nack()` re-queries canonical event via `get_batch_events`; raises on unknown `msg_id`
`sql/pgque-additions/dlq.sql`	New unique index `dl_queue_consumer_ev_idx`; `event_dead()` insert uses `ON CONFLICT DO NOTHING`
`sql/pgque.sql`	Regenerated (both changes above)
`tests/test_nack_dlq_canonical.sql`	New red/green regression tests for both bugs
`tests/test_api_dlq.sql`	Test 2 updated: no longer forges `retry_count` (was the #98 hole); uses `max_retries=0` instead
`tests/run_all.sql`	Added `test_nack_dlq_canonical` to suite

Test plan

Red tests confirmed failing on main before fix
Green tests pass after fix
Full suite (tests/run_all.sql) passes — all tests green

🤖 Generated with Claude Code

#98: nack() now re-queries the canonical event from the active batch using get_batch_events(batch_id) + ev_id lookup instead of trusting the caller-supplied pgque.message composite. A forged msg_id not present in the batch raises an exception; forged payload/type/extras are ignored. #104: event_dead() uses ON CONFLICT DO NOTHING on a new unique index dl_queue_consumer_ev_idx (dl_queue_id, dl_consumer_id, ev_id), so repeated nack() calls for the same terminal message produce exactly one dead_letter row. Red tests in test_nack_dlq_canonical.sql confirmed both bugs before fix; both pass green after. Existing test_api_dlq.sql updated: test 2 no longer forges retry_count (the #98 vulnerability), instead uses max_retries=0 to trigger DLQ routing via canonical ev_retry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

NikolayS · 2026-04-30T09:50:43Z

REV review — PR #116 (independent reviewer)

Verdict: REQUEST CHANGES (non-blocking nits + one docs gap; core fix is sound)

Confidence: high on the correctness of the fix; high on the docs gap; medium on test-coverage hardening suggestions.

Security (HIGH priority — #98 is a forgery vector)

Canonical re-query is correct. nack() now selects ev_id, ev_time, ev_txid, ev_retry, ev_type, ev_data, ev_extra1..4 from pgque.get_batch_events(i_batch_id) filtered by ev_id = i_msg.msg_id, and only those canonical values are passed to event_dead() / event_retry(). Caller-supplied type, payload, retry_count, created_at, and extras are correctly ignored.
Cross-batch attack is blocked. pgque.subscription declares unique (sub_batch), so any given batch_id resolves to exactly one (queue, consumer) pair. get_batch_events(batch_id) only returns rows that belong to that batch, so passing msg_id from a different batch raises msg_id % not found in batch %. Verified.
SECURITY DEFINER + search_path preserved. Both pgque.nack and pgque.event_dead keep language plpgsql security definer set search_path = pgque, pg_catalog.
Information leak in error message. raise exception 'msg_id % not found in batch %', i_msg.msg_id, i_batch_id echoes the caller's input back. Not a real leak (caller already knows what they passed). Acceptable.
Index choice for Bug: repeated nack() of same DLQ-bound message creates duplicate replayable DLQ rows #104 is correct. Unique on (dl_queue_id, dl_consumer_id, ev_id) correctly handles fan-out: the same ev_id can still enter the DLQ for two different consumers (different dl_consumer_id), but cannot be duplicated for the same (queue, consumer, ev_id) triple.
ON CONFLICT (col-list) DO NOTHING does not mask other constraint violations. Only the matching unique index on the listed columns is suppressed; FK violations on dl_queue_id/dl_consumer_id still raise. Good.

Bug hunter

NULL msg_id is implicitly handled but not explicitly tested. When i_msg.msg_id is null, the where ev_id = null in the canonical re-query yields no rows, so not found raises 'msg_id <NULL> not found in batch %'. Behavior is safe but the test suite never exercises it. Suggest a third test case: nack with a pgque.message whose msg_id is NULL — assert it raises and DLQ stays clean.
Test 1 catches when others, which is too broad. The forge-rejection test exception when others then v_ok := true; passes for any failure, including bugs unrelated to the forge check (e.g., a regression in get_batch_events). Tighten to match sqlerrm like 'msg_id % not found in batch %' or sqlstate 'P0001' so the test pinpoints the intended error.
Forged-payload-with-valid-msg_id case is not covered. The most interesting Bug: nack() DLQ path trusts caller-supplied pgque.message and allows forged DLQ rows #98 attack is the one where the attacker passes a real msg_id from the active batch but forges type/payload/extras/retry_count. The PR's Test 1 only covers a non-existent msg_id = 999999. Suggest adding a Test 1b: receive a real message, mutate v_msg.payload, v_msg.type, v_msg.retry_count := 999, call nack, then assert the DLQ row contains the real realpayload/realtype and the canonical ev_retry (0), not the forged values. This is the assertion that proves canonical-only writes — currently nothing in the suite asserts canonical fields land in DLQ.
ev_txid::text::xid8 cast is correct but worth a one-line comment. get_batch_events.ev_txid is declared bigint (legacy PgQ signature) while the underlying column and dead_letter.ev_txid are xid8; the text round-trip is the codebase convention (sql/pgque.sql lines 252-253 document it). Drop a 1-line comment so future readers don't think it's a typo.
Concurrent double-nack race is logically safe but untested. Two concurrent nack calls each pass get_batch_events, each call event_dead; the unique index + ON CONFLICT DO NOTHING makes the duplicate insert a no-op; both return 1. Since concurrent tests are hard in psql-only suites, this is acceptable to defer, but worth a comment in the test file noting that Test 2 covers serial-only and that the race is reasoned about above.
event_retry (PgQ primitive) still uses a subtransaction (exception when unique_violation) to suppress duplicate retry-queue inserts. Pre-existing PgQ behavior, not introduced by this PR, but flagged in CLAUDE.md design rule 4 ("no subtransactions in hot paths"). Out of scope for this PR; leave for Bug: concurrent receive() for same consumer can double-deliver same events #97/Security: any pgque_writer can ack another consumer/app's active batch by batch_id #102 follow-up.

Test analyzer

Red+green tests are present for both Bug: nack() DLQ path trusts caller-supplied pgque.message and allows forged DLQ rows #98 and Bug: repeated nack() of same DLQ-bound message creates duplicate replayable DLQ rows #104. Red evidence is documented in the PR body.
tests/test_api_dlq.sql Test 2 was correctly updated: it used to forge retry_count := 2 (the very vulnerability Bug: nack() DLQ path trusts caller-supplied pgque.message and allows forged DLQ rows #98 fixes), now uses max_retries=0 so the canonical ev_retry=0 triggers DLQ on first nack. Good cleanup.
Test isolation: each test creates and drops its own queue. No shared state. Good.
Coverage gaps (see Bug hunter Set up PgQ git submodule (vendor/pgq/) #1, Add pgque-specific schema additions (config table, roles, queue_max_retries) #3 above): NULL msg_id, valid-msg_id-with-forged-payload.

Guidelines / CLAUDE.md compliance

Lowercase SQL keywords throughout. Good.
snake_case identifiers. Good.
Schema-qualification (pgque.X). Good.
SECURITY DEFINER ... SET search_path = pgque, pg_catalog. Good.
Conventional Commits: fix(pgque.nack): canonical lookup + idempotent DLQ insert. Good.
Both source files (sql/pgque-api/receive.sql, sql/pgque-additions/dlq.sql) AND the regenerated sql/pgque.sql are in the same commit. Lesson from fix(pgque.receive): reject max_return < 1 and document batch-ownership #114 applied. Good.

Docs (this is the main blocking nit)

docs/reference.md was not updated. The current entry for pgque.nack says: "If msg.retry_count is below the queue's max_retries, re-queues..." — but after this fix, nack IGNORES msg.retry_count and re-reads ev_retry from the canonical batch row. Users upgrading from v0.1 who relied on forged retry_count for tests/operations (as the prior tests/test_api_dlq.sql itself did) will be surprised. Please update the reference entry to:
- State that nack re-queries the canonical event from the active batch and ignores caller-supplied type/payload/retry_count/created_at/extras in msg.
- Document the new msg_id % not found in batch % error.
- Add a one-line migration note ("v0.2 hardening: do not forge retry_count to drive DLQ; configure max_retries instead, as the updated tests/test_api_dlq.sql test 2 demonstrates").
docs/tutorial.md line 254 ("Keep nacking. ... On the next nack it becomes retry_count=1...") still describes user-driven retry counting; the wording is fine because the canonical retry counter advances naturally, but a sentence noting that the counter is server-side would help.

Anti-leak scan

CI

5/5 PG version test jobs (14, 15, 16, 17, 18) pass. client-smoke and verify pass. claude-review is the bot we're invoking; not a self-block.

Summary

Counts: 0 must-fix-before-merge security/correctness issues. 1 docs gap (reference.md still describes pre-fix behavior). 3 test-coverage hardening suggestions (NULL msg_id, valid-msg_id-with-forged-payload assertion, narrower exception match). 0 anti-leak hits.

The fix is conceptually right and the implementation is clean. The docs update is the only thing I'd want before merge; the test-coverage suggestions are nice-to-have but would meaningfully harden the regression surface for a security-sensitive fix.

I am NOT approving or denying — flagging for the maintainer to address the docs gap and decide on test hardening.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

NikolayS · 2026-04-30T10:28:46Z

REV Review — PR #116 (round 2)

CI: 5/5 PG matrix (14, 15, 16, 17, 18) green; client-smoke pass; verify pass; claude-review is this bot.
R1 docs gap status: addressed
R1 hardening tests status: Test 1 narrow / 1b forged / 1c NULL — all three present
Migration-note scrub (c808143) impact: preserved gap fix; only the historical "v0.2 hardening note" paragraph was removed (correct call — no v0.x users to migrate). The three behavioral statements survive.
Verdict: READY FOR USER REVIEW

Verification table

R1 finding	Status in r2
`docs/reference.md` `nack()` entry described pre-fix behavior (BLOCKER)	Fixed in `38713e3`; survives scrub `c808143`. New entry explicitly says caller-supplied `type`/`payload`/`retry_count`/`created_at`/`extra*` are ignored, terminal DLQ insert is idempotent, and unknown/NULL `msg_id` raises `msg_id % not found in batch %`.
Test 1 `when others` is too broad	Fixed in `fc1db18`: now `exception when raise_exception` plus `assert sqlerrm like 'msg_id % not found in batch %'`.
Forged-payload-with-valid-msg_id case missing	Added as Test 1b in `fc1db18`: builds a `pgque.message` with the real `msg_id` but forged `type='forgetype'`, `payload='forgepayload'`, `retry_count=999`, `extras='fe1..fe4'`; asserts the DLQ row has `ev_type='realtype'`, `ev_data='realpayload'`, `ev_retry=0`, `ev_extra1 is null`. This is the assertion that proves canonical-only writes — exactly what was missing in r1.
NULL msg_id untested	Added as Test 1c in `fc1db18`: composite with `msg_id=null`; asserts `nack` raises `msg_id % not found in batch %` and DLQ stays empty.
`ev_txid::text::xid8` cast deserves a one-line comment	Added in `a2359d8`: comment in both `sql/pgque-api/receive.sql` and the regenerated `sql/pgque.sql` ("ev_txid is bigint in get_batch_events (legacy PgQ signature); text round-trip is the codebase convention to widen to xid8 without loss.").

Scrub impact (`c808143`)

Diff is exactly two deleted lines (the migration paragraph + its blank line). The three behavioral bullets remain at lines 88, 91, 92 of the post-scrub file. No regression.

Anti-leak

Counts

0 blocking, 0 non-blocking, 0 potential (all r1 items resolved). 0 anti-leak hits.

REV-style review (security, bugs, tests, guidelines, docs). SOC2 items skipped per project policy.

NikolayS · 2026-04-30T10:42:04Z

Comment-scrub sweep applied per CLAUDE.md style (commit cae72ec): removed Fix #98, Fix #104, and idempotent-insert narrative from receive.sql, dlq.sql, and pgque.sql.

This was referenced Apr 30, 2026

Bug: nack() DLQ path trusts caller-supplied pgque.message and allows forged DLQ rows #98

Closed

Bug: repeated nack() of same DLQ-bound message creates duplicate replayable DLQ rows #104

Closed

NikolayS and others added 3 commits April 30, 2026 02:54

docs: update nack() entry for canonical-lookup fix

38713e3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test: harden nack canonical tests (REV follow-up)

fc1db18

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: comment bigint::text::xid8 cast convention

a2359d8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This was referenced Apr 30, 2026

fix(clients/python): consumer warns + acks unhandled event types #121

Merged

docs: reference + examples + roles cleanup (#99, #105, #112) #126

Merged

docs: drop pre-release migration note (no v0.x users)

c808143

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

NikolayS merged commit df88589 into main Apr 30, 2026
8 checks passed

NikolayS deleted the fix/nack-dlq-canonical branch April 30, 2026 10:31

NikolayS mentioned this pull request Apr 30, 2026

Bug: receive() can strand consumer on empty active batch with no batch_id to ack #103

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pgque.nack): canonical + idempotent DLQ terminal handling#116

fix(pgque.nack): canonical + idempotent DLQ terminal handling#116
NikolayS merged 5 commits intomainfrom
fix/nack-dlq-canonical

NikolayS commented Apr 30, 2026

Uh oh!

NikolayS commented Apr 30, 2026

Uh oh!

NikolayS commented Apr 30, 2026

Uh oh!

Uh oh!

NikolayS commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NikolayS commented Apr 30, 2026

Summary

Fix #98 — nack() DLQ path trusts caller-supplied pgque.message (forge)

Fix #104 — repeated nack() creates duplicate replayable DLQ rows

Changes

Test plan

Uh oh!

NikolayS commented Apr 30, 2026

REV review — PR #116 (independent reviewer)

Security (HIGH priority — #98 is a forgery vector)

Bug hunter

Test analyzer

Guidelines / CLAUDE.md compliance

Docs (this is the main blocking nit)

Anti-leak scan

CI

Summary

Uh oh!

NikolayS commented Apr 30, 2026

REV Review — PR #116 (round 2)

Verification table

Scrub impact (c808143)

Anti-leak

Counts

Uh oh!

Uh oh!

NikolayS commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Scrub impact (`c808143`)