Skip to content

fix(pgque.nack): canonical + idempotent DLQ terminal handling#116

Merged
NikolayS merged 5 commits intomainfrom
fix/nack-dlq-canonical
Apr 30, 2026
Merged

fix(pgque.nack): canonical + idempotent DLQ terminal handling#116
NikolayS merged 5 commits intomainfrom
fix/nack-dlq-canonical

Conversation

@NikolayS
Copy link
Copy Markdown
Owner

Summary

Fixes two related bugs in pgque.nack() / DLQ flow, tracked in the umbrella #113.

Fix #98 — nack() DLQ path trusts caller-supplied pgque.message (forge)

Root cause: nack() passed caller-supplied composite fields (i_msg.type, i_msg.payload, etc.) directly to event_dead() without verifying the msg_id actually belonged to the active batch. Any writer with an active batch could forge a pgque.message with an arbitrary ev_id, payload, type, or retry count, inserting fake DLQ rows.

Fix: nack() now calls get_batch_events(i_batch_id) and looks up the row by ev_id = i_msg.msg_id. If msg_id is not found in the batch, it raises msg_id % not found in batch %. Only canonical fields from the batch data tables reach event_dead().

Red test evidence:

psql:tests/test_nack_dlq_canonical.sql:84: ERROR:
  FAIL #98: nack() should have rejected forged msg_id 999999

Green (after fix):

NOTICE:  PASS #98: nack() rejected forged msg_id and DLQ is clean

Fix #104 — repeated nack() creates duplicate replayable DLQ rows

Root cause: event_dead() used a plain INSERT with no duplicate guard. Calling nack() twice before ack() on the same terminal message inserted two dead_letter rows, causing dlq_replay_all() to replay the payload twice.

Fix: Added unique index dl_queue_consumer_ev_idx on (dl_queue_id, dl_consumer_id, ev_id) and changed the INSERT in event_dead() to ON CONFLICT (dl_queue_id, dl_consumer_id, ev_id) DO NOTHING. Repeated terminal nack() calls are now idempotent.

Red test evidence:

NOTICE:  DLQ row count: 2
ERROR:  FAIL #104: expected 1 DLQ row, got 2

Green (after fix):

NOTICE:  PASS #104: repeated nack() is idempotent (1 DLQ row)

Changes

File Change
sql/pgque-api/receive.sql nack() re-queries canonical event via get_batch_events; raises on unknown msg_id
sql/pgque-additions/dlq.sql New unique index dl_queue_consumer_ev_idx; event_dead() insert uses ON CONFLICT DO NOTHING
sql/pgque.sql Regenerated (both changes above)
tests/test_nack_dlq_canonical.sql New red/green regression tests for both bugs
tests/test_api_dlq.sql Test 2 updated: no longer forges retry_count (was the #98 hole); uses max_retries=0 instead
tests/run_all.sql Added test_nack_dlq_canonical to suite

Test plan

  • Red tests confirmed failing on main before fix
  • Green tests pass after fix
  • Full suite (tests/run_all.sql) passes — all tests green

🤖 Generated with Claude Code

#98: nack() now re-queries the canonical event from the active batch
using get_batch_events(batch_id) + ev_id lookup instead of trusting
the caller-supplied pgque.message composite. A forged msg_id not
present in the batch raises an exception; forged payload/type/extras
are ignored.

#104: event_dead() uses ON CONFLICT DO NOTHING on a new unique index
dl_queue_consumer_ev_idx (dl_queue_id, dl_consumer_id, ev_id), so
repeated nack() calls for the same terminal message produce exactly
one dead_letter row.

Red tests in test_nack_dlq_canonical.sql confirmed both bugs before
fix; both pass green after. Existing test_api_dlq.sql updated: test 2
no longer forges retry_count (the #98 vulnerability), instead uses
max_retries=0 to trigger DLQ routing via canonical ev_retry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@NikolayS
Copy link
Copy Markdown
Owner Author

REV review — PR #116 (independent reviewer)

Verdict: REQUEST CHANGES (non-blocking nits + one docs gap; core fix is sound)

Confidence: high on the correctness of the fix; high on the docs gap; medium on test-coverage hardening suggestions.


Security (HIGH priority — #98 is a forgery vector)

  • Canonical re-query is correct. nack() now selects ev_id, ev_time, ev_txid, ev_retry, ev_type, ev_data, ev_extra1..4 from pgque.get_batch_events(i_batch_id) filtered by ev_id = i_msg.msg_id, and only those canonical values are passed to event_dead() / event_retry(). Caller-supplied type, payload, retry_count, created_at, and extras are correctly ignored.
  • Cross-batch attack is blocked. pgque.subscription declares unique (sub_batch), so any given batch_id resolves to exactly one (queue, consumer) pair. get_batch_events(batch_id) only returns rows that belong to that batch, so passing msg_id from a different batch raises msg_id % not found in batch %. Verified.
  • SECURITY DEFINER + search_path preserved. Both pgque.nack and pgque.event_dead keep language plpgsql security definer set search_path = pgque, pg_catalog.
  • Information leak in error message. raise exception 'msg_id % not found in batch %', i_msg.msg_id, i_batch_id echoes the caller's input back. Not a real leak (caller already knows what they passed). Acceptable.
  • Index choice for Bug: repeated nack() of same DLQ-bound message creates duplicate replayable DLQ rows #104 is correct. Unique on (dl_queue_id, dl_consumer_id, ev_id) correctly handles fan-out: the same ev_id can still enter the DLQ for two different consumers (different dl_consumer_id), but cannot be duplicated for the same (queue, consumer, ev_id) triple.
  • ON CONFLICT (col-list) DO NOTHING does not mask other constraint violations. Only the matching unique index on the listed columns is suppressed; FK violations on dl_queue_id/dl_consumer_id still raise. Good.

Bug hunter

  1. NULL msg_id is implicitly handled but not explicitly tested. When i_msg.msg_id is null, the where ev_id = null in the canonical re-query yields no rows, so not found raises 'msg_id <NULL> not found in batch %'. Behavior is safe but the test suite never exercises it. Suggest a third test case: nack with a pgque.message whose msg_id is NULL — assert it raises and DLQ stays clean.

  2. Test 1 catches when others, which is too broad. The forge-rejection test exception when others then v_ok := true; passes for any failure, including bugs unrelated to the forge check (e.g., a regression in get_batch_events). Tighten to match sqlerrm like 'msg_id % not found in batch %' or sqlstate 'P0001' so the test pinpoints the intended error.

  3. Forged-payload-with-valid-msg_id case is not covered. The most interesting Bug: nack() DLQ path trusts caller-supplied pgque.message and allows forged DLQ rows #98 attack is the one where the attacker passes a real msg_id from the active batch but forges type/payload/extras/retry_count. The PR's Test 1 only covers a non-existent msg_id = 999999. Suggest adding a Test 1b: receive a real message, mutate v_msg.payload, v_msg.type, v_msg.retry_count := 999, call nack, then assert the DLQ row contains the real realpayload/realtype and the canonical ev_retry (0), not the forged values. This is the assertion that proves canonical-only writes — currently nothing in the suite asserts canonical fields land in DLQ.

  4. ev_txid::text::xid8 cast is correct but worth a one-line comment. get_batch_events.ev_txid is declared bigint (legacy PgQ signature) while the underlying column and dead_letter.ev_txid are xid8; the text round-trip is the codebase convention (sql/pgque.sql lines 252-253 document it). Drop a 1-line comment so future readers don't think it's a typo.

  5. Concurrent double-nack race is logically safe but untested. Two concurrent nack calls each pass get_batch_events, each call event_dead; the unique index + ON CONFLICT DO NOTHING makes the duplicate insert a no-op; both return 1. Since concurrent tests are hard in psql-only suites, this is acceptable to defer, but worth a comment in the test file noting that Test 2 covers serial-only and that the race is reasoned about above.

  6. event_retry (PgQ primitive) still uses a subtransaction (exception when unique_violation) to suppress duplicate retry-queue inserts. Pre-existing PgQ behavior, not introduced by this PR, but flagged in CLAUDE.md design rule 4 ("no subtransactions in hot paths"). Out of scope for this PR; leave for Bug: concurrent receive() for same consumer can double-deliver same events #97/Security: any pgque_writer can ack another consumer/app's active batch by batch_id #102 follow-up.

Test analyzer

Guidelines / CLAUDE.md compliance

  • Lowercase SQL keywords throughout. Good.
  • snake_case identifiers. Good.
  • Schema-qualification (pgque.X). Good.
  • SECURITY DEFINER ... SET search_path = pgque, pg_catalog. Good.
  • Conventional Commits: fix(pgque.nack): canonical lookup + idempotent DLQ insert. Good.
  • Both source files (sql/pgque-api/receive.sql, sql/pgque-additions/dlq.sql) AND the regenerated sql/pgque.sql are in the same commit. Lesson from fix(pgque.receive): reject max_return < 1 and document batch-ownership #114 applied. Good.

Docs (this is the main blocking nit)

  • docs/reference.md was not updated. The current entry for pgque.nack says: "If msg.retry_count is below the queue's max_retries, re-queues..." — but after this fix, nack IGNORES msg.retry_count and re-reads ev_retry from the canonical batch row. Users upgrading from v0.1 who relied on forged retry_count for tests/operations (as the prior tests/test_api_dlq.sql itself did) will be surprised. Please update the reference entry to:
    • State that nack re-queries the canonical event from the active batch and ignores caller-supplied type/payload/retry_count/created_at/extras in msg.
    • Document the new msg_id % not found in batch % error.
    • Add a one-line migration note ("v0.2 hardening: do not forge retry_count to drive DLQ; configure max_retries instead, as the updated tests/test_api_dlq.sql test 2 demonstrates").
  • docs/tutorial.md line 254 ("Keep nacking. ... On the next nack it becomes retry_count=1...") still describes user-driven retry counting; the wording is fine because the canonical retry counter advances naturally, but a sentence noting that the counter is server-side would help.

Anti-leak scan

grep -iE "gitlab|sahmed|artifact[_-]?registry|@AR\b|wi[ -]?#?7[67]|round[ -]?[4-9]\b|R[4-9]\b" over the diff, PR body, and commit message: clean. No external/private refs.

CI

5/5 PG version test jobs (14, 15, 16, 17, 18) pass. client-smoke and verify pass. claude-review is the bot we're invoking; not a self-block.


Summary

Counts: 0 must-fix-before-merge security/correctness issues. 1 docs gap (reference.md still describes pre-fix behavior). 3 test-coverage hardening suggestions (NULL msg_id, valid-msg_id-with-forged-payload assertion, narrower exception match). 0 anti-leak hits.

The fix is conceptually right and the implementation is clean. The docs update is the only thing I'd want before merge; the test-coverage suggestions are nice-to-have but would meaningfully harden the regression surface for a security-sensitive fix.

I am NOT approving or denying — flagging for the maintainer to address the docs gap and decide on test hardening.

NikolayS and others added 3 commits April 30, 2026 02:54
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@NikolayS
Copy link
Copy Markdown
Owner Author

REV Review — PR #116 (round 2)

CI: 5/5 PG matrix (14, 15, 16, 17, 18) green; client-smoke pass; verify pass; claude-review is this bot.
R1 docs gap status: addressed
R1 hardening tests status: Test 1 narrow / 1b forged / 1c NULL — all three present
Migration-note scrub (c808143) impact: preserved gap fix; only the historical "v0.2 hardening note" paragraph was removed (correct call — no v0.x users to migrate). The three behavioral statements survive.
Verdict: READY FOR USER REVIEW

Verification table

R1 finding Status in r2
docs/reference.md nack() entry described pre-fix behavior (BLOCKER) Fixed in 38713e3; survives scrub c808143. New entry explicitly says caller-supplied type/payload/retry_count/created_at/extra* are ignored, terminal DLQ insert is idempotent, and unknown/NULL msg_id raises msg_id % not found in batch %.
Test 1 when others is too broad Fixed in fc1db18: now exception when raise_exception plus assert sqlerrm like 'msg_id % not found in batch %'.
Forged-payload-with-valid-msg_id case missing Added as Test 1b in fc1db18: builds a pgque.message with the real msg_id but forged type='forgetype', payload='forgepayload', retry_count=999, extras='fe1..fe4'; asserts the DLQ row has ev_type='realtype', ev_data='realpayload', ev_retry=0, ev_extra1 is null. This is the assertion that proves canonical-only writes — exactly what was missing in r1.
NULL msg_id untested Added as Test 1c in fc1db18: composite with msg_id=null; asserts nack raises msg_id % not found in batch % and DLQ stays empty.
ev_txid::text::xid8 cast deserves a one-line comment Added in a2359d8: comment in both sql/pgque-api/receive.sql and the regenerated sql/pgque.sql ("ev_txid is bigint in get_batch_events (legacy PgQ signature); text round-trip is the codebase convention to widen to xid8 without loss.").

Scrub impact (c808143)

Diff is exactly two deleted lines (the migration paragraph + its blank line). The three behavioral bullets remain at lines 88, 91, 92 of the post-scrub file. No regression.

Anti-leak

grep -iE "gitlab|sahmed|artifact[_-]?registry|@AR\b|wi[ -]?#?7[67]|round[ -]?[4-9]\b|R[4-9]\b" over the full PR diff: clean.

Counts

0 blocking, 0 non-blocking, 0 potential (all r1 items resolved). 0 anti-leak hits.


REV-style review (security, bugs, tests, guidelines, docs). SOC2 items skipped per project policy.

@NikolayS NikolayS merged commit df88589 into main Apr 30, 2026
8 checks passed
@NikolayS NikolayS deleted the fix/nack-dlq-canonical branch April 30, 2026 10:31
@NikolayS
Copy link
Copy Markdown
Owner Author

Comment-scrub sweep applied per CLAUDE.md style (commit cae72ec): removed Fix #98, Fix #104, and idempotent-insert narrative from receive.sql, dlq.sql, and pgque.sql.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant