fix(pgque.nack): canonical + idempotent DLQ terminal handling#116
fix(pgque.nack): canonical + idempotent DLQ terminal handling#116
Conversation
#98: nack() now re-queries the canonical event from the active batch using get_batch_events(batch_id) + ev_id lookup instead of trusting the caller-supplied pgque.message composite. A forged msg_id not present in the batch raises an exception; forged payload/type/extras are ignored. #104: event_dead() uses ON CONFLICT DO NOTHING on a new unique index dl_queue_consumer_ev_idx (dl_queue_id, dl_consumer_id, ev_id), so repeated nack() calls for the same terminal message produce exactly one dead_letter row. Red tests in test_nack_dlq_canonical.sql confirmed both bugs before fix; both pass green after. Existing test_api_dlq.sql updated: test 2 no longer forges retry_count (the #98 vulnerability), instead uses max_retries=0 to trigger DLQ routing via canonical ev_retry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
REV review — PR #116 (independent reviewer)Verdict: REQUEST CHANGES (non-blocking nits + one docs gap; core fix is sound) Confidence: high on the correctness of the fix; high on the docs gap; medium on test-coverage hardening suggestions. Security (HIGH priority — #98 is a forgery vector)
Bug hunter
Test analyzer
Guidelines / CLAUDE.md compliance
Docs (this is the main blocking nit)
Anti-leak scan
CI5/5 PG version test jobs (14, 15, 16, 17, 18) pass. SummaryCounts: 0 must-fix-before-merge security/correctness issues. 1 docs gap (reference.md still describes pre-fix behavior). 3 test-coverage hardening suggestions (NULL msg_id, valid-msg_id-with-forged-payload assertion, narrower exception match). 0 anti-leak hits. The fix is conceptually right and the implementation is clean. The docs update is the only thing I'd want before merge; the test-coverage suggestions are nice-to-have but would meaningfully harden the regression surface for a security-sensitive fix. I am NOT approving or denying — flagging for the maintainer to address the docs gap and decide on test hardening. |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
REV Review — PR #116 (round 2)CI: 5/5 PG matrix (14, 15, 16, 17, 18) green; Verification table
Scrub impact (c808143)Diff is exactly two deleted lines (the migration paragraph + its blank line). The three behavioral bullets remain at lines 88, 91, 92 of the post-scrub file. No regression. Anti-leak
Counts0 blocking, 0 non-blocking, 0 potential (all r1 items resolved). 0 anti-leak hits. REV-style review (security, bugs, tests, guidelines, docs). SOC2 items skipped per project policy. |
Summary
Fixes two related bugs in
pgque.nack()/ DLQ flow, tracked in the umbrella #113.Fix #98 — nack() DLQ path trusts caller-supplied pgque.message (forge)
Root cause:
nack()passed caller-supplied composite fields (i_msg.type,i_msg.payload, etc.) directly toevent_dead()without verifying themsg_idactually belonged to the active batch. Any writer with an active batch could forge apgque.messagewith an arbitraryev_id, payload, type, or retry count, inserting fake DLQ rows.Fix:
nack()now callsget_batch_events(i_batch_id)and looks up the row byev_id = i_msg.msg_id. Ifmsg_idis not found in the batch, it raisesmsg_id % not found in batch %. Only canonical fields from the batch data tables reachevent_dead().Red test evidence:
Green (after fix):
Fix #104 — repeated nack() creates duplicate replayable DLQ rows
Root cause:
event_dead()used a plainINSERTwith no duplicate guard. Callingnack()twice beforeack()on the same terminal message inserted twodead_letterrows, causingdlq_replay_all()to replay the payload twice.Fix: Added unique index
dl_queue_consumer_ev_idxon(dl_queue_id, dl_consumer_id, ev_id)and changed theINSERTinevent_dead()toON CONFLICT (dl_queue_id, dl_consumer_id, ev_id) DO NOTHING. Repeated terminalnack()calls are now idempotent.Red test evidence:
Green (after fix):
Changes
sql/pgque-api/receive.sqlnack()re-queries canonical event viaget_batch_events; raises on unknownmsg_idsql/pgque-additions/dlq.sqldl_queue_consumer_ev_idx;event_dead()insert usesON CONFLICT DO NOTHINGsql/pgque.sqltests/test_nack_dlq_canonical.sqltests/test_api_dlq.sqlretry_count(was the #98 hole); usesmax_retries=0insteadtests/run_all.sqltest_nack_dlq_canonicalto suiteTest plan
mainbefore fixtests/run_all.sql) passes — all tests green🤖 Generated with Claude Code