fix(roles): restore PgQ producer/consumer split (#102, #106)#163
fix(roles): restore PgQ producer/consumer split (#102, #106)#163
Conversation
PgQue had collapsed PgQ's two-role model: pgque_writer was granted both producer primitives (insert_event, send*) AND every consumer primitive (register_consumer*, next_batch*, get_batch_events, finish_batch, event_retry, plus the modern receive/ack/nack/subscribe/unsubscribe wrappers). pgque_writer was also a member of pgque_reader. Effect: any role that could pgque.send() could also ack any consumer's batch by id (#102), reposition another consumer's cursor, replay another consumer's events, or read another consumer's active batch payloads (#106). PgQ's upstream grants.ini puts these consumer primitives on pgq_reader; pgq_writer is producer-only, and pgq_admin inherits both as siblings. Restore that split: - pgque_reader is no longer inherited by pgque_writer; both are members of pgque_admin (sibling model, same as PgQ). - All consumer-side primitives move from pgque_writer to pgque_reader: register_consumer / register_consumer_at / unregister_consumer, next_batch / next_batch_info / next_batch_custom, get_batch_events, finish_batch, event_retry (both overloads). - Modern consume API (receive, ack, nack, subscribe, unsubscribe) moves to pgque_reader. send/send_batch stay on pgque_writer. - Apps that produce AND consume must hold both roles explicitly. Tests: - test_pgque_roles.sql: assert pgque_writer must NOT have execute on ack, nack, receive, finish_batch, next_batch, get_batch_events, register_consumer_at, event_retry, subscribe; assert pgque_admin retains both via membership. - New test_security_producer_isolation.sql: end-to-end repro — a producer-only role tries to ack/finish/register_consumer_at/ get_batch_events/event_retry/next_batch a victim consumer's batch and is denied with insufficient_privilege at every step. Docs (README + reference) rewritten to describe the producer/consumer split, dual-grant pattern for produce+consume apps, and the remaining consumer-vs-consumer interference (which the role split does not solve; that needs ownership checks in the API and is tracked separately). Breaking change: existing deployments that rely on pgque_writer for ack/receive must add `grant pgque_reader to <role>;` after upgrade.
REV Code Review Report
BLOCKING ISSUES (5)HIGH
HIGH
HIGH README/docs — Breaking-change migration steps missing (confidence 8)
HIGH
MEDIUM
NON-BLOCKING (3)LOW Commit subject is 54 chars (CLAUDE.md says < 50) (confidence 7)
LOW
LOW Single commit mixes red and green; no separable RED-only commit (confidence 6)
POTENTIAL ISSUES (5)MEDIUM
MEDIUM
MEDIUM README.md:174 + receive.sql:146 + pgque.sql:4750 + commit message —
MEDIUM
MEDIUM
Summary
Result: BLOCKED — five blocking issues. Most important: the upgrade path is broken (the entire security fix is a no-op for in-place upgrades), and the test coverage claims a REV-style review (5 parallel agents, GitHub-adapted from postgres-ai/rev). SOC2 checks skipped per project policy. Generated by Claude Code |
REV review of the original commit on this branch surfaced one critical bug and several smaller issues. This commit addresses them: CRITICAL — upgrade path PostgreSQL preserves function-level grants across `create or replace function`, and never auto-revokes role memberships. Re-running `\i sql/pgque.sql` on a pre-#163 database therefore left the old `grant pgque_reader to pgque_writer` membership AND the old `grant execute on function pgque.<consumer-fn> to pgque_writer` grants in place — silently nullifying the entire security fix on in-place upgrades. - roles.sql: emit `revoke pgque_reader from pgque_writer` (idempotent) before re-granting admin membership. - roles.sql / receive.sql / send.sql: `revoke ... from pgque_writer` for every consumer-side function before re-granting on pgque_reader. Verified end-to-end: starting from a pre-fix install, running the new pgque.sql in place clears pgque_writer's ack/finish/etc grants and the producer-only attacker is denied. Tests - test_security_producer_isolation.sql: - Idempotent preamble: drops leftover roles/queue from any prior aborted run before setup. - Roles created NOLOGIN (set role does not need LOGIN; reduces blast radius if test aborts before cleanup). - Add the missing `receive()` attacker block (the comment claimed it but only `next_batch` was tested). - Add the `event_retry(timestamptz)` overload as a separate attacker block (each overload has its own privilege row). - Fix swapped #106-B / #106-C labels. - test_pgque_roles.sql: - Add negative grant assertions for the 5 functions that moved but were not previously asserted: event_retry(timestamptz), next_batch_info, next_batch_custom, register_consumer, unregister_consumer. Docs - README: add "Upgrading from a pre-#163 install" block with the audit query and the explicit `grant pgque_reader to <role>` step. - blueprints/PHASES.md: drop stale `admin > writer > reader` inheritance language; describe the sibling model. - README + receive.sql comments: change `closes #102, #106` to `refs #102, #106` so GitHub does not auto-close on merge (the consumer→consumer half is tracked in #164). 72/72 PASS, 0 FAIL on PG16 fresh install AND on simulated upgrade (pre-fix install → post-fix install over the top).
Fixes for REV findings — pushed in fa96c1fAll 5 BLOCKING + 3 NON-BLOCKING + 5 POTENTIAL issues from the REV review are addressed. Plus an additional finding I uncovered while validating the upgrade-path fix. Critical: upgrade path was even worse than the agent flaggedThe Bug Hunter caught the missing RED → GREEN evidenceThe third scenario is the one Bug Hunter flagged: install with the old code, then re-run the new pgque.sql on top — without the function-level revokes, this kept "ATTACK SUCCEEDED" (verified before the fix). With the revokes, it's denied. Full regression after the fixesFresh PG16 install + What changed (per finding)
CI from the previous push was green on PG 14-18; the new push is being checked now. Generated by Claude Code |
REV Round 2 ReportRe-running 5 parallel agents against CI: ✅ all 8 checks success on Verification of round-1 findings
No round-1 finding remains open. No new BLOCKING issues. Round-2 findingsNON-BLOCKING (3) — followup-worthy but not gating merge: MEDIUM
MEDIUM
MEDIUM No machine-verifiable upgrade-path regression test (conf 6, new)
LOW (3):
INFO (2):
PlanI'll fix the two trivial items (heading level + audit-query redundancy comment), file follow-up issues for the 3 deferred MEDIUMs and 1 LOW (SPECx), then approve+merge once CI re-greens. REV round 2 — 5 parallel agents. Result: BLOCKERS CLEARED. Generated by Claude Code |
- README upgrade section: drop the spurious `###` heading inside the Roles section (no other sub-headings there). - README audit query: remove redundant `pgque_admin` from the exclusion list and explain inline that it is filtered by the second predicate (it is a member of pgque_reader). Cosmetic only. No SQL behavior change. Tests 72/72 PASS. REV round 2 followups tracked in #165 (upgrade-path regression test) and #166 (subscribe location, dlq_replay rationale, SPECx note, teardown polish).
REV Round 3 Report —
|
| Round-2 finding | Round-3 verdict |
|---|---|
LOW conf 5: ### heading inconsistent with surrounding section |
✅ FIXED — replaced with bold inline label |
LOW conf 5: redundant pgque_admin in audit NOT IN |
✅ FIXED — removed; explanatory comment added |
| INFO conf 7: commit subject > 50 chars (carried) | ✅ NOT REPRODUCED — ab45ced subject is 36 chars |
New round-3 findings
INFO README.md:197 — Removing the ### heading dropped the auto-generated anchor, which makes the upgrade section harder to deep-link from release notes / migration guides (conf 6)
Tradeoff: the heading-removal was the explicit fix from round 2 (consistency with surrounding section that has no sub-headings). Keeping the trade.
If linkability becomes important later: add<a id="upgrading-from-pre-163"></a>immediately above the bold label.
No other findings. NO_FINDINGS from both agents at confidence ≥ 4 with severity above INFO.
Lifecycle status
- ✅ CI green (7/8 confirmed; claude-review pending)
- ✅ REV done (3 rounds — round 1 surfaced blockers, round 2 verified blocker fixes, round 3 confirmed polish)
- ✅ Testing done with RED/GREEN/UPGRADED evidence
- ➡️ Next: merge once claude-review completes; delete branch.
REV round 3 — focused 2-agent review on a 11-line polish commit.
Generated by Claude Code
Add tests/test_security_get_batch_cursor.sql to lock in the grant posture
for pgque.get_batch_cursor (both 3-arg and 4-arg overloads):
- pgque_reader / pgque_writer must get insufficient_privilege (42501)
on either overload.
- pgque_admin (or members) can still invoke both overloads.
- The 4-arg overload's extra_where remains a raw SQL fragment; the test
runs a UNION ALL forgery probe under admin to document the threat
model: the chosen fix is *boundary lockdown*, not predicate parsing
(SPECx Key Design Rule #2 — "the PgQ engine is sacred").
Note: PR #163 already revoked PUBLIC EXECUTE and never re-granted
get_batch_cursor to reader/writer, so the assertions are green at HEAD.
Subsequent commits add an explicit revoke + warning comment + docs note
for defense-in-depth and to prevent regression.
Wires into tests/run_all.sql alongside the other test_security_*.sql files.
Refs #108
get_batch_cursor's 4-arg overload concatenates i_extra_where verbatim into the dynamic cursor body. A caller that controls extra_where can inject arbitrary predicate SQL or use "false UNION ALL SELECT ..." to forge event rows returned to application code. The PgQ engine body is sacred (SPECx Key Design Rule #2), so this is fixed at the boundary, not by parsing predicates: - Add explicit "revoke execute on function pgque.get_batch_cursor(...) from public, pgque_reader, pgque_writer" for both overloads. PR #163 already produced this posture by deny-by-default; the explicit revoke makes intent visible and prevents regression if a future grant block accidentally re-exposes the function. - Add a SECURITY comment block above the function definition warning that i_extra_where is raw SQL and must never receive user input. Access remains via "grant execute on all functions in schema pgque to pgque_admin" earlier in the same grants block. Refs #108
Add tests/test_security_get_batch_cursor.sql to lock in the grant posture
for pgque.get_batch_cursor (both 3-arg and 4-arg overloads):
- pgque_reader / pgque_writer must get insufficient_privilege (42501)
on either overload.
- pgque_admin (or members) can still invoke both overloads.
- The 4-arg overload's extra_where remains a raw SQL fragment; the test
runs a UNION ALL forgery probe under admin to document the threat
model: the chosen fix is *boundary lockdown*, not predicate parsing
(SPECx Key Design Rule #2 — "the PgQ engine is sacred").
Note: PR #163 already revoked PUBLIC EXECUTE and never re-granted
get_batch_cursor to reader/writer, so the assertions are green at HEAD.
Subsequent commits add an explicit revoke + warning comment + docs note
for defense-in-depth and to prevent regression.
Wires into tests/run_all.sql alongside the other test_security_*.sql files.
Refs #108
get_batch_cursor's 4-arg overload concatenates i_extra_where verbatim into the dynamic cursor body. A caller that controls extra_where can inject arbitrary predicate SQL or use "false UNION ALL SELECT ..." to forge event rows returned to application code. The PgQ engine body is sacred (SPECx Key Design Rule #2), so this is fixed at the boundary, not by parsing predicates: - Add explicit "revoke execute on function pgque.get_batch_cursor(...) from public, pgque_reader, pgque_writer" for both overloads. PR #163 already produced this posture by deny-by-default; the explicit revoke makes intent visible and prevents regression if a future grant block accidentally re-exposes the function. - Add a SECURITY comment block above the function definition warning that i_extra_where is raw SQL and must never receive user input. Access remains via "grant execute on all functions in schema pgque to pgque_admin" earlier in the same grants block. Refs #108
…108) (#169) * test: red for #108 Add tests/test_security_get_batch_cursor.sql to lock in the grant posture for pgque.get_batch_cursor (both 3-arg and 4-arg overloads): - pgque_reader / pgque_writer must get insufficient_privilege (42501) on either overload. - pgque_admin (or members) can still invoke both overloads. - The 4-arg overload's extra_where remains a raw SQL fragment; the test runs a UNION ALL forgery probe under admin to document the threat model: the chosen fix is *boundary lockdown*, not predicate parsing (SPECx Key Design Rule #2 — "the PgQ engine is sacred"). Note: PR #163 already revoked PUBLIC EXECUTE and never re-granted get_batch_cursor to reader/writer, so the assertions are green at HEAD. Subsequent commits add an explicit revoke + warning comment + docs note for defense-in-depth and to prevent regression. Wires into tests/run_all.sql alongside the other test_security_*.sql files. Refs #108 * fix(security): restrict get_batch_cursor to pgque_admin (#108) get_batch_cursor's 4-arg overload concatenates i_extra_where verbatim into the dynamic cursor body. A caller that controls extra_where can inject arbitrary predicate SQL or use "false UNION ALL SELECT ..." to forge event rows returned to application code. The PgQ engine body is sacred (SPECx Key Design Rule #2), so this is fixed at the boundary, not by parsing predicates: - Add explicit "revoke execute on function pgque.get_batch_cursor(...) from public, pgque_reader, pgque_writer" for both overloads. PR #163 already produced this posture by deny-by-default; the explicit revoke makes intent visible and prevents regression if a future grant block accidentally re-exposes the function. - Add a SECURITY comment block above the function definition warning that i_extra_where is raw SQL and must never receive user input. Access remains via "grant execute on all functions in schema pgque to pgque_admin" earlier in the same grants block. Refs #108 * docs: mark get_batch_cursor as admin/trusted-only (#108) Add a security note to docs/reference.md for both get_batch_cursor overloads explaining that extra_where is a raw SQL fragment (not a parameter), the UNION-ALL row-forgery risk, and that access is admin-only — never pass user-controlled text. Steer application code toward pgque.receive instead. Refs #108 --------- Co-authored-by: Claude <noreply@anthropic.com>
Closes #165 and #166. Background After #163 merged, audit found docs/reference.md "PgQ primitives" section still listed `Grant: pgque_writer` for 10 consumer-side functions that moved to pgque_reader: register_consumer (3 forms), next_batch (3 forms), get_batch_events, finish_batch, event_retry (2 forms). blueprints/SPECx.md and docs/tutorial.md also did not state the sibling relationship. Pure doc gap; no functional bug, but readers landing on those sections would be misled about which role they need. Changes reference.md - Fix 10 stale `Grant: pgque_writer` lines under "PgQ primitives". - Add a one-line rationale to dlq_replay/dlq_replay_all explaining why they remain on pgque_writer (replay = produce action). - Add a parenthetical to subscribe/unsubscribe explaining that Source: send.sql is intentional (file co-locates produce + sub management). tutorial.md - Replace the bare "three roles" mention with a short list that states the sibling rule and points to the reference. SPECx.md (Sprint-1 acceptance bullets) - Add a sibling-relationship bullet next to the existing role-test bullets. dlq.sql comment - Expand the "writer-level" rationale to explain the produce-side framing. test_security_producer_isolation.sql - Mirror the preamble's defensive `revoke all privileges on schema pgque from <test_role>` in the normal-path teardown. tests/test_upgrade_grants.sql (NEW) - Red/green regression test for #165: simulate the pre-#163 permission set (writer inherits reader + 15 function grants on writer), replay the explicit revokes from roles.sql / receive.sql / send.sql, then assert pgque_writer has no execute on any moved function and is no longer a member of pgque_reader. TDD evidence: skipping the revoke block in section 3 of the test makes section 4 fail with `upgrade should have revoked pgque_reader membership from pgque_writer`. Restoring the revokes makes it pass. run_all.sql wires the new test into the suite. Tests: 75/75 PASS on PG16 (was 72/72; +3 from new test).
Closes #165 and #166. Background After #163 merged, audit found docs/reference.md "PgQ primitives" section still listed `Grant: pgque_writer` for 10 consumer-side functions that moved to pgque_reader: register_consumer (3 forms), next_batch (3 forms), get_batch_events, finish_batch, event_retry (2 forms). blueprints/SPECx.md and docs/tutorial.md also did not state the sibling relationship. Pure doc gap; no functional bug, but readers landing on those sections would be misled about which role they need. Changes reference.md - Fix 10 stale `Grant: pgque_writer` lines under "PgQ primitives". - Add a one-line rationale to dlq_replay/dlq_replay_all explaining why they remain on pgque_writer (replay = produce action). - Add a parenthetical to subscribe/unsubscribe explaining that Source: send.sql is intentional (file co-locates produce + sub management). tutorial.md - Replace the bare "three roles" mention with a short list that states the sibling rule and points to the reference. SPECx.md (Sprint-1 acceptance bullets) - Add a sibling-relationship bullet next to the existing role-test bullets. dlq.sql comment - Expand the "writer-level" rationale to explain the produce-side framing. test_security_producer_isolation.sql - Mirror the preamble's defensive `revoke all privileges on schema pgque from <test_role>` in the normal-path teardown. tests/test_upgrade_grants.sql (NEW) - Red/green regression test for #165: simulate the pre-#163 permission set (writer inherits reader + 15 function grants on writer), replay the explicit revokes from roles.sql / receive.sql / send.sql, then assert pgque_writer has no execute on any moved function and is no longer a member of pgque_reader. TDD evidence: skipping the revoke block in section 3 of the test makes section 4 fail with `upgrade should have revoked pgque_reader membership from pgque_writer`. Restoring the revokes makes it pass. run_all.sql wires the new test into the suite. Tests: 75/75 PASS on PG16 (was 72/72; +3 from new test).
Summary
PgQue had collapsed PgQ's two-role model. Upstream PgQ (
pgq/structure/grants.ini) puts consumer primitives (finish_batch,next_batch*,get_batch_events,register_consumer*,event_retry) onpgq_reader, withpgq_writergranted only producer primitives (insert_event, triggers).pgq_adminis a sibling member of both.In PgQue:
receive,ack,nack,subscribe,unsubscribe— was granted directly topgque_writer.pgque_writerwas also a member ofpgque_reader.Effect: any role that could call
pgque.send()could also ack any consumer's batch by id (#102), reposition another consumer's cursor, replay another consumer's events, or read another consumer's active batch payloads (#106). PgQ's role separation, designed exactly to prevent this, had been undone.Fix — restore PgQ's sibling split
pgque_readeris no longer inherited bypgque_writer. Both are now members ofpgque_admin(sibling model, matches PgQ'screate role pgq_admin in role pgq_reader, pgq_writer;).pgque_writertopgque_reader:register_consumer,register_consumer_at,unregister_consumernext_batch,next_batch_info,next_batch_customget_batch_events,finish_batchevent_retry(both overloads)receive,ack,nack,subscribe,unsubscribe) moves topgque_reader.send/send_batchstay onpgque_writer.Apps that produce and consume must now hold both
pgque_readerandpgque_writerexplicitly.Tests
tests/test_pgque_roles.sql: positive assertions migrated topgque_reader; new negative assertions thatpgque_writerdoes not haveexecuteonack,nack,receive,finish_batch,next_batch,get_batch_events,register_consumer_at,event_retry,subscribe;pgque_adminkeeps both via membership.tests/test_security_producer_isolation.sql: end-to-end Security: any pgque_writer can ack another consumer/app's active batch by batch_id #102/Security: low-level primitives let writers mutate/read other consumers' active batches #106 repro. A producer-only role (onlypgque_writer) tries toack/finish_batch/register_consumer_at/get_batch_events/event_retry/next_batchagainst a victim consumer's batch and is denied withinsufficient_privilegeat every step.Local run on a fresh PG16 install: 70/70 PASS, 0 FAIL.
Docs
README.mdanddocs/reference.mdrewritten to describe the sibling split, dual-grant pattern for produce+consume apps, and the remaining (unsolved) consumer-vs-consumer interference.Breaking change
Existing deployments that grant
pgque_writerfor ack/receive must addgrant pgque_reader to <role>;after upgrade. Documented in the commit message and README.Out of scope — deferred to a later release (#164)
The role split closes the producer → consumer boundary, not the consumer → consumer one. Any
pgque_readercan still ack another consumer's batch by id. Closing that requires ownership checks insideack/nack(e.g.ack(queue, consumer, batch_id)) — the second half of #102's "Suggested fix options".Decision: not in this release. Tracked in #164. The producer/consumer split lands now and is the higher-leverage half (it closes the most common misconfiguration — apps using
pgque_writerfor everything). Consumer-vs-consumer ownership is a meaningful API change (new arities, breaking signature change) and needs its own PR + client-driver coordination.Test plan
bash build/transform.sh— cleantests/run_all.sqlon PG16 — 70 PASS, 0 FAILtest_security_producer_isolation.sqlexercises 6 attacker paths under a pure producer role; all deniedverifyandclient-smokegreenCloses #102 (producer side) and #106 (producer side). Consumer-side ownership tracked in #164.