Rotate subscription and tick tables to avoid held-xmin bloat (#61)#62
Rotate subscription and tick tables to avoid held-xmin bloat (#61)#62
Conversation
Follow-up suggestion: decouple metadata rotation cadence from event rotationThe PR currently ties subscription/tick rotation to the existing Observed in R6 smoke (benchmark
So metadata rotation at 30 s is 4-6× more frequent than it needs to be. Costs:
ProposalAdd a separate config field on ALTER TABLE pgque.queue
ADD COLUMN IF NOT EXISTS queue_metadata_rotation_period interval NOT NULL DEFAULT '2 minutes';Default to 2 min; the rotation trigger in the ticker uses this field for subscription/tick rotation specifically, while event-table rotation keeps using At 2 min cadence the expected peak bloat is:
Still meets acceptance criteria while cutting pg_class bloat and TRUNCATE churn by 4×. Benchmark override: Rationale for separationEvent tables are bulk-storage tables where rotation = capacity management (drop old events you no longer need). Metadata tables are tiny-hot tables where rotation = bloat mitigation. Same mechanism, different economics — deserve independent knobs. Scope for this PR vs follow-upThis is a design-polish suggestion, not a correctness blocker for landing the current PR. Options:
I'd lean toward (1) since the acceptance criteria for the metadata-bloat fix (pgque#61) arguably includes "reasonable default cadence for production" — and 30 s matching events is a debatable default. Either way, the rotation design itself is validated by the R6 smoke; this is a tuning refinement. |
REV Code Review — #62Summary: Fixes #61 by extending PgQ's 3-table rotation pattern to Installed on PG18 locally — full BLOCKING — 0NON-BLOCKINGMEDIUM No new regression tests for rotation logic
MEDIUM
LOW
LOW
LOW
POTENTIAL ISSUESMEDIUM Hot-path overhead: every
LOW The subscription view's
INFO The tick rotation's "defer if any consumer's sub_last_tick lives in the target slot" rule means a single stuck consumer can block tick rotation indefinitely while subscription rotation keeps happening
Verified on a local PG18 install
No crashes, no FK violations, no grant errors on install, no issues with the INSTEAD OF triggers. The design holds together in practice. Summary
Result: REQUEST CHANGES (non-blocking). The design is correct, the manual validation holds, and the existing test suite keeps passing. But shipping a change of this scope without a dedicated REV-assisted review (SOC2 checks skipped per request). |
…ark/ (issue #77) Adds a strictly-additive benchmark/ directory documenting the methodology, tooling, and operational lessons from the pgque-vs-pgq-vs-pgmq-vs-river-vs-que-vs-pgboss-vs-pgmq-partitioned bench that backs #61 and PR #62. - README.md: entry point + quick-start - METHODOLOGY.md: adapted from GitLab #77 note 3263767264 - OPS_GOTCHAS.md: 15 operational lessons (NEW — NVMe mount, partman stale rows, que func leftovers, pgboss covering index, pgq ticker, pgque xid8 bug, spot reclaim, ASH prereqs, NOTICE instrumentation, etc.) - HARDWARE.md: i4i.2xlarge specs, PG tuning, microbench baselines - tooling/, runners/, consumers/, producers/, install/, charts/, gifs/ No pgque production SQL is touched. Refs: #61, #62. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ark/ (issue #77) Adds a strictly-additive benchmark/ directory documenting the methodology, tooling, and operational lessons from the pgque-vs-pgq-vs-pgmq-vs-river-vs-que-vs-pgboss-vs-pgmq-partitioned bench that backs #61 and PR #62. - README.md: entry point + quick-start - METHODOLOGY.md: adapted from GitLab #77 note 3263767264 - OPS_GOTCHAS.md: 15 operational lessons (NEW — NVMe mount, partman stale rows, que func leftovers, pgboss covering index, pgq ticker, pgque xid8 bug, spot reclaim, ASH prereqs, NOTICE instrumentation, etc.) - HARDWARE.md: i4i.2xlarge specs, PG tuning, microbench baselines - tooling/, runners/, consumers/, producers/, install/, charts/, gifs/ No pgque production SQL is touched. Refs: #61, #62. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ark/ Adds a strictly-additive benchmark/ directory documenting the methodology, tooling, and operational lessons from the pgque-vs-pgq-vs-pgmq-vs-river-vs-que-vs-pgboss-vs-pgmq-partitioned bench that backs #61 and PR #62. - README.md: entry point + quick-start - METHODOLOGY.md: methodology fix per review feedback - OPS_GOTCHAS.md: 15 operational lessons (NEW — NVMe mount, partman stale rows, que func leftovers, pgboss covering index, pgq ticker, pgque xid8 bug, spot reclaim, ASH prereqs, NOTICE instrumentation, etc.) - HARDWARE.md: i4i.2xlarge specs, PG tuning, microbench baselines - tooling/, runners/, consumers/, producers/, install/, charts/, gifs/ No pgque production SQL is touched. Refs: #61, #62. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: complete bench methodology + tooling + ops gotchas under benchmark/ Adds a strictly-additive benchmark/ directory documenting the methodology, tooling, and operational lessons from the pgque-vs-pgq-vs-pgmq-vs-river-vs-que-vs-pgboss-vs-pgmq-partitioned bench that backs #61 and PR #62. - README.md: entry point + quick-start - METHODOLOGY.md: methodology fix per review feedback - OPS_GOTCHAS.md: 15 operational lessons (NEW — NVMe mount, partman stale rows, que func leftovers, pgboss covering index, pgq ticker, pgque xid8 bug, spot reclaim, ASH prereqs, NOTICE instrumentation, etc.) - HARDWARE.md: i4i.2xlarge specs, PG tuning, microbench baselines - tooling/, runners/, consumers/, producers/, install/, charts/, gifs/ No pgque production SQL is touched. Refs: #61, #62. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(benchmark): shell-style polish per CLAUDE.md - benchmark/runners/fix_nvme_mount.sh: switch to /usr/bin/env bash shebang; use set -Eeuo pipefail (was set -euo pipefail without -E) - benchmark/runners/run_r7.sh: add -Ee flags to existing pipefail - benchmark/runners/clean_reinstall.sh: read -r in two while-loops Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(benchmark): anti-leak scrub + polish pass Remove all references to private internal URLs, round numbering (R4/R5/R6/R8), and private repository paths from benchmark/ files. - METHODOLOGY.md: drop internal URL + note IDs from header; remove internal posting-style section (§9); neutralize round refs; fix /tmp/bench_r<N> path reference - README.md: drop internal reference comment - HARDWARE.md: fix binary units (GB→GiB, TB→TiB); drop R7 round ref - OPS_GOTCHAS.md: neutralize R4/R6 round refs in lessons; fix binary units (GB→GiB, MB/s→MiB/s) - consumers/*.sql: drop "R6 instrumented" prefix from all 7 files - runners/run_r7.sh: remove R6/R7 round refs from inline comments - tooling/sys_metrics_sampler.py: remove R7 from docstring - tooling/parse_events_consumed.py: remove R6 from docstring + msg - charts/r5_analyze.py, r6_smoke_chart.py: remove Rn from docstrings, chart titles, and file-size output (KB→KiB) PR description updated separately via gh pr edit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore(benchmark): address REV r1 findings (anti-leak + style) - Remove private WI refs from bootstrap.sh:41,44 (replace with "see methodology notes") - Fix set -Eeuo pipefail in 7 shell scripts that only had partial flags - Fix broken OPS_GOTCHAS.md:185 link (install_pgque.sh → install/README.md) - Fix binary unit in install_pgboss.sh:2 (GB → GiB) - Fix run_r7.sh tool paths: resolve from benchmark/tooling/ by default instead of hardcoded /tmp/r7 and /tmp/r6; override via R7_DIR/R6_DIR Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(benchmark): correct three methodology claims - logging_collector=off does not mean zero log I/O; journald still writes to disk (#123) - premake=20 planner cost is first-query-in-session, not per-query; root cause of steady-state regression is a follow-up (#124) - add RAISE NOTICE observer-effect caveat for high-frequency use (#127) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(benchmark): drop incorrect PgBouncer speculation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Nik Samokhvalov <nik@Niks-MacBook-Pro.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Converts pgque.subscription and pgque.tick from single base tables to 3-child UNION-ALL views, mirroring the event-table rotation pattern, to cap held-xmin bloat under high-update load. Changes: - sql/pgque-additions/metadata_rotation.sql (new): tables tick_0/1/2, subscription_0/1/2, template tables, INSTEAD OF routing triggers, overrides for maint_tables_to_vacuum(), maint_operations(), unregister_consumer(), and new functions maint_rotate_metadata() + maint_rotate_metadata_step2(). - sql/pgque-additions/lifecycle.sql: pgque_rotate_step2 cron job also calls maint_rotate_metadata_step2(). - sql/pgque-api/maint.sql: skip maint_rotate_metadata_step2 in maint() loop (needs separate transaction). - build/transform.sh: add metadata_rotation.sql to assembly. - sql/pgque.sql, sql/pgque-tle.sql: regenerated. Rebased from a9e992f onto current main; logic relocated from generated sql/pgque.sql into sql/pgque-additions/ per CLAUDE.md. All SECURITY DEFINER functions pin search_path. Regression + acceptance tests pass. https://claude.ai/code/session_015Tf54iAd24uQPKCBaXeiTV
a9e992f to
8acc53f
Compare
Rebase complete —
|
| File | Role |
|---|---|
sql/pgque-additions/metadata_rotation.sql |
New. All rotation DDL and functions live here — this is the canonical source. |
sql/pgque-additions/lifecycle.sql |
pgque_rotate_step2 cron job now also calls maint_rotate_metadata_step2(). |
sql/pgque-api/maint.sql |
maint() skips maint_rotate_metadata_step2 (needs separate tx, same pattern as maint_rotate_tables_step2). |
build/transform.sh |
Adds metadata_rotation.sql to the assembly loop (after dlq.sql). |
sql/pgque.sql, sql/pgque-tle.sql |
Regenerated — not hand-edited. |
sql/pgque.sql is now derived, not the source of truth for this feature.
What metadata_rotation.sql contains
pgque.meta_rotationsingleton table (rotation pointer)pgque.tick_tmpl,tick_0/1/2child tables +pgque.tickUNION ALL viewpgque.subscription_tmpl,subscription_0/1/2child tables +pgque.subscriptionUNION ALL viewpgque._subscription_route()andpgque._tick_route()INSTEAD OF trigger functionspgque.maint_rotate_metadata()andpgque.maint_rotate_metadata_step2()— both SECURITY DEFINER withset search_path = pgque, pg_catalog- Override of
maint_tables_to_vacuum()(lists child tables, not views) - Override of
maint_operations()(emits both metadata rotation entries) - Override of
unregister_consumer()(dropsFOR UPDATE OF s— view can't be locked;FOR UPDATE OF ckept) - Explicit
SELECTgrants on the new views topublic(replacing the OID-renamed base-table grants fromgrants.sql)
Test results (PG 16, local)
- Regression (
tests/run_all.sql):=== ALL TESTS PASSED === - Acceptance (
tests/acceptance/run_acceptance.sql):=== ALL ACCEPTANCE TESTS PASSED === - Idempotent reinstall: no ERROR/FATAL output on second
\i sql/pgque.sql - Functional smoke: rotation returns 0 when too-soon, returns 1 after backdating
last_rotation_time; pointer flips fromcur=0tocur=1correctly
Caveats / things to review
subscription_tmpl/tick_tmpltables persist after install — they are empty but sit in the schema. This is intentional (column-schema carrier forLIKE … INCLUDING DEFAULTS), but could be dropped after child tables are created if preferred.FOR UPDATE OF sremoval inunregister_consumeris safe for the single-consumer unregister path (DELETE provides the lock), but worth a second pair of eyes given PgQ's subconsumer logic.- R6-style benchmark (dead-tuple bound under held xmin) not yet run — acceptance criteria from Metadata-table bloat under held xmin — subscription/tick need rotation #61 (
subscription dead-tuples ≤ 500,tick dead-tuples ≤ 200, iter-TPS within 20%) require a long-running soak test, not included here. - Tick view
tick_child_tableextra column is visible to code that doesSELECT *frompgque.tick. PgQ code only ever selects named columns, but worth noting.
Generated by Claude Code
Idle sweep (30 s per cell, 100 ev/s, single laptop, PG16): tick_period_ms=1000: p50 503, p99 994, max 1004 ms (1 tick/sec floor) tick_period_ms=100: p50 53, p99 103, max 105 ms (default; clean) tick_period_ms=10: p50 8, p99 864, max 1013 ms (tail blows up) tick_period_ms=1: p50 3, p99 460, max 548 ms (tail blows up) Held-xmin (tick=100ms, 1000 ev/s, 5 min RR tx open): baseline: p50 52.6, p99 103.4, max 143.8 ms held-xmin: p50 53.8, p99 104.7, max 235.6 ms Median essentially flat under held-xmin; worst-case roughly doubles. Milder than expected — consistent with the design goal of stable latency under load on moderate timescales. Longer runs / higher tick rates would amplify the bloat-driven tail (PR #62 territory). Refs #69, PR #204
Idle sweep (30 s per cell, 100 ev/s, single laptop, PG16): tick_period_ms=1000: p50 503, p99 994, max 1004 ms (1 tick/sec floor) tick_period_ms=100: p50 53, p99 103, max 105 ms (default; clean) tick_period_ms=10: p50 8, p99 864, max 1013 ms (tail blows up) tick_period_ms=1: p50 3, p99 460, max 548 ms (tail blows up) Held-xmin (tick=100ms, 1000 ev/s, 5 min RR tx open): baseline: p50 52.6, p99 103.4, max 143.8 ms held-xmin: p50 53.8, p99 104.7, max 235.6 ms Median essentially flat under held-xmin; worst-case roughly doubles. Milder than expected — consistent with the design goal of stable latency under load on moderate timescales. Longer runs / higher tick rates would amplify the bloat-driven tail (PR #62 territory). Refs #69, PR #204
* feat: configurable sub-second tick rate (default 10 Hz) pg_cron's minimum schedule is 1 second, which capped end-to-end latency at ~1 s for non-LISTEN consumers. Drive sub-second ticking from inside a single pg_cron slot via a new pgque.ticker_loop() procedure that re-invokes pgque.ticker() at pgque.config.tick_period_ms cadence (default 100 ms = 10 Hz). The procedure commits between iterations so each tick gets its own transaction and rotation isn't blocked by a held xmin (this is also why ticker_loop is a PROCEDURE, not a FUNCTION). Tunable at runtime with pgque.set_tick_period_ms(ms); changes apply on the next pg_cron slot without rescheduling. Refs #69 * docs: explain default 10 Hz ticking, configurability, pg_cron log hygiene - README: replace "1 second tick" framing with "10 Hz default"; new "Tick rate" section covers `pgque.set_tick_period_ms`, trade-offs (WAL, NOTIFY, metadata-table dead tuples), and clarifies that the per-second pg_cron slot count does NOT increase with sub-second ticking — the cron.job_run_details growth rate is unchanged. - docs/three-latencies.md: rewrite the cadence table around tick_period_ms with new rough numbers per rate; explicitly note the pg_cron logging problem is independent of sub-second ticking. - docs/reference.md: document `pgque.ticker_loop()` and `pgque.set_tick_period_ms(ms)`; expand `start()` and `ticker()` notes. - docs/tutorial.md: extend "Production cadence" with the 10 Hz default, why ticker_loop is a procedure (per-iteration commit / snapshot semantics / xmin), and the unchanged log-hygiene recipe. - lifecycle.sql: switch status() detail strings from `||` concatenation to format(). Refs #69 * docs: switch from "Hz" to "ticks/sec" terminology "Hz" reads as electronics/CPU-clock vocabulary in queue/DB context, and implies a precision the loop doesn't actually deliver — real cycle is tick_work + sleep, so 100 ms period yields ~9 ticks/s, not exactly 10. "ticks/sec" is more natural for this domain. Replaces all user-facing strings (status() detail, start() notice, README, docs, test message). No behaviour change. * fix: tighten tick_period_ms range to 1..1000, add bg-worker note Three small follow-ups from review: 1. Range 1..1000 (was 1..60000): pg_cron's minimum schedule is 1 s, so tick_period_ms > 1000 collapses to "one tick per pg_cron slot" inside ticker_loop's clamp. Reject explicitly rather than silently clamping. To tick less often than once per second, edit the pg_cron schedule string directly. 2. status() now includes the configured tick_period_ms even when pg_cron is unavailable, so operators driving the ticker manually can see the value they configured. 3. README "Tick rate" section gains a note about cron.max_running_jobs: ticker_loop holds one pg_cron bg worker for ~1 s per slot (vs. ~10 ms previously), bounding ~30 pgque-bearing databases per cluster at pg_cron's default of 32. Note on the statement_timeout guardrail considered earlier: NOT folded in. pg_cron concatenates `SET statement_timeout = '...'; CALL pgque.ticker_loop()` into a single multi-statement transaction, and the COMMIT inside the procedure raises "invalid transaction termination" in that wrapper. Documented inline in pgque.start(). Refs #69, PR #204 * bench: tick-rate latency harness Two scenarios: - idle: sweep tick_period_ms in {1, 10, 100, 1000} and measure producer→consumer e2e latency at light load. - held-xmin: a long-running RR transaction holds xmin while a 1000 ev/s producer runs at the default 100 ms tick. Demonstrates how tick/subscription metadata UPDATEs degrade under blocked vacuum. Latency is measured server-side: producer stamps clock_timestamp() into the payload, consumer subtracts at receive time. No client clock skew. Refs #69, PR #204 * bench: tick-rate latency results + README Idle sweep (30 s per cell, 100 ev/s, single laptop, PG16): tick_period_ms=1000: p50 503, p99 994, max 1004 ms (1 tick/sec floor) tick_period_ms=100: p50 53, p99 103, max 105 ms (default; clean) tick_period_ms=10: p50 8, p99 864, max 1013 ms (tail blows up) tick_period_ms=1: p50 3, p99 460, max 548 ms (tail blows up) Held-xmin (tick=100ms, 1000 ev/s, 5 min RR tx open): baseline: p50 52.6, p99 103.4, max 143.8 ms held-xmin: p50 53.8, p99 104.7, max 235.6 ms Median essentially flat under held-xmin; worst-case roughly doubles. Milder than expected — consistent with the design goal of stable latency under load on moderate timescales. Longer runs / higher tick rates would amplify the bloat-driven tail (PR #62 territory). Refs #69, PR #204 * ci: add pg_cron CI variant; document statement_timeout limitation Two follow-ups from the PR review. 1. pg_cron CI variant (closes the test-coverage gap) New ci/Dockerfile.pgcron installs postgresql-NN-cron over the official postgres image; new pgcron-test workflow job builds it, starts PG with shared_preload_libraries=pg_cron and cron.use_background_workers=on, runs the full regression suite, and explicitly fails if any pg_cron-only test prints "SKIP: pg_cron not installed" (so a future accidental loss of coverage is loud, not silent). This makes test_tick_period's "pgque.start() schedules CALL pgque.ticker_loop()" assertion run in CI for the first time, plus the four pgque.start() / pgque.stop() cases in test_pgcron_lifecycle. 2. statement_timeout guardrail: honestly cannot be enforced Tried two approaches; neither works: - SET statement_timeout = '...'; CALL pgque.ticker_loop() in the pg_cron command -- pg_cron concatenates them into one multi-statement transaction, and the procedure's COMMIT raises "invalid transaction termination". - set_config('statement_timeout', '1500ms', is_local := true) inside the procedure body -- this updates the GUC value, but statement_timeout is a top-level-statement timer. The CALL is the statement; its timer is fixed at invocation. Setting the GUC mid-procedure does not restart or re-arm the timer, so subsequent pg_sleep / pgque.ticker() work runs unguarded. Verified by reproduction. Reverted the set_config attempt and replaced the inline comment with the full diagnosis. ticker_loop self-bounds via clock_timestamp() to limit how many additional iterations a slow ticker can chain, but a genuinely hung pgque.ticker() will pin the pg_cron worker until an admin pg_cancel_backend()s it. ticker() has no indefinite-block code paths under normal operation; we accept the residual risk over shipping a guardrail that doesn't actually fire. Refs #69, PR #204 * fix: reject non-divisor tick periods --------- Co-authored-by: Claude <noreply@anthropic.com>
Summary
Fixes #61.
Applies PgQ's own 3-table rotation pattern to the two PgQ metadata tables that are NOT already rotated —
pgque.subscriptionandpgque.tick— so that held-xmin bloat on those tables is bounded instead of unbounded.The event tables (
pgque.event_<queue>_0/_1/_2) already rotate, and R4/R5 showed their dead-tuple count stays at 0 even under a 10-minute idle-in-transaction. But the metadata tables don't rotate; under the same R5 run they peaked at ~14 312 dead tuples onpgque.subscriptionand ~7 154 onpgque.tick, which dragged pgque's consumer iter-TPS from ~3 500 down to ~1 540 during the held-xmin window.This PR extends the rotation pattern to those two tables.
Design
pgque.subscriptionbecomes a view (UNION ALLover three physical childrensubscription_0 / _1 / _2), filtered to the currently-active child viapgque.meta_rotation.cur_subscription_table.INSTEAD OFtriggers route everyINSERT/UPDATE/DELETEto the active child.pgque.tickbecomes a view (UNION ALLovertick_0 / _1 / _2) with no active-child filter — a consumer'ssub_last_tickcan legitimately reference a tick that was inserted before the most recent rotation.INSTEAD OFtriggers routeINSERTs to the active child.DELETEs (as issued by the existingmaint_rotate_tables_step1) fan out to all three children.pgque.maint_rotate_metadata()performs the rotation step1:TRUNCATEthe(cur + 1) % 3slot,INSERT … SELECTlive subscription rows fromcur→ the new slot, flipcur. For tick, truncation is gated on "no livesub_last_tickreferences rows on the target slot."pgque.maint_rotate_metadata_step2()is the step2 counterpart (same pattern PgQ uses for event rotation), scheduled bypgque.start()alongside the existingpgque_rotate_step2cron job.pgque.meta_rotation_period.maint_operations()emits both rotation calls;maint()skips the step2 call (it needs its own transaction, same as the eventmaint_rotate_tables_step2case).No external dependencies added. Install path stays
psql -f sql/pgque.sql. Uninstall picks up the new children transitively viaDROP SCHEMA … CASCADE. No C accelerator.Correctness trade-offs
The subscription view filters to
cur, which means that any transaction whose MVCC snapshot was taken before a rotation committed will (viapgque.meta_rotation) see the oldcur's child. That matches how PgQ's event tables already work: long-running read snapshots that predate a rotation see the pre-rotation layout. pgque itself has no code path that readspgque.subscriptionunder a long-lived snapshot;next_batch_custom/finish_batchare short transactions.Tick rotation is more conservative than subscription rotation: we only truncate the target tick slot if no live
sub_last_tickcurrently resolves to a row in it. In the (rare) case where a consumer has lagged for longer than two rotation periods, the tick flip is deferred until the consumer catches up. Subscription rotation still happens in that case — which is the dominant bloat source.Acceptance criteria
pgque.subscription_*peak dead tuples ≤ 500 (was 14 312 on R5 without this patch).pgque.tick_*peak dead tuples ≤ 200 (was 7 154 on R5 without this patch).tests/*.sqlcontinue to pass.Test plan
psql -f sql/pgque.sql) produces no errors or warnings.pgque.create_queue,pgque.register_consumer, send → ticker → next_batch → finish_batch happy path works across multiple forced rotations.cur = 0 → 1 → 2 → 0),pgque.subscriptionview returns exactly one row per(sub_queue, sub_consumer)andsub_last_tickresolves via the view.subscription_<cur>are cleared to 0 after rotation (heap-only — tuples on the previous slot stick around until that slot is truncated two rotations later).R6 smoke results will be posted as a follow-up comment on this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com