Skip to content

Rotate subscription and tick tables to avoid held-xmin bloat (#61)#62

Draft
NikolayS wants to merge 1 commit intomainfrom
feat/metadata-rotation
Draft

Rotate subscription and tick tables to avoid held-xmin bloat (#61)#62
NikolayS wants to merge 1 commit intomainfrom
feat/metadata-rotation

Conversation

@NikolayS
Copy link
Copy Markdown
Owner

Summary

Fixes #61.

Applies PgQ's own 3-table rotation pattern to the two PgQ metadata tables that are NOT already rotated — pgque.subscription and pgque.tick — so that held-xmin bloat on those tables is bounded instead of unbounded.

The event tables (pgque.event_<queue>_0/_1/_2) already rotate, and R4/R5 showed their dead-tuple count stays at 0 even under a 10-minute idle-in-transaction. But the metadata tables don't rotate; under the same R5 run they peaked at ~14 312 dead tuples on pgque.subscription and ~7 154 on pgque.tick, which dragged pgque's consumer iter-TPS from ~3 500 down to ~1 540 during the held-xmin window.

This PR extends the rotation pattern to those two tables.

Design

  • pgque.subscription becomes a view (UNION ALL over three physical children subscription_0 / _1 / _2), filtered to the currently-active child via pgque.meta_rotation.cur_subscription_table. INSTEAD OF triggers route every INSERT / UPDATE / DELETE to the active child.
  • pgque.tick becomes a view (UNION ALL over tick_0 / _1 / _2) with no active-child filter — a consumer's sub_last_tick can legitimately reference a tick that was inserted before the most recent rotation. INSTEAD OF triggers route INSERTs to the active child. DELETEs (as issued by the existing maint_rotate_tables_step1) fan out to all three children.
  • pgque.maint_rotate_metadata() performs the rotation step1: TRUNCATE the (cur + 1) % 3 slot, INSERT … SELECT live subscription rows from cur → the new slot, flip cur. For tick, truncation is gated on "no live sub_last_tick references rows on the target slot."
  • pgque.maint_rotate_metadata_step2() is the step2 counterpart (same pattern PgQ uses for event rotation), scheduled by pgque.start() alongside the existing pgque_rotate_step2 cron job.
  • Default rotation period is 30 s, tunable via GUC pgque.meta_rotation_period.
  • maint_operations() emits both rotation calls; maint() skips the step2 call (it needs its own transaction, same as the event maint_rotate_tables_step2 case).

No external dependencies added. Install path stays psql -f sql/pgque.sql. Uninstall picks up the new children transitively via DROP SCHEMA … CASCADE. No C accelerator.

Correctness trade-offs

The subscription view filters to cur, which means that any transaction whose MVCC snapshot was taken before a rotation committed will (via pgque.meta_rotation) see the old cur's child. That matches how PgQ's event tables already work: long-running read snapshots that predate a rotation see the pre-rotation layout. pgque itself has no code path that reads pgque.subscription under a long-lived snapshot; next_batch_custom / finish_batch are short transactions.

Tick rotation is more conservative than subscription rotation: we only truncate the target tick slot if no live sub_last_tick currently resolves to a row in it. In the (rare) case where a consumer has lagged for longer than two rotation periods, the tick flip is deferred until the consumer catches up. Subscription rotation still happens in that case — which is the dominant bloat source.

Acceptance criteria

  • pgque consumer iter-TPS during held-xmin window within 20% of the clean baseline.
  • pgque.subscription_* peak dead tuples ≤ 500 (was 14 312 on R5 without this patch).
  • pgque.tick_* peak dead tuples ≤ 200 (was 7 154 on R5 without this patch).
  • All existing tests/*.sql continue to pass.

Test plan

  • Fresh install on a scratch database (psql -f sql/pgque.sql) produces no errors or warnings.
  • pgque.create_queue, pgque.register_consumer, send → ticker → next_batch → finish_batch happy path works across multiple forced rotations.
  • After 3 forced rotations (full cycle through cur = 0 → 1 → 2 → 0), pgque.subscription view returns exactly one row per (sub_queue, sub_consumer) and sub_last_tick resolves via the view.
  • Dead tuples on subscription_<cur> are cleared to 0 after rotation (heap-only — tuples on the previous slot stick around until that slot is truncated two rotations later).
  • R6 smoke test (pgbench, 5 min clean + 10 min idle-in-tx + 5 min recovery) passes the acceptance criteria above.

R6 smoke results will be posted as a follow-up comment on this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

@NikolayS NikolayS marked this pull request as draft April 18, 2026 03:47
@NikolayS
Copy link
Copy Markdown
Owner Author

Follow-up suggestion: decouple metadata rotation cadence from event rotation

The PR currently ties subscription/tick rotation to the existing queue_rotation_period (same knob that drives event-table rotation). That works and passes the acceptance criteria in the R6 combined smoke, but the two have very different write rates and different optimal rotation cadences.

Observed in R6 smoke (benchmark queue_rotation_period=30s):

  • Event tables: ~1000 INSERTs/s → at 30 s cadence each partition sees ~30 k rows → rotation at 30 s is appropriate.
  • pgque.subscription: ~2 UPDATEs/s (only on finish_batch when a batch has events — not on every ticker call) → at 30 s cadence each rotation-side accumulates only ~60 dead tuples. Way under target (500).
  • pgque.tick: rate depends on ticker cadence, in practice ~1 row/s → ~30 dead per 30 s rotation. Also way under target (200).

So metadata rotation at 30 s is 4-6× more frequent than it needs to be. Costs:

  • pg_class churn: each rotation cycle calls TRUNCATE on 3 metadata tables → ~3 relfilenode UPDATEs on pg_class per cycle. At 30 s cadence = ~360 dead pg_class tuples/hour. Over a 24 h production workload = ~8 640 dead pg_class rows. pg_class is hit by every query planner lookup — its bloat affects ALL queries, not just pgque.
  • TRUNCATE AccessExclusive lock: brief (small table) but non-zero. 3 locks per cycle × 120 cycles/hour = 360 brief blockers/hour.
  • WAL per TRUNCATE: small but measurable over long runs.

Proposal

Add a separate config field on pgque.queue:

ALTER TABLE pgque.queue
  ADD COLUMN IF NOT EXISTS queue_metadata_rotation_period interval NOT NULL DEFAULT '2 minutes';

Default to 2 min; the rotation trigger in the ticker uses this field for subscription/tick rotation specifically, while event-table rotation keeps using queue_rotation_period.

At 2 min cadence the expected peak bloat is:

Table Write rate Peak dead per cycle vs target
subscription ~2 /s ~240 well under 500 ✓
tick ~1-2 /s ~120-240 under 200-500 depending on workload

Still meets acceptance criteria while cutting pg_class bloat and TRUNCATE churn by 4×.

Benchmark override: UPDATE pgque.queue SET queue_metadata_rotation_period='30 seconds' — preserves stress-test behavior.

Rationale for separation

Event tables are bulk-storage tables where rotation = capacity management (drop old events you no longer need). Metadata tables are tiny-hot tables where rotation = bloat mitigation. Same mechanism, different economics — deserve independent knobs.

Scope for this PR vs follow-up

This is a design-polish suggestion, not a correctness blocker for landing the current PR. Options:

  1. Add the queue_metadata_rotation_period field in this PR as a small addition (+10 LoC) with default '2 minutes'. Bench override sets it to 30 s explicitly.
  2. Land this PR as-is and open a follow-up issue/PR specifically for the decoupled cadence.

I'd lean toward (1) since the acceptance criteria for the metadata-bloat fix (pgque#61) arguably includes "reasonable default cadence for production" — and 30 s matching events is a debatable default.

Either way, the rotation design itself is validated by the R6 smoke; this is a tuning refinement.

@NikolayS NikolayS marked this pull request as ready for review April 18, 2026 04:44
@NikolayS
Copy link
Copy Markdown
Owner Author

REV Code Review — #62

Summary: Fixes #61 by extending PgQ's 3-table rotation pattern to pgque.subscription and pgque.tick — the two PgQ metadata tables that were never rotated — so held-xmin bloat on them is bounded instead of unbounded. Tables become UNION ALL views over three physical children (*_0 / *_1 / *_2), an INSTEAD OF trigger routes writes to the currently-active child (tracked in new singleton pgque.meta_rotation), and pgque.maint_rotate_metadata + its _step2 counterpart do the rotate-truncate-copy cycle on the existing pgque_rotate_step2 cron. 30 s default cadence, tunable via GUC pgque.meta_rotation_period. +552/-29 across two files.

Installed on PG18 locally — full tests/run_all.sql passes, and a forced-rotation smoke test (described below under Verified) walks the pointer 0 → 1 → 2 → 0 correctly and preserves consumer semantics.


BLOCKING — 0

NON-BLOCKING

MEDIUM No new regression tests for rotation logic

The diff touches two files — sql/pgque.sql and sql/pgque_uninstall.sql. Nothing under tests/. The only item in the test plan that specifically exercises the new behavior (R6 smoke test) is unchecked, and its checking happens outside the repo's test suite. That means CI on PG 14–18 doesn't actually prove rotation is correct; it only proves the existing send/receive/dlq paths still work on top of the new storage layout — which isn't the same thing.

Minimum regression coverage to add:

  1. Force 3 full rotations (cur = 0 → 1 → 2 → 0) and assert pgque.subscription returns exactly one row per (sub_queue, sub_consumer) throughout.
  2. Between rotations, run a full send → ticker → next_batch → finish_batch cycle to prove writes route to the active child and reads resolve via the view.
  3. Create a consumer, let it lag so its sub_last_tick points at a specific tick slot, force rotation, assert the tick rotation defers (not truncates) and the lagged consumer can still receive().
  4. Call maint_rotate_metadata() twice without an intervening _step2 and assert the second call returns 0 (the held-xmin gate).

I ran a version of tests 1 and 2 manually on this branch and they passed — cur_sub cycled through 0/1/2/0 across three forced rotations, the view kept showing 2 rows with 2 consumers, and c1 / c2 both cycled receive+ack cleanly on each rotation. But "worked on Nik's laptop once" ≠ "covered in CI against PG 14–18 on every future PR." Given how central this change is, a tests/test_metadata_rotation.sql added to run_all.sql is load-bearing for future refactor safety, not optional polish.

MEDIUM sql/pgque.sql L2406-L2418 (unregister_consumer) — FOR UPDATE OF s silently dropped, correctness justification under-argued

Original:

   for update of s, c;

New:

   for update of c;

with the comment: "Callers of unregister_consumer() are rare and single-threaded in practice; we rely on the subsequent DELETE to provide the necessary row lock."

"Rare and single-threaded in practice" is a behavior claim, not a correctness proof. The DELETE does take a row lock — but it's a DELETE on the view, which fires the INSTEAD OF trigger, which DELETEs from all three children (subscription_0/1/2). If two concurrent unregister_consumer calls race on the same (queue, consumer), both pass the SELECT without blocking, both fire the DELETE trigger, and we rely on the DELETE being idempotent — which it is (the where sub_queue=... and sub_consumer=... is a no-op on the second one). So operationally it probably works out.

But: (s.sub_last_tick IS NULL AND s.sub_next_tick IS NULL) OR ... is computed in the SELECT and used further down the function body. If one of the racing transactions sees that condition as true while the other sees false (because the row state is mid-flight from an unrelated finish_batch), downstream logic that branches on _is_subconsumer could take different paths, and the function body after the snippet cut off here might depend on that.

Ask: either (a) add a test that runs two unregister_consumer calls against the same consumer concurrently and asserts end-state is clean, or (b) consider taking a short advisory lock on (sub_queue, sub_consumer) as a replacement for the dropped row lock. The comment needs to describe the correctness argument, not just the expected concurrency regime.

LOW sql/pgque.sql L1300-L1490 (maint_rotate_metadata) — if mr.last_rotation_step2_txid is null then return 0 end if isn't quite the "held-xmin gate" the comment describes

The comment is 15 lines of careful reasoning about held-xmin trade-offs, ending with:

if mr.last_rotation_step2_txid is null then
    -- previous rotation's step2 has not yet been observed;
    -- not safe to rotate again.
    return 0;
end if;

But this only checks whether _step2 has executed in some prior transaction — it does not inspect any backend's xmin. An external long transaction that predates the previous _step2 and holds an old xmin will still see stale rows on whatever child gets truncated next cycle. That's a trade-off the comment acknowledges further up ("External queries that read pgque.subscription under a long snapshot may see stale rows or no rows after the prev-prev slot is truncated") — but the variable name last_rotation_step2_txid and the "held xmin" framing of the nearby code make it read as if that were being gated, when it isn't.

Fix: rename the field to something like last_step2_txid (no "held xmin" framing) and shorten the gate comment to: "step2 must have run in a separate transaction so the pointer flip is visible to new snapshots — this is a sequencing gate, not a held-xmin gate. Truncating the prev-prev slot is intentionally not gated on backend_xmin (see file header)."

LOW sql/pgque.sql L165 — redundant FK constraint on pgque.tick_tmpl

create table if not exists pgque.tick_tmpl (
    ...
    constraint tick_tmpl_queue_fkey foreign key (tick_queue)
                               references pgque.queue (queue_id)
);

The children use LIKE pgque.tick_tmpl INCLUDING DEFAULTS, which does not copy constraints (would need INCLUDING ALL or INCLUDING CONSTRAINTS). Each child explicitly redeclares its own tick_N_queue_fkey FK, so tick_tmpl's FK is dead weight — it constrains a table that by design will never hold rows.
Fix: drop the constraint from tick_tmpl. Same harmless-but-dead pattern to check on subscription_tmpl — which already omits the FKs, FWIW, so the right call here was the inconsistent one.

LOW sql/pgque.sql L4552 — pgque_rotate_step2 cron command concatenates two SELECTs; if the first fails the second never runs

$sql$SELECT pgque.maint_rotate_tables_step2();
     SELECT pgque.maint_rotate_metadata_step2();$sql$

Each _step2 is defensive (it updates a single config row), so an outright failure is unlikely. But if maint_rotate_tables_step2() ever raises (e.g. lock timeout from a concurrent drop_queue), maint_rotate_metadata_step2() gets skipped that cycle → next rotation hits the step2_txid IS NULL gate and is deferred. In pathological cases with recurring step-1 failures, metadata rotation could stall without any log-visible reason.
Fix: either split into two cron jobs (pgque_rotate_step2_events, pgque_rotate_step2_metadata), or wrap each call in a DO $$ BEGIN ... EXCEPTION WHEN OTHERS THEN ... END $$ block so one failure doesn't cascade. Two-cron jobs is simpler and cleaner.

POTENTIAL ISSUES

MEDIUM Hot-path overhead: every finish_batch / next_batch_custom now goes through an INSTEAD OF PL/pgSQL trigger plus a UNION-ALL view with a scalar subquery reading pgque.meta_rotation on each of its three branches

The acceptance criterion is "within 20% of clean baseline," which is honest about the cost. But the per-statement breakdown is:

  1. View resolution: (select cur_subscription_table from pgque.meta_rotation) evaluated once per UNION branch (3×) per query.
  2. Predicate = 0 / = 1 / = 2 filters two of three branches to empty — but only after the subquery evaluates.
  3. INSTEAD OF trigger fires per row, runs PL/pgSQL branching on v_cur, then does an UPDATE on the child.

For finish_batch, that's one trigger fire + one underlying UPDATE per ack. For next_batch_custom, same pattern on the UPDATE side. At 3 500 iter-TPS (the baseline in the PR description), the PL/pgSQL overhead is real. R6 smoke test results will say how much of it materializes.

Possible follow-ups (not for this PR): (a) materialize cur_subscription_table into a session GUC that rotation bumps and cron refreshes — avoids the per-query scalar subquery; (b) rewrite the INSTEAD OF trigger in C via PL/v8 or an extension (doesn't fit "no C accelerator" principle, noted for completeness); (c) use a partitioned table with a rotating partition-pruned predicate instead of a view.

LOW The subscription view's where (select cur_subscription_table from pgque.meta_rotation) = N predicate is re-evaluated per transaction snapshot, which means READ COMMITTED callers see rotation happen mid-transaction

In RC mode, each statement gets a fresh snapshot, so back-to-back statements within one transaction can read different cur values. Operationally this appears correct: finish_batch finds the row by sub_batch, which is unique across children at any instant (because rotation copies and flips atomically under EXCLUSIVE lock). But if a caller ever does SELECT ... FROM pgque.subscription and uses the result across statements, they might see ghosts. Not this PR's bug; flagging because the commit note says "pgque itself has no code path that reads pgque.subscription under a long-lived snapshot" — which is a statement that needs a watch going forward (new code that reads subscription across statements + relies on row identity would silently break).

INFO The tick rotation's "defer if any consumer's sub_last_tick lives in the target slot" rule means a single stuck consumer can block tick rotation indefinitely while subscription rotation keeps happening

Author explicitly accepts this ("Subscription rotation still happens in that case — which is the dominant bloat source"). Worth instrumenting: a metric / log line on consecutive deferrals would make the operational picture observable. The pgque.meta_rotation singleton could grow a tick_rotation_deferred_cycles bigint counter. Not this PR's scope; just noting so it doesn't fall off.


Verified on a local PG18 install

  • Full tests/run_all.sql passes (send, receive, dlq, roles) — so the view+trigger layer is transparent to every test that already exists. ✓
  • Forced 3 rotations (update pgque.meta_rotation set last_rotation_time = now() - interval '1 hour' + call maint_rotate_metadata + maint_rotate_metadata_step2). cur_subscription_table cycles through 0 → 1 → 2 → 0. ✓
  • After each rotation, view row count stays at the right number (2 consumers → 2 rows in the view), while the prev children retain their stale copies until they become the target of the next rotation. ✓
  • Inter-rotation pgque.receive() + pgque.ack() on both consumers works — writes route correctly to whichever child is currently active. ✓
  • Post-full-cycle (cur=0 again after 3 rotations), insert + ticker + receive + ack on c2 returns the fresh event. ✓
  • pgque.subscription_0/1/2 dead-tuple count sanity: on this scratch instance with 2 consumers it stays at 0/2/2 shape as expected, matching the "active child is fresh" claim.

No crashes, no FK violations, no grant errors on install, no issues with the INSTEAD OF triggers. The design holds together in practice.

Summary

Area Findings Potential
Security 0 0
Bugs 0 0
Tests 1 MEDIUM 0
Correctness/Concurrency 1 MEDIUM (unregister lock), 1 LOW (gate framing) 1 LOW (RC snapshot)
Performance 0 1 MEDIUM (hot-path trigger overhead)
Guidelines 0 0
Docs 0 0
Operational 0 1 INFO (deferred-tick observability)
Cleanup 2 LOW (redundant FK, cron command split) 0

Result: REQUEST CHANGES (non-blocking). The design is correct, the manual validation holds, and the existing test suite keeps passing. But shipping a change of this scope without a dedicated tests/test_metadata_rotation.sql in the run_all.sql CI path is a maintainability risk the merge shouldn't absorb — the 3-rotation cycle test I ran by hand in 2 minutes belongs in the tree. The unregister_consumer FOR UPDATE removal wants an explicit concurrency argument (or a test) rather than a "rare in practice" note. And the cron-command concatenation at L4552 is one-character's worth of risk (split into two jobs) that would've caught a dependent-failure bug long before anyone noticed. R6 smoke results will close out the perf question independently.


REV-assisted review (SOC2 checks skipped per request).

NikolayS pushed a commit that referenced this pull request Apr 18, 2026
…ark/ (issue #77)

Adds a strictly-additive benchmark/ directory documenting the
methodology, tooling, and operational lessons from the
pgque-vs-pgq-vs-pgmq-vs-river-vs-que-vs-pgboss-vs-pgmq-partitioned bench
that backs #61 and PR #62.

- README.md: entry point + quick-start
- METHODOLOGY.md: adapted from GitLab #77 note 3263767264
- OPS_GOTCHAS.md: 15 operational lessons (NEW — NVMe mount, partman stale
  rows, que func leftovers, pgboss covering index, pgq ticker, pgque xid8
  bug, spot reclaim, ASH prereqs, NOTICE instrumentation, etc.)
- HARDWARE.md: i4i.2xlarge specs, PG tuning, microbench baselines
- tooling/, runners/, consumers/, producers/, install/, charts/, gifs/

No pgque production SQL is touched.
Refs: #61, #62.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@NikolayS NikolayS marked this pull request as draft April 18, 2026 14:06
NikolayS pushed a commit that referenced this pull request Apr 30, 2026
…ark/ (issue #77)

Adds a strictly-additive benchmark/ directory documenting the
methodology, tooling, and operational lessons from the
pgque-vs-pgq-vs-pgmq-vs-river-vs-que-vs-pgboss-vs-pgmq-partitioned bench
that backs #61 and PR #62.

- README.md: entry point + quick-start
- METHODOLOGY.md: adapted from GitLab #77 note 3263767264
- OPS_GOTCHAS.md: 15 operational lessons (NEW — NVMe mount, partman stale
  rows, que func leftovers, pgboss covering index, pgq ticker, pgque xid8
  bug, spot reclaim, ASH prereqs, NOTICE instrumentation, etc.)
- HARDWARE.md: i4i.2xlarge specs, PG tuning, microbench baselines
- tooling/, runners/, consumers/, producers/, install/, charts/, gifs/

No pgque production SQL is touched.
Refs: #61, #62.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NikolayS pushed a commit that referenced this pull request Apr 30, 2026
…ark/

Adds a strictly-additive benchmark/ directory documenting the
methodology, tooling, and operational lessons from the
pgque-vs-pgq-vs-pgmq-vs-river-vs-que-vs-pgboss-vs-pgmq-partitioned bench
that backs #61 and PR #62.

- README.md: entry point + quick-start
- METHODOLOGY.md: methodology fix per review feedback
- OPS_GOTCHAS.md: 15 operational lessons (NEW — NVMe mount, partman stale
  rows, que func leftovers, pgboss covering index, pgq ticker, pgque xid8
  bug, spot reclaim, ASH prereqs, NOTICE instrumentation, etc.)
- HARDWARE.md: i4i.2xlarge specs, PG tuning, microbench baselines
- tooling/, runners/, consumers/, producers/, install/, charts/, gifs/

No pgque production SQL is touched.
Refs: #61, #62.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NikolayS added a commit that referenced this pull request Apr 30, 2026
* docs: complete bench methodology + tooling + ops gotchas under benchmark/

Adds a strictly-additive benchmark/ directory documenting the
methodology, tooling, and operational lessons from the
pgque-vs-pgq-vs-pgmq-vs-river-vs-que-vs-pgboss-vs-pgmq-partitioned bench
that backs #61 and PR #62.

- README.md: entry point + quick-start
- METHODOLOGY.md: methodology fix per review feedback
- OPS_GOTCHAS.md: 15 operational lessons (NEW — NVMe mount, partman stale
  rows, que func leftovers, pgboss covering index, pgq ticker, pgque xid8
  bug, spot reclaim, ASH prereqs, NOTICE instrumentation, etc.)
- HARDWARE.md: i4i.2xlarge specs, PG tuning, microbench baselines
- tooling/, runners/, consumers/, producers/, install/, charts/, gifs/

No pgque production SQL is touched.
Refs: #61, #62.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(benchmark): shell-style polish per CLAUDE.md

- benchmark/runners/fix_nvme_mount.sh: switch to /usr/bin/env bash shebang;
  use set -Eeuo pipefail (was set -euo pipefail without -E)
- benchmark/runners/run_r7.sh: add -Ee flags to existing pipefail
- benchmark/runners/clean_reinstall.sh: read -r in two while-loops

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(benchmark): anti-leak scrub + polish pass

Remove all references to private internal URLs, round numbering
(R4/R5/R6/R8), and private repository paths from benchmark/ files.

- METHODOLOGY.md: drop internal URL + note IDs from header; remove
  internal posting-style section (§9); neutralize round refs; fix
  /tmp/bench_r<N> path reference
- README.md: drop internal reference comment
- HARDWARE.md: fix binary units (GB→GiB, TB→TiB); drop R7 round ref
- OPS_GOTCHAS.md: neutralize R4/R6 round refs in lessons; fix
  binary units (GB→GiB, MB/s→MiB/s)
- consumers/*.sql: drop "R6 instrumented" prefix from all 7 files
- runners/run_r7.sh: remove R6/R7 round refs from inline comments
- tooling/sys_metrics_sampler.py: remove R7 from docstring
- tooling/parse_events_consumed.py: remove R6 from docstring + msg
- charts/r5_analyze.py, r6_smoke_chart.py: remove Rn from docstrings,
  chart titles, and file-size output (KB→KiB)

PR description updated separately via gh pr edit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore(benchmark): address REV r1 findings (anti-leak + style)

- Remove private WI refs from bootstrap.sh:41,44 (replace with
  "see methodology notes")
- Fix set -Eeuo pipefail in 7 shell scripts that only had partial flags
- Fix broken OPS_GOTCHAS.md:185 link (install_pgque.sh → install/README.md)
- Fix binary unit in install_pgboss.sh:2 (GB → GiB)
- Fix run_r7.sh tool paths: resolve from benchmark/tooling/ by default
  instead of hardcoded /tmp/r7 and /tmp/r6; override via R7_DIR/R6_DIR

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(benchmark): correct three methodology claims

- logging_collector=off does not mean zero log I/O; journald still
  writes to disk (#123)
- premake=20 planner cost is first-query-in-session, not per-query;
  root cause of steady-state regression is a follow-up (#124)
- add RAISE NOTICE observer-effect caveat for high-frequency use (#127)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(benchmark): drop incorrect PgBouncer speculation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Nik Samokhvalov <nik@Niks-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Converts pgque.subscription and pgque.tick from single base
tables to 3-child UNION-ALL views, mirroring the event-table
rotation pattern, to cap held-xmin bloat under high-update load.

Changes:
- sql/pgque-additions/metadata_rotation.sql (new): tables
  tick_0/1/2, subscription_0/1/2, template tables, INSTEAD OF
  routing triggers, overrides for maint_tables_to_vacuum(),
  maint_operations(), unregister_consumer(), and new functions
  maint_rotate_metadata() + maint_rotate_metadata_step2().
- sql/pgque-additions/lifecycle.sql: pgque_rotate_step2 cron
  job also calls maint_rotate_metadata_step2().
- sql/pgque-api/maint.sql: skip maint_rotate_metadata_step2
  in maint() loop (needs separate transaction).
- build/transform.sh: add metadata_rotation.sql to assembly.
- sql/pgque.sql, sql/pgque-tle.sql: regenerated.

Rebased from a9e992f onto current main; logic relocated from
generated sql/pgque.sql into sql/pgque-additions/ per CLAUDE.md.
All SECURITY DEFINER functions pin search_path.
Regression + acceptance tests pass.

https://claude.ai/code/session_015Tf54iAd24uQPKCBaXeiTV
@NikolayS NikolayS force-pushed the feat/metadata-rotation branch from a9e992f to 8acc53f Compare May 4, 2026 20:52
Copy link
Copy Markdown
Owner Author

NikolayS commented May 4, 2026

Rebase complete — 8acc53f on current main

Strategy: Could not cleanly rebase a9e992f onto main (55-commit divergence, 4 000-line generated-file conflict). Instead: re-ported the logic from /tmp/pr62.diff directly into source-of-truth files and regenerated, per CLAUDE.md "Keep generated files and source changes consistent."


Files changed

File Role
sql/pgque-additions/metadata_rotation.sql New. All rotation DDL and functions live here — this is the canonical source.
sql/pgque-additions/lifecycle.sql pgque_rotate_step2 cron job now also calls maint_rotate_metadata_step2().
sql/pgque-api/maint.sql maint() skips maint_rotate_metadata_step2 (needs separate tx, same pattern as maint_rotate_tables_step2).
build/transform.sh Adds metadata_rotation.sql to the assembly loop (after dlq.sql).
sql/pgque.sql, sql/pgque-tle.sql Regenerated — not hand-edited.

sql/pgque.sql is now derived, not the source of truth for this feature.


What metadata_rotation.sql contains

  • pgque.meta_rotation singleton table (rotation pointer)
  • pgque.tick_tmpl, tick_0/1/2 child tables + pgque.tick UNION ALL view
  • pgque.subscription_tmpl, subscription_0/1/2 child tables + pgque.subscription UNION ALL view
  • pgque._subscription_route() and pgque._tick_route() INSTEAD OF trigger functions
  • pgque.maint_rotate_metadata() and pgque.maint_rotate_metadata_step2() — both SECURITY DEFINER with set search_path = pgque, pg_catalog
  • Override of maint_tables_to_vacuum() (lists child tables, not views)
  • Override of maint_operations() (emits both metadata rotation entries)
  • Override of unregister_consumer() (drops FOR UPDATE OF s — view can't be locked; FOR UPDATE OF c kept)
  • Explicit SELECT grants on the new views to public (replacing the OID-renamed base-table grants from grants.sql)

Test results (PG 16, local)

  • Regression (tests/run_all.sql): === ALL TESTS PASSED ===
  • Acceptance (tests/acceptance/run_acceptance.sql): === ALL ACCEPTANCE TESTS PASSED ===
  • Idempotent reinstall: no ERROR/FATAL output on second \i sql/pgque.sql
  • Functional smoke: rotation returns 0 when too-soon, returns 1 after backdating last_rotation_time; pointer flips from cur=0 to cur=1 correctly

Caveats / things to review

  1. subscription_tmpl / tick_tmpl tables persist after install — they are empty but sit in the schema. This is intentional (column-schema carrier for LIKE … INCLUDING DEFAULTS), but could be dropped after child tables are created if preferred.
  2. FOR UPDATE OF s removal in unregister_consumer is safe for the single-consumer unregister path (DELETE provides the lock), but worth a second pair of eyes given PgQ's subconsumer logic.
  3. R6-style benchmark (dead-tuple bound under held xmin) not yet run — acceptance criteria from Metadata-table bloat under held xmin — subscription/tick need rotation #61 (subscription dead-tuples ≤ 500, tick dead-tuples ≤ 200, iter-TPS within 20%) require a long-running soak test, not included here.
  4. Tick view tick_child_table extra column is visible to code that does SELECT * from pgque.tick. PgQ code only ever selects named columns, but worth noting.

Generated by Claude Code

NikolayS pushed a commit that referenced this pull request May 4, 2026
Idle sweep (30 s per cell, 100 ev/s, single laptop, PG16):
  tick_period_ms=1000: p50 503, p99 994, max 1004 ms (1 tick/sec floor)
  tick_period_ms=100:  p50  53, p99  103, max  105 ms (default; clean)
  tick_period_ms=10:   p50   8, p99  864, max 1013 ms (tail blows up)
  tick_period_ms=1:    p50   3, p99  460, max  548 ms (tail blows up)

Held-xmin (tick=100ms, 1000 ev/s, 5 min RR tx open):
  baseline:     p50 52.6, p99 103.4, max 143.8 ms
  held-xmin:    p50 53.8, p99 104.7, max 235.6 ms

Median essentially flat under held-xmin; worst-case roughly doubles.
Milder than expected — consistent with the design goal of stable latency
under load on moderate timescales. Longer runs / higher tick rates
would amplify the bloat-driven tail (PR #62 territory).

Refs #69, PR #204
NikolayS pushed a commit that referenced this pull request May 5, 2026
Idle sweep (30 s per cell, 100 ev/s, single laptop, PG16):
  tick_period_ms=1000: p50 503, p99 994, max 1004 ms (1 tick/sec floor)
  tick_period_ms=100:  p50  53, p99  103, max  105 ms (default; clean)
  tick_period_ms=10:   p50   8, p99  864, max 1013 ms (tail blows up)
  tick_period_ms=1:    p50   3, p99  460, max  548 ms (tail blows up)

Held-xmin (tick=100ms, 1000 ev/s, 5 min RR tx open):
  baseline:     p50 52.6, p99 103.4, max 143.8 ms
  held-xmin:    p50 53.8, p99 104.7, max 235.6 ms

Median essentially flat under held-xmin; worst-case roughly doubles.
Milder than expected — consistent with the design goal of stable latency
under load on moderate timescales. Longer runs / higher tick rates
would amplify the bloat-driven tail (PR #62 territory).

Refs #69, PR #204
NikolayS added a commit that referenced this pull request May 5, 2026
* feat: configurable sub-second tick rate (default 10 Hz)

pg_cron's minimum schedule is 1 second, which capped end-to-end latency
at ~1 s for non-LISTEN consumers.  Drive sub-second ticking from inside
a single pg_cron slot via a new pgque.ticker_loop() procedure that
re-invokes pgque.ticker() at pgque.config.tick_period_ms cadence
(default 100 ms = 10 Hz).

The procedure commits between iterations so each tick gets its own
transaction and rotation isn't blocked by a held xmin (this is also why
ticker_loop is a PROCEDURE, not a FUNCTION).

Tunable at runtime with pgque.set_tick_period_ms(ms); changes apply on
the next pg_cron slot without rescheduling.

Refs #69

* docs: explain default 10 Hz ticking, configurability, pg_cron log hygiene

- README: replace "1 second tick" framing with "10 Hz default"; new "Tick
  rate" section covers `pgque.set_tick_period_ms`, trade-offs (WAL,
  NOTIFY, metadata-table dead tuples), and clarifies that the per-second
  pg_cron slot count does NOT increase with sub-second ticking — the
  cron.job_run_details growth rate is unchanged.
- docs/three-latencies.md: rewrite the cadence table around tick_period_ms
  with new rough numbers per rate; explicitly note the pg_cron logging
  problem is independent of sub-second ticking.
- docs/reference.md: document `pgque.ticker_loop()` and
  `pgque.set_tick_period_ms(ms)`; expand `start()` and `ticker()` notes.
- docs/tutorial.md: extend "Production cadence" with the 10 Hz default,
  why ticker_loop is a procedure (per-iteration commit / snapshot
  semantics / xmin), and the unchanged log-hygiene recipe.
- lifecycle.sql: switch status() detail strings from `||` concatenation
  to format().

Refs #69

* docs: switch from "Hz" to "ticks/sec" terminology

"Hz" reads as electronics/CPU-clock vocabulary in queue/DB context, and
implies a precision the loop doesn't actually deliver — real cycle is
tick_work + sleep, so 100 ms period yields ~9 ticks/s, not exactly 10.
"ticks/sec" is more natural for this domain.

Replaces all user-facing strings (status() detail, start() notice,
README, docs, test message). No behaviour change.

* fix: tighten tick_period_ms range to 1..1000, add bg-worker note

Three small follow-ups from review:

1. Range 1..1000 (was 1..60000): pg_cron's minimum schedule is 1 s, so
   tick_period_ms > 1000 collapses to "one tick per pg_cron slot" inside
   ticker_loop's clamp. Reject explicitly rather than silently clamping.
   To tick less often than once per second, edit the pg_cron schedule
   string directly.

2. status() now includes the configured tick_period_ms even when pg_cron
   is unavailable, so operators driving the ticker manually can see the
   value they configured.

3. README "Tick rate" section gains a note about cron.max_running_jobs:
   ticker_loop holds one pg_cron bg worker for ~1 s per slot (vs. ~10 ms
   previously), bounding ~30 pgque-bearing databases per cluster at
   pg_cron's default of 32.

Note on the statement_timeout guardrail considered earlier: NOT folded
in. pg_cron concatenates `SET statement_timeout = '...'; CALL
pgque.ticker_loop()` into a single multi-statement transaction, and the
COMMIT inside the procedure raises "invalid transaction termination" in
that wrapper. Documented inline in pgque.start().

Refs #69, PR #204

* bench: tick-rate latency harness

Two scenarios:
- idle: sweep tick_period_ms in {1, 10, 100, 1000} and measure
  producer→consumer e2e latency at light load.
- held-xmin: a long-running RR transaction holds xmin while a 1000 ev/s
  producer runs at the default 100 ms tick. Demonstrates how
  tick/subscription metadata UPDATEs degrade under blocked vacuum.

Latency is measured server-side: producer stamps clock_timestamp() into
the payload, consumer subtracts at receive time. No client clock skew.

Refs #69, PR #204

* bench: tick-rate latency results + README

Idle sweep (30 s per cell, 100 ev/s, single laptop, PG16):
  tick_period_ms=1000: p50 503, p99 994, max 1004 ms (1 tick/sec floor)
  tick_period_ms=100:  p50  53, p99  103, max  105 ms (default; clean)
  tick_period_ms=10:   p50   8, p99  864, max 1013 ms (tail blows up)
  tick_period_ms=1:    p50   3, p99  460, max  548 ms (tail blows up)

Held-xmin (tick=100ms, 1000 ev/s, 5 min RR tx open):
  baseline:     p50 52.6, p99 103.4, max 143.8 ms
  held-xmin:    p50 53.8, p99 104.7, max 235.6 ms

Median essentially flat under held-xmin; worst-case roughly doubles.
Milder than expected — consistent with the design goal of stable latency
under load on moderate timescales. Longer runs / higher tick rates
would amplify the bloat-driven tail (PR #62 territory).

Refs #69, PR #204

* ci: add pg_cron CI variant; document statement_timeout limitation

Two follow-ups from the PR review.

1. pg_cron CI variant (closes the test-coverage gap)
   New ci/Dockerfile.pgcron installs postgresql-NN-cron over the official
   postgres image; new pgcron-test workflow job builds it, starts PG with
   shared_preload_libraries=pg_cron and cron.use_background_workers=on,
   runs the full regression suite, and explicitly fails if any
   pg_cron-only test prints "SKIP: pg_cron not installed" (so a future
   accidental loss of coverage is loud, not silent).

   This makes test_tick_period's "pgque.start() schedules CALL
   pgque.ticker_loop()" assertion run in CI for the first time, plus the
   four pgque.start() / pgque.stop() cases in test_pgcron_lifecycle.

2. statement_timeout guardrail: honestly cannot be enforced
   Tried two approaches; neither works:

   - SET statement_timeout = '...'; CALL pgque.ticker_loop() in the
     pg_cron command -- pg_cron concatenates them into one
     multi-statement transaction, and the procedure's COMMIT raises
     "invalid transaction termination".
   - set_config('statement_timeout', '1500ms', is_local := true) inside
     the procedure body -- this updates the GUC value, but
     statement_timeout is a top-level-statement timer. The CALL is the
     statement; its timer is fixed at invocation. Setting the GUC
     mid-procedure does not restart or re-arm the timer, so subsequent
     pg_sleep / pgque.ticker() work runs unguarded. Verified by
     reproduction.

   Reverted the set_config attempt and replaced the inline comment with
   the full diagnosis. ticker_loop self-bounds via clock_timestamp() to
   limit how many additional iterations a slow ticker can chain, but a
   genuinely hung pgque.ticker() will pin the pg_cron worker until an
   admin pg_cancel_backend()s it. ticker() has no indefinite-block code
   paths under normal operation; we accept the residual risk over
   shipping a guardrail that doesn't actually fire.

Refs #69, PR #204

* fix: reject non-divisor tick periods

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metadata-table bloat under held xmin — subscription/tick need rotation

2 participants