Skip to content

bench: xmin-horizon torture test for PG-backed queues#80

Merged
NikolayS merged 1 commit intomainfrom
bench/xmin-horizon-v1
Apr 30, 2026
Merged

bench: xmin-horizon torture test for PG-backed queues#80
NikolayS merged 1 commit intomainfrom
bench/xmin-horizon-v1

Conversation

@NikolayS
Copy link
Copy Markdown
Owner

Summary

Reproducible benchmark in bench/xmin-horizon/ contrasting a generic SKIP LOCKED queue and pgque under a held xmin horizon — the most common operational failure mode of SKIP LOCKED queues in production. Causes are routine: long REPEATABLE READ transactions, pg_dump, idle logical replication slots, hot_standby_feedback=on with a slow replica. While xmin is held, VACUUM cannot reclaim dead tuples generated by the queue's INSERT → UPDATE → DELETE lifecycle, the queue table bloats, and read latency climbs.

Single-laptop reproducer. PG17 in Docker Compose, aggressive autovacuum baked in (autovacuum_vacuum_scale_factor = 0.005, autovacuum_naptime = 10s, autovacuum_vacuum_cost_limit = 10000) so the result cannot be dismissed as "you didn't tune."

Headline

3-minute cells, 4 producers + 4 consumers + 2 bystander clients on an unrelated 1M-row table, 800 enqueues/s.

Scenario Workload Dequeue (jobs/s) n_dead_tup Table size Bystander avg lat
baseline SKIP LOCKED 797 6,397 1.0 MiB 1.35 ms
baseline pgque 792 0 13.4 MiB 1.50 ms
RR holds xmin SKIP LOCKED 517 91,593 15.1 MiB 2.05 ms
RR holds xmin pgque 804 0 27.0 MiB 1.45 ms

When xmin is blocked, the SKIP LOCKED queue's dead tuple count grows by 14×, table size by 15×, dequeue throughput drops by ~35%, and bystander query latency on an unrelated table sharing buffer cache rises by ~50%. pgque is unaffected — n_dead_tup = 0 across all pgque.event_* tables in every cell, throughput and bystander latency unchanged from baseline.

What's in

  • blueprints/BENCH_XMIN_HORIZON.md — benchmark spec
  • bench/xmin-horizon/ — reproducer (Docker Compose, pgbench scripts, orchestration)
  • bench/xmin-horizon/results/ — raw 5s metrics + final-bloat snapshots from today's run
  • bench/xmin-horizon/results/results.md — generated summary
  • docs/benchmarks.md — top section added with summary table + link to reproducer

How to reproduce

cd bench/xmin-horizon
make up                 # PG17 in Docker
DURATION=180 make run-s1 run-s2 report
make down

Override DURATION, PRODUCERS, CONSUMERS, ENQUEUE_RATE to taste.

Framing

Comparison is of queue implementations under realistic operational conditions, not a takedown of any specific library. SKIP LOCKED is a sound primitive; the failure mode is structural to its INSERT/UPDATE/DELETE lifecycle, not specific to any one implementation.

Test plan

  • Harness brings up PG17 cleanly via docker compose up -d
  • All 4 cells run end-to-end via make run-s1 run-s2
  • Numbers reproducible (run twice, same shape)
  • Generated results.md matches headline above
  • docs/benchmarks.md renders with new section at the top
  • Future: S3 (idle logical replication slot scenario)
  • Future: PG16 + PG17 matrix
  • Future: Charts (gnuplot or matplotlib over the 5s CSV)

Notes

  • pgque's n_live_tup is large in S2 (~287K) because rotation can't truncate while xmin is held — this is deferred reclamation, not bloat. n_dead_tup is what VACUUM needs to chase, and it stays at 0.
  • The bench uses pgbench with rate-limited producers and unconstrained consumers; the consumer's "no-op" transactions (when the queue is empty) used to inflate the dequeue counter — fixed by counting only transactions that actually deleted a row.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

Reproducible benchmark contrasting a generic SKIP LOCKED queue and pgque
under a held xmin horizon — the most common operational failure mode of
SKIP LOCKED queues in production (long RR transactions, pg_dump, idle
logical replication slots, hot_standby_feedback=on with a slow replica).

Single-laptop reproducer in bench/xmin-horizon/. PG17, Docker Compose,
aggressive autovacuum baked in. 4 cells run today (S1+S2 x skiplocked+pgque),
3 minutes each, 800 enqueues/s.

Headline (from results/results.md):

  | Scenario      | Workload    | Dequeue/s | n_dead_tup | Bystander lat |
  |---------------|-------------|-----------|------------|---------------|
  | baseline      | SKIP LOCKED | 797       | 6,397      | 1.35 ms       |
  | baseline      | pgque       | 792       | 0          | 1.50 ms       |
  | RR holds xmin | SKIP LOCKED | 517       | 91,593     | 2.05 ms       |
  | RR holds xmin | pgque       | 804       | 0          | 1.45 ms       |

When xmin is blocked: SKIP LOCKED dead tuples 14x, table size 15x,
dequeue throughput drops ~35%, bystander query latency on an unrelated
table sharing buffer cache rises ~50%. pgque is unaffected.

Spec: blueprints/BENCH_XMIN_HORIZON.md
Docs: docs/benchmarks.md updated with summary section + link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@NikolayS
Copy link
Copy Markdown
Owner Author

REV Review

CI: ✅ all 8 checks green (test 14/15/16/17/18, verify, client-smoke, claude-review)
Functional evidence: Bench harness ran 2×2 matrix (S1+S2 × skiplocked+pgque), 3-min cells, 800 enq/s, PG17. Raw CSVs + pgbench logs in bench/xmin-horizon/results/. Headline: under blocked xmin, SKIP LOCKED gets ~14× dead tuples / ~15× table size / −35% throughput / +50% bystander latency; pgque unaffected (n_dead_tup=0 everywhere, throughput unchanged).
Verdict: READY TO MERGE

Blocking

none

Non-blocking

  • [MEDIUM] [conf 9/10] scripts/run-scenario.sh:24-35 — pgque state not reset between cells. The skiplocked branch only truncate table jobs; it leaves pgque.event_*_* rows from any prior pgque cell intact. Effect: s2-skiplocked/final-bloat.csv shows pgque.event_1_0 with 143,564 live tuples carried over from s1-pgque. Doesn't affect the per-workload relevant numbers (skiplocked's public.jobs, pgque's pgque.event_*_*), but confusing in the dumps. Fix: also reset pgque queue tables (or redeploy the schema) at the start of each cell.
  • [MEDIUM] [conf 9/10] results/results.md — baseline size comparison (1.0 MiB skiplocked vs 13.4 MiB pgque) reads as a pgque negative without context. pgque pre-allocates rotation tables (event_0/1/2 + template + per-queue subtables); the absolute-size gap is structural, not workload-driven. The strong claim is n_dead_tup (0 vs 91k) and delta growth, not absolute size. Add 2 sentences explaining this in the Findings section, or reframe size as Δ(S2 − S1) per workload.
  • [LOW] [conf 7/10] results/results.md — S2 pgque dequeued (141,501) > enqueued (141,200) by 301. The Notes section explains n_live_tup semantics but not this overcount. Most likely the consumer drained a few tail batches after the producer stopped. Worth one line in Notes acknowledging it so reviewers don't suspect a counter bug.

Potential

  • [LOW] [conf 6/10] results.mdxmin age for s2-pgque is 372s, ~2× the cell duration (177s). Suggests either a stale RR-holder from s2-skiplocked surviving the cleanup, a long-lived pgbench connection holding backend_xmin, or a measurement artifact in the collector's xact_start query. Worth a follow-up to verify kill $RR_PID actually terminates the held transaction (e.g., check pg_stat_activity post-cleanup).
  • [LOW] [conf 6/10] scripts/collect.sh SQL composition. $WORKLOAD is interpolated into the SQL string, but it's validated against the skiplocked|pgque enum upstream in run-scenario.sh. Acceptable for a bench harness; flagging only because future contributors might extend the workload set without preserving the validation.

REV-style review (security, bugs, tests, guidelines, docs). SOC2 items skipped per project policy. Anti-leak scan clean.

@NikolayS NikolayS merged commit 9b3f89f into main Apr 30, 2026
8 checks passed
@NikolayS NikolayS deleted the bench/xmin-horizon-v1 branch April 30, 2026 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant