feat: configurable sub-second tick rate (default 10 ticks/sec)#204
feat: configurable sub-second tick rate (default 10 ticks/sec)#204
Conversation
Three small follow-ups from review: 1. Range 1..1000 (was 1..60000): pg_cron's minimum schedule is 1 s, so tick_period_ms > 1000 collapses to "one tick per pg_cron slot" inside ticker_loop's clamp. Reject explicitly rather than silently clamping. To tick less often than once per second, edit the pg_cron schedule string directly. 2. status() now includes the configured tick_period_ms even when pg_cron is unavailable, so operators driving the ticker manually can see the value they configured. 3. README "Tick rate" section gains a note about cron.max_running_jobs: ticker_loop holds one pg_cron bg worker for ~1 s per slot (vs. ~10 ms previously), bounding ~30 pgque-bearing databases per cluster at pg_cron's default of 32. Note on the statement_timeout guardrail considered earlier: NOT folded in. pg_cron concatenates `SET statement_timeout = '...'; CALL pgque.ticker_loop()` into a single multi-statement transaction, and the COMMIT inside the procedure raises "invalid transaction termination" in that wrapper. Documented inline in pgque.start(). Refs #69, PR #204
Two scenarios:
- idle: sweep tick_period_ms in {1, 10, 100, 1000} and measure
producer→consumer e2e latency at light load.
- held-xmin: a long-running RR transaction holds xmin while a 1000 ev/s
producer runs at the default 100 ms tick. Demonstrates how
tick/subscription metadata UPDATEs degrade under blocked vacuum.
Latency is measured server-side: producer stamps clock_timestamp() into
the payload, consumer subtracts at receive time. No client clock skew.
Refs #69, PR #204
Idle sweep (30 s per cell, 100 ev/s, single laptop, PG16): tick_period_ms=1000: p50 503, p99 994, max 1004 ms (1 tick/sec floor) tick_period_ms=100: p50 53, p99 103, max 105 ms (default; clean) tick_period_ms=10: p50 8, p99 864, max 1013 ms (tail blows up) tick_period_ms=1: p50 3, p99 460, max 548 ms (tail blows up) Held-xmin (tick=100ms, 1000 ev/s, 5 min RR tx open): baseline: p50 52.6, p99 103.4, max 143.8 ms held-xmin: p50 53.8, p99 104.7, max 235.6 ms Median essentially flat under held-xmin; worst-case roughly doubles. Milder than expected — consistent with the design goal of stable latency under load on moderate timescales. Longer runs / higher tick rates would amplify the bloat-driven tail (PR #62 territory). Refs #69, PR #204
Latency bench resultsHarness landed in 1. Idle sweep — tick rate vs. e2e latency30 s per cell, 100 ev/s producer, queue configured to tick on every event (
Reads:
2. Held-xmin — does metadata bloat hurt the default path?5 min at the default 100 ms tick / 1000 ev/s producer, with a separate session running
Reads:
The takeaway is not "held-xmin is fine, deprioritise #62". It's that under a moderate held-vacuum window the median is robust and the tail degrades gracefully — which is the design goal. PR #62's rotation fix extends the ceiling on that stability. Out of scope here as previously noted. Status of the "known gaps" from the PR description
Bench raw output and a methodology README are committed in Generated by Claude Code |
Two follow-ups from the PR review.
1. pg_cron CI variant (closes the test-coverage gap)
New ci/Dockerfile.pgcron installs postgresql-NN-cron over the official
postgres image; new pgcron-test workflow job builds it, starts PG with
shared_preload_libraries=pg_cron and cron.use_background_workers=on,
runs the full regression suite, and explicitly fails if any
pg_cron-only test prints "SKIP: pg_cron not installed" (so a future
accidental loss of coverage is loud, not silent).
This makes test_tick_period's "pgque.start() schedules CALL
pgque.ticker_loop()" assertion run in CI for the first time, plus the
four pgque.start() / pgque.stop() cases in test_pgcron_lifecycle.
2. statement_timeout guardrail: honestly cannot be enforced
Tried two approaches; neither works:
- SET statement_timeout = '...'; CALL pgque.ticker_loop() in the
pg_cron command -- pg_cron concatenates them into one
multi-statement transaction, and the procedure's COMMIT raises
"invalid transaction termination".
- set_config('statement_timeout', '1500ms', is_local := true) inside
the procedure body -- this updates the GUC value, but
statement_timeout is a top-level-statement timer. The CALL is the
statement; its timer is fixed at invocation. Setting the GUC
mid-procedure does not restart or re-arm the timer, so subsequent
pg_sleep / pgque.ticker() work runs unguarded. Verified by
reproduction.
Reverted the set_config attempt and replaced the inline comment with
the full diagnosis. ticker_loop self-bounds via clock_timestamp() to
limit how many additional iterations a slow ticker can chain, but a
genuinely hung pgque.ticker() will pin the pg_cron worker until an
admin pg_cancel_backend()s it. ticker() has no indefinite-block code
paths under normal operation; we accept the residual risk over
shipping a guardrail that doesn't actually fire.
Refs #69, PR #204
Closing the remaining gapsBoth items left from the earlier "what's left" rundown are addressed in commit a43967d: ✅ pg_cron CI variant (gap #2)New
This makes the ❌→📝 statement_timeout guardrail (gap #4)Cannot be made to work; documented honestly. Tried:
Reverted the PR description updated. Ready for another look. Generated by Claude Code |
REV review — PR #204Scope: configurable sub-second tick rate / default 10 ticks/sec, v0.2.0 readiness. VerdictRequest changes if the API promises arbitrary The default Blocking / must fix before mergeMAJOR — Non-divisor tick periods produce inaccurate effective rates Evidence from v_iter_budget := greatest(1, v_window_ms / v_period_ms);This is integer floor division within the 1000 ms window. But status/docs report idealized cadence as Examples:
Tests cover clean divisor values such as Recommended v0.2.0 fix: restrict accepted values to divisors of 1000 ms, or explicitly document/report the floored effective cadence. I prefer restriction for now: simple, honest, and testable. Non-blocking docs fixesLOW — WAL math in The docs say LOW — 1 ms docs overstate latency Docs imply Positive evidence
SummaryThis is likely a good v0.2.0 feature, but only after the accepted-value contract matches actual scheduler behavior. |
REV blocker fix: reject non-divisor tick periodsPushed What changed
Validation
This keeps the v0.2.0 behavior honest: every advertised tick period maps exactly to an integer number of ticker iterations inside pg_cron's 1000 ms slot. |
|
Pulled and verified
The honest-cadence framing ("every advertised tick period maps exactly to an integer number of iterations inside the 1000 ms slot") is materially better than the silent-floor behavior the original implementation had. Doc fixes ( LGTM. Ready for external review on my end. Generated by Claude Code |
pg_cron's minimum schedule is 1 second, which capped end-to-end latency at ~1 s for non-LISTEN consumers. Drive sub-second ticking from inside a single pg_cron slot via a new pgque.ticker_loop() procedure that re-invokes pgque.ticker() at pgque.config.tick_period_ms cadence (default 100 ms = 10 Hz). The procedure commits between iterations so each tick gets its own transaction and rotation isn't blocked by a held xmin (this is also why ticker_loop is a PROCEDURE, not a FUNCTION). Tunable at runtime with pgque.set_tick_period_ms(ms); changes apply on the next pg_cron slot without rescheduling. Refs #69
…iene - README: replace "1 second tick" framing with "10 Hz default"; new "Tick rate" section covers `pgque.set_tick_period_ms`, trade-offs (WAL, NOTIFY, metadata-table dead tuples), and clarifies that the per-second pg_cron slot count does NOT increase with sub-second ticking — the cron.job_run_details growth rate is unchanged. - docs/three-latencies.md: rewrite the cadence table around tick_period_ms with new rough numbers per rate; explicitly note the pg_cron logging problem is independent of sub-second ticking. - docs/reference.md: document `pgque.ticker_loop()` and `pgque.set_tick_period_ms(ms)`; expand `start()` and `ticker()` notes. - docs/tutorial.md: extend "Production cadence" with the 10 Hz default, why ticker_loop is a procedure (per-iteration commit / snapshot semantics / xmin), and the unchanged log-hygiene recipe. - lifecycle.sql: switch status() detail strings from `||` concatenation to format(). Refs #69
"Hz" reads as electronics/CPU-clock vocabulary in queue/DB context, and implies a precision the loop doesn't actually deliver — real cycle is tick_work + sleep, so 100 ms period yields ~9 ticks/s, not exactly 10. "ticks/sec" is more natural for this domain. Replaces all user-facing strings (status() detail, start() notice, README, docs, test message). No behaviour change.
Three small follow-ups from review: 1. Range 1..1000 (was 1..60000): pg_cron's minimum schedule is 1 s, so tick_period_ms > 1000 collapses to "one tick per pg_cron slot" inside ticker_loop's clamp. Reject explicitly rather than silently clamping. To tick less often than once per second, edit the pg_cron schedule string directly. 2. status() now includes the configured tick_period_ms even when pg_cron is unavailable, so operators driving the ticker manually can see the value they configured. 3. README "Tick rate" section gains a note about cron.max_running_jobs: ticker_loop holds one pg_cron bg worker for ~1 s per slot (vs. ~10 ms previously), bounding ~30 pgque-bearing databases per cluster at pg_cron's default of 32. Note on the statement_timeout guardrail considered earlier: NOT folded in. pg_cron concatenates `SET statement_timeout = '...'; CALL pgque.ticker_loop()` into a single multi-statement transaction, and the COMMIT inside the procedure raises "invalid transaction termination" in that wrapper. Documented inline in pgque.start(). Refs #69, PR #204
Two scenarios:
- idle: sweep tick_period_ms in {1, 10, 100, 1000} and measure
producer→consumer e2e latency at light load.
- held-xmin: a long-running RR transaction holds xmin while a 1000 ev/s
producer runs at the default 100 ms tick. Demonstrates how
tick/subscription metadata UPDATEs degrade under blocked vacuum.
Latency is measured server-side: producer stamps clock_timestamp() into
the payload, consumer subtracts at receive time. No client clock skew.
Refs #69, PR #204
Idle sweep (30 s per cell, 100 ev/s, single laptop, PG16): tick_period_ms=1000: p50 503, p99 994, max 1004 ms (1 tick/sec floor) tick_period_ms=100: p50 53, p99 103, max 105 ms (default; clean) tick_period_ms=10: p50 8, p99 864, max 1013 ms (tail blows up) tick_period_ms=1: p50 3, p99 460, max 548 ms (tail blows up) Held-xmin (tick=100ms, 1000 ev/s, 5 min RR tx open): baseline: p50 52.6, p99 103.4, max 143.8 ms held-xmin: p50 53.8, p99 104.7, max 235.6 ms Median essentially flat under held-xmin; worst-case roughly doubles. Milder than expected — consistent with the design goal of stable latency under load on moderate timescales. Longer runs / higher tick rates would amplify the bloat-driven tail (PR #62 territory). Refs #69, PR #204
Two follow-ups from the PR review.
1. pg_cron CI variant (closes the test-coverage gap)
New ci/Dockerfile.pgcron installs postgresql-NN-cron over the official
postgres image; new pgcron-test workflow job builds it, starts PG with
shared_preload_libraries=pg_cron and cron.use_background_workers=on,
runs the full regression suite, and explicitly fails if any
pg_cron-only test prints "SKIP: pg_cron not installed" (so a future
accidental loss of coverage is loud, not silent).
This makes test_tick_period's "pgque.start() schedules CALL
pgque.ticker_loop()" assertion run in CI for the first time, plus the
four pgque.start() / pgque.stop() cases in test_pgcron_lifecycle.
2. statement_timeout guardrail: honestly cannot be enforced
Tried two approaches; neither works:
- SET statement_timeout = '...'; CALL pgque.ticker_loop() in the
pg_cron command -- pg_cron concatenates them into one
multi-statement transaction, and the procedure's COMMIT raises
"invalid transaction termination".
- set_config('statement_timeout', '1500ms', is_local := true) inside
the procedure body -- this updates the GUC value, but
statement_timeout is a top-level-statement timer. The CALL is the
statement; its timer is fixed at invocation. Setting the GUC
mid-procedure does not restart or re-arm the timer, so subsequent
pg_sleep / pgque.ticker() work runs unguarded. Verified by
reproduction.
Reverted the set_config attempt and replaced the inline comment with
the full diagnosis. ticker_loop self-bounds via clock_timestamp() to
limit how many additional iterations a slow ticker can chain, but a
genuinely hung pgque.ticker() will pin the pg_cron worker until an
admin pg_cancel_backend()s it. ticker() has no indefinite-block code
paths under normal operation; we accept the residual risk over
shipping a guardrail that doesn't actually fire.
Refs #69, PR #204
Rebased on current
|
5feed16 to
8b58cb2
Compare
Summary
Configurable polling rate even when driven by
pg_cron. Default jumps from1 tick/sec (the pg_cron floor) to 10 ticks/sec (every 100 ms).
pgque.ticker_loop()PROCEDURE: pg_cron fires it once a second; theprocedure re-invokes
pgque.ticker()everytick_period_msms inside thatone slot, with a
commitbetween iterations so each tick gets its owntransaction and held-xmin doesn't pile up against rotation.
pgque.config.tick_period_ms(default100, range1..1000),added with a safe
alter table ... add column if not existsso existinginstalls upgrade cleanly.
pgque.set_tick_period_ms(ms)— takes effect on the nextpg_cron slot (≤1 s) without rescheduling.
pgque.start()now schedulesCALL pgque.ticker_loop()instead ofSELECT pgque.ticker().pgque.status()reports the current cadence in ticks/sec.Why a PROCEDURE (and why no
SET search_pathon it)Each
pgque.ticker()call must run in its own transaction:pg_snapshotto mark a batchboundary. Two ticks inside one transaction would record the same
snapshot — consumers couldn't tell them apart.
xmin floor and blocks
maint_rotate_tables. Per-iterationcommitbounds the held-xmin window to
tick_period_ms(100 ms by default)instead of the 1-second pg_cron slot.
Postgres only allows
COMMITmid-flight inside a procedure — and forbidscombining
COMMITwith aSETclause. The body is therefore fullyschema-qualified, runs as
SECURITY INVOKER, and is admin-only:The actual security boundary stays in the
SECURITY DEFINERfunctions(
pgque.ticker, thepgque.configupdater) that ticker_loop calls.Known limitation: no statement_timeout guardrail on ticker()
A misbehaving
pgque.ticker()call can pin a pg_cron worker until an adminruns
pg_cancel_backend(). Two approaches tried, neither fires:SET statement_timeout = '...'; CALL pgque.ticker_loop()in thepg_croncommand — pg_cron joins them into one multi-statementtransaction; the procedure's
COMMITthen raises "invalid transactiontermination".
set_config('statement_timeout', '...', is_local := true)inside theprocedure body — updates the GUC, but
statement_timeoutis atop-level-statement timer. The CALL is the statement; its timer is fixed
at invocation, so changing the GUC mid-procedure has no effect on
subsequent pg_sleep / ticker() calls.
The
clock_timestamp()-based budget insideticker_looplimits how manyadditional iterations a slow run can chain, but it cannot cancel a stuck
inner call.
pgque.ticker()has no indefinite-block code paths undernormal operation; we accept the residual risk over shipping a guardrail
that doesn't actually fire. Documented inline in
sql/pgque-additions/lifecycle.sql.Docs
log-hygiene clarified to not be made worse by sub-second ticking
(still one pg_cron slot per second, regardless of
tick_period_ms);bg-worker /
cron.max_running_jobsnote added.tick_period_ms;trade-off bullets (WAL, NOTIFY, metadata dead tuples).
ticker_loop()andset_tick_period_ms.default and why ticker_loop is a procedure.
Bench
benchmark/tick-rate/— full harness + results in the directory README.Idle sweep (30 s / cell, 100 ev/s, single laptop, PG 16):
tick_period_msDefault
tick_period_ms = 100is the clean point: median ~53 ms, max~105 ms. Sub-10 ms periods improve the median but the tail blows up to ~1 s
(procedure can't always finish its inner iterations within one pg_cron
slot, so the next slot lands on a still-running worker).
Held-xmin (default 100 ms tick, 1000 ev/s):
Median essentially flat; worst-case roughly doubles. Milder than expected
on a 5-min window; longer durations / higher tick rates would amplify the
tail (PR #62 territory, out of scope).
Test plan
tests/test_tick_period.sql— defaults / setter / validation /multi-tick / single-tick /
pgque.start()schedule wiring.tests/run_all.sqlsuite green locally on PG 16, with andwithout pg_cron.
testjob) — green.ci/Dockerfile.pgcron+pgcron-testworkflow). Runs the full suite with pg_cron preloaded; explicitly
fails if any test prints "SKIP: pg_cron not installed", closing the
coverage gap that previously kept the schedule-wiring test
effectively unrun in CI.
tick_period_ms ∈ {1, 10, 100, 1000}plus 5-min held-xmin run.Refs #69