You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
promote_waiting_with_outbox is ~58% of total database time (7.0M calls, ~40.8ms mean). It is not a slow query — the SQL plan is optimal. The cost is call volume: one full bulk-lane transaction per task completion (~4k/min). Each call's per-transaction exec time feeds the DB-pressure-controller EMA, so this is what trips HOVER-JD ("DB pressure at floor") under burst load.
Related: realtime/index load reductions in #395 (removes a separate ~14% WAL-decode load that may relieve this symptom indirectly).
Evidence
EXPLAIN confirms an index scan on idx_tasks_waiting_by_job for the waiting-row pick; PK update; ON CONFLICT(task_id) DO NOTHING outbox insert. tasks autovacuum healthy (~5% dead). Nothing to tune at statement level.
Promote fires once per completion via q.Execute (full BEGIN/COMMIT) — internal/jobs/stream_worker.go:390.
Per-transaction exec time feeds the pressure EMA — internal/db/queue.go:317.
Not connection-bound: 1 active / 57 of 120 connections during a burst. More compute/connections would worsen contention.
~40ms mean is concurrent-burst contention, not work (idle DB ≈ 1–3ms).
Architecture (why a gate is safe)
Both waiting producers converge on a per-job Redis ZSET keyed by task_id, drained by the dispatcher which enforces the concurrency cap (over-cap items stay in the ZSET — internal/broker/dispatcher.go:199).
Indirectly (removes 14% WAL-decode load → lower DB CPU/latency → may stop HOVER-JD)
none
done
1
Gate — skip promote when no enqueue-backlog waiting, + periodic promote-all sweep
Yes — the real lever (eliminates the 0-row majority of 7M calls)
medium, mitigated by sweep
~300 LOC, 7–8 files
2
Lane-gated autocommit
No (~0% — that figure is server exec time; only saves client round-trips, single-digit ms on the EMA)
low
~25 LOC
3
Root-cause the dispatcher (below)
Potentially — could reduce/remove promote
unknown
investigation
Deepest open question (possibly the best fix)
Enqueue-waiting tasks already get an outbox row → sweeper → ZSET → dispatcher. So why is per-completion promote needed? The comment at internal/db/queue.go:1751 says backlogs "went idle" without it, but the root cause of why the ZSET/dispatcher path alone left them stranded was not fully established. If found and fixed, promote could be dropped or made low-frequency — a smaller, cleaner win than the gate. The gate treats the symptom; this would treat the cause. Time-box this before committing to the gate.
Gate design (if we build it)
Signal, one site: Redis waiting-hint set where enqueue creates waiting rows (processedWaiting > 0, internal/db/queue.go:1221); cleared when a promote returns 0. Retry-waiting needs no instrumentation.
Gate: cheap Redis read at internal/jobs/stream_worker.go:390 before calling promote.
Backstop: periodic promote-all over getActiveJobs() in the existing reclaim/reconcile cadence + a "stranded-found" metric so a regression is visible, not silent.
Wrinkle: enqueue lives in the DB layer (no Redis handle) — surface processedWaiting up to the jobs-layer caller rather than couple layers.
Risks / challenges
No fix that is both small and meaningful; the 58% is intrinsic to a per-completion design.
Gate correctness rests entirely on the sweep backstop + hint coverage; ship with review-app load testing, not blind.
Summary
promote_waiting_with_outboxis ~58% of total database time (7.0M calls, ~40.8ms mean). It is not a slow query — the SQL plan is optimal. The cost is call volume: one full bulk-lane transaction per task completion (~4k/min). Each call's per-transaction exec time feeds the DB-pressure-controller EMA, so this is what tripsHOVER-JD("DB pressure at floor") under burst load.Related: realtime/index load reductions in #395 (removes a separate ~14% WAL-decode load that may relieve this symptom indirectly).
Evidence
EXPLAINconfirms an index scan onidx_tasks_waiting_by_jobfor the waiting-row pick; PK update;ON CONFLICT(task_id) DO NOTHINGoutbox insert.tasksautovacuum healthy (~5% dead). Nothing to tune at statement level.q.Execute(fullBEGIN/COMMIT) —internal/jobs/stream_worker.go:390.internal/db/queue.go:317.Architecture (why a gate is safe)
Both waiting producers converge on a per-job Redis ZSET keyed by
task_id, drained by the dispatcher which enforces the concurrency cap (over-cap items stay in the ZSET —internal/broker/dispatcher.go:199).task_id→ promote + retry cannot double-dispatch (internal/broker/scheduler.go:210).ScheduleAndAck) → do not need promote.Options
HOVER-JD)Deepest open question (possibly the best fix)
Enqueue-waiting tasks already get an outbox row → sweeper → ZSET → dispatcher. So why is per-completion promote needed? The comment at
internal/db/queue.go:1751says backlogs "went idle" without it, but the root cause of why the ZSET/dispatcher path alone left them stranded was not fully established. If found and fixed, promote could be dropped or made low-frequency — a smaller, cleaner win than the gate. The gate treats the symptom; this would treat the cause. Time-box this before committing to the gate.Gate design (if we build it)
processedWaiting > 0,internal/db/queue.go:1221); cleared when a promote returns 0. Retry-waiting needs no instrumentation.internal/jobs/stream_worker.go:390before calling promote.getActiveJobs()in the existing reclaim/reconcile cadence + a "stranded-found" metric so a regression is visible, not silent.processedWaitingup to the jobs-layer caller rather than couple layers.Risks / challenges
Recommended next steps
pg_stat_statements, re-run the load test.HOVER-JDfrequency,exec_ema_msvs 500ms, promote's new share over a clean stats window.