Skip to content

Reduce promote_waiting_with_outbox DB cost (~58% of DB time) #396

@simonsmallchua

Description

@simonsmallchua

Summary

promote_waiting_with_outbox is ~58% of total database time (7.0M calls, ~40.8ms mean). It is not a slow query — the SQL plan is optimal. The cost is call volume: one full bulk-lane transaction per task completion (~4k/min). Each call's per-transaction exec time feeds the DB-pressure-controller EMA, so this is what trips HOVER-JD ("DB pressure at floor") under burst load.

Related: realtime/index load reductions in #395 (removes a separate ~14% WAL-decode load that may relieve this symptom indirectly).

Evidence

  • EXPLAIN confirms an index scan on idx_tasks_waiting_by_job for the waiting-row pick; PK update; ON CONFLICT(task_id) DO NOTHING outbox insert. tasks autovacuum healthy (~5% dead). Nothing to tune at statement level.
  • Promote fires once per completion via q.Execute (full BEGIN/COMMIT) — internal/jobs/stream_worker.go:390.
  • Per-transaction exec time feeds the pressure EMA — internal/db/queue.go:317.
  • Not connection-bound: 1 active / 57 of 120 connections during a burst. More compute/connections would worsen contention.
  • ~40ms mean is concurrent-burst contention, not work (idle DB ≈ 1–3ms).

Architecture (why a gate is safe)

Both waiting producers converge on a per-job Redis ZSET keyed by task_id, drained by the dispatcher which enforces the concurrency cap (over-cap items stay in the ZSET — internal/broker/dispatcher.go:199).

  • ZSET dedupes by task_id → promote + retry cannot double-dispatch (internal/broker/scheduler.go:210).
  • Retry-backoff waiting tasks are ZSET-native (ScheduleAndAck) → do not need promote.
  • Enqueue-backlog waiting tasks are the only ones promote is load-bearing for.

Options

# Option Cuts the 58%? Risk Effort
0 Merge #395, re-measure Indirectly (removes 14% WAL-decode load → lower DB CPU/latency → may stop HOVER-JD) none done
1 Gate — skip promote when no enqueue-backlog waiting, + periodic promote-all sweep Yes — the real lever (eliminates the 0-row majority of 7M calls) medium, mitigated by sweep ~300 LOC, 7–8 files
2 Lane-gated autocommit No (~0% — that figure is server exec time; only saves client round-trips, single-digit ms on the EMA) low ~25 LOC
3 Root-cause the dispatcher (below) Potentially — could reduce/remove promote unknown investigation

Deepest open question (possibly the best fix)

Enqueue-waiting tasks already get an outbox row → sweeper → ZSET → dispatcher. So why is per-completion promote needed? The comment at internal/db/queue.go:1751 says backlogs "went idle" without it, but the root cause of why the ZSET/dispatcher path alone left them stranded was not fully established. If found and fixed, promote could be dropped or made low-frequency — a smaller, cleaner win than the gate. The gate treats the symptom; this would treat the cause. Time-box this before committing to the gate.

Gate design (if we build it)

  • Signal, one site: Redis waiting-hint set where enqueue creates waiting rows (processedWaiting > 0, internal/db/queue.go:1221); cleared when a promote returns 0. Retry-waiting needs no instrumentation.
  • Gate: cheap Redis read at internal/jobs/stream_worker.go:390 before calling promote.
  • Backstop: periodic promote-all over getActiveJobs() in the existing reclaim/reconcile cadence + a "stranded-found" metric so a regression is visible, not silent.
  • Wrinkle: enqueue lives in the DB layer (no Redis handle) — surface processedWaiting up to the jobs-layer caller rather than couple layers.

Risks / challenges

  • No fix that is both small and meaningful; the 58% is intrinsic to a per-completion design.
  • Gate correctness rests entirely on the sweep backstop + hint coverage; ship with review-app load testing, not blind.
  • Measurement dependency: Cut DB load: drop jobs realtime, remove dead index #395 may already drop the EMA below threshold — building the gate pre-emptively risks complexity on a non-problem.

Recommended next steps

  1. Merge Cut DB load: drop jobs realtime, remove dead index #395. Reset pg_stat_statements, re-run the load test.
  2. Measure: HOVER-JD frequency, exec_ema_ms vs 500ms, promote's new share over a clean stats window.
  3. Decide with data:
    • EMA comfortably under threshold → close; the 58% is cheap enough.
    • Still tripping → time-box the dispatcher root-cause investigation (Option 3) before defaulting to the gate (Option 1).
  4. Skip Option 2 unless the EMA sits just over threshold and nothing else moves it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions