Reduce promote_waiting_with_outbox DB cost (~58% of DB time)

## Summary
`promote_waiting_with_outbox` is ~58% of total database time (7.0M calls, ~40.8ms mean). It is **not a slow query** — the SQL plan is optimal. The cost is **call volume**: one full bulk-lane transaction per task completion (~4k/min). Each call's per-transaction exec time feeds the DB-pressure-controller EMA, so this is what trips `HOVER-JD` ("DB pressure at floor") under burst load.

Related: realtime/index load reductions in #395 (removes a separate ~14% WAL-decode load that may relieve this symptom indirectly).

## Evidence
- `EXPLAIN` confirms an index scan on `idx_tasks_waiting_by_job` for the waiting-row pick; PK update; `ON CONFLICT(task_id) DO NOTHING` outbox insert. `tasks` autovacuum healthy (~5% dead). Nothing to tune at statement level.
- Promote fires once per completion via `q.Execute` (full `BEGIN`/`COMMIT`) — `internal/jobs/stream_worker.go:390`.
- Per-transaction exec time feeds the pressure EMA — `internal/db/queue.go:317`.
- Not connection-bound: 1 active / 57 of 120 connections during a burst. More compute/connections would worsen contention.
- ~40ms mean is concurrent-burst contention, not work (idle DB ≈ 1–3ms).

## Architecture (why a gate is safe)
Both waiting producers converge on a per-job Redis ZSET keyed by `task_id`, drained by the dispatcher which enforces the concurrency cap (over-cap items stay in the ZSET — `internal/broker/dispatcher.go:199`).
- ZSET dedupes by `task_id` → promote + retry cannot double-dispatch (`internal/broker/scheduler.go:210`).
- Retry-backoff waiting tasks are ZSET-native (`ScheduleAndAck`) → do not need promote.
- Enqueue-backlog waiting tasks are the only ones promote is load-bearing for.

## Options
| # | Option | Cuts the 58%? | Risk | Effort |
|---|---|---|---|---|
| 0 | Merge #395, re-measure | Indirectly (removes 14% WAL-decode load → lower DB CPU/latency → may stop `HOVER-JD`) | none | done |
| 1 | **Gate** — skip promote when no enqueue-backlog waiting, + periodic promote-all sweep | Yes — the real lever (eliminates the 0-row majority of 7M calls) | medium, mitigated by sweep | ~300 LOC, 7–8 files |
| 2 | Lane-gated autocommit | No (~0% — that figure is server exec time; only saves client round-trips, single-digit ms on the EMA) | low | ~25 LOC |
| 3 | Root-cause the dispatcher (below) | Potentially — could reduce/remove promote | unknown | investigation |

## Deepest open question (possibly the best fix)
Enqueue-waiting tasks already get an outbox row → sweeper → ZSET → dispatcher. So *why* is per-completion promote needed? The comment at `internal/db/queue.go:1751` says backlogs "went idle" without it, but the root cause of why the ZSET/dispatcher path alone left them stranded was not fully established. If found and fixed, promote could be dropped or made low-frequency — a smaller, cleaner win than the gate. The gate treats the symptom; this would treat the cause. **Time-box this before committing to the gate.**

## Gate design (if we build it)
- Signal, one site: Redis waiting-hint set where enqueue creates waiting rows (`processedWaiting > 0`, `internal/db/queue.go:1221`); cleared when a promote returns 0. Retry-waiting needs no instrumentation.
- Gate: cheap Redis read at `internal/jobs/stream_worker.go:390` before calling promote.
- Backstop: periodic promote-all over `getActiveJobs()` in the existing reclaim/reconcile cadence + a "stranded-found" metric so a regression is visible, not silent.
- Wrinkle: enqueue lives in the DB layer (no Redis handle) — surface `processedWaiting` up to the jobs-layer caller rather than couple layers.

## Risks / challenges
- No fix that is both small and meaningful; the 58% is intrinsic to a per-completion design.
- Gate correctness rests entirely on the sweep backstop + hint coverage; ship with review-app load testing, not blind.
- Measurement dependency: #395 may already drop the EMA below threshold — building the gate pre-emptively risks complexity on a non-problem.

## Recommended next steps
1. Merge #395. Reset `pg_stat_statements`, re-run the load test.
2. Measure: `HOVER-JD` frequency, `exec_ema_ms` vs 500ms, promote's new share over a clean stats window.
3. Decide with data:
   - EMA comfortably under threshold → close; the 58% is cheap enough.
   - Still tripping → time-box the dispatcher root-cause investigation (Option 3) before defaulting to the gate (Option 1).
4. Skip Option 2 unless the EMA sits just over threshold and nothing else moves it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce promote_waiting_with_outbox DB cost (~58% of DB time) #396

Summary

Evidence

Architecture (why a gate is safe)

Options

Deepest open question (possibly the best fix)

Gate design (if we build it)

Risks / challenges

Recommended next steps

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

#	Option	Cuts the 58%?	Risk	Effort
0	Merge #395, re-measure	Indirectly (removes 14% WAL-decode load → lower DB CPU/latency → may stop `HOVER-JD`)	none	done
1	Gate — skip promote when no enqueue-backlog waiting, + periodic promote-all sweep	Yes — the real lever (eliminates the 0-row majority of 7M calls)	medium, mitigated by sweep	~300 LOC, 7–8 files
2	Lane-gated autocommit	No (~0% — that figure is server exec time; only saves client round-trips, single-digit ms on the EMA)	low	~25 LOC
3	Root-cause the dispatcher (below)	Potentially — could reduce/remove promote	unknown	investigation

Reduce promote_waiting_with_outbox DB cost (~58% of DB time) #396

Description

Summary

Evidence

Architecture (why a gate is safe)

Options

Deepest open question (possibly the best fix)

Gate design (if we build it)

Risks / challenges

Recommended next steps

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions