Stop outbox inheriting stale run_at#350
Conversation
|
Updates to Preview Branch (fix/outbox-oldest-age-tail) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis change modifies how task promotion and outbox age measurement operate. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Release VersionsApp patch: ChangelogFixed
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
🐝 Review App Deployed Homepage: https://hover-pr-350.fly.dev |
Summary
The Grafana
Outbox oldest agepanel sawtoothed up to 5+ hours despiteDispatch success ratio = 100%,Outbox backlog < 50, andtask_outbox_deadbeing empty. Investigation against production data found two compounding bugs.Root cause
promote_waiting_with_outboxinsertsCOALESCE(t.run_at, NOW())intotask_outbox.run_at.tasks.run_atisNOT NULLand equalscreated_atfor waiting tasks (no code path schedules a futurerun_aton a waiting task). With ~881k waiting tasks in production carryingrun_atmore than 30 minutes old (oldest > 3 days), every freshly-promoted outbox row inherited an arbitrarily ancient timestamp.Live evidence (sampled at 18:16 AEST):
run_at6h 13min in the pastrun_at3h 17min in the pastrun_atrows had outbox dwell time < 1.5 secondsThe probe metric
bee.broker.outbox_age_secondsis computedNOW() - MIN(run_at) WHERE run_at <= NOW(), so it reported the inherited staleness rather than actual outbox dwell time. The sawtooth on the dashboard was the gauge tracking the inherited age of whichever long-waiting task got promoted next, not rows being stuck.Fix
20260425081706_outbox_runat_use_now.sql—promote_waiting_with_outboxnow insertsNOW()intotask_outbox.run_atinstead of inheriting from the parent task. Retry/back-off paths inSweeper.bumpAttemptscontinue to set futurerun_atvalues; only the initial insert changes.internal/broker/probe.go— gauge now measuresNOW() - MIN(created_at)over due rows, i.e. true dwell time.created_atis set toNOW()on insert and is monotonic w.r.t. row arrival.internal/observability/observability.go— gauge description updated to match.Deadlock-safe
ORDER BY idordering from migration20260425000001is preserved.ON CONFLICT (task_id) DO NOTHINGsemantics unchanged. Function return value unchanged.Test plan
go build ./...go vet ./...go test ./...(all packages pass)bee.broker.outbox_age_secondsshould drop into the seconds range and stay there. Sawtooth pattern should disappear.outbox_run_at_ageshould be ≤ outbox row dwell time, not minutes/hours ahead of it.Out of scope
The 881k waiting tasks with stale
run_atare unrelated to this fix and should drain naturally as the admission loop catches up. If they don't, a follow-up housekeeping task canUPDATE tasks SET run_at = NOW() WHERE status = 'waiting' AND run_at < NOW() - interval '1 hour'.Summary by CodeRabbit