Skip to content

Stop outbox inheriting stale run_at#350

Merged
simonsmallchua merged 1 commit into
mainfrom
fix/outbox-oldest-age-tail
Apr 25, 2026
Merged

Stop outbox inheriting stale run_at#350
simonsmallchua merged 1 commit into
mainfrom
fix/outbox-oldest-age-tail

Conversation

@simonsmallchua
Copy link
Copy Markdown
Contributor

@simonsmallchua simonsmallchua commented Apr 25, 2026

Summary

The Grafana Outbox oldest age panel sawtoothed up to 5+ hours despite Dispatch success ratio = 100%, Outbox backlog < 50, and task_outbox_dead being empty. Investigation against production data found two compounding bugs.

Root cause

promote_waiting_with_outbox inserts COALESCE(t.run_at, NOW()) into task_outbox.run_at. tasks.run_at is NOT NULL and equals created_at for waiting tasks (no code path schedules a future run_at on a waiting task). With ~881k waiting tasks in production carrying run_at more than 30 minutes old (oldest > 3 days), every freshly-promoted outbox row inherited an arbitrarily ancient timestamp.

Live evidence (sampled at 18:16 AEST):

  • Outbox row created 1 second ago, run_at 6h 13min in the past
  • Outbox row created 1.4s ago, run_at 3h 17min in the past
  • All 10 oldest-by-run_at rows had outbox dwell time < 1.5 seconds

The probe metric bee.broker.outbox_age_seconds is computed NOW() - MIN(run_at) WHERE run_at <= NOW(), so it reported the inherited staleness rather than actual outbox dwell time. The sawtooth on the dashboard was the gauge tracking the inherited age of whichever long-waiting task got promoted next, not rows being stuck.

Fix

  1. Migration 20260425081706_outbox_runat_use_now.sqlpromote_waiting_with_outbox now inserts NOW() into task_outbox.run_at instead of inheriting from the parent task. Retry/back-off paths in Sweeper.bumpAttempts continue to set future run_at values; only the initial insert changes.
  2. internal/broker/probe.go — gauge now measures NOW() - MIN(created_at) over due rows, i.e. true dwell time. created_at is set to NOW() on insert and is monotonic w.r.t. row arrival.
  3. internal/observability/observability.go — gauge description updated to match.

Deadlock-safe ORDER BY id ordering from migration 20260425000001 is preserved. ON CONFLICT (task_id) DO NOTHING semantics unchanged. Function return value unchanged.

Test plan

  • go build ./...
  • go vet ./...
  • go test ./... (all packages pass)
  • Migration applies cleanly on review app
  • Post-deploy: bee.broker.outbox_age_seconds should drop into the seconds range and stay there. Sawtooth pattern should disappear.
  • Sample new outbox rows: outbox_run_at_age should be ≤ outbox row dwell time, not minutes/hours ahead of it.

Out of scope

The 881k waiting tasks with stale run_at are unrelated to this fix and should drain naturally as the admission loop catches up. If they don't, a follow-up housekeeping task can UPDATE tasks SET run_at = NOW() WHERE status = 'waiting' AND run_at < NOW() - interval '1 hour'.

Summary by CodeRabbit

  • Bug Fixes
    • Fixed task outbox dispatch timing to use current time when inserting new tasks, instead of inheriting timing from parent tasks, ensuring more predictable task dispatch behavior.
    • Corrected the outbox queue age metric to measure true dwell time from task creation rather than inherited dispatch times, providing more accurate performance insights.

@supabase
Copy link
Copy Markdown

supabase Bot commented Apr 25, 2026

Updates to Preview Branch (fix/outbox-oldest-age-tail) ↗︎

Deployments Status Updated
Database Sat, 25 Apr 2026 08:21:37 UTC
Services Sat, 25 Apr 2026 08:21:37 UTC
APIs Sat, 25 Apr 2026 08:21:37 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks Status Updated
Configurations Sat, 25 Apr 2026 08:21:43 UTC
Migrations Sat, 25 Apr 2026 08:22:25 UTC
Seeding Sat, 25 Apr 2026 08:22:32 UTC
Edge Functions Sat, 25 Apr 2026 08:22:32 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 285a9378-2a48-45ec-b98d-f18d1f231c14

📥 Commits

Reviewing files that changed from the base of the PR and between b59dfb8 and bef6692.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • internal/broker/probe.go
  • internal/observability/observability.go
  • supabase/migrations/20260425081706_outbox_runat_use_now.sql

📝 Walkthrough

Walkthrough

This change modifies how task promotion and outbox age measurement operate. The promote_waiting_with_outbox function now sets task_outbox.run_at to NOW() unconditionally on insert instead of inheriting the parent task's run_at. Outbox age metrics are recalibrated to measure from created_at instead of run_at for accurate dwell time calculation.

Changes

Cohort / File(s) Summary
Observability & Metrics
internal/broker/probe.go, internal/observability/observability.go
Updated outbox age measurement from MIN(run_at) to MIN(created_at) with corresponding metric description and inline documentation clarifying semantics of run_at vs. created_at.
Database Migration
supabase/migrations/20260425081706_outbox_runat_use_now.sql
Replaced promote_waiting_with_outbox function to unconditionally set task_outbox.run_at = NOW() on initial outbox inserts during task promotion, with atomic ordering by priority/creation order and skip-locked concurrent-safe selection.
Documentation
CHANGELOG.md
Added changelog entries documenting the semantic changes to task promotion (no longer inheriting parent run_at) and outbox age measurement basis (created_at instead of run_at).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐇 Hops through the queue with timing so true,
NOW sets the dispatch, no old tasks in view,
Created arrives, not the run that was worn,
Fresh outbox dwell time, no staleness forlorn! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main fix: preventing the outbox from inheriting stale run_at values, which is the core bug addressed across all changed files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/outbox-oldest-age-tail

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

Release Versions

App patch: v0.33.6v0.33.7

Changelog

Fixed

  • promote_waiting_with_outbox no longer inherits tasks.run_at into newly-
    inserted task_outbox rows. Waiting tasks always carry run_at = created_at,
    so promoting a task that had been waiting for hours/days baked an ancient
    timestamp into the outbox row. Production ran ~881k waiting tasks with
    run_at more than 30 minutes old (oldest > 3 days), causing the
    bee.broker.outbox_age_seconds gauge to climb past 5 hours despite actual
    outbox dwell times being sub-second. Outbox rows now use NOW() for run_at
    on initial insert; retry/back-off paths in Sweeper.bumpAttempts continue to
    set future run_at values.
  • bee.broker.outbox_age_seconds gauge now measures actual outbox dwell time
    (NOW() - MIN(created_at) over due rows) instead of staleness of the
    inherited run_at value.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
internal/broker/probe.go 0.00% 1 Missing ⚠️
internal/observability/observability.go 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-350.fly.dev
Dashboard: https://hover-pr-350.fly.dev/dashboard

@simonsmallchua simonsmallchua merged commit 2c7bdf2 into main Apr 25, 2026
11 checks passed
@simonsmallchua simonsmallchua deleted the fix/outbox-oldest-age-tail branch April 25, 2026 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant