Skip to content

Outbox aging + counter-sync + worker observability fixes#342

Merged
simonsmallchua merged 8 commits into
mainfrom
fix-counter-sync-prepared-stmt
Apr 24, 2026
Merged

Outbox aging + counter-sync + worker observability fixes#342
simonsmallchua merged 8 commits into
mainfrom
fix-counter-sync-prepared-stmt

Conversation

@simonsmallchua
Copy link
Copy Markdown
Contributor

@simonsmallchua simonsmallchua commented Apr 24, 2026

Consolidates PR #340 and the 42P05 counter-sync fix into a single branch so one CI run + one preview covers all the worker-stability work.

Summary

Outbox aging (originally #340)

  • task_outbox_dead table captures rows past the retry cap with last_error for triage.
  • Scheduler.ScheduleBatch returns a typed *BatchError exposing FailedIndices, so the sweeper deletes the succeeded rows and only bumps attempts on the failed ones (previously every row in a 500-row batch was punished when one ZADD failed).
  • Sweeper tick bounded by SET LOCAL statement_timeout (default 5s) plus context.WithTimeout around the whole tx, so a wedged backend can't hold locks indefinitely.
  • JobManager.CancelJob deletes task_outbox rows for the cancelled job in the same tx — stops stale rows inflating backlog/oldest-age gauges.
  • bee.broker.outbox_sweep_total{outcome=dispatched|retried|dead_lettered} counter.
  • Moved dead-letter insert to attempts + 1 so the terminal attempt count is recorded.
  • Investigation notes at docs/diagnostics/outbox-aging-investigation.md.

Worker observability

  • Worker Fly processes now launch via scripts/start.sh (not the bare binary), so the Alloy metrics sidecar runs on every process. Before: bee.worker.* and bee.broker.* from prod hover-worker and every hover-worker-pr-* were silently dropped. Now tagged with app=hover-worker[-pr-N] and environment=production|staging.
  • scripts/start.sh now accepts the binary name as \$1 (default main), so one script serves both API and worker.

Counter sync 42P05 fix

  • DefaultDBSyncFunc used tx.PrepareContext + stmt.ExecContext. pgx v5 hashes the SQL into stmt_<md5> — deterministic across pgx pools. Since PR Bound outbox sweep aging #340's worker split, API + worker have separate pgx pools but share Supabase's pgbouncer transaction-mode backend conns, so the second process PREPAREs a name the first already left on the backend → SQLSTATE 42P05.
  • Fixed by dropping the explicit prepare and using tx.ExecContext directly — that honours the pool's default_query_exec_mode=simple_protocol (already set for pooler.supabase.com URLs) and skips the PREPARE entirely.
  • Swept internal/ and cmd/: no other PrepareContext / .Prepare( sites exist.

Why the 42P05 only surfaced now

62dd480c ("Deploy worker app per preview PR", 2026-04-19) split the worker into its own Fly app. Before: one pgx pool → pgx's stmtcache stayed coherent. After: two pgx pools hashing the same SQL to the same name, sharing one pgbouncer. 529 collisions in one ~6h PR #340 preview window, all on the same stmt_32c9a907… name.

Test plan

  • go test ./internal/broker/ -short
  • go build ./...
  • Deploy to PR Outbox aging + counter-sync + worker observability fixes #342 preview, run crawls, confirm broker: failed to sync running counters to DB drops to zero in log summaries
  • Confirm bee.broker.* + bee.worker.* series appear in Grafana tagged app=hover-worker-pr-342, environment=staging
  • Confirm task_outbox_dead accepts rows and outbox_sweep_total counter increments by outcome

Summary by CodeRabbit

  • New Features
    • Outbox sweeper: per-outcome metrics, per-tick DB time bounds, capped retries, and moving permanently failing rows to a dead-letter table.
    • Scheduler now reports per-entry partial failures so successful dispatches aren’t retried.
  • Bug Fixes
    • Running-counter DB sync avoids prepared-statement collisions.
    • Cancelling a job now removes its outbox rows in the same transaction.
    • Worker startup ensures metrics sidecar runs so broker/worker metrics are preserved.
  • Documentation
    • Added outbox aging investigation and remediation guide.
  • Tests
    • Integration tests cover dead-lettering and healthy multi-row dispatch.

@grafana-sync-hover-simon-smallchua
Copy link
Copy Markdown

Hey there! 👋
Grafana spotted some changes to your dashboard.

See the original and preview of hover-overview.json.


Posted by simonsmallchua.grafana.net · Repository: Repository (repository-27dc5a5)

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

📝 Walkthrough

Walkthrough

Adds outbox dead-lettering and migration, per-tick statement timeouts, partial per-entry scheduler failure handling via BatchError, per-outcome sweep metrics, transactional cleanup of cancelled-job outbox rows, fixes running-counter DB sync to avoid prepared-statement collisions, starts the Alloy metrics sidecar for all Fly processes, and adds tests and docs.

Changes

Cohort / File(s) Summary
Changelog / Docs
CHANGELOG.md, docs/diagnostics/outbox-aging-investigation.md
Expanded Unreleased changelog; new outbox aging investigation doc with hypotheses, diagnostic queries, mitigations, and post-fix notes.
Outbox Sweeper & Options
internal/broker/outbox.go
Added MaxAttempts and StatementTimeout options, per-tick timeout context, SET LOCAL statement_timeout inside TX, partial-success handling (delete succeeded, bump or dead-letter failed rows), atomic move-to-dead-letter helper, and per-outcome counters.
Outbox Integration Tests
internal/broker/outbox_integration_test.go
Added TestOutboxSweeper_DeadLetter and TestOutboxSweeper_HealthyMultiRow.
Scheduler & Tests
internal/broker/scheduler.go, internal/broker/scheduler_test.go
Introduced exported BatchError (with FailedIndices, Total, Err) for per-entry failures; updated logging; unit test distinguishing pipeline-level failures from per-entry failures.
Running-counter DB sync
internal/broker/counters.go
Replaced PrepareContext usage with direct tx.ExecContext for running_tasks updates to avoid prepared-statement collisions.
Job cancellation cleanup
internal/jobs/manager.go
CancelJob now deletes matching task_outbox rows within the same transaction that marks tasks skipped.
Observability
internal/observability/observability.go
Added bee.broker.outbox_sweep_total counter and RecordBrokerOutboxSweep helper to record dispatched/retried/dead_lettered counts.
Startup / Fly configs
scripts/start.sh, fly.worker.toml, .fly/review_apps.worker.toml
start.sh parameterized to launch chosen binary and ensure Alloy metrics sidecar; Fly process configs updated to run start.sh so worker processes include metrics sidecar.
DB migration
supabase/migrations/20260423132003_outbox_dead_letter.sql
Added task_outbox_dead table with forensic fields (dead_lettered_at, last_error) and indexes on job_id and dead_lettered_at DESC.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Sweeper as "Outbox Sweeper\n(runner)"
  participant DB as "Postgres DB"
  participant Scheduler as "Redis Scheduler"
  participant Observ as "Observability"

  rect rgba(200,200,255,0.5)
  Sweeper->>DB: Begin TX\nSET LOCAL statement_timeout
  end

  Sweeper->>Scheduler: ScheduleBatch(ctx, rows)  -- pipeline ZADDs
  alt Full success
    Scheduler-->>Sweeper: nil (all dispatched)
    Sweeper->>DB: DELETE FROM task_outbox WHERE id IN (...)
  else Partial failures (*BatchError)
    Scheduler-->>Sweeper: *BatchError {FailedIndices}
    Sweeper->>DB: DELETE succeeded rows\nUPDATE attempts for failed rows\nMOVE to task_outbox_dead if attempts>=MaxAttempts
  else Pipeline/Exec failure
    Scheduler-->>Sweeper: error (non-BatchError)
    Sweeper->>DB: rollback TX
  end

  Sweeper->>DB: Commit TX
  Sweeper->>Observ: RecordBrokerOutboxSweep(outcome, count)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I nibbled through the outbox hay,
Moved tired rows where they may stay,
Timeouts set and counters sing,
Retries capped, dead letters ring,
A rabbit cheers: metrics hop away!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes three major changes: outbox aging mitigation, counter-sync prepared-statement fix, and worker observability improvements via Alloy sidecar integration.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-counter-sync-prepared-stmt

Comment @coderabbitai help to get the list of available commands and usage tips.

@supabase
Copy link
Copy Markdown

supabase Bot commented Apr 24, 2026

Updates to Preview Branch (fix-counter-sync-prepared-stmt) ↗︎

Deployments Status Updated
Database Fri, 24 Apr 2026 10:03:23 UTC
Services Fri, 24 Apr 2026 10:03:23 UTC
APIs Fri, 24 Apr 2026 10:03:23 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks Status Updated
Configurations Fri, 24 Apr 2026 10:03:26 UTC
Migrations Fri, 24 Apr 2026 10:03:28 UTC
Seeding Fri, 24 Apr 2026 10:03:29 UTC
Edge Functions Fri, 24 Apr 2026 10:03:33 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

Release Versions

App patch: v0.33.0v0.33.1

Changelog

Added

  • bee.broker.outbox_sweep_total counter with
    outcome={dispatched, retried, dead_lettered} labels so partial-failure and
    dead-letter rates are visible without a database session.
  • task_outbox_dead table capturing rows whose attempts exceeded the retry
    cap (default 10), with dead_lettered_at and last_error for triage.
  • Outbox investigation notes at docs/diagnostics/outbox-aging-investigation.md
    covering the oldest-age growth pattern, ranked hypotheses, and diagnostic
    queries.

Changed

  • Scheduler.ScheduleBatch now returns a typed *BatchError on partial
    pipeline failure, exposing FailedIndices so the outbox sweeper can DELETE
    the succeeded rows and only bump the failed ones. Previously every row in a
    500-row batch had attempts bumped whenever any single ZADD failed.
  • Outbox sweeper bounds each tick's DB work with SET LOCAL statement_timeout
    (default 5 s) to keep a wedged backend from holding locks indefinitely.
  • JobManager.CancelJob now deletes task_outbox rows for the cancelled job in
    the same transaction, preventing stale rows from contributing to the backlog
    and oldest-age gauges.

Fixed

  • Running-counter DB sync now uses tx.ExecContext instead of
    tx.PrepareContext, so the sync loop no longer fails with
    prepared statement "stmt_…" already exists (SQLSTATE 42P05) when the worker
    shares Supabase's pgbouncer with the API. The explicit prepare was creating
    deterministically named server-side statements that collided across pooled
    backends; the pool-level default_query_exec_mode=simple_protocol setting
    already in place for pooler URLs now takes effect.
  • Worker Fly processes now launch via scripts/start.sh instead of running the
    binary directly, so the Alloy metrics sidecar runs on every process. Without
    this, bee.worker.* and bee.broker.* metrics from the prod hover-worker
    app and every hover-worker-pr-* review app were silently dropped before
    reaching Grafana Cloud.
  • scripts/start.sh now accepts the binary name as $1 (default main), so a
    single script launches either the API or the worker alongside Alloy. Both
    fly.worker.toml and .fly/review_apps.worker.toml point at
    ./start.sh worker.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 4.72973% with 141 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
internal/broker/outbox.go 0.00% 111 Missing ⚠️
internal/broker/scheduler.go 38.88% 10 Missing and 1 partial ⚠️
internal/observability/observability.go 0.00% 11 Missing ⚠️
internal/jobs/manager.go 0.00% 5 Missing ⚠️
internal/broker/counters.go 0.00% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-342.fly.dev
Dashboard: https://hover-pr-342.fly.dev/dashboard

@grafana-sync-hover-simon-smallchua
Copy link
Copy Markdown

Hey there! 👋
Grafana spotted some changes to your dashboard.

See the original and preview of hover-overview.json.


Posted by simonsmallchua.grafana.net · Repository: Repository (repository-27dc5a5)

@simonsmallchua simonsmallchua changed the title Fix 42P05 in counter sync under pgbouncer Outbox aging + counter-sync + worker observability fixes Apr 24, 2026
@simonsmallchua simonsmallchua mentioned this pull request Apr 24, 2026
5 tasks
@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-342.fly.dev
Dashboard: https://hover-pr-342.fly.dev/dashboard

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/broker/outbox.go (1)

325-328: Consider using errors.As directly in the switch.

The isBatchError helper works correctly, but Go's errors.As can be used directly in a type switch for slightly cleaner code. This is a minor stylistic point; the current approach is perfectly valid.

♻️ Optional: Inline the type assertion
-	switch {
-	case schedErr == nil:
+	var be *BatchError
+	switch {
+	case schedErr == nil:
 		succeeded = make([]int64, 0, len(claimed))
 		for _, r := range claimed {
 			succeeded = append(succeeded, r.id)
 		}
-	case isBatchError(schedErr):
-		be := schedErr.(*BatchError) //nolint:errcheck // checked by isBatchError
+	case errors.As(schedErr, &be):
 		failedSet := make(map[int]struct{}, len(be.FailedIndices))

This eliminates the helper function and the //nolint comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/broker/outbox.go` around lines 325 - 328, The helper function
isBatchError simply wraps errors.As for *BatchError; remove this helper and
inline the check by using errors.As directly in the relevant switch/case or
conditional (replace calls to isBatchError(err) with errors.As(err, &be) where
be is a *BatchError variable declared in that scope), and remove the now-unused
isBatchError function and its nolint comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/broker/outbox.go`:
- Around line 325-328: The helper function isBatchError simply wraps errors.As
for *BatchError; remove this helper and inline the check by using errors.As
directly in the relevant switch/case or conditional (replace calls to
isBatchError(err) with errors.As(err, &be) where be is a *BatchError variable
declared in that scope), and remove the now-unused isBatchError function and its
nolint comment.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0b0928cf-f375-4f8c-b368-2cb7241499b9

📥 Commits

Reviewing files that changed from the base of the PR and between 3091b9b and bebbb4a.

📒 Files selected for processing (12)
  • .fly/review_apps.worker.toml
  • CHANGELOG.md
  • docs/diagnostics/outbox-aging-investigation.md
  • fly.worker.toml
  • internal/broker/outbox.go
  • internal/broker/outbox_integration_test.go
  • internal/broker/scheduler.go
  • internal/broker/scheduler_test.go
  • internal/jobs/manager.go
  • internal/observability/observability.go
  • scripts/start.sh
  • supabase/migrations/20260423132003_outbox_dead_letter.sql
✅ Files skipped from review due to trivial changes (1)
  • supabase/migrations/20260423132003_outbox_dead_letter.sql

@grafana-sync-hover-simon-smallchua
Copy link
Copy Markdown

Hey there! 👋
Grafana spotted some changes to your dashboard.

See the original and preview of hover-overview.json.


Posted by simonsmallchua.grafana.net · Repository: Repository (repository-27dc5a5)

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/broker/outbox.go (1)

253-253: Consider defensive nil check on be.Err.

The current scheduler implementation guarantees be.Err is non-nil when BatchError is returned, but adding a guard would protect against future refactors:

🛡️ Optional defensive check
-		lastErrMsg = be.Err.Error()
+		if be.Err != nil {
+			lastErrMsg = be.Err.Error()
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/broker/outbox.go` at line 253, Add a defensive nil-check before
dereferencing be.Err when setting lastErrMsg in the BatchError handling path:
guard the assignment (where lastErrMsg = be.Err.Error()) so that if be.Err is
nil you set a sensible fallback (e.g., "unknown error" or include be.Message)
instead of calling Error() on nil; update the code handling BatchError and any
use-sites of lastErrMsg to rely on this non-nil string to prevent panics if
future refactors allow be.Err to be nil.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/broker/outbox.go`:
- Line 253: Add a defensive nil-check before dereferencing be.Err when setting
lastErrMsg in the BatchError handling path: guard the assignment (where
lastErrMsg = be.Err.Error()) so that if be.Err is nil you set a sensible
fallback (e.g., "unknown error" or include be.Message) instead of calling
Error() on nil; update the code handling BatchError and any use-sites of
lastErrMsg to rely on this non-nil string to prevent panics if future refactors
allow be.Err to be nil.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ef9f4b8f-97c3-48e9-aeec-8bcf76231d04

📥 Commits

Reviewing files that changed from the base of the PR and between bebbb4a and bf017d1.

📒 Files selected for processing (1)
  • internal/broker/outbox.go

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-342.fly.dev
Dashboard: https://hover-pr-342.fly.dev/dashboard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant