Outbox aging + counter-sync + worker observability fixes#342
Conversation
|
Hey there! 👋 See the original and preview of hover-overview.json. Posted by simonsmallchua.grafana.net · Repository: Repository ( |
📝 WalkthroughWalkthroughAdds outbox dead-lettering and migration, per-tick statement timeouts, partial per-entry scheduler failure handling via Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Sweeper as "Outbox Sweeper\n(runner)"
participant DB as "Postgres DB"
participant Scheduler as "Redis Scheduler"
participant Observ as "Observability"
rect rgba(200,200,255,0.5)
Sweeper->>DB: Begin TX\nSET LOCAL statement_timeout
end
Sweeper->>Scheduler: ScheduleBatch(ctx, rows) -- pipeline ZADDs
alt Full success
Scheduler-->>Sweeper: nil (all dispatched)
Sweeper->>DB: DELETE FROM task_outbox WHERE id IN (...)
else Partial failures (*BatchError)
Scheduler-->>Sweeper: *BatchError {FailedIndices}
Sweeper->>DB: DELETE succeeded rows\nUPDATE attempts for failed rows\nMOVE to task_outbox_dead if attempts>=MaxAttempts
else Pipeline/Exec failure
Scheduler-->>Sweeper: error (non-BatchError)
Sweeper->>DB: rollback TX
end
Sweeper->>DB: Commit TX
Sweeper->>Observ: RecordBrokerOutboxSweep(outcome, count)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Updates to Preview Branch (fix-counter-sync-prepared-stmt) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
Release VersionsApp patch: ChangelogAdded
Changed
Fixed
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
🐝 Review App Deployed Homepage: https://hover-pr-342.fly.dev |
|
Hey there! 👋 See the original and preview of hover-overview.json. Posted by simonsmallchua.grafana.net · Repository: Repository ( |
|
🐝 Review App Deployed Homepage: https://hover-pr-342.fly.dev |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/broker/outbox.go (1)
325-328: Consider usingerrors.Asdirectly in the switch.The
isBatchErrorhelper works correctly, but Go'serrors.Ascan be used directly in a type switch for slightly cleaner code. This is a minor stylistic point; the current approach is perfectly valid.♻️ Optional: Inline the type assertion
- switch { - case schedErr == nil: + var be *BatchError + switch { + case schedErr == nil: succeeded = make([]int64, 0, len(claimed)) for _, r := range claimed { succeeded = append(succeeded, r.id) } - case isBatchError(schedErr): - be := schedErr.(*BatchError) //nolint:errcheck // checked by isBatchError + case errors.As(schedErr, &be): failedSet := make(map[int]struct{}, len(be.FailedIndices))This eliminates the helper function and the
//nolintcomment.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/broker/outbox.go` around lines 325 - 328, The helper function isBatchError simply wraps errors.As for *BatchError; remove this helper and inline the check by using errors.As directly in the relevant switch/case or conditional (replace calls to isBatchError(err) with errors.As(err, &be) where be is a *BatchError variable declared in that scope), and remove the now-unused isBatchError function and its nolint comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/broker/outbox.go`:
- Around line 325-328: The helper function isBatchError simply wraps errors.As
for *BatchError; remove this helper and inline the check by using errors.As
directly in the relevant switch/case or conditional (replace calls to
isBatchError(err) with errors.As(err, &be) where be is a *BatchError variable
declared in that scope), and remove the now-unused isBatchError function and its
nolint comment.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0b0928cf-f375-4f8c-b368-2cb7241499b9
📒 Files selected for processing (12)
.fly/review_apps.worker.tomlCHANGELOG.mddocs/diagnostics/outbox-aging-investigation.mdfly.worker.tomlinternal/broker/outbox.gointernal/broker/outbox_integration_test.gointernal/broker/scheduler.gointernal/broker/scheduler_test.gointernal/jobs/manager.gointernal/observability/observability.goscripts/start.shsupabase/migrations/20260423132003_outbox_dead_letter.sql
✅ Files skipped from review due to trivial changes (1)
- supabase/migrations/20260423132003_outbox_dead_letter.sql
|
Hey there! 👋 See the original and preview of hover-overview.json. Posted by simonsmallchua.grafana.net · Repository: Repository ( |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/broker/outbox.go (1)
253-253: Consider defensive nil check onbe.Err.The current scheduler implementation guarantees
be.Erris non-nil whenBatchErroris returned, but adding a guard would protect against future refactors:🛡️ Optional defensive check
- lastErrMsg = be.Err.Error() + if be.Err != nil { + lastErrMsg = be.Err.Error() + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/broker/outbox.go` at line 253, Add a defensive nil-check before dereferencing be.Err when setting lastErrMsg in the BatchError handling path: guard the assignment (where lastErrMsg = be.Err.Error()) so that if be.Err is nil you set a sensible fallback (e.g., "unknown error" or include be.Message) instead of calling Error() on nil; update the code handling BatchError and any use-sites of lastErrMsg to rely on this non-nil string to prevent panics if future refactors allow be.Err to be nil.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/broker/outbox.go`:
- Line 253: Add a defensive nil-check before dereferencing be.Err when setting
lastErrMsg in the BatchError handling path: guard the assignment (where
lastErrMsg = be.Err.Error()) so that if be.Err is nil you set a sensible
fallback (e.g., "unknown error" or include be.Message) instead of calling
Error() on nil; update the code handling BatchError and any use-sites of
lastErrMsg to rely on this non-nil string to prevent panics if future refactors
allow be.Err to be nil.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ef9f4b8f-97c3-48e9-aeec-8bcf76231d04
📒 Files selected for processing (1)
internal/broker/outbox.go
|
🐝 Review App Deployed Homepage: https://hover-pr-342.fly.dev |
Consolidates PR #340 and the 42P05 counter-sync fix into a single branch so one CI run + one preview covers all the worker-stability work.
Summary
Outbox aging (originally #340)
task_outbox_deadtable captures rows past the retry cap withlast_errorfor triage.Scheduler.ScheduleBatchreturns a typed*BatchErrorexposingFailedIndices, so the sweeper deletes the succeeded rows and only bumps attempts on the failed ones (previously every row in a 500-row batch was punished when one ZADD failed).SET LOCAL statement_timeout(default 5s) pluscontext.WithTimeoutaround the whole tx, so a wedged backend can't hold locks indefinitely.JobManager.CancelJobdeletestask_outboxrows for the cancelled job in the same tx — stops stale rows inflating backlog/oldest-age gauges.bee.broker.outbox_sweep_total{outcome=dispatched|retried|dead_lettered}counter.attempts + 1so the terminal attempt count is recorded.docs/diagnostics/outbox-aging-investigation.md.Worker observability
scripts/start.sh(not the bare binary), so the Alloy metrics sidecar runs on every process. Before:bee.worker.*andbee.broker.*from prodhover-workerand everyhover-worker-pr-*were silently dropped. Now tagged withapp=hover-worker[-pr-N]andenvironment=production|staging.scripts/start.shnow accepts the binary name as\$1(defaultmain), so one script serves both API and worker.Counter sync 42P05 fix
DefaultDBSyncFuncusedtx.PrepareContext+stmt.ExecContext. pgx v5 hashes the SQL intostmt_<md5>— deterministic across pgx pools. Since PR Bound outbox sweep aging #340's worker split, API + worker have separate pgx pools but share Supabase's pgbouncer transaction-mode backend conns, so the second process PREPAREs a name the first already left on the backend → SQLSTATE 42P05.tx.ExecContextdirectly — that honours the pool'sdefault_query_exec_mode=simple_protocol(already set forpooler.supabase.comURLs) and skips the PREPARE entirely.internal/andcmd/: no otherPrepareContext/.Prepare(sites exist.Why the 42P05 only surfaced now
62dd480c("Deploy worker app per preview PR", 2026-04-19) split the worker into its own Fly app. Before: one pgx pool → pgx's stmtcache stayed coherent. After: two pgx pools hashing the same SQL to the same name, sharing one pgbouncer. 529 collisions in one ~6h PR #340 preview window, all on the samestmt_32c9a907…name.Test plan
go test ./internal/broker/ -shortgo build ./...broker: failed to sync running counters to DBdrops to zero in log summariesbee.broker.*+bee.worker.*series appear in Grafana taggedapp=hover-worker-pr-342, environment=stagingtask_outbox_deadaccepts rows andoutbox_sweep_totalcounter increments by outcomeSummary by CodeRabbit