Bound outbox sweep aging#340
Conversation
|
Hey there! 👋 See the original and preview of hover-overview.json. Posted by simonsmallchua.grafana.net · Repository: Repository ( |
📝 WalkthroughWalkthroughImplements outbox dead‑lettering and partial‑batch error signaling, bounds sweeper ticks with statement_timeout, moves expired rows to Changes
Sequence DiagramsequenceDiagram
participant DB as Database
participant Sweeper as Outbox Sweeper
participant Scheduler as Scheduler
participant Observability as Observability
participant DeadLetter as task_outbox_dead
Note over Sweeper,DB: sweeper tick begins (tickCtx + SET LOCAL statement_timeout)
Sweeper->>DB: SELECT ... FOR UPDATE SKIP LOCKED
DB-->>Sweeper: Claimed rows
Sweeper->>Scheduler: ScheduleBatch(entries)
alt Partial per-entry failures (BatchError)
Scheduler-->>Sweeper: BatchError(FailedIndices, Total, Err)
Sweeper->>DB: DELETE rows for succeeded indices
Sweeper->>DB: UPDATE attempts/run_at for failed indices
else All succeeded
Scheduler-->>Sweeper: nil
Sweeper->>DB: DELETE all claimed rows
else Pipeline error (non-BatchError)
Scheduler-->>Sweeper: error
Sweeper->>DB: UPDATE attempts/run_at for all claimed rows
end
alt attempts >= MaxAttempts
Sweeper->>DeadLetter: INSERT row(s) with last_error, dead_lettered_at
Sweeper->>DB: DELETE from task_outbox
Sweeper->>Observability: RecordBrokerOutboxSweep("dead_lettered", count)
else Retried
Sweeper->>Observability: RecordBrokerOutboxSweep("retried", count)
else Dispatched
Sweeper->>Observability: RecordBrokerOutboxSweep("dispatched", count)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Updates to Preview Branch (investigate-outbox-aging) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
Release VersionsApp patch: ChangelogAdded
Changed
Fixed
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
🐝 Review App Deployed Homepage: https://hover-pr-340.fly.dev |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/broker/scheduler.go (1)
164-167:⚠️ Potential issue | 🟠 MajorDon't return on
Pipeline.Execerror before inspecting per-command results.The error return at line 165 prevents per-command error inspection that occurs later (lines 173–186), collapsing partial command failures into a full-batch failure. In go-redis/v9, when some pipelined commands fail server-side,
Execreturns a non-nil error and the command slice with per-commandErr()values set. The current code ignores this per-command information, treating anyExecerror as a complete batch failure instead of distinguishing partial failures whereFailedIndiceswould apply.Capture the
Execerror without returning immediately; inspect per-command results first to properly handle partial-failure cases.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/broker/scheduler.go` around lines 164 - 167, The current early return on pipe.Exec(ctx) hides per-command results needed to detect partial failures; change the handling in schedule batch so that you capture the error from pipe.Exec(ctx) into a variable (e.g., execErr) but do NOT return immediately, iterate the returned cmds slice and inspect each command's Err() to build/append FailedIndices and partial errors, and only return a final error that incorporates execErr plus per-command failures if appropriate; update logic around cmds, entries, and FailedIndices to distinguish full Exec failures from server-side per-command errors.internal/broker/outbox.go (1)
72-88:⚠️ Potential issue | 🟠 Major
StatementTimeoutis documented as default-on, butNewOutboxSweepernever applies that default.
DefaultOutboxSweeperOpts()sets a 5 s statement timeout, butNewOutboxSweeper()only backfillsMaxAttemptsand leavesStatementTimeoutat zero. Any caller that constructsOutboxSweeperOpts{}directly will silently lose the new safeguard.Suggested fix
func NewOutboxSweeper(db *sql.DB, scheduler *Scheduler, opts OutboxSweeperOpts) *Sweeper { if opts.Interval <= 0 { opts.Interval = 500 * time.Millisecond } if opts.BatchSize <= 0 { opts.BatchSize = 200 } if opts.BaseBackoff <= 0 { opts.BaseBackoff = 2 * time.Second } if opts.MaxBackoff <= 0 { opts.MaxBackoff = 5 * time.Minute } if opts.MaxAttempts <= 0 { opts.MaxAttempts = DefaultOutboxMaxAttempts } + if opts.StatementTimeout <= 0 { + opts.StatementTimeout = 5 * time.Second + } return &Sweeper{db: db, scheduler: scheduler, opts: opts} }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/broker/outbox.go` around lines 72 - 88, NewOutboxSweeper currently fails to apply the default StatementTimeout from DefaultOutboxSweeperOpts, so callers that pass an empty OutboxSweeperOpts get StatementTimeout == 0; update NewOutboxSweeper to detect opts.StatementTimeout <= 0 and set it to the default (use the value from DefaultOutboxSweeperOpts() or the documented 5s) alongside the existing backfills for Interval/BatchSize/BaseBackoff/MaxBackoff/MaxAttempts so the statement timeout safeguard is always applied even when opts is constructed directly.
🧹 Nitpick comments (2)
supabase/migrations/20260423132003_outbox_dead_letter.sql (1)
20-20: Consider a unique index onoriginal_idfor dead-letter integrity and lookup speed.You already query by
original_idin tests/triage paths; a unique index would both accelerate that and guarantee one dead-letter row per source outbox row.Suggested migration addition
+CREATE UNIQUE INDEX IF NOT EXISTS idx_task_outbox_dead_original_id + ON public.task_outbox_dead (original_id);Also applies to: 39-43
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@supabase/migrations/20260423132003_outbox_dead_letter.sql` at line 20, Add a unique index on the dead-letter table's original_id to enforce one dead-letter row per source outbox row and speed lookups: modify the outbox_dead_letter migration to create a unique index/constraint on original_id (e.g., CREATE UNIQUE INDEX or ALTER TABLE ... ADD CONSTRAINT UNIQUE on original_id) and include the same change where original_id is defined/used (lines referenced around 39-43) so queries and tests that filter by original_id benefit from the uniqueness and performance guarantee.internal/broker/outbox_integration_test.go (1)
283-314:TestOutboxSweeper_PartialFailurecurrently validates only healthy dispatch.The test name/comment says partial failure, but the assertions only prove successful sweep+delete. Consider renaming it to reflect healthy multi-row dispatch, or add a dedicated case that actually drives
*BatchErrorhandling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/broker/outbox_integration_test.go` around lines 283 - 314, The test TestOutboxSweeper_PartialFailure currently only asserts successful dispatch (no attempts bumped) and doesn't exercise the partial-failure code path; either rename the test to reflect a healthy dispatch scenario (e.g., TestOutboxSweeper_HealthyDispatch) or extend it to simulate a partial ScheduleBatch failure and assert BatchError handling: locate the test function TestOutboxSweeper_PartialFailure and modify it to (A) inject a scheduler stub/mock where ScheduleBatch returns a *BatchError indicating some failed entries and successful ones, then assert that only failed rows have attempts incremented while successful rows are DELETEd after NewOutboxSweeper(..., OutboxSweeperOpts{BatchSize: 50}).Tick(ctx), or (B) change the test name and message to match the current healthy dispatch assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/diagnostics/outbox-aging-investigation.md`:
- Line 3: This doc is ambiguous about whether it describes the pre-fix state or
the post-fix behavior: either explicitly label the whole document (or the noted
sections) as a "pre-fix investigation snapshot" and keep the proposed fixes as
historical notes, OR update the text to reflect the merged behavior (remove/mark
as implemented the proposed fixes) so the doc matches current code; specifically
call out CancelJob and task_outbox behavior (e.g., state that CancelJob now
cleans task_outbox if that is true) and update the descriptions of the
counter/label changes to match the implemented counter names and semantics so
the file is accurate for ops.
In `@internal/broker/outbox.go`:
- Around line 247-255: The dead-lettering logic checks terminality with
"r.attempts+1 >= s.opts.MaxAttempts" but then appends the row unchanged so the
stored attempts value is one less than the terminal count; update the code path
that moves items to dead-letter (the deadLetter append and the
moveToDeadLetter() call sites) to record the bumped attempt count (i.e., use
r.attempts+1 or set attempts = r.attempts+1 before appending) so the persisted
dead-letter rows reflect the same attempt number used for the terminal decision;
apply the same fix to the other occurrence around moveToDeadLetter() referenced
in the review.
- Around line 140-150: The Tick transaction currently sets only PostgreSQL's
statement_timeout which won't cancel the whole transaction while the code waits
on ScheduleBatch; wrap the entire Tick operation in a context.WithTimeout (using
a sensible deadline, e.g. s.opts.StatementTimeout or a new s.opts.TickTimeout)
and use that derived ctx for beginning the transaction, all tx.ExecContext calls
and the call to ScheduleBatch so that a hung ScheduleBatch cancels the
transaction; remember to defer cancel() and propagate the timeout error back so
locks are released promptly.
---
Outside diff comments:
In `@internal/broker/outbox.go`:
- Around line 72-88: NewOutboxSweeper currently fails to apply the default
StatementTimeout from DefaultOutboxSweeperOpts, so callers that pass an empty
OutboxSweeperOpts get StatementTimeout == 0; update NewOutboxSweeper to detect
opts.StatementTimeout <= 0 and set it to the default (use the value from
DefaultOutboxSweeperOpts() or the documented 5s) alongside the existing
backfills for Interval/BatchSize/BaseBackoff/MaxBackoff/MaxAttempts so the
statement timeout safeguard is always applied even when opts is constructed
directly.
In `@internal/broker/scheduler.go`:
- Around line 164-167: The current early return on pipe.Exec(ctx) hides
per-command results needed to detect partial failures; change the handling in
schedule batch so that you capture the error from pipe.Exec(ctx) into a variable
(e.g., execErr) but do NOT return immediately, iterate the returned cmds slice
and inspect each command's Err() to build/append FailedIndices and partial
errors, and only return a final error that incorporates execErr plus per-command
failures if appropriate; update logic around cmds, entries, and FailedIndices to
distinguish full Exec failures from server-side per-command errors.
---
Nitpick comments:
In `@internal/broker/outbox_integration_test.go`:
- Around line 283-314: The test TestOutboxSweeper_PartialFailure currently only
asserts successful dispatch (no attempts bumped) and doesn't exercise the
partial-failure code path; either rename the test to reflect a healthy dispatch
scenario (e.g., TestOutboxSweeper_HealthyDispatch) or extend it to simulate a
partial ScheduleBatch failure and assert BatchError handling: locate the test
function TestOutboxSweeper_PartialFailure and modify it to (A) inject a
scheduler stub/mock where ScheduleBatch returns a *BatchError indicating some
failed entries and successful ones, then assert that only failed rows have
attempts incremented while successful rows are DELETEd after
NewOutboxSweeper(..., OutboxSweeperOpts{BatchSize: 50}).Tick(ctx), or (B) change
the test name and message to match the current healthy dispatch assertions.
In `@supabase/migrations/20260423132003_outbox_dead_letter.sql`:
- Line 20: Add a unique index on the dead-letter table's original_id to enforce
one dead-letter row per source outbox row and speed lookups: modify the
outbox_dead_letter migration to create a unique index/constraint on original_id
(e.g., CREATE UNIQUE INDEX or ALTER TABLE ... ADD CONSTRAINT UNIQUE on
original_id) and include the same change where original_id is defined/used
(lines referenced around 39-43) so queries and tests that filter by original_id
benefit from the uniqueness and performance guarantee.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ef44366e-3e61-4dbe-b85e-025e9dea7b2f
📒 Files selected for processing (9)
CHANGELOG.mddocs/diagnostics/outbox-aging-investigation.mdinternal/broker/outbox.gointernal/broker/outbox_integration_test.gointernal/broker/scheduler.gointernal/broker/scheduler_test.gointernal/jobs/manager.gointernal/observability/observability.gosupabase/migrations/20260423132003_outbox_dead_letter.sql
|
Hey there! 👋 See the original and preview of hover-overview.json. Posted by simonsmallchua.grafana.net · Repository: Repository ( |
|
🐝 Review App Deployed Homepage: https://hover-pr-340.fly.dev |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/broker/outbox.go (1)
72-88:⚠️ Potential issue | 🟠 MajorDefault
StatementTimeoutwhen callers leave it unset.
NewOutboxSweeper(...)now normalisesMaxAttempts, but it still leavesStatementTimeoutat zero. Any zero-value or partially-filledOutboxSweeperOptswill therefore run unbounded ticks, which undercuts the new lock-release guard.Suggested change
func NewOutboxSweeper(db *sql.DB, scheduler *Scheduler, opts OutboxSweeperOpts) *Sweeper { if opts.Interval <= 0 { opts.Interval = 500 * time.Millisecond } @@ if opts.MaxAttempts <= 0 { opts.MaxAttempts = DefaultOutboxMaxAttempts } + if opts.StatementTimeout <= 0 { + opts.StatementTimeout = DefaultOutboxSweeperOpts().StatementTimeout + } return &Sweeper{db: db, scheduler: scheduler, opts: opts} }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/broker/outbox.go` around lines 72 - 88, NewOutboxSweeper currently normalizes several fields on OutboxSweeperOpts but leaves StatementTimeout at zero, allowing unbounded DB statements; update NewOutboxSweeper to set a sensible default (e.g., 30s) when opts.StatementTimeout <= 0 so the Sweeper uses a bounded statement timeout, referencing NewOutboxSweeper, OutboxSweeperOpts, StatementTimeout and Sweeper in your change.
🧹 Nitpick comments (1)
internal/broker/outbox_integration_test.go (1)
246-280: Assert the persisted terminal attempt count too.This test already hits the terminal retry boundary, but it only checks row movement and
last_error. Adding anattemptsassertion would lock in the off-by-one fix and stop the9/10literals drifting away fromDefaultOutboxMaxAttempts.Suggested change
- _, err := db.ExecContext(context.Background(), - `UPDATE task_outbox SET attempts = $1 WHERE id = $2`, 9, id) + const maxAttempts = DefaultOutboxMaxAttempts + _, err := db.ExecContext(context.Background(), + `UPDATE task_outbox SET attempts = $1 WHERE id = $2`, maxAttempts-1, id) require.NoError(t, err) @@ - MaxAttempts: 10, + MaxAttempts: maxAttempts, }) @@ - var dead int - var lastErr string + var dead int + var attempts int + var lastErr string require.NoError(t, db.QueryRowContext(ctx, - `SELECT COUNT(*), COALESCE(MAX(last_error), '') + `SELECT COUNT(*), COALESCE(MAX(attempts), 0), COALESCE(MAX(last_error), '') FROM task_outbox_dead WHERE original_id = $1`, id, - ).Scan(&dead, &lastErr)) + ).Scan(&dead, &attempts, &lastErr)) assert.Equal(t, 1, dead, "dead-lettered row must appear in task_outbox_dead") + assert.Equal(t, maxAttempts, attempts, "dead-lettered row must record the terminal attempt") assert.NotEmpty(t, lastErr, "last_error must capture the ScheduleBatch failure")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/broker/outbox_integration_test.go` around lines 246 - 280, The test seeds a row at attempts = 9 and checks movement to task_outbox_dead and last_error but doesn't assert the persisted attempts value, risking an off-by-one regressions against DefaultOutboxMaxAttempts; update the test (around insertOutboxFixture, NewOutboxSweeper/OutboxSweeperOpts and the SELECT from task_outbox_dead) to also query and assert that the attempts column in the dead-letter row equals the expected terminal attempts (e.g., DefaultOutboxMaxAttempts or the MaxAttempts value passed into OutboxSweeper) so the terminal count is locked in by the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/diagnostics/outbox-aging-investigation.md`:
- Around line 3-4: The markdown breaks the phrase "PR `#340`" across lines causing
MD018; in docs/diagnostics/outbox-aging-investigation.md locate the sentence
containing "Status: historical — investigation notes captured before the fixes
landed in PR" and either join the following line so "PR `#340`." stays on the same
line as that sentence or escape the hash as "\#340" to prevent it being treated
as a header; update the sentence in place (do not change surrounding content) so
the document passes MD018.
- Around line 218-225: The doc currently implies both cancel and archive cleanup
landed, but only jobs.CancelJob was implemented; update the Outcome table and
surrounding text (the "3. Cancel/archive cleanup" row and the Line 218 lead-in)
to say only "Cancel cleanup" or similar and reference jobs.CancelJob and its
behavior (deletes task_outbox rows in the same tx) so ops readers aren't misled
into thinking archive cleanup exists when it does not.
---
Outside diff comments:
In `@internal/broker/outbox.go`:
- Around line 72-88: NewOutboxSweeper currently normalizes several fields on
OutboxSweeperOpts but leaves StatementTimeout at zero, allowing unbounded DB
statements; update NewOutboxSweeper to set a sensible default (e.g., 30s) when
opts.StatementTimeout <= 0 so the Sweeper uses a bounded statement timeout,
referencing NewOutboxSweeper, OutboxSweeperOpts, StatementTimeout and Sweeper in
your change.
---
Nitpick comments:
In `@internal/broker/outbox_integration_test.go`:
- Around line 246-280: The test seeds a row at attempts = 9 and checks movement
to task_outbox_dead and last_error but doesn't assert the persisted attempts
value, risking an off-by-one regressions against DefaultOutboxMaxAttempts;
update the test (around insertOutboxFixture, NewOutboxSweeper/OutboxSweeperOpts
and the SELECT from task_outbox_dead) to also query and assert that the attempts
column in the dead-letter row equals the expected terminal attempts (e.g.,
DefaultOutboxMaxAttempts or the MaxAttempts value passed into OutboxSweeper) so
the terminal count is locked in by the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8f72022b-003f-4b93-aa38-f9597bd5b368
📒 Files selected for processing (3)
docs/diagnostics/outbox-aging-investigation.mdinternal/broker/outbox.gointernal/broker/outbox_integration_test.go
| Status: historical — investigation notes captured before the fixes landed in PR | ||
| #340. Code pointers and line numbers refer to the pre-fix tree. The "Suggested |
There was a problem hiding this comment.
Keep PR #340`` on the same line.
The hard wrap leaves #340 at the start of Line 4, which trips MD018 and can render oddly in Markdown. Fold that sentence onto one line, or escape the hash.
Suggested change
-Status: historical — investigation notes captured before the fixes landed in PR
-#340. Code pointers and line numbers refer to the pre-fix tree. The "Suggested
+Status: historical — investigation notes captured before the fixes landed in PR `#340`. Code pointers and line numbers refer to the pre-fix tree. The "Suggested
fixes" section has all been implemented; see the Outcome section at the bottom📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Status: historical — investigation notes captured before the fixes landed in PR | |
| #340. Code pointers and line numbers refer to the pre-fix tree. The "Suggested | |
| Status: historical — investigation notes captured before the fixes landed in PR `#340`. Code pointers and line numbers refer to the pre-fix tree. The "Suggested | |
| fixes" section has all been implemented; see the Outcome section at the bottom |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 4-4: No space after hash on atx style heading
(MD018, no-missing-space-atx)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/diagnostics/outbox-aging-investigation.md` around lines 3 - 4, The
markdown breaks the phrase "PR `#340`" across lines causing MD018; in
docs/diagnostics/outbox-aging-investigation.md locate the sentence containing
"Status: historical — investigation notes captured before the fixes landed in
PR" and either join the following line so "PR `#340`." stays on the same line as
that sentence or escape the hash as "\#340" to prevent it being treated as a
header; update the sentence in place (do not change surrounding content) so the
document passes MD018.
| Every suggested fix above was implemented; this section records what landed so | ||
| the doc stays useful as an ops reference. | ||
|
|
||
| | Suggested fix | Implemented as | | ||
| | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | | ||
| | 1. Per-entry failure tracking | `*broker.BatchError` (`internal/broker/scheduler.go`) + partition-by-index in `Sweeper.Tick` (`internal/broker/outbox.go`). | | ||
| | 2. Dead-letter cap | `OutboxSweeperOpts.MaxAttempts` (default 10) + `task_outbox_dead` table (migration `20260423132003_outbox_dead_letter.sql`). | | ||
| | 3. Cancel/archive cleanup | `jobs.CancelJob` now deletes the job's `task_outbox` rows in the same tx (`internal/jobs/manager.go`). | |
There was a problem hiding this comment.
Don’t imply archive cleanup landed if this PR only wires CancelJob.
Suggested fix 3 is still phrased as cancel/archive, but the Outcome table only maps jobs.CancelJob. With the Line 218 lead-in, ops readers can easily infer archive cleanup exists when the table says otherwise.
Suggested change
-Every suggested fix above was implemented; this section records what landed so
+Most suggested fixes above were implemented; this section records what landed so
the doc stays useful as an ops reference.
@@
-| 3. Cancel/archive cleanup | `jobs.CancelJob` now deletes the job's `task_outbox` rows in the same tx (`internal/jobs/manager.go`). |
+| 3. Cancel cleanup | `jobs.CancelJob` now deletes the job's `task_outbox` rows in the same tx (`internal/jobs/manager.go`). |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Every suggested fix above was implemented; this section records what landed so | |
| the doc stays useful as an ops reference. | |
| | Suggested fix | Implemented as | | |
| | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | | |
| | 1. Per-entry failure tracking | `*broker.BatchError` (`internal/broker/scheduler.go`) + partition-by-index in `Sweeper.Tick` (`internal/broker/outbox.go`). | | |
| | 2. Dead-letter cap | `OutboxSweeperOpts.MaxAttempts` (default 10) + `task_outbox_dead` table (migration `20260423132003_outbox_dead_letter.sql`). | | |
| | 3. Cancel/archive cleanup | `jobs.CancelJob` now deletes the job's `task_outbox` rows in the same tx (`internal/jobs/manager.go`). | | |
| Most suggested fixes above were implemented; this section records what landed so | |
| the doc stays useful as an ops reference. | |
| | Suggested fix | Implemented as | | |
| | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | | |
| | 1. Per-entry failure tracking | `*broker.BatchError` (`internal/broker/scheduler.go`) + partition-by-index in `Sweeper.Tick` (`internal/broker/outbox.go`). | | |
| | 2. Dead-letter cap | `OutboxSweeperOpts.MaxAttempts` (default 10) + `task_outbox_dead` table (migration `20260423132003_outbox_dead_letter.sql`). | | |
| | 3. Cancel cleanup | `jobs.CancelJob` now deletes the job's `task_outbox` rows in the same tx (`internal/jobs/manager.go`). | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/diagnostics/outbox-aging-investigation.md` around lines 218 - 225, The
doc currently implies both cancel and archive cleanup landed, but only
jobs.CancelJob was implemented; update the Outcome table and surrounding text
(the "3. Cancel/archive cleanup" row and the Line 218 lead-in) to say only
"Cancel cleanup" or similar and reference jobs.CancelJob and its behavior
(deletes task_outbox rows in the same tx) so ops readers aren't misled into
thinking archive cleanup exists when it does not.
|
Hey there! 👋 See the original and preview of hover-overview.json. Posted by simonsmallchua.grafana.net · Repository: Repository ( |
|
🐝 Review App Deployed Homepage: https://hover-pr-340.fly.dev |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/start.sh (1)
12-32:⚠️ Potential issue | 🟡 MinorValidate
APP_BINbefore starting Alloy.If
./${APP_BIN}is invalid, the script exits at Line 31 after Alloy has already been spawned. Reorder this so startup fails cleanly before any sidecar process is launched.Proposed diff
APP_BIN="${1:-main}" +if [ ! -x "./${APP_BIN}" ]; then + echo "start.sh: ./${APP_BIN} is not executable in $(pwd)" >&2 + exit 127 +fi + # Start Alloy metrics agent in background (skipped if either credential is absent) alloy_pid="" if [ -n "$GRAFANA_CLOUD_API_KEY" ] && [ -n "$GRAFANA_CLOUD_USER" ]; then echo "Starting Alloy metrics agent for ${APP_BIN}" /usr/local/bin/alloy run --storage.path=/tmp/alloy-wal /app/alloy.river & alloy_pid=$! else echo "Grafana Cloud credentials not fully set, skipping metrics agent" fi @@ -if [ ! -x "./${APP_BIN}" ]; then - echo "start.sh: ./${APP_BIN} is not executable in $(pwd)" >&2 - exit 127 -fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/start.sh` around lines 12 - 32, Validate the application binary before launching the Alloy sidecar: move the executable check for ./${APP_BIN} (the if [ ! -x "./${APP_BIN}" ] block) so it runs before the Alloy spawn code that starts /usr/local/bin/alloy run and sets alloy_pid; ensure the script exits (exit 127) if APP_BIN is not executable and only then proceed to start Alloy, preserving the term() function and trap INT TERM behavior so no alloy process is launched when the startup should fail.
🧹 Nitpick comments (1)
scripts/start.sh (1)
6-10: Consider restrictingAPP_BINto known roles (main/worker).Allowing arbitrary values for
$1makes misconfiguration easier; a small guard gives deterministic failures and a tighter startup contract.Proposed diff
APP_BIN="${1:-main}" +case "$APP_BIN" in + main|worker) ;; + *) + echo "start.sh: unsupported binary '$APP_BIN' (allowed: main, worker)" >&2 + exit 64 + ;; +esac🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/start.sh` around lines 6 - 10, The script currently assigns APP_BIN="${1:-main}" allowing arbitrary binaries; add a validation step after reading $1 to restrict allowed values to "main" or "worker" (default to "main" when empty) and exit non-zero with a clear error message if an invalid role is supplied. Update the start.sh logic that sets APP_BIN to normalize/validate the input and fail fast, referencing the APP_BIN variable to enforce this tight startup contract.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@scripts/start.sh`:
- Around line 12-32: Validate the application binary before launching the Alloy
sidecar: move the executable check for ./${APP_BIN} (the if [ ! -x
"./${APP_BIN}" ] block) so it runs before the Alloy spawn code that starts
/usr/local/bin/alloy run and sets alloy_pid; ensure the script exits (exit 127)
if APP_BIN is not executable and only then proceed to start Alloy, preserving
the term() function and trap INT TERM behavior so no alloy process is launched
when the startup should fail.
---
Nitpick comments:
In `@scripts/start.sh`:
- Around line 6-10: The script currently assigns APP_BIN="${1:-main}" allowing
arbitrary binaries; add a validation step after reading $1 to restrict allowed
values to "main" or "worker" (default to "main" when empty) and exit non-zero
with a clear error message if an invalid role is supplied. Update the start.sh
logic that sets APP_BIN to normalize/validate the input and fail fast,
referencing the APP_BIN variable to enforce this tight startup contract.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1a0bddbb-64df-43d5-b95d-aecb1282dcd3
📒 Files selected for processing (4)
.fly/review_apps.worker.tomlCHANGELOG.mdfly.worker.tomlscripts/start.sh
✅ Files skipped from review due to trivial changes (1)
- CHANGELOG.md
|
Superseded by #342, which now contains all of this plus the Alloy sidecar fix and the 42P05 counter-sync fix on one branch. |
Summary
Investigation + fixes for the
bee.broker.outbox_age_secondsgauge climbing to 2.78 h during production runs and sawtoothing to 5–11 h afterwards.See docs/diagnostics/outbox-aging-investigation.md for the full ranked hypotheses and diagnostic queries.
Fixes (all ~90%+ combined reduction in peak age expected)
ScheduleBatchper-entry failures — returns typed*BatchErrorwithFailedIndices; sweeper DELETEs the successes and only bumps attempts on the entries that actually failed. Previously a single flaky ZADD bumped all 500 rows.task_outbox_dead— rows pastMaxAttempts(default 10) move atomically with the failing error message attached. Bounds worst-case age toMaxAttempts × MaxBackoff= ~50 min regardless of which hypothesis is the real driver.CancelJoboutbox cleanup — deletestask_outboxrows for the cancelled job in the same tx as the status flip.statement_timeouton sweep tx — 5 s budget so a wedged sweeper backend can't hold locks indefinitely (self-heals SKIP LOCKED starvation if the sweeper itself is the offender).bee.broker.outbox_sweep_totalcounter —outcome={dispatched, retried, dead_lettered}labels so future incidents are diagnosable without a DB session.What's NOT changed
ScheduleBatchpublic contract for non-sweeper callers — still a plainerror; they can checkerrors.As(err, &*BatchError)if they want partial-failure info.Test plan
go test ./internal/broker/ ./internal/jobs/ ./internal/observability/— pass.scripts/security-check.sh— clean (govulncheck, gosec, ESLint).TestOutboxSweeper_DeadLetter,TestOutboxSweeper_PartialFailure, existingHappyPath+ConcurrentClaim+RedisDown_RetriesSucceedagainst Supabase preview branch in CI.20260423132003_outbox_dead_letter.sqlapplies cleanly on preview branch.bee.broker.outbox_sweep_total{outcome}split in Grafana after deploy to confirm the counters are wired.Summary by CodeRabbit
New Features
Changed