Stabilise worker under heavy load#346
Conversation
Fixes the two root causes Sentry exposed during the 2026-04-24 overnight load test: deterministic lock ordering in promote_waiting_with_outbox (HOVER-K2) and in counters.go DefaultDBSyncFunc (HOVER-K4), so concurrent promoters and counter syncs no longer deadlock on tasks/jobs. Also bounds link-discovery fan-out (HOVER-KG, 2,153 live goroutines), extends the outbox sweeper timeout (HOVER-K3), removes the redundant GNH_PRESSURE_INITIAL_LIMIT env var from prod fly tomls, flips ForceAttemptHTTP2 off to silence H2 DATA-after-END_STREAM log noise, demotes the body-cap warn to debug, and drops job.id labels from broker gauges / skew histogram to cut Mimir series cardinality ~85x (probe aggregates totals once per tick so dashboard sum() queries still work).
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds an env-configurable outbox sweeper statement-timeout (default raised to 15s), enforces deterministic DB row-lock ordering for counter sync and task promotion (with SQL migration), reduces metric cardinality by removing job IDs and aggregating stream stats, introduces a global link-discovery semaphore, and centralizes HTTP transport creation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Updates to Preview Branch (claude/unruffled-khayyam-f70265) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
Release VersionsApp patch: ChangelogFixed
Changed
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
🐝 Review App Deployed Homepage: https://hover-pr-346.fly.dev |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/crawler/crawler.go (1)
1214-1222: Consider extracting shared transport defaults to avoid drift.Line 1215–1223 duplicates the base transport shape from Line 372–380. A small helper for common transport config would make future tuning safer.
♻️ Optional refactor sketch
+func newBaseTransport() *http.Transport { + return &http.Transport{ + MaxIdleConns: 150, + MaxIdleConnsPerHost: 25, + MaxConnsPerHost: 50, + IdleConnTimeout: 120 * time.Second, + TLSHandshakeTimeout: 10 * time.Second, + DisableCompression: true, + ForceAttemptHTTP2: false, + } +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/crawler/crawler.go` around lines 1214 - 1222, The duplicated http.Transport setup should be extracted into a shared helper to prevent drift: create a function (e.g. newBaseTransport or configureTransportDefaults) that returns *http.Transport populated with the common fields (MaxIdleConns, MaxIdleConnsPerHost, MaxConnsPerHost, IdleConnTimeout, TLSHandshakeTimeout, DisableCompression, ForceAttemptHTTP2) and replace both the transport initializations (the local transport variable and the earlier baseTransport) to call that helper, then apply any call-site specific overrides on the returned transport as needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/crawler/crawler.go`:
- Around line 1214-1222: The duplicated http.Transport setup should be extracted
into a shared helper to prevent drift: create a function (e.g. newBaseTransport
or configureTransportDefaults) that returns *http.Transport populated with the
common fields (MaxIdleConns, MaxIdleConnsPerHost, MaxConnsPerHost,
IdleConnTimeout, TLSHandshakeTimeout, DisableCompression, ForceAttemptHTTP2) and
replace both the transport initializations (the local transport variable and the
earlier baseTransport) to call that helper, then apply any call-site specific
overrides on the returned transport as needed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 22d9797c-026a-4319-b24d-2a4585ab41c7
📒 Files selected for processing (11)
cmd/worker/main.gofly.tomlfly.worker.tomlinternal/broker/counters.gointernal/broker/outbox.gointernal/broker/probe.gointernal/crawler/crawler.gointernal/jobs/executor.gointernal/jobs/stream_worker.gointernal/observability/observability.gosupabase/migrations/20260425000001_promote_waiting_deterministic_lock_order.sql
Both http.Transport literals in crawler.go were byte-identical after the ForceAttemptHTTP2 flip. Pull them into newBaseHTTPTransport() so future tuning can't leave the two sites out of sync.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/crawler/crawler.go (1)
1213-1232: Consider deriving probe transport from the shared helper too.You’ve centralised two call sites well; doing the same for the probe client will further reduce config drift (timeouts/H2 posture) while preserving probe-specific limits.
♻️ Suggested refactor
- probeTransport := &http.Transport{ - MaxIdleConns: 20, - MaxIdleConnsPerHost: 5, - MaxConnsPerHost: 10, - IdleConnTimeout: 30 * time.Second, - TLSHandshakeTimeout: 10 * time.Second, - } + probeTransport := newBaseHTTPTransport() + probeTransport.MaxIdleConns = 20 + probeTransport.MaxIdleConnsPerHost = 5 + probeTransport.MaxConnsPerHost = 10 + probeTransport.IdleConnTimeout = 30 * time.Second + probeTransport.TLSHandshakeTimeout = 10 * time.Second🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/crawler/crawler.go` around lines 1213 - 1232, The probe HTTP client should reuse the shared transport from newBaseHTTPTransport to avoid config drift: change the probe client initialization to call newBaseHTTPTransport(), then apply probe-specific overrides (e.g. adjust MaxConnsPerHost/MaxIdleConnsPerHost/IdleConnTimeout or other limits) on the returned *http.Transport before using it; preserve existing SSRF-safe DialContext and any round-trip wrappers currently attached to the probe client so only the transport defaults are centralized while probe limits remain localized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/crawler/crawler.go`:
- Around line 1213-1232: The probe HTTP client should reuse the shared transport
from newBaseHTTPTransport to avoid config drift: change the probe client
initialization to call newBaseHTTPTransport(), then apply probe-specific
overrides (e.g. adjust MaxConnsPerHost/MaxIdleConnsPerHost/IdleConnTimeout or
other limits) on the returned *http.Transport before using it; preserve existing
SSRF-safe DialContext and any round-trip wrappers currently attached to the
probe client so only the transport defaults are centralized while probe limits
remain localized.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b5c3ef4e-fb56-4423-8fee-b450177eecbe
📒 Files selected for processing (1)
internal/crawler/crawler.go
|
🐝 Review App Deployed Homepage: https://hover-pr-346.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-346.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-346.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-346.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-346.fly.dev |
Summary
Addresses the escalating Sentry issues from the 2026-04-24 overnight load-test window (12:24 → 15:23 UTC, ~3h heavy job load). Two independent deadlocks, one goroutine leak, one timeout regression, plus Mimir cardinality control and config/log cleanup.
P0 — deadlocks
40P01onjobsduring counter sync, 2,086 events, escalating) —DefaultDBSyncFunciterated a Go map in random order, so concurrent sync ticks on different worker VMs acquiredjobsrow locks in different sequences.internal/broker/counters.go: sortjobIDsbefore the per-job UPDATE loop and wrap the zero-stale UPDATE in aWITH targets AS (... ORDER BY id FOR UPDATE)CTE.40P01ontasks/task_outboxduring promotion, 3,172 events) —promote_waiting_with_outboxordered only bypriority_score, created_at, leaving the tie-breaker non-deterministic. Concurrent promoters could lock task rows in different sequences; thetrg_update_job_queue_countersAFTER trigger then updatedjobson top of that, completing the cycle. New migration20260425000001_promote_waiting_deterministic_lock_order.sqladdsid ASCto the picked CTE, inserts apicked_orderedstage that re-sorts by id, and orders the outbox insert by id.P1 — bounded link-discovery fan-out
handleDiscoveredLinks, 2,468 events, 2,153 live goroutines at one event) — fan-out had no ceiling.internal/jobs/stream_worker.go: addedlinkDiscoverySemsemaphore sized viaJOBS_LINK_DISCOVERY_MAX_INFLIGHT(default 32).P2 — outbox sweeper budget
bump attemptshitting context deadline, 645 events) — sweeper shared the shedding pool and lost the race when pressure shed was active.internal/broker/outbox.goraisesStatementTimeout5s → 15s, env-overridable viaOUTBOX_SWEEP_STATEMENT_TIMEOUT_MSwired throughcmd/worker/main.go.P3 — config / log cleanup
GNH_PRESSURE_INITIAL_LIMIT = "100"fromfly.tomlandfly.worker.toml—pressure.godefaults it toDB_QUEUE_MAX_CONCURRENCY, so setting it explicitly only invites drift and the "exceeds queue cap" clamp warning.flyctl secrets unset GNH_PRESSURE_INITIAL_LIMIT --app hoverand--app hover-worker(done).ForceAttemptHTTP2: true → falseat two crawler call sites to silence 1,226+DATA after END_STREAMlog lines; ALPN still negotiates H1.Mimir cardinality
job.idlabel from broker stream gauges,bee_jobs_*gauges, andbee.broker.counter_sync_skewhistogram.probe.gonow accumulatesStreamLength/ScheduledDepth/Pendingtotals across active jobs and emits once per tick, so dashboardsum(...)queries continue to return the total. Cardinality goes from6N+1to7series per worker (~85× reduction at N=100 jobs × 30 workers).task.domain/task.id/job.idatobservability.go:729-732) for per-job debugging.Out of scope
hover) — clean during the window.Test plan
go build ./...go vet ./...go test -count=1 -short ./internal/broker/... ./internal/db/... ./internal/jobs/... ./internal/crawler/... ./internal/observability/...gofmtclean on all touched Go filessum(bee_broker_scheduled_zset_depth)andsum(bee_broker_consumer_pending)still render correctly after metric-label change (expect a brief zombie-series window in Mimir)Summary by CodeRabbit
New Features
Bug Fixes
Performance Improvements
Chores