Skip to content

Stabilise worker under heavy load#346

Merged
simonsmallchua merged 8 commits into
mainfrom
claude/unruffled-khayyam-f70265
Apr 25, 2026
Merged

Stabilise worker under heavy load#346
simonsmallchua merged 8 commits into
mainfrom
claude/unruffled-khayyam-f70265

Conversation

@simonsmallchua
Copy link
Copy Markdown
Contributor

@simonsmallchua simonsmallchua commented Apr 24, 2026

Summary

Addresses the escalating Sentry issues from the 2026-04-24 overnight load-test window (12:24 → 15:23 UTC, ~3h heavy job load). Two independent deadlocks, one goroutine leak, one timeout regression, plus Mimir cardinality control and config/log cleanup.

P0 — deadlocks

  • HOVER-K4 (40P01 on jobs during counter sync, 2,086 events, escalating) — DefaultDBSyncFunc iterated a Go map in random order, so concurrent sync ticks on different worker VMs acquired jobs row locks in different sequences. internal/broker/counters.go: sort jobIDs before the per-job UPDATE loop and wrap the zero-stale UPDATE in a WITH targets AS (... ORDER BY id FOR UPDATE) CTE.
  • HOVER-K2 (40P01 on tasks/task_outbox during promotion, 3,172 events) — promote_waiting_with_outbox ordered only by priority_score, created_at, leaving the tie-breaker non-deterministic. Concurrent promoters could lock task rows in different sequences; the trg_update_job_queue_counters AFTER trigger then updated jobs on top of that, completing the cycle. New migration 20260425000001_promote_waiting_deterministic_lock_order.sql adds id ASC to the picked CTE, inserts a picked_ordered stage that re-sorts by id, and orders the outbox insert by id.

P1 — bounded link-discovery fan-out

  • HOVER-KG (pool saturation on handleDiscoveredLinks, 2,468 events, 2,153 live goroutines at one event) — fan-out had no ceiling. internal/jobs/stream_worker.go: added linkDiscoverySem semaphore sized via JOBS_LINK_DISCOVERY_MAX_INFLIGHT (default 32).

P2 — outbox sweeper budget

  • HOVER-K3 (outbox bump attempts hitting context deadline, 645 events) — sweeper shared the shedding pool and lost the race when pressure shed was active. internal/broker/outbox.go raises StatementTimeout 5s → 15s, env-overridable via OUTBOX_SWEEP_STATEMENT_TIMEOUT_MS wired through cmd/worker/main.go.

P3 — config / log cleanup

  • Removed redundant GNH_PRESSURE_INITIAL_LIMIT = "100" from fly.toml and fly.worker.tomlpressure.go defaults it to DB_QUEUE_MAX_CONCURRENCY, so setting it explicitly only invites drift and the "exceeds queue cap" clamp warning. ⚠️ Post-merge: flyctl secrets unset GNH_PRESSURE_INITIAL_LIMIT --app hover and --app hover-worker (done).
  • Flipped ForceAttemptHTTP2: true → false at two crawler call sites to silence 1,226+ DATA after END_STREAM log lines; ALPN still negotiates H1.
  • Demoted the body-cap skip log from Warn to Debug.

Mimir cardinality

  • Dropped job.id label from broker stream gauges, bee_jobs_* gauges, and bee.broker.counter_sync_skew histogram. probe.go now accumulates StreamLength/ScheduledDepth/Pending totals across active jobs and emits once per tick, so dashboard sum(...) queries continue to return the total. Cardinality goes from 6N+1 to 7 series per worker (~85× reduction at N=100 jobs × 30 workers).
  • Per-job drill-down on these metrics is intentionally gone — use trace spans (which still carry task.domain/task.id/job.id at observability.go:729-732) for per-job debugging.

Out of scope

  • Larger Postgres pool — lock contention, not pool size, was the cap.
  • Supabase plan / pgbouncer mode changes.
  • API side (hover) — clean during the window.
  • Sentry environment-scoping for HOVER-KC / HOVER-JS "relation does not exist" leak — handled via Sentry UI inbound filters, no code change.

Test plan

  • go build ./...
  • go vet ./...
  • go test -count=1 -short ./internal/broker/... ./internal/db/... ./internal/jobs/... ./internal/crawler/... ./internal/observability/...
  • gofmt clean on all touched Go files
  • CI: migration applies cleanly against a fresh Supabase branch
  • Load-test rerun on prod: confirm HOVER-K4, HOVER-K2, HOVER-K3 → 0 events; HOVER-KG → < 50/h; "DB pressure at floor" plateau no longer sustained
  • Grafana: confirm sum(bee_broker_scheduled_zset_depth) and sum(bee_broker_consumer_pending) still render correctly after metric-label change (expect a brief zombie-series window in Mimir)

Summary by CodeRabbit

  • New Features

    • Add env var to configure outbox sweeper statement timeout.
    • Global concurrency limiter for link-discovery processing.
  • Bug Fixes

    • Deterministic database locking to prevent concurrent deadlocks.
    • Deterministic lock-ordering for promoting waiting tasks.
  • Performance Improvements

    • Increased default outbox sweeper timeout for stability.
    • Aggregate stream metrics to reduce cardinality.
    • Disable forced HTTP/2 to reduce noisy logs.
  • Chores

    • Remove explicit pressure initial limit from runtime configs; adjust log level to info.
    • Demote oversized upload logs from warning to debug.

Fixes the two root causes Sentry exposed during the 2026-04-24 overnight
load test: deterministic lock ordering in promote_waiting_with_outbox
(HOVER-K2) and in counters.go DefaultDBSyncFunc (HOVER-K4), so concurrent
promoters and counter syncs no longer deadlock on tasks/jobs.

Also bounds link-discovery fan-out (HOVER-KG, 2,153 live goroutines),
extends the outbox sweeper timeout (HOVER-K3), removes the redundant
GNH_PRESSURE_INITIAL_LIMIT env var from prod fly tomls, flips
ForceAttemptHTTP2 off to silence H2 DATA-after-END_STREAM log noise,
demotes the body-cap warn to debug, and drops job.id labels from broker
gauges / skew histogram to cut Mimir series cardinality ~85x
(probe aggregates totals once per tick so dashboard sum() queries still
work).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5ab6119f-48d1-4246-be49-e36ec2902588

📥 Commits

Reviewing files that changed from the base of the PR and between 09bcfa2 and a9e8066.

📒 Files selected for processing (2)
  • CHANGELOG.md
  • fly.toml
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • fly.toml

📝 Walkthrough

Walkthrough

Adds an env-configurable outbox sweeper statement-timeout (default raised to 15s), enforces deterministic DB row-lock ordering for counter sync and task promotion (with SQL migration), reduces metric cardinality by removing job IDs and aggregating stream stats, introduces a global link-discovery semaphore, and centralizes HTTP transport creation.

Changes

Cohort / File(s) Summary
Configuration & Env
cmd/worker/main.go, fly.toml, fly.worker.toml
Introduce OUTBOX_SWEEP_STATEMENT_TIMEOUT_MS parsed in worker startup to override sweeper statement timeout; remove explicit GNH_PRESSURE_INITIAL_LIMIT env entries and set LOG_LEVEL to info.
Outbox Sweeper Defaults
internal/broker/outbox.go, cmd/worker/main.go
Increase default per-tick StatementTimeout from 5s → 15s and allow overriding via environment variable.
Broker DB Sync / Locking
internal/broker/counters.go, supabase/migrations/20260425000001_promote_waiting_deterministic_lock_order.sql
Make DB lock acquisition deterministic: sort job IDs and use ORDER BY id FOR UPDATE CTEs in sync; replace promote_waiting_with_outbox with a deterministic-locking SQL function to avoid deadlocks.
Probe & Observability
internal/broker/probe.go, internal/observability/observability.go
Aggregate stream metrics across jobs into a single per-tick emission and remove job.id from metric labels/BrokerStreamStats to reduce cardinality; refactor probeJob to return counts for aggregation.
Link-discovery Concurrency
internal/jobs/stream_worker.go
Add global semaphore on StreamWorkerPool to cap link-discovery concurrency (JOBS_LINK_DISCOVERY_MAX_INFLIGHT, default 32); acquire slot in handleDiscoveredLinks.
Crawler Transport
internal/crawler/crawler.go
Extract newBaseHTTPTransport() and reuse it for clients/probes; set ForceAttemptHTTP2: false to avoid noisy HTTP/2 logs.
Logging
internal/jobs/executor.go
Demote oversized HTML-upload log from Warn → Debug.
Changelog / Docs
CHANGELOG.md
Document fixes/changes: deterministic locking, outbox timeout/env override, metric cardinality reduction, link-discovery semaphore, and transport centralization.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Poem

🐰 I tidy locks in patient rows,
Timeouts lengthen as the traffic grows,
Metrics trimmed to steady streams,
Semaphore keeps chaotic dreams,
I hop away with grateful nose 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Stabilise worker under heavy load' accurately summarizes the main objective of the PR, which addresses worker failures and stability issues from a ~3-hour load test by fixing deadlocks, bounding concurrency, and adjusting timeouts.
Docstring Coverage ✅ Passed Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/unruffled-khayyam-f70265

Comment @coderabbitai help to get the list of available commands and usage tips.

@supabase
Copy link
Copy Markdown

supabase Bot commented Apr 24, 2026

Updates to Preview Branch (claude/unruffled-khayyam-f70265) ↗︎

Deployments Status Updated
Database Sat, 25 Apr 2026 00:41:33 UTC
Services Sat, 25 Apr 2026 00:41:33 UTC
APIs Sat, 25 Apr 2026 00:41:33 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks Status Updated
Configurations Sat, 25 Apr 2026 00:41:35 UTC
Migrations Sat, 25 Apr 2026 00:41:36 UTC
Seeding Sat, 25 Apr 2026 00:41:38 UTC
Edge Functions Sat, 25 Apr 2026 00:41:38 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

Release Versions

App patch: v0.33.3v0.33.4

Changelog

Fixed

  • promote_waiting_with_outbox now sorts the picked CTE with id ASC as a
    tie-breaker and orders the outbox INSERT … SELECT by id, making concurrent
    promoters lock task rows in a deterministic sequence. The AFTER trigger
    trg_update_job_queue_counters then updates the parent jobs row on top of
    an acyclic lock graph, so the 40P01 deadlock that fired ~3,000 times during
    the 2026-04-24 overnight load test (Sentry HOVER-K2) cannot form. Migration
    20260425000001_promote_waiting_deterministic_lock_order.
  • broker.DefaultDBSyncFunc now sorts jobIDs before issuing per-job
    UPDATE jobs SET running_tasks = … and wraps the zero-stale UPDATE in a
    WITH targets AS (… ORDER BY id FOR UPDATE) UPDATE jobs … FROM targets CTE.
    Previously the sync loop iterated a Go map (random order), so two worker VMs
    ticking concurrently locked jobs rows in different sequences and deadlocked
    (Sentry HOVER-K4, ~2,000 events/12h, escalating).
  • StreamWorkerPool.handleDiscoveredLinks now acquires from a bounded semaphore
    (JOBS_LINK_DISCOVERY_MAX_INFLIGHT, default 32) before each link is enqueued.
    Previously the fan-out was unbounded; one Sentry event captured 2,153 live
    goroutines in the worker (HOVER-KG, ~2,500 events/12h).
  • Outbox sweeper StatementTimeout raised from 5 s to 15 s and made
    env-overridable via OUTBOX_SWEEP_STATEMENT_TIMEOUT_MS, so the
    bump attempts UPDATE no longer hits the context deadline when the shedding
    pool is under pressure (Sentry HOVER-K3).

Changed

  • Per-job job.id label removed from the broker stream-depth gauges
    (bee.broker.stream_length, bee.broker.scheduled_zset_depth,
    bee.broker.consumer_pending), the running-tasks / concurrency-limit gauges
    (bee.jobs.running_tasks, bee.jobs.concurrency_limit), and the counter-sync
    skew histogram (bee.broker.counter_sync_skew). The probe loop now sums
    per-job depths across all active jobs and emits one aggregate per tick, so
    dashboard sum(...) queries continue to render the correct totals while Mimir
    series cardinality drops from 6N + 1 to 7 per worker (≈85× reduction at
    100 jobs × 30 workers). Per-job drill-down on these metrics is intentionally
    gone — use trace spans (which still carry task.domain / task.id /
    job.id) for that.
  • Crawler http.Transport configuration centralised in newBaseHTTPTransport.
    Both CreateHTTPClient and the colly base transport now derive from the
    helper, and the probe-only HEAD client derives from it too with overrides for
    its smaller pool. Among other things this makes ForceAttemptHTTP2: false
    apply uniformly across the three clients.
  • ForceAttemptHTTP2: false on crawler transports silences ~1,200
    received DATA after END_STREAM log lines per heavy-load window; ALPN still
    negotiates HTTP/1.1 and per-request throughput is unchanged.
  • HTML body-cap skip log demoted from Warn to Debug — a single large domain was
    producing thousands of warns per job.
  • GNH_PRESSURE_INITIAL_LIMIT removed from fly.toml and fly.worker.toml.
    pressure.go already defaults the initial limit to DB_QUEUE_MAX_CONCURRENCY
    (the safe maximum), so setting it explicitly only invited drift and the
    "exceeds queue cap" clamp warning when the two values got out of sync.
  • Production API LOG_LEVEL lowered from debug to info in fly.toml, now
    matching the worker. Debug verbosity was retained while we were chasing the
    load-test issues and is no longer needed; review apps stay on debug.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 17.04545% with 73 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
internal/jobs/stream_worker.go 0.00% 24 Missing ⚠️
internal/broker/counters.go 0.00% 23 Missing ⚠️
internal/broker/probe.go 0.00% 13 Missing ⚠️
internal/observability/observability.go 0.00% 8 Missing ⚠️
cmd/worker/main.go 0.00% 2 Missing ⚠️
internal/broker/outbox.go 0.00% 1 Missing ⚠️
internal/crawler/crawler.go 93.75% 1 Missing ⚠️
internal/jobs/executor.go 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-346.fly.dev
Dashboard: https://hover-pr-346.fly.dev/dashboard

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/crawler/crawler.go (1)

1214-1222: Consider extracting shared transport defaults to avoid drift.

Line 1215–1223 duplicates the base transport shape from Line 372–380. A small helper for common transport config would make future tuning safer.

♻️ Optional refactor sketch
+func newBaseTransport() *http.Transport {
+	return &http.Transport{
+		MaxIdleConns:        150,
+		MaxIdleConnsPerHost: 25,
+		MaxConnsPerHost:     50,
+		IdleConnTimeout:     120 * time.Second,
+		TLSHandshakeTimeout: 10 * time.Second,
+		DisableCompression:  true,
+		ForceAttemptHTTP2:   false,
+	}
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/crawler/crawler.go` around lines 1214 - 1222, The duplicated
http.Transport setup should be extracted into a shared helper to prevent drift:
create a function (e.g. newBaseTransport or configureTransportDefaults) that
returns *http.Transport populated with the common fields (MaxIdleConns,
MaxIdleConnsPerHost, MaxConnsPerHost, IdleConnTimeout, TLSHandshakeTimeout,
DisableCompression, ForceAttemptHTTP2) and replace both the transport
initializations (the local transport variable and the earlier baseTransport) to
call that helper, then apply any call-site specific overrides on the returned
transport as needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/crawler/crawler.go`:
- Around line 1214-1222: The duplicated http.Transport setup should be extracted
into a shared helper to prevent drift: create a function (e.g. newBaseTransport
or configureTransportDefaults) that returns *http.Transport populated with the
common fields (MaxIdleConns, MaxIdleConnsPerHost, MaxConnsPerHost,
IdleConnTimeout, TLSHandshakeTimeout, DisableCompression, ForceAttemptHTTP2) and
replace both the transport initializations (the local transport variable and the
earlier baseTransport) to call that helper, then apply any call-site specific
overrides on the returned transport as needed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 22d9797c-026a-4319-b24d-2a4585ab41c7

📥 Commits

Reviewing files that changed from the base of the PR and between 1809fed and 15d146f.

📒 Files selected for processing (11)
  • cmd/worker/main.go
  • fly.toml
  • fly.worker.toml
  • internal/broker/counters.go
  • internal/broker/outbox.go
  • internal/broker/probe.go
  • internal/crawler/crawler.go
  • internal/jobs/executor.go
  • internal/jobs/stream_worker.go
  • internal/observability/observability.go
  • supabase/migrations/20260425000001_promote_waiting_deterministic_lock_order.sql

Both http.Transport literals in crawler.go were byte-identical after the
ForceAttemptHTTP2 flip. Pull them into newBaseHTTPTransport() so future
tuning can't leave the two sites out of sync.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/crawler/crawler.go (1)

1213-1232: Consider deriving probe transport from the shared helper too.

You’ve centralised two call sites well; doing the same for the probe client will further reduce config drift (timeouts/H2 posture) while preserving probe-specific limits.

♻️ Suggested refactor
-	probeTransport := &http.Transport{
-		MaxIdleConns:        20,
-		MaxIdleConnsPerHost: 5,
-		MaxConnsPerHost:     10,
-		IdleConnTimeout:     30 * time.Second,
-		TLSHandshakeTimeout: 10 * time.Second,
-	}
+	probeTransport := newBaseHTTPTransport()
+	probeTransport.MaxIdleConns = 20
+	probeTransport.MaxIdleConnsPerHost = 5
+	probeTransport.MaxConnsPerHost = 10
+	probeTransport.IdleConnTimeout = 30 * time.Second
+	probeTransport.TLSHandshakeTimeout = 10 * time.Second
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/crawler/crawler.go` around lines 1213 - 1232, The probe HTTP client
should reuse the shared transport from newBaseHTTPTransport to avoid config
drift: change the probe client initialization to call newBaseHTTPTransport(),
then apply probe-specific overrides (e.g. adjust
MaxConnsPerHost/MaxIdleConnsPerHost/IdleConnTimeout or other limits) on the
returned *http.Transport before using it; preserve existing SSRF-safe
DialContext and any round-trip wrappers currently attached to the probe client
so only the transport defaults are centralized while probe limits remain
localized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/crawler/crawler.go`:
- Around line 1213-1232: The probe HTTP client should reuse the shared transport
from newBaseHTTPTransport to avoid config drift: change the probe client
initialization to call newBaseHTTPTransport(), then apply probe-specific
overrides (e.g. adjust MaxConnsPerHost/MaxIdleConnsPerHost/IdleConnTimeout or
other limits) on the returned *http.Transport before using it; preserve existing
SSRF-safe DialContext and any round-trip wrappers currently attached to the
probe client so only the transport defaults are centralized while probe limits
remain localized.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b5c3ef4e-fb56-4423-8fee-b450177eecbe

📥 Commits

Reviewing files that changed from the base of the PR and between 15d146f and ee162fa.

📒 Files selected for processing (1)
  • internal/crawler/crawler.go

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-346.fly.dev
Dashboard: https://hover-pr-346.fly.dev/dashboard

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-346.fly.dev
Dashboard: https://hover-pr-346.fly.dev/dashboard

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-346.fly.dev
Dashboard: https://hover-pr-346.fly.dev/dashboard

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-346.fly.dev
Dashboard: https://hover-pr-346.fly.dev/dashboard

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-346.fly.dev
Dashboard: https://hover-pr-346.fly.dev/dashboard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant