Fix fd exhaustion from probe transport leak#297
Conversation
|
Updates to Preview Branch (fix/fd-exhaustion-under-load) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis change addresses file-descriptor exhaustion by replacing per-request HTTP client creation with a shared, reusable client in the crawler, adding FD usage monitoring and pressure calculation, gating DB pool acquisitions under FD pressure, adding DB pool env vars, and attempting to raise the container FD soft limit at startup. Changes
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Release VersionsApp patch: ChangelogFixed
Added
|
|
🐝 Review App Deployed Homepage: https://hover-pr-297.fly.dev |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
🐝 Review App Deployed Homepage: https://hover-pr-297.fly.dev |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CHANGELOG.md`:
- Around line 29-51: Run Prettier on the modified changelog hunk to fix CI
formatting errors: reformat the "Unreleased:patch" section (the headings and
bullet list items shown) using the project's Prettier config (e.g., run prettier
--write or your editor integration) so the Markdown spacing/line breaks and list
wrapping match the repo's formatting rules before merging.
In `@internal/crawler/crawler.go`:
- Around line 413-435: The probe HTTP client is using probeTransport directly so
HEAD cache-status probes skip the AIA chain-repair layer; wrap the probe
transport with the same AIA transport used by the main client (use
newAIATransport) before assigning it to Crawler.probeClient so probes use the
aia layer (i.e., pass probeTransport and the existing aiaRT into newAIATransport
and set that as the probe client's Transport in the Crawler constructor).
In `@internal/util/fdlimit.go`:
- Around line 3-7: The parser that reads the "Max open files" line currently
calls strconv.Atoi and silently leaves limit==0 on parse failure; change that
parsing logic to return a non-nil error when strconv.Atoi fails (do not treat 0
as success), and propagate that error back to callers so they can explicitly
disable FD-pressure guards/gauges; specifically update the code that locates the
"Max open files" line and the strconv.Atoi invocation to check and return the
parse error, and update any callers that assume a nil error on unparsable input
to handle the error path instead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: da87cba9-4409-4d0a-97ad-91fd2724ef61
📒 Files selected for processing (7)
CHANGELOG.mdDockerfilefly.tomlinternal/crawler/crawler.gointernal/db/queue.gointernal/observability/observability.gointernal/util/fdlimit.go
| ## [Unreleased:patch] | ||
|
|
||
| ### Fixed | ||
|
|
||
| - Fix file descriptor exhaustion under load caused by per-request HTTP transport | ||
| leak in `CheckCacheStatus()` — each probe call created a new `http.Transport` | ||
| that held idle connections open indefinitely, exhausting 10240 fds during | ||
| sustained crawling | ||
| - Add shared `probeClient` to `Crawler` struct, reused across all cache probe | ||
| requests with proper idle connection limits | ||
| - Add global `MaxIdleConns` cap (150) to main crawler and `CreateHTTPClient` | ||
| transports to prevent unbounded idle socket accumulation across hosts | ||
| - Reduce production DB pool from 70 → 45 open connections via env vars to match | ||
| actual queue semaphore usage (40 concurrent ops + 5 reserved) | ||
| - Add file descriptor pressure detection in `ensurePoolCapacity` — rejects DB | ||
| operations early with `ErrPoolSaturated` (triggering existing retry/backoff) | ||
| when fd usage exceeds 90%, instead of failing with cryptic DNS errors | ||
| - Raise container fd soft limit to hard limit ceiling in Dockerfile as safety net | ||
|
|
||
| ### Added | ||
|
|
||
| - File descriptor observability gauges (`bee.process.fd.current`, | ||
| `bee.process.fd.limit`, `bee.process.fd.pressure`) for Grafana monitoring |
There was a problem hiding this comment.
Run Prettier on this changelog hunk before merge.
CI is already reporting a Prettier formatting warning, and this is the only touched Markdown file in the patch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@CHANGELOG.md` around lines 29 - 51, Run Prettier on the modified changelog
hunk to fix CI formatting errors: reformat the "Unreleased:patch" section (the
headings and bullet list items shown) using the project's Prettier config (e.g.,
run prettier --write or your editor integration) so the Markdown spacing/line
breaks and list wrapping match the repo's formatting rules before merging.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/crawler/crawler.go (1)
363-367: 🧹 Nitpick | 🔵 TrivialExtract transport limits into shared constants/config to prevent drift.
MaxIdleConns,IdleConnTimeout, and TLS timeout values are now tuned in multiple places. Centralising these values will reduce future config skew betweenNewandCreateHTTPClient.♻️ Proposed refactor
+const ( + mainMaxIdleConns = 150 + mainIdleConnTimeout = 120 * time.Second + probeMaxIdleConns = 20 + probeIdleConnTimeout = 30 * time.Second + defaultTLSHandshakeTime = 10 * time.Second +) ... baseTransport := &http.Transport{ - MaxIdleConns: 150, // Global cap — prevents idle socket accumulation across hosts + MaxIdleConns: mainMaxIdleConns, // Global cap — prevents idle socket accumulation across hosts MaxIdleConnsPerHost: 25, MaxConnsPerHost: 50, - IdleConnTimeout: 120 * time.Second, - TLSHandshakeTimeout: 10 * time.Second, + IdleConnTimeout: mainIdleConnTimeout, + TLSHandshakeTimeout: defaultTLSHandshakeTime, ... probeTransport := &http.Transport{ - MaxIdleConns: 20, + MaxIdleConns: probeMaxIdleConns, MaxIdleConnsPerHost: 5, MaxConnsPerHost: 10, - IdleConnTimeout: 30 * time.Second, - TLSHandshakeTimeout: 10 * time.Second, + IdleConnTimeout: probeIdleConnTimeout, + TLSHandshakeTimeout: defaultTLSHandshakeTime, } ... transport := &http.Transport{ - MaxIdleConns: 150, // Global cap — prevents idle socket accumulation across hosts + MaxIdleConns: mainMaxIdleConns, // Global cap — prevents idle socket accumulation across hosts MaxIdleConnsPerHost: 25, MaxConnsPerHost: 50, - IdleConnTimeout: 120 * time.Second, - TLSHandshakeTimeout: 10 * time.Second, + IdleConnTimeout: mainIdleConnTimeout, + TLSHandshakeTimeout: defaultTLSHandshakeTime,Also applies to: 416-420, 1070-1075
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/crawler/crawler.go` around lines 363 - 367, Extract the repeated transport tuning values into package-level shared constants (e.g., DefaultMaxIdleConns, DefaultMaxIdleConnsPerHost, DefaultMaxConnsPerHost, DefaultIdleConnTimeout, DefaultTLSHandshakeTimeout) or a small config struct in the crawler package, then replace the hard-coded literals used in New (the transport initialization block) and CreateHTTPClient with those constants/config fields so both functions read the same source of truth; update any other places referenced in the review (the other transport init sites) to use the same constants/config to avoid drift and keep behavior identical across New and CreateHTTPClient.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/util/fdlimit.go`:
- Line 37: The error string returned in the function that parses
/proc/self/limits (the return statement returning current, 0, fmt.Errorf("Max
open files line not found in /proc/self/limits")) starts with an uppercase
letter and fails staticcheck ST1005; change that fmt.Errorf message to start
with a lowercase letter (e.g., "max open files line not found in
/proc/self/limits") so the error string conforms to the lint rule while keeping
the same contextual text and location in the function that reads/parses open
file limits.
---
Outside diff comments:
In `@internal/crawler/crawler.go`:
- Around line 363-367: Extract the repeated transport tuning values into
package-level shared constants (e.g., DefaultMaxIdleConns,
DefaultMaxIdleConnsPerHost, DefaultMaxConnsPerHost, DefaultIdleConnTimeout,
DefaultTLSHandshakeTimeout) or a small config struct in the crawler package,
then replace the hard-coded literals used in New (the transport initialization
block) and CreateHTTPClient with those constants/config fields so both functions
read the same source of truth; update any other places referenced in the review
(the other transport init sites) to use the same constants/config to avoid drift
and keep behavior identical across New and CreateHTTPClient.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1a3e64b0-5a9a-42b8-ac1a-555af4691c93
📒 Files selected for processing (2)
internal/crawler/crawler.gointernal/util/fdlimit.go
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/util/fdlimit.go`:
- Around line 42-48: FDPressure currently computes fail-open semantics itself
but other code (internal/db/queue.go) recomputes pressure inline causing drift;
add a shared helper function FDPressureFrom(current, limit) that implements the
canonical logic (return 0 when limit <= 0 or on error-equivalent inputs,
otherwise current/limit as float64), change FDPressure() to call
FDPressureFrom(FDUsage() results) and update the inline pressure computation in
internal/db/queue.go to call FDPressureFrom instead so both places share
identical edge-case handling.
- Around line 19-37: The code currently re-reads and reparses /proc/self/limits
on every FDUsage() call; add a concurrency-safe cache for the parsed soft FD
limit so we only parse once. Introduce a package-level cachedSoftLimit int (and
a sync.Once or atomic flag) and move the /proc/self/limits parsing logic into an
init-once helper (e.g., populateCachedSoftLimit or getCachedSoftLimit) that
stores the parsed soft limit into cachedSoftLimit; then have FDUsage() call that
helper and return the cached soft limit on subsequent calls. Make sure the
caching is safe for concurrent callers and that errors during the first parse
are handled and propagated appropriately.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 32274ec3-f629-458f-b472-a7c6b91c0c91
📒 Files selected for processing (1)
internal/util/fdlimit.go
|
🐝 Review App Deployed Homepage: https://hover-pr-297.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-297.fly.dev |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/db/queue.go`:
- Around line 769-770: Replace the magic literal 0.90 used in the fd pressure
check with a named constant (e.g., fdPressureThreshold or maxFDPressure)
declared in the same package const block; update the condition that uses
fdPressure (the line calling log.Warn()) to compare against this constant, add a
short comment documenting the threshold purpose, and search for other
occurrences of 0.90 related to fd pressure to reuse the constant so tuning is
centralized.
🪄 Autofix (Beta)
✅ Autofix completed
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a46d9412-304a-4432-b6f9-edc0c53da17f
📒 Files selected for processing (2)
internal/db/queue.gointernal/util/fdlimit.go
|
Note Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it. Fixes Applied SuccessfullyFixed 1 file(s) based on 1 unresolved review comment. Files modified:
Commit: The changes have been pushed to the Time taken: |
Fixed 1 file(s) based on 1 unresolved review comment. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
|
🐝 Review App Deployed Homepage: https://hover-pr-297.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-297.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-297.fly.dev |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Dockerfile`:
- Line 58: The startup currently runs ulimit in the Dockerfile CMD with stderr
discarded, so there's no visibility whether the file descriptor limit was raised
and internal/util/fdlimit.go caches the soft limit via sync.Once; either add an
observable step to the startup (e.g., change the CMD to emit the resulting
ulimit -n value or an explicit echo/log of the ulimit outcome before exec
./main) or add a short log in Go at startup (in main.main or an init) that reads
and logs the cached FD limit from the code in internal/util/fdlimit.go (the
function that retrieves/caches the soft limit) so operators can see the
effective FD limit at boot.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: fa2716ef-0801-4c95-92cc-2500cf88d1ac
📒 Files selected for processing (3)
Dockerfileinternal/db/queue.gointernal/util/fdlimit.go
|
🐝 Review App Deployed Homepage: https://hover-pr-297.fly.dev |
Fix fd exhaustion from probe transport leak
Summary
CheckCacheStatus()created a newhttp.Transport+http.Clienton every call (up to 3× per URL × thousands of tasks). Each transport held idle connections open indefinitely, exhausting all 10,240 file descriptors under sustained load.probeClienton theCrawlerstruct, initialised once with proper connection limits (MaxIdleConns: 20,MaxConnsPerHost: 10,IdleConnTimeout: 30s).MaxIdleConns: 150cap to main crawler andCreateHTTPClienttransports.ensurePoolCapacity— triggersErrPoolSaturatedat >90% fd usage so existing retry/backoff handles it gracefully instead of cryptic DNS errors.bee.process.fd.current,.limit,.pressure).Context
Load test over ~24h generated 5.1K
pgconn.ConnectErrorevents in Sentry (socket: too many open files). Supabase connections only peaked at 54/90 (60%) — the DB wasn't the bottleneck. The Fly VM's 10,240 fd limit was exhausted by leaked HTTP transports from probe calls.Test plan
go build ./...passesgo test ./internal/crawler/...passesgo test ./internal/db/...passesgo test ./internal/util/...passesgofmtclean on all changed filesfly ssh console→ls /proc/$(pidof main)/fd | wc -lduring loadbee.process.fd.*gaugesSummary by CodeRabbit
Bug Fixes
New Features
Chores