Cache robots.txt rules per domain#385
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughJobManager caches robots.txt discovery per normalized, lowercased domain with configurable positive (1h) and negative (60s) TTLs, collapses concurrent misses via singleflight, and avoids caching caller-side cancellations. Tests and CHANGELOG updated to reflect behavior. ChangesRobots Cache Implementation
Sequence Diagram(s)sequenceDiagram
participant Caller
participant GetRobotsRules
participant RobotsCache
participant SingleflightGroup
participant Crawler
Caller->>GetRobotsRules: GetRobotsRules(domain, ctx)
GetRobotsRules->>GetRobotsRules: normalise & lowercase domain
GetRobotsRules->>RobotsCache: check cached {rules,err,expiresAt}
alt cache hit (not expired)
RobotsCache-->>GetRobotsRules: cached rules/err
else cache miss or expired
GetRobotsRules->>SingleflightGroup: Do(domainKey, fetchFn)
alt first caller
SingleflightGroup->>Crawler: DiscoverSitemapsAndRobots(ctx, domain)
Crawler-->>SingleflightGroup: rules, err
SingleflightGroup->>RobotsCache: store {rules,err,expiresAt} using TTL
else concurrent callers
SingleflightGroup-->>GetRobotsRules: wait & receive result from first caller
end
end
GetRobotsRules-->>Caller: rules, err
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Updates to Preview Branch (work/robots-cache) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
Release VersionsApp patch: ChangelogChanged
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/jobs/manager.go`:
- Around line 412-433: The code caches robots results under the lowercased,
normalised key but calls jm.crawler.DiscoverSitemapsAndRobots(ctx, domain) with
the raw domain, causing mismatched fetches and bad cache entries; update the
call inside jm.robotsGroup.Do to use the canonical key (pass key instead of
domain) so jm.crawler.DiscoverSitemapsAndRobots, any error formatting
(fmt.Errorf("jobs: fetch robots for %s"...)) and the cached entry all operate on
the same normalised origin string.
- Around line 438-448: The current code unconditionally caches errors from
out.err under robotsTTLNeg which allows transient context cancellations to
poison jm.robotsCache; change the logic in the robots caching block to detect
context.Canceled and context.DeadlineExceeded (use errors.Is(out.err,
context.Canceled) || errors.Is(out.err, context.DeadlineExceeded)) and do not
write a negative cache entry for those transient errors (either skip writing to
jm.robotsCache or set ttl to zero and avoid adding an entry unless ttl>0).
Update the conditional around jm.robotsCache assignment (referencing
robotsTTLPos, robotsTTLNeg, out.err, robotsCacheEntry, jm.robotsMutex,
jm.robotsCache) so only non-transient errors are cached with robotsTTLNeg.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 826131a5-6589-4283-aeed-cb4203c98137
📒 Files selected for processing (3)
CHANGELOG.mdinternal/jobs/manager.gointernal/jobs/robots_cache_test.go
|
🐝 Review App Deployed Homepage: https://hover-pr-385.fly.dev |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/jobs/robots_cache_test.go`:
- Around line 175-192: The test currently drops errors from concurrent calls to
GetRobotsRules (the goroutine uses "_, _ =" so failures are ignored); change the
goroutines to capture the returned error (call jm.GetRobotsRules(ctx,
"swarm.com") and send the error into a buffered channel or append to a protected
slice) and after wg.Wait() drain and assert on those errors (e.g., ensure all
are nil or match the expected error when the stub simulates failure). Use the
existing fanout, wg, gate and stub.calls identifiers to locate the loop and
assertions to replace the discarded errors with explicit checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 15acd520-5d6e-41a5-9e3e-e9d3e49f011c
📒 Files selected for processing (2)
internal/jobs/manager.gointernal/jobs/robots_cache_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/jobs/manager.go
|
🐝 Review App Deployed Homepage: https://hover-pr-385.fly.dev |
1 similar comment
|
🐝 Review App Deployed Homepage: https://hover-pr-385.fly.dev |
Cache robots.txt rules per domain
Summary
JobManager.GetRobotsRulespreviously refetched/robots.txton every call.The stream worker's job-info cache expires at five-minute intervals, so a
long crawl re-hit the origin every five minutes — and a 429 on
/robots.txtcame back on the next read with no negative caching. On2026-05-11 the merrypeople.com run produced two 429s on
/robots.txtfive minutes apart. No functional damage (the code falls through to "no
restrictions") but unnecessary origin load and noisy logs.
This change adds a per-domain in-memory cache with:
/robots.txtonceper hour at most.
/robots.txtunblocks within a minute and stops re-hammering the origin in the
meantime.
on one origin fetch instead of N parallel requests.
lower(util.NormaliseDomain(domain))so case-variantand scheme-prefixed inputs share one entry.
Restart loses the cache, which is fine — a single fetch on the next
read warms it. No Redis involvement (single worker box; in-memory is
the right granularity).
Test plan
go test -race -short ./...passesscripts/security-check.shcleanRisk
triggers at most one fetch per domain seen during the next crawl
window.
JobManager, which is a long-lived singleton inthe worker process — no leak risk because the key space is bounded by
the number of domains the worker has ever touched (≪ memory budget).
Need help on this PR? Tag
@codesmithwith what you need.Summary by CodeRabbit