Cache robots.txt rules per domain by simonsmallchua · Pull Request #385 · Good-Native/hover

simonsmallchua · 2026-05-11T22:44:07Z

Summary

JobManager.GetRobotsRules previously refetched /robots.txt on every call.
The stream worker's job-info cache expires at five-minute intervals, so a
long crawl re-hit the origin every five minutes — and a 429 on
/robots.txt came back on the next read with no negative caching. On
2026-05-11 the merrypeople.com run produced two 429s on /robots.txt
five minutes apart. No functional damage (the code falls through to "no
restrictions") but unnecessary origin load and noisy logs.

This change adds a per-domain in-memory cache with:

Positive TTL: 1 hour. A long crawl now fetches /robots.txt once
per hour at most.
Negative TTL: 60 seconds. A transient 429 on /robots.txt
unblocks within a minute and stops re-hammering the origin in the
meantime.
singleflight collapse: concurrent misses for the same domain land
on one origin fetch instead of N parallel requests.
Domain key: lower(util.NormaliseDomain(domain)) so case-variant
and scheme-prefixed inputs share one entry.

Restart loses the cache, which is fine — a single fetch on the next
read warms it. No Redis involvement (single worker box; in-memory is
the right granularity).

Test plan

go test -race -short ./... passes
scripts/security-check.sh clean
Tests cover:
- hit-then-cache positive TTL (10 reads → 1 origin fetch)
- error caching for negative TTL (5 reads of a 429 → 1 origin fetch)
- positive TTL expiry triggers refetch
- negative TTL expiry triggers refetch (and can recover to success)
- case/scheme normalisation collapses onto one entry
- 20-way concurrent miss → 1 origin fetch via singleflight

Risk

In-memory only — restart loses the cache. Acceptable: cold start
triggers at most one fetch per domain seen during the next crawl
window.
The cache lives on JobManager, which is a long-lived singleton in
the worker process — no leak risk because the key space is bounded by
the number of domains the worker has ever touched (≪ memory budget).

^{Need help on this PR? Tag @codesmith with what you need.}

Let Codesmith autofix CI failures and bot reviews

Summary by CodeRabbit

Changed
- Robots.txt and sitemap discovery now caches results per canonical domain (1h positive, 60s negative TTLs), collapses concurrent fetches and bounds refetching to reduce redundant origin requests and rate-limit responses.
Tests
- Expanded coverage for positive/negative caching, TTL expiry, concurrent request collapsing, cancellation handling, and domain canonicalization.

coderabbitai · 2026-05-11T22:44:21Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b1944f46-9def-44b1-99bc-674a83cc0c8f

📥 Commits

Reviewing files that changed from the base of the PR and between d04f5c7 and 8c96105.

📒 Files selected for processing (1)

internal/jobs/robots_cache_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/jobs/robots_cache_test.go

📝 Walkthrough

Walkthrough

JobManager caches robots.txt discovery per normalized, lowercased domain with configurable positive (1h) and negative (60s) TTLs, collapses concurrent misses via singleflight, and avoids caching caller-side cancellations. Tests and CHANGELOG updated to reflect behavior.

Changes

Robots Cache Implementation

Layer / File(s)	Summary
Cache State & Dependency `internal/jobs/manager.go`	Import `golang.org/x/sync/singleflight`; extend `JobManager` with robots cache map, RW mutex, and related fields; reposition callback type declarations.
Cache Configuration `internal/jobs/manager.go`	Define default TTL constants (1h positive, 60s negative) and initialize cache map and TTL fields in `NewJobManager`.
GetRobotsRules with Caching `internal/jobs/manager.go`	Rewrite `GetRobotsRules` to normalise/lowercase domain keys, return unexpired cached `{rules, err}`, use `singleflight.Group` to de-duplicate concurrent fetches, cache outcomes with success vs error TTLs, and skip caching for caller cancellations/deadlines.
Test Double & Helper `internal/jobs/robots_cache_test.go`	Add `stubRobotsCrawler` (atomic call counter) and `newJobManagerWithCrawler` helper to inject controllable crawler behavior and observe origin call counts.
Cache Behavior Tests `internal/jobs/robots_cache_test.go`	Add tests for positive & negative caching, zero TTL refetch behaviors, domain key normalization, concurrent-miss collapse, non-caching of `context.Canceled`, canonical lowercased domain used for origin fetch, and a compile-time interface assertion.
Changelog Documentation `CHANGELOG.md`	Populate `## [Unreleased]` with a "Changed" entry describing normalized-domain caching, TTLs, singleflight deduplication, and bounded refetch/429 handling.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant GetRobotsRules
  participant RobotsCache
  participant SingleflightGroup
  participant Crawler
  Caller->>GetRobotsRules: GetRobotsRules(domain, ctx)
  GetRobotsRules->>GetRobotsRules: normalise & lowercase domain
  GetRobotsRules->>RobotsCache: check cached {rules,err,expiresAt}
  alt cache hit (not expired)
    RobotsCache-->>GetRobotsRules: cached rules/err
  else cache miss or expired
    GetRobotsRules->>SingleflightGroup: Do(domainKey, fetchFn)
    alt first caller
      SingleflightGroup->>Crawler: DiscoverSitemapsAndRobots(ctx, domain)
      Crawler-->>SingleflightGroup: rules, err
      SingleflightGroup->>RobotsCache: store {rules,err,expiresAt} using TTL
    else concurrent callers
      SingleflightGroup-->>GetRobotsRules: wait & receive result from first caller
    end
  end
  GetRobotsRules-->>Caller: rules, err

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A cache for robots, neat and keen,
Domains normalised, tidy and clean,
Singleflight hums to stop the throng,
TTLs keep answers quick and strong,
A little rabbit hops — cache fixed, song on!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main change: implementing caching of robots.txt rules per domain in the JobManager.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch work/robots-cache

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

supabase · 2026-05-11T22:44:22Z

Updates to Preview Branch (work/robots-cache) ↗︎

Deployments	Status	Updated
Database	✅	Mon, 11 May 2026 23:00:26 UTC
Services	✅	Mon, 11 May 2026 23:00:26 UTC
APIs	✅	Mon, 11 May 2026 23:00:26 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks	Status	Updated
Configurations	✅	Mon, 11 May 2026 23:00:29 UTC
Migrations	✅	Mon, 11 May 2026 23:00:30 UTC
Seeding	✅	Mon, 11 May 2026 23:00:32 UTC
Edge Functions	✅	Mon, 11 May 2026 23:00:32 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

github-actions · 2026-05-11T22:44:35Z

Release Versions

App patch: v0.34.10 → v0.34.11

Changelog

Changed

JobManager.GetRobotsRules now caches results per normalised domain (1h
positive TTL, 60s negative TTL), and collapses concurrent misses onto a single
origin fetch via singleflight. A long crawl previously refetched /robots.txt
every five minutes (stream worker's job-info TTL) and a 429 on /robots.txt
returned on the next read; both are now bounded.

codecov · 2026-05-11T22:47:09Z

Codecov Report

❌ Patch coverage is 86.11111% with 5 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
internal/jobs/manager.go	86.11%	4 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/jobs/manager.go`:
- Around line 412-433: The code caches robots results under the lowercased,
normalised key but calls jm.crawler.DiscoverSitemapsAndRobots(ctx, domain) with
the raw domain, causing mismatched fetches and bad cache entries; update the
call inside jm.robotsGroup.Do to use the canonical key (pass key instead of
domain) so jm.crawler.DiscoverSitemapsAndRobots, any error formatting
(fmt.Errorf("jobs: fetch robots for %s"...)) and the cached entry all operate on
the same normalised origin string.
- Around line 438-448: The current code unconditionally caches errors from
out.err under robotsTTLNeg which allows transient context cancellations to
poison jm.robotsCache; change the logic in the robots caching block to detect
context.Canceled and context.DeadlineExceeded (use errors.Is(out.err,
context.Canceled) || errors.Is(out.err, context.DeadlineExceeded)) and do not
write a negative cache entry for those transient errors (either skip writing to
jm.robotsCache or set ttl to zero and avoid adding an entry unless ttl>0).
Update the conditional around jm.robotsCache assignment (referencing
robotsTTLPos, robotsTTLNeg, out.err, robotsCacheEntry, jm.robotsMutex,
jm.robotsCache) so only non-transient errors are cached with robotsTTLNeg.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 826131a5-6589-4283-aeed-cb4203c98137

📥 Commits

Reviewing files that changed from the base of the PR and between 41af55a and 0e79222.

📒 Files selected for processing (3)

CHANGELOG.md
internal/jobs/manager.go
internal/jobs/robots_cache_test.go

github-actions · 2026-05-11T22:49:26Z

🐝 Review App Deployed

Homepage: https://hover-pr-385.fly.dev
Dashboard: https://hover-pr-385.fly.dev/dashboard

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/jobs/robots_cache_test.go`:
- Around line 175-192: The test currently drops errors from concurrent calls to
GetRobotsRules (the goroutine uses "_, _ =" so failures are ignored); change the
goroutines to capture the returned error (call jm.GetRobotsRules(ctx,
"swarm.com") and send the error into a buffered channel or append to a protected
slice) and after wg.Wait() drain and assert on those errors (e.g., ensure all
are nil or match the expected error when the stub simulates failure). Use the
existing fanout, wg, gate and stub.calls identifiers to locate the loop and
assertions to replace the discarded errors with explicit checks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 15acd520-5d6e-41a5-9e3e-e9d3e49f011c

📥 Commits

Reviewing files that changed from the base of the PR and between 0e79222 and d04f5c7.

📒 Files selected for processing (2)

internal/jobs/manager.go
internal/jobs/robots_cache_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/jobs/manager.go

github-actions · 2026-05-11T22:58:57Z

🐝 Review App Deployed

Homepage: https://hover-pr-385.fly.dev
Dashboard: https://hover-pr-385.fly.dev/dashboard

github-actions · 2026-05-11T23:04:41Z

🐝 Review App Deployed

Homepage: https://hover-pr-385.fly.dev
Dashboard: https://hover-pr-385.fly.dev/dashboard

Cache robots.txt rules per domain

Cache robots.txt rules per domain

0e79222

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread internal/jobs/manager.go Outdated

Comment thread internal/jobs/manager.go Outdated

Address CodeRabbit robots cache feedback

d04f5c7

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread internal/jobs/robots_cache_test.go

Assert concurrent caller errors in cache test

8c96105

simonsmallchua merged commit b07ebb6 into main May 12, 2026
21 checks passed

simonsmallchua deleted the work/robots-cache branch May 12, 2026 00:10

simonsmallchua added a commit that referenced this pull request May 12, 2026

Merge pull request #385 from Good-Native/work/robots-cache

912a27d

Cache robots.txt rules per domain

Conversation

simonsmallchua commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Risk

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

supabase Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Versions

Changelog

Changed

Uh oh!

codecov Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

simonsmallchua commented May 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading

supabase Bot commented May 11, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading

codecov Bot commented May 11, 2026 •

edited

Loading