Skip to content

Document case-based crawl handling#370

Merged
simonsmallchua merged 5 commits into
mainfrom
work/crawl-handling-docs
Apr 30, 2026
Merged

Document case-based crawl handling#370
simonsmallchua merged 5 commits into
mainfrom
work/crawl-handling-docs

Conversation

@simonsmallchua
Copy link
Copy Markdown
Contributor

@simonsmallchua simonsmallchua commented Apr 28, 2026

Summary

Adds a flat case → action table covering how Hover handles each domain
or page condition we encounter. Optimised for skim-reading and
incremental growth: each row is a case, each row tells you what
happens to the job and what happens to the task that surfaced
the case, with a pointer to the source code.

What's in it

  • docs/architecture/CRAWL_HANDLING.md (new) — three tables:
    • Domain-level cases (12 rows): healthy, WAF wall variants, robots.txt outcomes, existing-active-job, quota, cancellation, terminal job, etc.
    • Page-level cases (~20 rows): 2xx with content, 2xx empty/SPA shell, redirects, 4xx/5xx, WAF responses (with vs without fingerprint), 429, timeouts, TLS errors, robots.txt-disallowed paths, cross-subdomain links, dedupe, max_pages overflow, stale-task reclaim.
    • Reference tables for job statuses, task statuses, and domains table columns.
  • DATABASE.md updates: complete (and accurate) job-status list, new domains columns documented, cross-link to CRAWL_HANDLING.md.
  • ARCHITECTURE.md update: Task Lifecycle section points to the new doc.

Why

The job-status list in DATABASE.md was stale (5 statuses listed; 9
actually exist in code), and domains.waf_blocked / waf_vendor /
waf_blocked_at weren't documented anywhere. As we add more cases
(row 4 of #365 will introduce failure_class, row 2 will add Shopify-
specific handling), there needs to be a single skim-able place to
answer "what happens when X?". Tables grow well; long-form prose
doesn't.

Note on dependency

A few rows reference behaviour shipping in #368 (the EnqueueURLs
terminal-status guard, the lowered breaker default of 2). If #368
merges first, the doc is accurate on day one; if this lands first the
doc is slightly aspirational on those rows for the duration. No code
changes here, so either order is safe.

Test plan

  • Render the doc on GitHub and check the tables format correctly.
  • Verify the case rows match production behaviour (spot-check 2-3 of each table).
  • Confirm cross-links from DATABASE.md and ARCHITECTURE.md resolve.

Summary by CodeRabbit

  • Documentation
    • Added a Crawl Handling reference enumerating domain/page cases, recovery rules, and job/task status mappings.
    • Expanded architecture and database docs to describe per-domain crawl pacing, WAF verdict caching, and an expanded job lifecycle.
    • Updated API and config docs: task status options now include waiting/skipped; archival retention now counts blocked jobs; indexes and doc indices updated.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eba9cf27-7822-46fb-8798-dd64e7e296fe

📥 Commits

Reviewing files that changed from the base of the PR and between 2ae50d7 and 5677d8e.

📒 Files selected for processing (2)
  • internal/crawler/waf.go
  • internal/jobs/manager.go
✅ Files skipped from review due to trivial changes (2)
  • internal/jobs/manager.go
  • internal/crawler/waf.go

📝 Walkthrough

Walkthrough

Added a new crawl-handling architecture document and wired it into docs; expanded database schema docs to document per-domain pacing and WAF caching fields and an expanded job-status lifecycle; API/config docs updated for task statuses and archival semantics; small in-code comments added referencing the new spec.

Changes

Cohort / File(s) Summary
Architecture index & entry
docs/architecture/README.md, README.md, CLAUDE.md, docs/architecture/ARCHITECTURE.md
Inserted references to a new CRAWL_HANDLING.md and updated architecture indexes to point to the crawl-handling spec.
Crawl handling spec
docs/architecture/CRAWL_HANDLING.md
New detailed case→action tables for domain- and page-level crawl handling (WAF variants, robots, response classes, pacing, retries, dedupe, recovery) and mappings to job/task statuses.
Database & status model
docs/architecture/DATABASE.md
Documented domains columns for crawl pacing and WAF verdict caching (crawl_delay_seconds, adaptive delay/floor, waf_blocked, waf_vendor, waf_blocked_at) plus a partial index; expanded job lifecycle/statuses and noted validated transitions.
API & config references
docs/architecture/API.md, docs/architecture/CONFIG_REFERENCE.md
API: GET /v1/jobs/{job_id}/tasks status param accepts waiting and skipped and links to crawl-handling semantics. Config: ARCHIVE_RETENTION_JOBS description updated to include blocked among terminal jobs; table formatting adjusted.
In-code docs/comments
internal/crawler/waf.go, internal/jobs/manager.go, internal/jobs/types.go
Added/updated developer comments pointing maintainers to update CRAWL_HANDLING.md when adding WAF fingerprints, job statuses, or transition edges; no behavioral changes.
Top-level README
README.md
Added "Crawl Handling Cases" link to new spec in Documentation section.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • Trim narrative comments per CLAUDE.md rule #364 — touches the same in-code comments and documentation areas (internal/jobs/manager.go, internal/jobs/types.go, internal/crawler/waf.go) and appears related to job-status and WAF documentation updates.

Poem

🐰 I hopped the docs with careful paws,

Mapped WAF walls, robots, and crawl laws,
Rows of cases, actions aligned,
I flagged the paths for curious minds,
Nibble, note, and bound along — hooray! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Document case-based crawl handling' accurately and concisely summarizes the main objective of this pull request: introducing comprehensive documentation for crawl handling behavior via case tables.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch work/crawl-handling-docs

Review rate limit: 4/5 reviews remaining, refill in 12 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@supabase
Copy link
Copy Markdown

supabase Bot commented Apr 28, 2026

Updates to Preview Branch (work/crawl-handling-docs) ↗︎

Deployments Status Updated
Database Wed, 29 Apr 2026 03:31:19 UTC
Services Wed, 29 Apr 2026 03:31:19 UTC
APIs Wed, 29 Apr 2026 03:31:19 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks Status Updated
Configurations Wed, 29 Apr 2026 03:31:20 UTC
Migrations Wed, 29 Apr 2026 03:31:22 UTC
Seeding Wed, 29 Apr 2026 03:31:24 UTC
Edge Functions Wed, 29 Apr 2026 03:31:24 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/architecture/CRAWL_HANDLING.md (1)

11-13: Narrow the “single source of truth” wording.

ValidateStatusTransition is a status-transition validator; it does not represent lock ordering or trigger semantics. Tightening this sentence will avoid overclaiming scope.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/CRAWL_HANDLING.md` around lines 11 - 13, The sentence
overstates scope: change the phrasing so ValidateStatusTransition in
internal/jobs/manager.go is described only as the canonical validator for status
transitions, not for lock ordering or trigger semantics; update the text to
point readers to internal/jobs/manager.go for full state machine transition
rules and to mention that ValidateStatusTransition specifically enforces
status-transition validation (separate concerns like lock ordering and trigger
semantics are documented elsewhere or require separate references).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/architecture/CRAWL_HANDLING.md`:
- Line 23: Replace the non-canonical status string "initialising" with the
runtime canonical "initializing" in the documentation and any literals
referenced (e.g., the table entries at lines mentioning setupJobURLDiscovery)
and update the status transition validator to allow transitions involving the
"paused" state by adding the missing transition rules inside
ValidateStatusTransition (ensure this function now recognizes transitions
to/from "paused" consistent with the documented enforcement); reference
ValidateStatusTransition and internal/jobs/manager.go::setupJobURLDiscovery when
making these edits.

In `@docs/architecture/DATABASE.md`:
- Around line 219-234: The documentation uses the wrong status literal and an
incomplete description of the validator: change the documented status
`initialising` to the runtime `initializing` and update the lifecycle text to
reflect that `ValidateStatusTransition` (in internal/jobs/manager.go) does not
currently model `paused` transitions; either add `paused` transition rules to
the validator map in ValidateStatusTransition or remove/annotate `paused` from
the documented allowed lifecycle list so the docs and runtime match. Ensure
references to status literals (`pending`, `initializing`, `running`, `paused`,
`completed`, `failed`, `cancelled`, `blocked`, `archived`) match exactly between
the doc and the validator.

---

Nitpick comments:
In `@docs/architecture/CRAWL_HANDLING.md`:
- Around line 11-13: The sentence overstates scope: change the phrasing so
ValidateStatusTransition in internal/jobs/manager.go is described only as the
canonical validator for status transitions, not for lock ordering or trigger
semantics; update the text to point readers to internal/jobs/manager.go for full
state machine transition rules and to mention that ValidateStatusTransition
specifically enforces status-transition validation (separate concerns like lock
ordering and trigger semantics are documented elsewhere or require separate
references).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e8d22e38-478a-4eeb-b7fe-1b6985b83000

📥 Commits

Reviewing files that changed from the base of the PR and between 4e98d97 and 506151d.

📒 Files selected for processing (3)
  • docs/architecture/ARCHITECTURE.md
  • docs/architecture/CRAWL_HANDLING.md
  • docs/architecture/DATABASE.md

Comment thread docs/architecture/CRAWL_HANDLING.md
Comment thread docs/architecture/DATABASE.md
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-370.fly.dev
Dashboard: https://hover-pr-370.fly.dev/dashboard

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-370.fly.dev
Dashboard: https://hover-pr-370.fly.dev/dashboard

@simonsmallchua simonsmallchua merged commit 2ba6e6a into main Apr 30, 2026
18 of 19 checks passed
@simonsmallchua simonsmallchua deleted the work/crawl-handling-docs branch April 30, 2026 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant