Document case-based crawl handling by simonsmallchua · Pull Request #370 · Good-Native/hover

simonsmallchua · 2026-04-28T22:23:56Z

Summary

Adds a flat case → action table covering how Hover handles each domain
or page condition we encounter. Optimised for skim-reading and
incremental growth: each row is a case, each row tells you what
happens to the job and what happens to the task that surfaced
the case, with a pointer to the source code.

What's in it

docs/architecture/CRAWL_HANDLING.md (new) — three tables:
- Domain-level cases (12 rows): healthy, WAF wall variants, robots.txt outcomes, existing-active-job, quota, cancellation, terminal job, etc.
- Page-level cases (~20 rows): 2xx with content, 2xx empty/SPA shell, redirects, 4xx/5xx, WAF responses (with vs without fingerprint), 429, timeouts, TLS errors, robots.txt-disallowed paths, cross-subdomain links, dedupe, max_pages overflow, stale-task reclaim.
- Reference tables for job statuses, task statuses, and domains table columns.
DATABASE.md updates: complete (and accurate) job-status list, new domains columns documented, cross-link to CRAWL_HANDLING.md.
ARCHITECTURE.md update: Task Lifecycle section points to the new doc.

Why

The job-status list in DATABASE.md was stale (5 statuses listed; 9
actually exist in code), and domains.waf_blocked / waf_vendor /
waf_blocked_at weren't documented anywhere. As we add more cases
(row 4 of #365 will introduce failure_class, row 2 will add Shopify-
specific handling), there needs to be a single skim-able place to
answer "what happens when X?". Tables grow well; long-form prose
doesn't.

Note on dependency

A few rows reference behaviour shipping in #368 (the EnqueueURLs
terminal-status guard, the lowered breaker default of 2). If #368
merges first, the doc is accurate on day one; if this lands first the
doc is slightly aspirational on those rows for the duration. No code
changes here, so either order is safe.

Test plan

Render the doc on GitHub and check the tables format correctly.
Verify the case rows match production behaviour (spot-check 2-3 of each table).
Confirm cross-links from DATABASE.md and ARCHITECTURE.md resolve.

Summary by CodeRabbit

Documentation
- Added a Crawl Handling reference enumerating domain/page cases, recovery rules, and job/task status mappings.
- Expanded architecture and database docs to describe per-domain crawl pacing, WAF verdict caching, and an expanded job lifecycle.
- Updated API and config docs: task status options now include waiting/skipped; archival retention now counts blocked jobs; indexes and doc indices updated.

coderabbitai · 2026-04-28T22:24:11Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eba9cf27-7822-46fb-8798-dd64e7e296fe

📥 Commits

Reviewing files that changed from the base of the PR and between 2ae50d7 and 5677d8e.

📒 Files selected for processing (2)

internal/crawler/waf.go
internal/jobs/manager.go

✅ Files skipped from review due to trivial changes (2)

internal/jobs/manager.go
internal/crawler/waf.go

📝 Walkthrough

Walkthrough

Added a new crawl-handling architecture document and wired it into docs; expanded database schema docs to document per-domain pacing and WAF caching fields and an expanded job-status lifecycle; API/config docs updated for task statuses and archival semantics; small in-code comments added referencing the new spec.

Changes

Cohort / File(s)	Summary
Architecture index & entry `docs/architecture/README.md`, `README.md`, `CLAUDE.md`, `docs/architecture/ARCHITECTURE.md`	Inserted references to a new `CRAWL_HANDLING.md` and updated architecture indexes to point to the crawl-handling spec.
Crawl handling spec `docs/architecture/CRAWL_HANDLING.md`	New detailed case→action tables for domain- and page-level crawl handling (WAF variants, robots, response classes, pacing, retries, dedupe, recovery) and mappings to job/task statuses.
Database & status model `docs/architecture/DATABASE.md`	Documented `domains` columns for crawl pacing and WAF verdict caching (`crawl_delay_seconds`, adaptive delay/floor, `waf_blocked`, `waf_vendor`, `waf_blocked_at`) plus a partial index; expanded job lifecycle/statuses and noted validated transitions.
API & config references `docs/architecture/API.md`, `docs/architecture/CONFIG_REFERENCE.md`	API: `GET /v1/jobs/{job_id}/tasks` `status` param accepts `waiting` and `skipped` and links to crawl-handling semantics. Config: `ARCHIVE_RETENTION_JOBS` description updated to include `blocked` among terminal jobs; table formatting adjusted.
In-code docs/comments `internal/crawler/waf.go`, `internal/jobs/manager.go`, `internal/jobs/types.go`	Added/updated developer comments pointing maintainers to update `CRAWL_HANDLING.md` when adding WAF fingerprints, job statuses, or transition edges; no behavioral changes.
Top-level README `README.md`	Added "Crawl Handling Cases" link to new spec in Documentation section.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Trim narrative comments per CLAUDE.md rule #364 — touches the same in-code comments and documentation areas (internal/jobs/manager.go, internal/jobs/types.go, internal/crawler/waf.go) and appears related to job-status and WAF documentation updates.

Poem

🐰 I hopped the docs with careful paws,

Mapped WAF walls, robots, and crawl laws,
Rows of cases, actions aligned,
I flagged the paths for curious minds,
Nibble, note, and bound along — hooray! 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Document case-based crawl handling' accurately and concisely summarizes the main objective of this pull request: introducing comprehensive documentation for crawl handling behavior via case tables.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch work/crawl-handling-docs

_{Review rate limit: 4/5 reviews remaining, refill in 12 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

supabase · 2026-04-28T22:24:11Z

Updates to Preview Branch (work/crawl-handling-docs) ↗︎

Deployments	Status	Updated
Database	✅	Wed, 29 Apr 2026 03:31:19 UTC
Services	✅	Wed, 29 Apr 2026 03:31:19 UTC
APIs	✅	Wed, 29 Apr 2026 03:31:19 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks	Status	Updated
Configurations	✅	Wed, 29 Apr 2026 03:31:20 UTC
Migrations	✅	Wed, 29 Apr 2026 03:31:22 UTC
Seeding	✅	Wed, 29 Apr 2026 03:31:24 UTC
Edge Functions	✅	Wed, 29 Apr 2026 03:31:24 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

docs/architecture/CRAWL_HANDLING.md (1)

11-13: Narrow the “single source of truth” wording.

ValidateStatusTransition is a status-transition validator; it does not represent lock ordering or trigger semantics. Tightening this sentence will avoid overclaiming scope.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/CRAWL_HANDLING.md` around lines 11 - 13, The sentence
overstates scope: change the phrasing so ValidateStatusTransition in
internal/jobs/manager.go is described only as the canonical validator for status
transitions, not for lock ordering or trigger semantics; update the text to
point readers to internal/jobs/manager.go for full state machine transition
rules and to mention that ValidateStatusTransition specifically enforces
status-transition validation (separate concerns like lock ordering and trigger
semantics are documented elsewhere or require separate references).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/architecture/CRAWL_HANDLING.md`:
- Line 23: Replace the non-canonical status string "initialising" with the
runtime canonical "initializing" in the documentation and any literals
referenced (e.g., the table entries at lines mentioning setupJobURLDiscovery)
and update the status transition validator to allow transitions involving the
"paused" state by adding the missing transition rules inside
ValidateStatusTransition (ensure this function now recognizes transitions
to/from "paused" consistent with the documented enforcement); reference
ValidateStatusTransition and internal/jobs/manager.go::setupJobURLDiscovery when
making these edits.

In `@docs/architecture/DATABASE.md`:
- Around line 219-234: The documentation uses the wrong status literal and an
incomplete description of the validator: change the documented status
`initialising` to the runtime `initializing` and update the lifecycle text to
reflect that `ValidateStatusTransition` (in internal/jobs/manager.go) does not
currently model `paused` transitions; either add `paused` transition rules to
the validator map in ValidateStatusTransition or remove/annotate `paused` from
the documented allowed lifecycle list so the docs and runtime match. Ensure
references to status literals (`pending`, `initializing`, `running`, `paused`,
`completed`, `failed`, `cancelled`, `blocked`, `archived`) match exactly between
the doc and the validator.

---

Nitpick comments:
In `@docs/architecture/CRAWL_HANDLING.md`:
- Around line 11-13: The sentence overstates scope: change the phrasing so
ValidateStatusTransition in internal/jobs/manager.go is described only as the
canonical validator for status transitions, not for lock ordering or trigger
semantics; update the text to point readers to internal/jobs/manager.go for full
state machine transition rules and to mention that ValidateStatusTransition
specifically enforces status-transition validation (separate concerns like lock
ordering and trigger semantics are documented elsewhere or require separate
references).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e8d22e38-478a-4eeb-b7fe-1b6985b83000

📥 Commits

Reviewing files that changed from the base of the PR and between 4e98d97 and 506151d.

📒 Files selected for processing (3)

docs/architecture/ARCHITECTURE.md
docs/architecture/CRAWL_HANDLING.md
docs/architecture/DATABASE.md

codecov · 2026-04-28T22:46:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

github-actions · 2026-04-28T22:47:58Z

🐝 Review App Deployed

Homepage: https://hover-pr-370.fly.dev
Dashboard: https://hover-pr-370.fly.dev/dashboard

github-actions · 2026-04-29T03:35:07Z

🐝 Review App Deployed

Homepage: https://hover-pr-370.fly.dev
Dashboard: https://hover-pr-370.fly.dev/dashboard

Document case-based crawl handling

506151d

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread docs/architecture/CRAWL_HANDLING.md

Comment thread docs/architecture/DATABASE.md

simonsmallchua added 2 commits April 29, 2026 08:32

Fix stale status references in API and config docs

88acee0

Reference CRAWL_HANDLING from canonical entrypoints

0cb0220

simonsmallchua added 2 commits April 29, 2026 08:52

Tighten doc-pointer comments

2ae50d7

Merge branch 'main' into work/crawl-handling-docs

5677d8e

simonsmallchua merged commit 2ba6e6a into main Apr 30, 2026
18 of 19 checks passed

simonsmallchua deleted the work/crawl-handling-docs branch April 30, 2026 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document case-based crawl handling#370

Document case-based crawl handling#370
simonsmallchua merged 5 commits into
mainfrom
work/crawl-handling-docs

simonsmallchua commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

supabase Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

simonsmallchua commented Apr 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in it

Why

Note on dependency

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

supabase Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

simonsmallchua commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

supabase Bot commented Apr 28, 2026 •

edited

Loading

codecov Bot commented Apr 28, 2026 •

edited

Loading