Document case-based crawl handling#370
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (2)
📝 WalkthroughWalkthroughAdded a new crawl-handling architecture document and wired it into docs; expanded database schema docs to document per-domain pacing and WAF caching fields and an expanded job-status lifecycle; API/config docs updated for task statuses and archival semantics; small in-code comments added referencing the new spec. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Review rate limit: 4/5 reviews remaining, refill in 12 minutes. Comment |
|
Updates to Preview Branch (work/crawl-handling-docs) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
docs/architecture/CRAWL_HANDLING.md (1)
11-13: Narrow the “single source of truth” wording.
ValidateStatusTransitionis a status-transition validator; it does not represent lock ordering or trigger semantics. Tightening this sentence will avoid overclaiming scope.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/architecture/CRAWL_HANDLING.md` around lines 11 - 13, The sentence overstates scope: change the phrasing so ValidateStatusTransition in internal/jobs/manager.go is described only as the canonical validator for status transitions, not for lock ordering or trigger semantics; update the text to point readers to internal/jobs/manager.go for full state machine transition rules and to mention that ValidateStatusTransition specifically enforces status-transition validation (separate concerns like lock ordering and trigger semantics are documented elsewhere or require separate references).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/architecture/CRAWL_HANDLING.md`:
- Line 23: Replace the non-canonical status string "initialising" with the
runtime canonical "initializing" in the documentation and any literals
referenced (e.g., the table entries at lines mentioning setupJobURLDiscovery)
and update the status transition validator to allow transitions involving the
"paused" state by adding the missing transition rules inside
ValidateStatusTransition (ensure this function now recognizes transitions
to/from "paused" consistent with the documented enforcement); reference
ValidateStatusTransition and internal/jobs/manager.go::setupJobURLDiscovery when
making these edits.
In `@docs/architecture/DATABASE.md`:
- Around line 219-234: The documentation uses the wrong status literal and an
incomplete description of the validator: change the documented status
`initialising` to the runtime `initializing` and update the lifecycle text to
reflect that `ValidateStatusTransition` (in internal/jobs/manager.go) does not
currently model `paused` transitions; either add `paused` transition rules to
the validator map in ValidateStatusTransition or remove/annotate `paused` from
the documented allowed lifecycle list so the docs and runtime match. Ensure
references to status literals (`pending`, `initializing`, `running`, `paused`,
`completed`, `failed`, `cancelled`, `blocked`, `archived`) match exactly between
the doc and the validator.
---
Nitpick comments:
In `@docs/architecture/CRAWL_HANDLING.md`:
- Around line 11-13: The sentence overstates scope: change the phrasing so
ValidateStatusTransition in internal/jobs/manager.go is described only as the
canonical validator for status transitions, not for lock ordering or trigger
semantics; update the text to point readers to internal/jobs/manager.go for full
state machine transition rules and to mention that ValidateStatusTransition
specifically enforces status-transition validation (separate concerns like lock
ordering and trigger semantics are documented elsewhere or require separate
references).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e8d22e38-478a-4eeb-b7fe-1b6985b83000
📒 Files selected for processing (3)
docs/architecture/ARCHITECTURE.mddocs/architecture/CRAWL_HANDLING.mddocs/architecture/DATABASE.md
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
🐝 Review App Deployed Homepage: https://hover-pr-370.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-370.fly.dev |
Summary
Adds a flat case → action table covering how Hover handles each domain
or page condition we encounter. Optimised for skim-reading and
incremental growth: each row is a case, each row tells you what
happens to the job and what happens to the task that surfaced
the case, with a pointer to the source code.
What's in it
docs/architecture/CRAWL_HANDLING.md(new) — three tables:domainstable columns.DATABASE.mdupdates: complete (and accurate) job-status list, newdomainscolumns documented, cross-link toCRAWL_HANDLING.md.ARCHITECTURE.mdupdate: Task Lifecycle section points to the new doc.Why
The job-status list in
DATABASE.mdwas stale (5 statuses listed; 9actually exist in code), and
domains.waf_blocked/waf_vendor/waf_blocked_atweren't documented anywhere. As we add more cases(row 4 of #365 will introduce
failure_class, row 2 will add Shopify-specific handling), there needs to be a single skim-able place to
answer "what happens when X?". Tables grow well; long-form prose
doesn't.
Note on dependency
A few rows reference behaviour shipping in #368 (the
EnqueueURLsterminal-status guard, the lowered breaker default of 2). If #368
merges first, the doc is accurate on day one; if this lands first the
doc is slightly aspirational on those rows for the duration. No code
changes here, so either order is safe.
Test plan
DATABASE.mdandARCHITECTURE.mdresolve.Summary by CodeRabbit