Skip to content

feat(web-sources): crawl websites into an assistant's knowledge base#378

Merged
philmerrell merged 1 commit into
developfrom
feature/web-source-crawl
May 23, 2026
Merged

feat(web-sources): crawl websites into an assistant's knowledge base#378
philmerrell merged 1 commit into
developfrom
feature/web-source-crawl

Conversation

@philmerrell
Copy link
Copy Markdown
Contributor

Summary

Adds an "Add web content" flow in the assistant editor that lets users attach a single web page or a bounded crawl of a site to the assistant's knowledge base. Discovered pages flow through the existing S3 → ingestion Lambda → chunking/embedding pipeline — no new infra.

  • New backend package `apis/app_api/web_sources/` (BFS crawler, repository, routes, URL/SSRF helpers).
  • New SPA `WebSourceDialogComponent` with single-page and crawl modes (depth, max pages, concurrency, delay sliders); editor wires it as a sibling to the connector buttons.
  • Two new pinned deps: `beautifulsoup4==4.13.5`, `trafilatura==2.0.0`.
  • Drive-by UX fix on the existing docs list: delete is now optimistic (row removes instantly, rolls back on failure) and works on stuck-uploading docs.

Behind the scenes

  • Crawler respects robots.txt, stays same-domain, has per-host jitter + bounded concurrency, SSRF-guards every URL, 5 MB per-page cap, 15-minute crawl budget, always finalizes in a `finally`.
  • `CrawlJob` rows live in the assistants table at `SK=CRAWL#{crawl_id}`, get a 30-day TTL on terminal status, and cascade-delete when the last web doc for that root URL is removed.
  • `list_active_crawls` self-heals — a `running` row older than 20 minutes (the crawler's budget + 5 min buffer) is auto-finalized, so a crashed process can't leave the SPA in perma-poll.
  • Crawler workers and the route-level background task both hold strong refs, sidestepping Python's weak-task-tracking GC.
  • Floats in `CrawlSettings` (min/maxDelay) are coerced to `Decimal` before DynamoDB writes — boto3 rejects bare `float`.

Decisions worth flagging

  • HTTP-only for v1. Static HTML → markdown via trafilatura (BS4 fallback). JavaScript-only sites render as empty pages; called out in the modal. Browser rendering via AgentCore browser sessions is a follow-up.
  • Submit-and-watch UX rather than discover-then-pick — the modal closes on Start; pages stream into the docs list as they're ingested via an incremental discovery merge (no list-wide refresh).

Test plan

  • `cd backend && uv sync --extra agentcore --extra dev && uv run pytest tests/apis/app_api/web_sources/ -v` — 60 tests
  • `cd backend && uv run pytest tests/architecture/ tests/routes/test_documents.py` — boundary + adjacent docs routes still green
  • `cd frontend/ai.client && npx tsc --noEmit` — clean
  • Manual: single-page import of a Wikipedia article → doc reaches `complete` and is citable
  • Manual: small-site crawl with depth 2 / max 10 → pages stream into the list, "Crawling…" badge clears on finish
  • Manual: 404 URL → root doc transitions to `failed` with a readable error
  • Manual: SSRF guard rejects `http://127.0.0.1/...\` with 422 in the modal
  • Manual: stale `running` CrawlJob (from a crashed process before this PR) gets reaped on next watcher tick and polling stops
  • Manual: delete a complete web doc → row disappears immediately, S3 markdown + vectors cleaned up, orphan CrawlJob row goes too

Out of scope (follow-ups)

  • JS-rendering via AgentCore browser (separate plumbing into app-api).
  • Cancel-crawl button (crawler would need to poll its own status between fetches).
  • Sitemap.xml-driven crawls and scheduled re-crawls.

🤖 Generated with Claude Code

Adds an "Add web content" flow alongside the existing connector imports in
the assistant editor. Single-page mode (default) and bounded BFS crawl mode
share one dialog; the backend writes extracted markdown to the documents
bucket so the existing S3-event ingestion Lambda chunks and embeds it
exactly as a device upload would.

Backend
- New `apis/app_api/web_sources/` package (models, routes, crawler, repo,
  url_utils). Endpoints under `/assistants/{id}/web-sources/`: `POST /crawl`,
  `GET /crawls?active=true`, `GET /crawls/{id}`. Uses
  `get_current_user_from_session` per the auth-dependency rule.
- BFS crawler: per-host jitter, bounded concurrency, robots.txt-respecting,
  same-domain, SSRF-guarded, 5 MB per-page cap, 15-minute crawl budget,
  always-finalize-on-exit. trafilatura → markdown with BS4 fallback.
- `CrawlJob` rows persisted in the assistants table via the adjacency-list
  pattern (`SK=CRAWL#{crawl_id}`). Floats coerced to `Decimal` before
  put_item (DynamoDB rejects bare floats). Terminal rows get a 30-day TTL
  and cascade-delete when the last web doc for that root is removed.
- Cleanup cascade: `cleanup_document_resources` now reaps orphaned terminal
  `CrawlJob` rows after deleting a web doc.
- Self-heal: `list_active_crawls` auto-finalizes any `running` row older
  than 20 minutes (mirrors the stale-doc auto-fail pattern), so a crashed
  process can't leave the SPA in perma-poll.
- Crawler holds strong refs to worker tasks; the route holds a module-level
  set of in-flight crawl tasks (Python's weak task tracking would otherwise
  GC them mid-execution).
- New deps: beautifulsoup4 4.13.5, trafilatura 2.0.0.

Frontend
- New `WebSourceDialogComponent`: URL input + "Crawl linked pages" toggle
  revealing depth / max-pages / concurrency / delay sliders. Submit-and-watch
  UX — modal closes on Start, pages appear in the docs list as they're
  ingested. Style tokens match the file-source dialog.
- `WebSourceService` thin client for the three endpoints.
- Editor wiring: "Add web content" button next to the connector buttons,
  with an inline "Crawling…" badge while a crawl is in flight.
- Crawl watcher polls `/web-sources/crawls?active=true` every 5 s; new
  pages surface via an incremental discovery merge — no list-wide refresh.
- Document delete: optimistic UI removes the row immediately and rolls
  back on failure (no more wait-then-disappear). Stale-uploading docs can
  now be deleted regardless of polling state.

Tests
- `tests/apis/app_api/web_sources/` (60 tests): URL normalization,
  same-domain, SSRF guard, BFS bounds, robots, per-page failure handling,
  crawl finalization, route 202/404/422/401, CrawlJob put/get/list round
  trip with float delays, stale-row reaper, cascade cleanup, TTL on
  finalize.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@philmerrell philmerrell merged commit 2c38475 into develop May 23, 2026
39 checks passed
@philmerrell philmerrell deleted the feature/web-source-crawl branch May 23, 2026 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant