feat(web-sources): crawl websites into an assistant's knowledge base#378
Merged
Conversation
Adds an "Add web content" flow alongside the existing connector imports in
the assistant editor. Single-page mode (default) and bounded BFS crawl mode
share one dialog; the backend writes extracted markdown to the documents
bucket so the existing S3-event ingestion Lambda chunks and embeds it
exactly as a device upload would.
Backend
- New `apis/app_api/web_sources/` package (models, routes, crawler, repo,
url_utils). Endpoints under `/assistants/{id}/web-sources/`: `POST /crawl`,
`GET /crawls?active=true`, `GET /crawls/{id}`. Uses
`get_current_user_from_session` per the auth-dependency rule.
- BFS crawler: per-host jitter, bounded concurrency, robots.txt-respecting,
same-domain, SSRF-guarded, 5 MB per-page cap, 15-minute crawl budget,
always-finalize-on-exit. trafilatura → markdown with BS4 fallback.
- `CrawlJob` rows persisted in the assistants table via the adjacency-list
pattern (`SK=CRAWL#{crawl_id}`). Floats coerced to `Decimal` before
put_item (DynamoDB rejects bare floats). Terminal rows get a 30-day TTL
and cascade-delete when the last web doc for that root is removed.
- Cleanup cascade: `cleanup_document_resources` now reaps orphaned terminal
`CrawlJob` rows after deleting a web doc.
- Self-heal: `list_active_crawls` auto-finalizes any `running` row older
than 20 minutes (mirrors the stale-doc auto-fail pattern), so a crashed
process can't leave the SPA in perma-poll.
- Crawler holds strong refs to worker tasks; the route holds a module-level
set of in-flight crawl tasks (Python's weak task tracking would otherwise
GC them mid-execution).
- New deps: beautifulsoup4 4.13.5, trafilatura 2.0.0.
Frontend
- New `WebSourceDialogComponent`: URL input + "Crawl linked pages" toggle
revealing depth / max-pages / concurrency / delay sliders. Submit-and-watch
UX — modal closes on Start, pages appear in the docs list as they're
ingested. Style tokens match the file-source dialog.
- `WebSourceService` thin client for the three endpoints.
- Editor wiring: "Add web content" button next to the connector buttons,
with an inline "Crawling…" badge while a crawl is in flight.
- Crawl watcher polls `/web-sources/crawls?active=true` every 5 s; new
pages surface via an incremental discovery merge — no list-wide refresh.
- Document delete: optimistic UI removes the row immediately and rolls
back on failure (no more wait-then-disappear). Stale-uploading docs can
now be deleted regardless of polling state.
Tests
- `tests/apis/app_api/web_sources/` (60 tests): URL normalization,
same-domain, SSRF guard, BFS bounds, robots, per-page failure handling,
crawl finalization, route 202/404/422/401, CrawlJob put/get/list round
trip with float delays, stale-row reaper, cascade cleanup, TTL on
finalize.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an "Add web content" flow in the assistant editor that lets users attach a single web page or a bounded crawl of a site to the assistant's knowledge base. Discovered pages flow through the existing S3 → ingestion Lambda → chunking/embedding pipeline — no new infra.
Behind the scenes
Decisions worth flagging
Test plan
Out of scope (follow-ups)
🤖 Generated with Claude Code