feat(web-sources): crawl websites into an assistant's knowledge base by philmerrell · Pull Request #378 · Boise-State-Development/agentcore-public-stack

philmerrell · 2026-05-23T15:28:17Z

Summary

Adds an "Add web content" flow in the assistant editor that lets users attach a single web page or a bounded crawl of a site to the assistant's knowledge base. Discovered pages flow through the existing S3 → ingestion Lambda → chunking/embedding pipeline — no new infra.

New backend package `apis/app_api/web_sources/` (BFS crawler, repository, routes, URL/SSRF helpers).
New SPA `WebSourceDialogComponent` with single-page and crawl modes (depth, max pages, concurrency, delay sliders); editor wires it as a sibling to the connector buttons.
Two new pinned deps: `beautifulsoup4==4.13.5`, `trafilatura==2.0.0`.
Drive-by UX fix on the existing docs list: delete is now optimistic (row removes instantly, rolls back on failure) and works on stuck-uploading docs.

Behind the scenes

Crawler respects robots.txt, stays same-domain, has per-host jitter + bounded concurrency, SSRF-guards every URL, 5 MB per-page cap, 15-minute crawl budget, always finalizes in a `finally`.
`CrawlJob` rows live in the assistants table at `SK=CRAWL#{crawl_id}`, get a 30-day TTL on terminal status, and cascade-delete when the last web doc for that root URL is removed.
`list_active_crawls` self-heals — a `running` row older than 20 minutes (the crawler's budget + 5 min buffer) is auto-finalized, so a crashed process can't leave the SPA in perma-poll.
Crawler workers and the route-level background task both hold strong refs, sidestepping Python's weak-task-tracking GC.
Floats in `CrawlSettings` (min/maxDelay) are coerced to `Decimal` before DynamoDB writes — boto3 rejects bare `float`.

Decisions worth flagging

HTTP-only for v1. Static HTML → markdown via trafilatura (BS4 fallback). JavaScript-only sites render as empty pages; called out in the modal. Browser rendering via AgentCore browser sessions is a follow-up.
Submit-and-watch UX rather than discover-then-pick — the modal closes on Start; pages stream into the docs list as they're ingested via an incremental discovery merge (no list-wide refresh).

Test plan

`cd backend && uv sync --extra agentcore --extra dev && uv run pytest tests/apis/app_api/web_sources/ -v` — 60 tests
`cd backend && uv run pytest tests/architecture/ tests/routes/test_documents.py` — boundary + adjacent docs routes still green
`cd frontend/ai.client && npx tsc --noEmit` — clean
Manual: single-page import of a Wikipedia article → doc reaches `complete` and is citable
Manual: small-site crawl with depth 2 / max 10 → pages stream into the list, "Crawling…" badge clears on finish
Manual: 404 URL → root doc transitions to `failed` with a readable error
Manual: SSRF guard rejects `http://127.0.0.1/...\` with 422 in the modal
Manual: stale `running` CrawlJob (from a crashed process before this PR) gets reaped on next watcher tick and polling stops
Manual: delete a complete web doc → row disappears immediately, S3 markdown + vectors cleaned up, orphan CrawlJob row goes too

Out of scope (follow-ups)

JS-rendering via AgentCore browser (separate plumbing into app-api).
Cancel-crawl button (crawler would need to poll its own status between fetches).
Sitemap.xml-driven crawls and scheduled re-crawls.

🤖 Generated with Claude Code

Adds an "Add web content" flow alongside the existing connector imports in the assistant editor. Single-page mode (default) and bounded BFS crawl mode share one dialog; the backend writes extracted markdown to the documents bucket so the existing S3-event ingestion Lambda chunks and embeds it exactly as a device upload would. Backend - New `apis/app_api/web_sources/` package (models, routes, crawler, repo, url_utils). Endpoints under `/assistants/{id}/web-sources/`: `POST /crawl`, `GET /crawls?active=true`, `GET /crawls/{id}`. Uses `get_current_user_from_session` per the auth-dependency rule. - BFS crawler: per-host jitter, bounded concurrency, robots.txt-respecting, same-domain, SSRF-guarded, 5 MB per-page cap, 15-minute crawl budget, always-finalize-on-exit. trafilatura → markdown with BS4 fallback. - `CrawlJob` rows persisted in the assistants table via the adjacency-list pattern (`SK=CRAWL#{crawl_id}`). Floats coerced to `Decimal` before put_item (DynamoDB rejects bare floats). Terminal rows get a 30-day TTL and cascade-delete when the last web doc for that root is removed. - Cleanup cascade: `cleanup_document_resources` now reaps orphaned terminal `CrawlJob` rows after deleting a web doc. - Self-heal: `list_active_crawls` auto-finalizes any `running` row older than 20 minutes (mirrors the stale-doc auto-fail pattern), so a crashed process can't leave the SPA in perma-poll. - Crawler holds strong refs to worker tasks; the route holds a module-level set of in-flight crawl tasks (Python's weak task tracking would otherwise GC them mid-execution). - New deps: beautifulsoup4 4.13.5, trafilatura 2.0.0. Frontend - New `WebSourceDialogComponent`: URL input + "Crawl linked pages" toggle revealing depth / max-pages / concurrency / delay sliders. Submit-and-watch UX — modal closes on Start, pages appear in the docs list as they're ingested. Style tokens match the file-source dialog. - `WebSourceService` thin client for the three endpoints. - Editor wiring: "Add web content" button next to the connector buttons, with an inline "Crawling…" badge while a crawl is in flight. - Crawl watcher polls `/web-sources/crawls?active=true` every 5 s; new pages surface via an incremental discovery merge — no list-wide refresh. - Document delete: optimistic UI removes the row immediately and rolls back on failure (no more wait-then-disappear). Stale-uploading docs can now be deleted regardless of polling state. Tests - `tests/apis/app_api/web_sources/` (60 tests): URL normalization, same-domain, SSRF guard, BFS bounds, robots, per-page failure handling, crawl finalization, route 202/404/422/401, CrawlJob put/get/list round trip with float delays, stale-row reaper, cascade cleanup, TTL on finalize. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

philmerrell merged commit 2c38475 into develop May 23, 2026
39 checks passed

philmerrell deleted the feature/web-source-crawl branch May 23, 2026 18:25

This was referenced May 23, 2026

refactor(assistant-editor): inline knowledge-base row + connector skeletons #379

Merged

feat(assistants): ground consumer chat in knowledge base only #382

Merged

Feature: Web Crawling Engine for Assistant Knowledge Bases #115

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(web-sources): crawl websites into an assistant's knowledge base#378

feat(web-sources): crawl websites into an assistant's knowledge base#378
philmerrell merged 1 commit into
developfrom
feature/web-source-crawl

philmerrell commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

philmerrell commented May 23, 2026

Summary

Behind the scenes

Decisions worth flagging

Test plan

Out of scope (follow-ups)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant