-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
When a user submits a URL for indexing, only that single page is fetched and indexed. If the page links to related documentation (e.g. a docs site sidebar, a wiki with interlinked pages, an API reference with subpages), the user must manually submit each URL individually.
This makes it impractical to index an entire documentation site or knowledge base section.
Proposed Feature
Add opt-in spidering when submitting a URL, with strict limits to prevent runaway crawling.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
spider |
boolean | false |
Enable link following |
maxPages |
number | 25 |
Hard cap on total pages indexed per spider run |
maxDepth |
number | 2 |
How many hops from the seed URL (0 = seed only) |
sameDomain |
boolean | true |
Only follow links on the same domain/subdomain |
pathPrefix |
string | null |
Only follow links matching this URL path prefix (e.g. /docs/) |
excludePatterns |
string[] | [] |
Glob patterns for URLs to skip (e.g. */changelog*, */api/v1/*) |
API Surface
REST:
POST /api/v1/documents/url
{
"url": "https://docs.example.com/getting-started",
"spider": true,
"maxPages": 50,
"maxDepth": 2,
"sameDomain": true,
"pathPrefix": "/docs/"
}
MCP Tool:
submit-document with url + spider options
Implementation Plan
Phase 1: Link Extraction (src/core/link-extractor.ts — new file)
- Create
extractLinks(html: string, baseUrl: string): string[]- Parse the fetched HTML (before markdown conversion) for
<a href="...">tags - Resolve relative URLs against the base URL
- Deduplicate and normalize (strip fragments, trailing slashes)
- Filter out non-http(s) schemes, mailto:, tel:, javascript:, etc.
- Parse the fetched HTML (before markdown conversion) for
Phase 2: Spider Engine (src/core/spider.ts — new file)
- Create
SpiderOptionsinterface with all parameters above - Create
spiderUrl(seedUrl: string, options: SpiderOptions): AsyncGenerator<SpiderResult>- BFS crawl using a queue:
{ url: string, depth: number }[] - Track visited URLs in a
Set<string>to avoid cycles - For each page:
a. Run existing SSRF validation (validateUrl+isPrivateIP) on every URL before fetching
b. Fetch withfetchAndConvert()from url-fetcher.ts
c. Extract links from raw HTML (before markdown conversion — need to capture HTML before conversion)
d. Filter links bysameDomain,pathPrefix,excludePatterns
e. Enqueue unvisited links ifdepth < maxDepthandvisited.size < maxPages
f. Yield each{ url, title, content, depth }as it is fetched - Respect rate limiting: add configurable delay between requests (default 1 second)
- Abort cleanly if maxPages reached
- BFS crawl using a queue:
Phase 3: Integration with Existing Code
-
Modify
url-fetcher.ts:- Export the raw HTML before markdown conversion (currently it converts inline)
- Add a
fetchRaw(url: string): Promise<{ html: string, finalUrl: string }>alongside existingfetchAndConvert() - Or refactor
fetchAndConvertto optionally return raw HTML
-
Modify
src/api/routes.ts:- Update
POST /api/v1/documents/urlto accept spider options - When
spider: true, use the spider engine instead of single fetch - Index each yielded page as a separate document
- Return
{ documents: [...], pagesIndexed: number, pagesCrawled: number }
- Update
-
Modify
src/mcp/server.ts:- Add spider parameters to the
submit-documenttool schema - Wire through to spider engine
- Add spider parameters to the
Phase 4: Safety & Observability
-
Hard limits (non-overridable):
- Absolute max pages: 200 (even if user requests more)
- Absolute max depth: 5
- Request timeout per page: 30s
- Total spider timeout: 10 minutes
- Respect robots.txt (parse and honor Disallow rules)
-
Logging:
- Log each page fetched at INFO level
- Log skipped URLs (filtered, already visited, blocked by robots.txt) at DEBUG
- Log summary at completion: pages indexed, pages skipped, time elapsed
-
Error handling:
- Individual page fetch failures should NOT abort the spider
- Log the error, skip the page, continue crawling
- Return partial results with error count in response
Phase 5: Tests (tests/unit/spider.test.ts — new file)
- Unit tests for link extraction (relative URLs, fragments, dedup, scheme filtering)
- Unit tests for spider engine with mocked fetcher (BFS order, depth limit, page limit, domain filtering, path prefix, exclude patterns, cycle detection)
- Integration test: seed URL with 3 levels of links, verify correct pages indexed
- Safety tests: maxPages enforced, maxDepth enforced, private IP links skipped, robots.txt respected
Files to Create/Modify
| File | Action |
|---|---|
src/core/link-extractor.ts |
New — HTML link extraction |
src/core/spider.ts |
New — Spider engine |
src/core/url-fetcher.ts |
Modify — export raw HTML option |
src/api/routes.ts |
Modify — accept spider params on POST /documents/url |
src/mcp/server.ts |
Modify — add spider params to submit-document tool |
src/api/openapi.ts |
Modify — document spider parameters |
tests/unit/spider.test.ts |
New — spider tests |
tests/unit/link-extractor.test.ts |
New — link extraction tests |
Complexity Estimate
Medium-large. Core spider engine is ~200 lines. Link extractor ~80 lines. Integration ~100 lines across routes/MCP. Tests ~300 lines. Total ~700 lines new code.