Skip to content

feat: URL spidering — crawl linked pages with configurable depth and page limits #315

@RobertLD

Description

@RobertLD

Problem

When a user submits a URL for indexing, only that single page is fetched and indexed. If the page links to related documentation (e.g. a docs site sidebar, a wiki with interlinked pages, an API reference with subpages), the user must manually submit each URL individually.

This makes it impractical to index an entire documentation site or knowledge base section.

Proposed Feature

Add opt-in spidering when submitting a URL, with strict limits to prevent runaway crawling.

Parameters

Parameter Type Default Description
spider boolean false Enable link following
maxPages number 25 Hard cap on total pages indexed per spider run
maxDepth number 2 How many hops from the seed URL (0 = seed only)
sameDomain boolean true Only follow links on the same domain/subdomain
pathPrefix string null Only follow links matching this URL path prefix (e.g. /docs/)
excludePatterns string[] [] Glob patterns for URLs to skip (e.g. */changelog*, */api/v1/*)

API Surface

REST:

POST /api/v1/documents/url
{
  "url": "https://docs.example.com/getting-started",
  "spider": true,
  "maxPages": 50,
  "maxDepth": 2,
  "sameDomain": true,
  "pathPrefix": "/docs/"
}

MCP Tool:

submit-document with url + spider options

Implementation Plan

Phase 1: Link Extraction (src/core/link-extractor.ts — new file)

  1. Create extractLinks(html: string, baseUrl: string): string[]
    • Parse the fetched HTML (before markdown conversion) for <a href="..."> tags
    • Resolve relative URLs against the base URL
    • Deduplicate and normalize (strip fragments, trailing slashes)
    • Filter out non-http(s) schemes, mailto:, tel:, javascript:, etc.

Phase 2: Spider Engine (src/core/spider.ts — new file)

  1. Create SpiderOptions interface with all parameters above
  2. Create spiderUrl(seedUrl: string, options: SpiderOptions): AsyncGenerator<SpiderResult>
    • BFS crawl using a queue: { url: string, depth: number }[]
    • Track visited URLs in a Set<string> to avoid cycles
    • For each page:
      a. Run existing SSRF validation (validateUrl + isPrivateIP) on every URL before fetching
      b. Fetch with fetchAndConvert() from url-fetcher.ts
      c. Extract links from raw HTML (before markdown conversion — need to capture HTML before conversion)
      d. Filter links by sameDomain, pathPrefix, excludePatterns
      e. Enqueue unvisited links if depth < maxDepth and visited.size < maxPages
      f. Yield each { url, title, content, depth } as it is fetched
    • Respect rate limiting: add configurable delay between requests (default 1 second)
    • Abort cleanly if maxPages reached

Phase 3: Integration with Existing Code

  1. Modify url-fetcher.ts:

    • Export the raw HTML before markdown conversion (currently it converts inline)
    • Add a fetchRaw(url: string): Promise<{ html: string, finalUrl: string }> alongside existing fetchAndConvert()
    • Or refactor fetchAndConvert to optionally return raw HTML
  2. Modify src/api/routes.ts:

    • Update POST /api/v1/documents/url to accept spider options
    • When spider: true, use the spider engine instead of single fetch
    • Index each yielded page as a separate document
    • Return { documents: [...], pagesIndexed: number, pagesCrawled: number }
  3. Modify src/mcp/server.ts:

    • Add spider parameters to the submit-document tool schema
    • Wire through to spider engine

Phase 4: Safety & Observability

  1. Hard limits (non-overridable):

    • Absolute max pages: 200 (even if user requests more)
    • Absolute max depth: 5
    • Request timeout per page: 30s
    • Total spider timeout: 10 minutes
    • Respect robots.txt (parse and honor Disallow rules)
  2. Logging:

    • Log each page fetched at INFO level
    • Log skipped URLs (filtered, already visited, blocked by robots.txt) at DEBUG
    • Log summary at completion: pages indexed, pages skipped, time elapsed
  3. Error handling:

    • Individual page fetch failures should NOT abort the spider
    • Log the error, skip the page, continue crawling
    • Return partial results with error count in response

Phase 5: Tests (tests/unit/spider.test.ts — new file)

  1. Unit tests for link extraction (relative URLs, fragments, dedup, scheme filtering)
  2. Unit tests for spider engine with mocked fetcher (BFS order, depth limit, page limit, domain filtering, path prefix, exclude patterns, cycle detection)
  3. Integration test: seed URL with 3 levels of links, verify correct pages indexed
  4. Safety tests: maxPages enforced, maxDepth enforced, private IP links skipped, robots.txt respected

Files to Create/Modify

File Action
src/core/link-extractor.ts New — HTML link extraction
src/core/spider.ts New — Spider engine
src/core/url-fetcher.ts Modify — export raw HTML option
src/api/routes.ts Modify — accept spider params on POST /documents/url
src/mcp/server.ts Modify — add spider params to submit-document tool
src/api/openapi.ts Modify — document spider parameters
tests/unit/spider.test.ts New — spider tests
tests/unit/link-extractor.test.ts New — link extraction tests

Complexity Estimate

Medium-large. Core spider engine is ~200 lines. Link extractor ~80 lines. Integration ~100 lines across routes/MCP. Tests ~300 lines. Total ~700 lines new code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions