feat: URL spidering — crawl linked pages with configurable depth and page limits

## Problem

When a user submits a URL for indexing, only that single page is fetched and indexed. If the page links to related documentation (e.g. a docs site sidebar, a wiki with interlinked pages, an API reference with subpages), the user must manually submit each URL individually.

This makes it impractical to index an entire documentation site or knowledge base section.

## Proposed Feature

Add opt-in spidering when submitting a URL, with strict limits to prevent runaway crawling.

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `spider` | boolean | `false` | Enable link following |
| `maxPages` | number | `25` | Hard cap on total pages indexed per spider run |
| `maxDepth` | number | `2` | How many hops from the seed URL (0 = seed only) |
| `sameDomain` | boolean | `true` | Only follow links on the same domain/subdomain |
| `pathPrefix` | string | `null` | Only follow links matching this URL path prefix (e.g. `/docs/`) |
| `excludePatterns` | string[] | `[]` | Glob patterns for URLs to skip (e.g. `*/changelog*`, `*/api/v1/*`) |

### API Surface

**REST:**
```
POST /api/v1/documents/url
{
  "url": "https://docs.example.com/getting-started",
  "spider": true,
  "maxPages": 50,
  "maxDepth": 2,
  "sameDomain": true,
  "pathPrefix": "/docs/"
}
```

**MCP Tool:**
```
submit-document with url + spider options
```

### Implementation Plan

#### Phase 1: Link Extraction (`src/core/link-extractor.ts` — new file)

1. Create `extractLinks(html: string, baseUrl: string): string[]`
   - Parse the fetched HTML (before markdown conversion) for `<a href="...">` tags
   - Resolve relative URLs against the base URL
   - Deduplicate and normalize (strip fragments, trailing slashes)
   - Filter out non-http(s) schemes, mailto:, tel:, javascript:, etc.

#### Phase 2: Spider Engine (`src/core/spider.ts` — new file)

1. Create `SpiderOptions` interface with all parameters above
2. Create `spiderUrl(seedUrl: string, options: SpiderOptions): AsyncGenerator<SpiderResult>`
   - BFS crawl using a queue: `{ url: string, depth: number }[]`
   - Track visited URLs in a `Set<string>` to avoid cycles
   - For each page:
     a. Run existing SSRF validation (`validateUrl` + `isPrivateIP`) on every URL before fetching
     b. Fetch with `fetchAndConvert()` from url-fetcher.ts
     c. Extract links from raw HTML (before markdown conversion — need to capture HTML before conversion)
     d. Filter links by `sameDomain`, `pathPrefix`, `excludePatterns`
     e. Enqueue unvisited links if `depth < maxDepth` and `visited.size < maxPages`
     f. Yield each `{ url, title, content, depth }` as it is fetched
   - Respect rate limiting: add configurable delay between requests (default 1 second)
   - Abort cleanly if maxPages reached

#### Phase 3: Integration with Existing Code

1. **Modify `url-fetcher.ts`:**
   - Export the raw HTML before markdown conversion (currently it converts inline)
   - Add a `fetchRaw(url: string): Promise<{ html: string, finalUrl: string }>` alongside existing `fetchAndConvert()`
   - Or refactor `fetchAndConvert` to optionally return raw HTML

2. **Modify `src/api/routes.ts`:**
   - Update `POST /api/v1/documents/url` to accept spider options
   - When `spider: true`, use the spider engine instead of single fetch
   - Index each yielded page as a separate document
   - Return `{ documents: [...], pagesIndexed: number, pagesCrawled: number }`

3. **Modify `src/mcp/server.ts`:**
   - Add spider parameters to the `submit-document` tool schema
   - Wire through to spider engine

#### Phase 4: Safety & Observability

1. **Hard limits (non-overridable):**
   - Absolute max pages: 200 (even if user requests more)
   - Absolute max depth: 5
   - Request timeout per page: 30s
   - Total spider timeout: 10 minutes
   - Respect robots.txt (parse and honor Disallow rules)

2. **Logging:**
   - Log each page fetched at INFO level
   - Log skipped URLs (filtered, already visited, blocked by robots.txt) at DEBUG
   - Log summary at completion: pages indexed, pages skipped, time elapsed

3. **Error handling:**
   - Individual page fetch failures should NOT abort the spider
   - Log the error, skip the page, continue crawling
   - Return partial results with error count in response

#### Phase 5: Tests (`tests/unit/spider.test.ts` — new file)

1. Unit tests for link extraction (relative URLs, fragments, dedup, scheme filtering)
2. Unit tests for spider engine with mocked fetcher (BFS order, depth limit, page limit, domain filtering, path prefix, exclude patterns, cycle detection)
3. Integration test: seed URL with 3 levels of links, verify correct pages indexed
4. Safety tests: maxPages enforced, maxDepth enforced, private IP links skipped, robots.txt respected

### Files to Create/Modify

| File | Action |
|------|--------|
| `src/core/link-extractor.ts` | **New** — HTML link extraction |
| `src/core/spider.ts` | **New** — Spider engine |
| `src/core/url-fetcher.ts` | Modify — export raw HTML option |
| `src/api/routes.ts` | Modify — accept spider params on POST /documents/url |
| `src/mcp/server.ts` | Modify — add spider params to submit-document tool |
| `src/api/openapi.ts` | Modify — document spider parameters |
| `tests/unit/spider.test.ts` | **New** — spider tests |
| `tests/unit/link-extractor.test.ts` | **New** — link extraction tests |

### Complexity Estimate

Medium-large. Core spider engine is ~200 lines. Link extractor ~80 lines. Integration ~100 lines across routes/MCP. Tests ~300 lines. Total ~700 lines new code.

File	Action
`src/core/link-extractor.ts`	New — HTML link extraction
`src/core/spider.ts`	New — Spider engine
`src/core/url-fetcher.ts`	Modify — export raw HTML option
`src/api/routes.ts`	Modify — accept spider params on POST /documents/url
`src/mcp/server.ts`	Modify — add spider params to submit-document tool
`src/api/openapi.ts`	Modify — document spider parameters
`tests/unit/spider.test.ts`	New — spider tests
`tests/unit/link-extractor.test.ts`	New — link extraction tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: URL spidering — crawl linked pages with configurable depth and page limits #315

Problem

Proposed Feature

Parameters

API Surface

Implementation Plan

Phase 1: Link Extraction (`src/core/link-extractor.ts` — new file)

Phase 2: Spider Engine (`src/core/spider.ts` — new file)

Phase 3: Integration with Existing Code

Phase 4: Safety & Observability

Phase 5: Tests (`tests/unit/spider.test.ts` — new file)

Files to Create/Modify

Complexity Estimate

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Parameter	Type	Default	Description
`spider`	boolean	`false`	Enable link following
`maxPages`	number	`25`	Hard cap on total pages indexed per spider run
`maxDepth`	number	`2`	How many hops from the seed URL (0 = seed only)
`sameDomain`	boolean	`true`	Only follow links on the same domain/subdomain
`pathPrefix`	string	`null`	Only follow links matching this URL path prefix (e.g. `/docs/`)
`excludePatterns`	string[]	`[]`	Glob patterns for URLs to skip (e.g. `/changelog`, `/api/v1/`)

feat: URL spidering — crawl linked pages with configurable depth and page limits #315

Description

Problem

Proposed Feature

Parameters

API Surface

Implementation Plan

Phase 1: Link Extraction (src/core/link-extractor.ts — new file)

Phase 2: Spider Engine (src/core/spider.ts — new file)

Phase 3: Integration with Existing Code

Phase 4: Safety & Observability

Phase 5: Tests (tests/unit/spider.test.ts — new file)

Files to Create/Modify

Complexity Estimate

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Phase 1: Link Extraction (`src/core/link-extractor.ts` — new file)

Phase 2: Spider Engine (`src/core/spider.ts` — new file)

Phase 5: Tests (`tests/unit/spider.test.ts` — new file)