SiteSavvy v0.5.0 — Capture the web, your way. (Major release)

A massive update that transforms SiteSavvy from a scraper into an AI-powered web research tool. This release adds LLM integration, RAG问答, an MCP server, 6 new output formats, and 15+ new features.

🤖 AI & Intelligence

LLM content extraction (--ai-extract): use a language model to extract clean main content from messy HTML
Per-page summaries + site digest (--summarize): generate a TL;DR for every page and a single site-digest.md for the whole crawl
Auto-categorization (--categorize): tag each page (article/product/listing/docs/…)

💬 RAG问答 — `sitesavvy ask "..."`

After crawling with --index, ask natural-language questions about the site and get cited answers:

sitesavvy crawl https://docs.example.com --mode text --format md --index
sitesavvy ask "how do I configure authentication?"

Backed by a SQLite vector store with cosine similarity over OpenAI-compatible embeddings.

🔌 MCP Server — `sitesavvy mcp`

Expose SiteSavvy to AI assistants (Claude, Cursor, VS Code Copilot) via the Model Context Protocol. Six tools: crawl, list_pages, search, get_page, ask, info. Configure in Claude Desktop:

{"mcpServers": {"sitesavvy": {"command": "sitesavvy", "args": ["mcp"]}}}

📋 9 Output Formats

html, md, txt, pdf, epub, zip, sqlite (queryable DB), warc (ISO 28500:2017, replayweb.page-compatible), obsidian (Markdown vault with wikilinks)

🎯 Scraping Power

--include / --exclude URL patterns (glob */** or re: regex)
--scope "main" CSS selector restricts link discovery + content extraction
--sitemap seeds URLs from sitemap.xml (incl. sitemap indexes)
--max-pages / --max-bytes / --max-time budgets
--structured emits JSON-LD + Open Graph + table sidecars
--proxy http://... or socks5://...
--screenshots full-page PNGs (headless)
--archive submits every page to the Wayback Machine

⚙️ Config & UX

sitesavvy.toml config files with [default] + [profiles.<name>]
6 built-in presets: --preset docs|blog|wiki|shop|archive|research
sitesavvy new interactive wizard
sitesavvy diff <old> <new> compares two crawls
--report generates an HTML crawl report
sitesavvy init-config writes an example config

Installation

pip install sitesavvy          # from PyPI

Or download the platform binary for your OS (no Python required):

Linux: sitesavvy-0.5.0-linux-x86_64.tar.gz
macOS: sitesavvy-0.5.0-macos-x86_64.tar.gz (built by CI)
Windows: sitesavvy-0.5.0-windows-x86_64.exe (built by CI)

AI Configuration

AI features use any OpenAI-compatible API:

export SITESAVVY_LLM_API_KEY=sk-...
export SITESAVVY_LLM_BASE_URL=https://api.openai.com/v1   # or http://localhost:11434/v1 for Ollama
export SITESAVVY_LLM_MODEL=gpt-4o-mini

Quality

282 tests (171 new), 90% coverage
ruff + mypy clean
17 new modules, 15 new config fields

See the full changelog for every detail.

Legal

SiteSavvy is for personal, non-commercial use only. Respect copyright, ToS, and robots.txt of every site you crawl. MIT licensed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SiteSavvy v0.5.0 — AI, RAG, MCP server & 15+ new features

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

SiteSavvy v0.5.0 — Capture the web, your way. (Major release)

🤖 AI & Intelligence

💬 RAG问答 — `sitesavvy ask "..."`

🔌 MCP Server — `sitesavvy mcp`

📋 9 Output Formats

🎯 Scraping Power

⚙️ Config & UX

Installation

AI Configuration

Quality

Legal

Uh oh!

SiteSavvy v0.5.0 — AI, RAG, MCP server & 15+ new features

SiteSavvy v0.5.0 — Capture the web, your way. (Major release)

🤖 AI & Intelligence

💬 RAG问答 — sitesavvy ask "..."

🔌 MCP Server — sitesavvy mcp

📋 9 Output Formats

🎯 Scraping Power

⚙️ Config & UX

Installation

AI Configuration

Quality

Legal

Uh oh!

💬 RAG问答 — `sitesavvy ask "..."`

🔌 MCP Server — `sitesavvy mcp`