SiteSavvy v0.5.0 — AI, RAG, MCP server & 15+ new features
SiteSavvy v0.5.0 — Capture the web, your way. (Major release)
A massive update that transforms SiteSavvy from a scraper into an AI-powered web research tool. This release adds LLM integration, RAG问答, an MCP server, 6 new output formats, and 15+ new features.
🤖 AI & Intelligence
- LLM content extraction (
--ai-extract): use a language model to extract clean main content from messy HTML - Per-page summaries + site digest (
--summarize): generate a TL;DR for every page and a singlesite-digest.mdfor the whole crawl - Auto-categorization (
--categorize): tag each page (article/product/listing/docs/…)
💬 RAG问答 — sitesavvy ask "..."
After crawling with --index, ask natural-language questions about the site and get cited answers:
sitesavvy crawl https://docs.example.com --mode text --format md --index
sitesavvy ask "how do I configure authentication?"Backed by a SQLite vector store with cosine similarity over OpenAI-compatible embeddings.
🔌 MCP Server — sitesavvy mcp
Expose SiteSavvy to AI assistants (Claude, Cursor, VS Code Copilot) via the Model Context Protocol. Six tools: crawl, list_pages, search, get_page, ask, info. Configure in Claude Desktop:
{"mcpServers": {"sitesavvy": {"command": "sitesavvy", "args": ["mcp"]}}}📋 9 Output Formats
html, md, txt, pdf, epub, zip, sqlite (queryable DB), warc (ISO 28500:2017, replayweb.page-compatible), obsidian (Markdown vault with wikilinks)
🎯 Scraping Power
--include/--excludeURL patterns (glob*/**orre:regex)--scope "main"CSS selector restricts link discovery + content extraction--sitemapseeds URLs from sitemap.xml (incl. sitemap indexes)--max-pages/--max-bytes/--max-timebudgets--structuredemits JSON-LD + Open Graph + table sidecars--proxy http://...orsocks5://...--screenshotsfull-page PNGs (headless)--archivesubmits every page to the Wayback Machine
⚙️ Config & UX
sitesavvy.tomlconfig files with[default]+[profiles.<name>]- 6 built-in presets:
--preset docs|blog|wiki|shop|archive|research sitesavvy newinteractive wizardsitesavvy diff <old> <new>compares two crawls--reportgenerates an HTML crawl reportsitesavvy init-configwrites an example config
Installation
pip install sitesavvy # from PyPIOr download the platform binary for your OS (no Python required):
- Linux:
sitesavvy-0.5.0-linux-x86_64.tar.gz - macOS:
sitesavvy-0.5.0-macos-x86_64.tar.gz(built by CI) - Windows:
sitesavvy-0.5.0-windows-x86_64.exe(built by CI)
AI Configuration
AI features use any OpenAI-compatible API:
export SITESAVVY_LLM_API_KEY=sk-...
export SITESAVVY_LLM_BASE_URL=https://api.openai.com/v1 # or http://localhost:11434/v1 for Ollama
export SITESAVVY_LLM_MODEL=gpt-4o-miniQuality
- 282 tests (171 new), 90% coverage
- ruff + mypy clean
- 17 new modules, 15 new config fields
See the full changelog for every detail.
Legal
SiteSavvy is for personal, non-commercial use only. Respect copyright, ToS, and robots.txt of every site you crawl. MIT licensed.