Skip to content

SiteSavvy v0.5.0 — AI, RAG, MCP server & 15+ new features

Choose a tag to compare

@Bloody-Crow Bloody-Crow released this 23 Jun 23:21
· 4 commits to main since this release

SiteSavvy v0.5.0 — Capture the web, your way. (Major release)

A massive update that transforms SiteSavvy from a scraper into an AI-powered web research tool. This release adds LLM integration, RAG问答, an MCP server, 6 new output formats, and 15+ new features.

🤖 AI & Intelligence

  • LLM content extraction (--ai-extract): use a language model to extract clean main content from messy HTML
  • Per-page summaries + site digest (--summarize): generate a TL;DR for every page and a single site-digest.md for the whole crawl
  • Auto-categorization (--categorize): tag each page (article/product/listing/docs/…)

💬 RAG问答 — sitesavvy ask "..."

After crawling with --index, ask natural-language questions about the site and get cited answers:

sitesavvy crawl https://docs.example.com --mode text --format md --index
sitesavvy ask "how do I configure authentication?"

Backed by a SQLite vector store with cosine similarity over OpenAI-compatible embeddings.

🔌 MCP Server — sitesavvy mcp

Expose SiteSavvy to AI assistants (Claude, Cursor, VS Code Copilot) via the Model Context Protocol. Six tools: crawl, list_pages, search, get_page, ask, info. Configure in Claude Desktop:

{"mcpServers": {"sitesavvy": {"command": "sitesavvy", "args": ["mcp"]}}}

📋 9 Output Formats

html, md, txt, pdf, epub, zip, sqlite (queryable DB), warc (ISO 28500:2017, replayweb.page-compatible), obsidian (Markdown vault with wikilinks)

🎯 Scraping Power

  • --include / --exclude URL patterns (glob */** or re: regex)
  • --scope "main" CSS selector restricts link discovery + content extraction
  • --sitemap seeds URLs from sitemap.xml (incl. sitemap indexes)
  • --max-pages / --max-bytes / --max-time budgets
  • --structured emits JSON-LD + Open Graph + table sidecars
  • --proxy http://... or socks5://...
  • --screenshots full-page PNGs (headless)
  • --archive submits every page to the Wayback Machine

⚙️ Config & UX

  • sitesavvy.toml config files with [default] + [profiles.<name>]
  • 6 built-in presets: --preset docs|blog|wiki|shop|archive|research
  • sitesavvy new interactive wizard
  • sitesavvy diff <old> <new> compares two crawls
  • --report generates an HTML crawl report
  • sitesavvy init-config writes an example config

Installation

pip install sitesavvy          # from PyPI

Or download the platform binary for your OS (no Python required):

  • Linux: sitesavvy-0.5.0-linux-x86_64.tar.gz
  • macOS: sitesavvy-0.5.0-macos-x86_64.tar.gz (built by CI)
  • Windows: sitesavvy-0.5.0-windows-x86_64.exe (built by CI)

AI Configuration

AI features use any OpenAI-compatible API:

export SITESAVVY_LLM_API_KEY=sk-...
export SITESAVVY_LLM_BASE_URL=https://api.openai.com/v1   # or http://localhost:11434/v1 for Ollama
export SITESAVVY_LLM_MODEL=gpt-4o-mini

Quality

  • 282 tests (171 new), 90% coverage
  • ruff + mypy clean
  • 17 new modules, 15 new config fields

See the full changelog for every detail.

Legal

SiteSavvy is for personal, non-commercial use only. Respect copyright, ToS, and robots.txt of every site you crawl. MIT licensed.