Releases: Bloody-Crow/SiteSavvy
SiteSavvy v0.6.0
SiteSavvy v0.6.0
Capture the web, your way.
v0.6.0 completes the feature set with 7 new modules covering pagination, authentication, proxy/Tor, stealth, recipes, docs-site mode, and offline full-text search — on top of v0.5.0's AI/RAG/MCP capabilities.
Installation
pip install sitesavvyOr download the stand-alone binary for your OS (no Python required) from the assets below.
What's new in v0.6.0
| Feature | Flag | Module |
|---|---|---|
| 📄 Pagination awareness | --follow-pagination (default on) |
pagination.py |
| 🔐 Authenticated crawling | --login-url / --login-user / --login-pass |
auth.py |
| 🌐 Proxy / Tor / SOCKS5 | --proxy http://... or socks5://... |
proxies.py |
| 🥸 Stealth mode | --stealth |
stealth.py |
| 🍳 Recipe mode → cookbook EPUB | --recipe-mode |
recipe.py |
| 📚 Docs-site mode | --docs-mode |
docs_mode.py |
| 🔍 Offline full-text search | --offline-search |
offline_search.py |
Quick examples
# Offline-searchable mirror
sitesavvy crawl https://example.com --offline-search --format html --out-dir ./out
# → open ./out/search.html in any browser
# Recipe site → cookbook EPUB
sitesavvy crawl https://recipes.example.com --recipe-mode --out-dir ./out
# → ./out/sitesavvy-cookbook.epub
# Authenticated crawl
sitesavvy crawl https://private.example.com \
--login-url https://private.example.com/login \
--login-user alice --login-pass secret --out-dir ./out
# Tor + stealth
sitesavvy crawl https://example.onion \
--proxy socks5://127.0.0.1:9050 --stealth --out-dir ./outStats
- 38 source modules (7 new in v0.6.0)
- 534 tests passing (252 new), 90% coverage
ruff check .clean,mypy sitesavvyclean- Tested on Python 3.12
Release assets
| Asset | OS | Notes |
|---|---|---|
sitesavvy-0.6.0-linux-x86_64.tar.gz |
Linux x86_64 | Single-file PyInstaller binary |
sitesavvy-0.6.0-macos-x86_64.tar.gz |
macOS x86_64 | Single-file PyInstaller binary |
sitesavvy-0.6.0-windows-x86_64.exe |
Windows x86_64 | Single-file PyInstaller binary |
sitesavvy-0.6.0-py3-none-any.whl |
Universal | pip install wheel |
sitesavvy-0.6.0.tar.gz |
Universal | Source distribution |
Legal
SiteSavvy is provided for personal, non-commercial use only. Respect the copyright, terms of service, and robots.txt of every site you crawl. Licensed under the MIT License.
SiteSavvy v0.5.0 — AI, RAG, MCP server & 15+ new features
SiteSavvy v0.5.0 — Capture the web, your way. (Major release)
A massive update that transforms SiteSavvy from a scraper into an AI-powered web research tool. This release adds LLM integration, RAG问答, an MCP server, 6 new output formats, and 15+ new features.
🤖 AI & Intelligence
- LLM content extraction (
--ai-extract): use a language model to extract clean main content from messy HTML - Per-page summaries + site digest (
--summarize): generate a TL;DR for every page and a singlesite-digest.mdfor the whole crawl - Auto-categorization (
--categorize): tag each page (article/product/listing/docs/…)
💬 RAG问答 — sitesavvy ask "..."
After crawling with --index, ask natural-language questions about the site and get cited answers:
sitesavvy crawl https://docs.example.com --mode text --format md --index
sitesavvy ask "how do I configure authentication?"Backed by a SQLite vector store with cosine similarity over OpenAI-compatible embeddings.
🔌 MCP Server — sitesavvy mcp
Expose SiteSavvy to AI assistants (Claude, Cursor, VS Code Copilot) via the Model Context Protocol. Six tools: crawl, list_pages, search, get_page, ask, info. Configure in Claude Desktop:
{"mcpServers": {"sitesavvy": {"command": "sitesavvy", "args": ["mcp"]}}}📋 9 Output Formats
html, md, txt, pdf, epub, zip, sqlite (queryable DB), warc (ISO 28500:2017, replayweb.page-compatible), obsidian (Markdown vault with wikilinks)
🎯 Scraping Power
--include/--excludeURL patterns (glob*/**orre:regex)--scope "main"CSS selector restricts link discovery + content extraction--sitemapseeds URLs from sitemap.xml (incl. sitemap indexes)--max-pages/--max-bytes/--max-timebudgets--structuredemits JSON-LD + Open Graph + table sidecars--proxy http://...orsocks5://...--screenshotsfull-page PNGs (headless)--archivesubmits every page to the Wayback Machine
⚙️ Config & UX
sitesavvy.tomlconfig files with[default]+[profiles.<name>]- 6 built-in presets:
--preset docs|blog|wiki|shop|archive|research sitesavvy newinteractive wizardsitesavvy diff <old> <new>compares two crawls--reportgenerates an HTML crawl reportsitesavvy init-configwrites an example config
Installation
pip install sitesavvy # from PyPIOr download the platform binary for your OS (no Python required):
- Linux:
sitesavvy-0.5.0-linux-x86_64.tar.gz - macOS:
sitesavvy-0.5.0-macos-x86_64.tar.gz(built by CI) - Windows:
sitesavvy-0.5.0-windows-x86_64.exe(built by CI)
AI Configuration
AI features use any OpenAI-compatible API:
export SITESAVVY_LLM_API_KEY=sk-...
export SITESAVVY_LLM_BASE_URL=https://api.openai.com/v1 # or http://localhost:11434/v1 for Ollama
export SITESAVVY_LLM_MODEL=gpt-4o-miniQuality
- 282 tests (171 new), 90% coverage
- ruff + mypy clean
- 17 new modules, 15 new config fields
See the full changelog for every detail.
Legal
SiteSavvy is for personal, non-commercial use only. Respect copyright, ToS, and robots.txt of every site you crawl. MIT licensed.
SiteSavvy v0.1.0 — Capture the web, your way.
SiteSavvy v0.1.0 — Capture the web, your way.
A modern, async, cross-platform web scraper with full-site and text-only modes, six export formats (HTML, Markdown, Text, PDF, EPUB, ZIP), robots.txt awareness, resume/incremental crawling, and optional Playwright headless rendering.
Installation
Via pip (once published to PyPI)
pip install sitesavvyFrom this release (no Python required)
Download the platform binary for your OS, extract, and run:
# Linux (x86_64)
tar -xzf sitesavvy-0.1.0-linux-x86_64.tar.gz
./sitesavvy --help
# Windows
# sitesavvy-0.1.0-windows-x86_64.exe — run in Command Prompt / PowerShell
# macOS
# sitesavvy-0.1.0-macos-x86_64 — chmod +x and runFrom source (sdist/wheel)
pip install sitesavvy-0.1.0-py3-none-any.whl
# or
pip install sitesavvy-0.1.0.tar.gzRelease Assets
| Asset | Type | Size | Notes |
|---|---|---|---|
sitesavvy-0.1.0-linux-x86_64.tar.gz |
PyInstaller binary (Linux) | ~88 MB | Single-file executable, no Python needed |
sitesavvy-0.1.0-windows-x86_64.exe |
PyInstaller binary (Windows) | pending | Built by CI |
sitesavvy-0.1.0-macos-x86_64 |
PyInstaller binary (macOS) | pending | Built by CI |
sitesavvy-0.1.0-py3-none-any.whl |
Python wheel | 32 KB | Universal, pip install |
sitesavvy-0.1.0.tar.gz |
Source distribution | 28 KB | pip install from source |
Note on platform binaries: PyInstaller binaries are OS- and architecture-specific and must be built on the target OS. The Linux binary is built directly here; the Windows
.exeand macOS binary will be attached automatically by the GitHub Actions release workflow on the next push (see.github/workflows/release.yml).
Features
- Two crawl modes:
full(mirror the whole site) andtext(readable text only) - Six output formats: HTML, Markdown, plain text, PDF, EPUB, ZIP archive
- Polite by default: respects
robots.txt, per-host rate limiting, auto-throttle on 429/5xx - Resume & incremental: JSON manifest tracks every URL; resume skips completed work, incremental re-downloads only changed resources (ETag/Last-Modified)
- Concurrency control with a global semaphore and per-host locks
- Dry-run mode to preview URLs before downloading
- Headless rendering via Playwright (with graceful
aiohttpfallback) - Fine-grained
--download-typesfiltering (html, css, js, img, pdf, other) - External-link gating — stays on the start host unless
--externalis passed - Rich CLI with progress tables and coloured output
Verification
- 111 tests, 92% line+branch coverage
ruff check .cleanmypy sitesavvyclean- Tested on Python 3.12
Legal
SiteSavvy is provided for personal, non-commercial use only. Respect the copyright, terms of service, and robots.txt of every site you crawl. The authors assume no liability for misuse. Licensed under the MIT License.