24 Jun 00:31

338709a

SiteSavvy v0.6.0 Latest

Latest

SiteSavvy v0.6.0

Capture the web, your way.

v0.6.0 completes the feature set with 7 new modules covering pagination, authentication, proxy/Tor, stealth, recipes, docs-site mode, and offline full-text search — on top of v0.5.0's AI/RAG/MCP capabilities.

Installation

pip install sitesavvy

Or download the stand-alone binary for your OS (no Python required) from the assets below.

What's new in v0.6.0

Feature	Flag	Module
📄 Pagination awareness	`--follow-pagination` (default on)	`pagination.py`
🔐 Authenticated crawling	`--login-url` / `--login-user` / `--login-pass`	`auth.py`
🌐 Proxy / Tor / SOCKS5	`--proxy http://...` or `socks5://...`	`proxies.py`
🥸 Stealth mode	`--stealth`	`stealth.py`
🍳 Recipe mode → cookbook EPUB	`--recipe-mode`	`recipe.py`
📚 Docs-site mode	`--docs-mode`	`docs_mode.py`
🔍 Offline full-text search	`--offline-search`	`offline_search.py`

Quick examples

# Offline-searchable mirror
sitesavvy crawl https://example.com --offline-search --format html --out-dir ./out
# → open ./out/search.html in any browser

# Recipe site → cookbook EPUB
sitesavvy crawl https://recipes.example.com --recipe-mode --out-dir ./out
# → ./out/sitesavvy-cookbook.epub

# Authenticated crawl
sitesavvy crawl https://private.example.com \
    --login-url https://private.example.com/login \
    --login-user alice --login-pass secret --out-dir ./out

# Tor + stealth
sitesavvy crawl https://example.onion \
    --proxy socks5://127.0.0.1:9050 --stealth --out-dir ./out

Stats

38 source modules (7 new in v0.6.0)
534 tests passing (252 new), 90% coverage
ruff check . clean, mypy sitesavvy clean
Tested on Python 3.12

Release assets

Asset	OS	Notes
`sitesavvy-0.6.0-linux-x86_64.tar.gz`	Linux x86_64	Single-file PyInstaller binary
`sitesavvy-0.6.0-macos-x86_64.tar.gz`	macOS x86_64	Single-file PyInstaller binary
`sitesavvy-0.6.0-windows-x86_64.exe`	Windows x86_64	Single-file PyInstaller binary
`sitesavvy-0.6.0-py3-none-any.whl`	Universal	`pip install` wheel
`sitesavvy-0.6.0.tar.gz`	Universal	Source distribution

Legal

SiteSavvy is provided for personal, non-commercial use only. Respect the copyright, terms of service, and robots.txt of every site you crawl. Licensed under the MIT License.

Assets 7

23 Jun 23:21

Bloody-Crow

v0.5.0

beaa157

SiteSavvy v0.5.0 — AI, RAG, MCP server & 15+ new features

SiteSavvy v0.5.0 — Capture the web, your way. (Major release)

A massive update that transforms SiteSavvy from a scraper into an AI-powered web research tool. This release adds LLM integration, RAG问答, an MCP server, 6 new output formats, and 15+ new features.

🤖 AI & Intelligence

LLM content extraction (--ai-extract): use a language model to extract clean main content from messy HTML
Per-page summaries + site digest (--summarize): generate a TL;DR for every page and a single site-digest.md for the whole crawl
Auto-categorization (--categorize): tag each page (article/product/listing/docs/…)

💬 RAG问答 — `sitesavvy ask "..."`

After crawling with --index, ask natural-language questions about the site and get cited answers:

sitesavvy crawl https://docs.example.com --mode text --format md --index
sitesavvy ask "how do I configure authentication?"

Backed by a SQLite vector store with cosine similarity over OpenAI-compatible embeddings.

🔌 MCP Server — `sitesavvy mcp`

Expose SiteSavvy to AI assistants (Claude, Cursor, VS Code Copilot) via the Model Context Protocol. Six tools: crawl, list_pages, search, get_page, ask, info. Configure in Claude Desktop:

{"mcpServers": {"sitesavvy": {"command": "sitesavvy", "args": ["mcp"]}}}

📋 9 Output Formats

html, md, txt, pdf, epub, zip, sqlite (queryable DB), warc (ISO 28500:2017, replayweb.page-compatible), obsidian (Markdown vault with wikilinks)

🎯 Scraping Power

--include / --exclude URL patterns (glob */** or re: regex)
--scope "main" CSS selector restricts link discovery + content extraction
--sitemap seeds URLs from sitemap.xml (incl. sitemap indexes)
--max-pages / --max-bytes / --max-time budgets
--structured emits JSON-LD + Open Graph + table sidecars
--proxy http://... or socks5://...
--screenshots full-page PNGs (headless)
--archive submits every page to the Wayback Machine

⚙️ Config & UX

sitesavvy.toml config files with [default] + [profiles.<name>]
6 built-in presets: --preset docs|blog|wiki|shop|archive|research
sitesavvy new interactive wizard
sitesavvy diff <old> <new> compares two crawls
--report generates an HTML crawl report
sitesavvy init-config writes an example config

Installation

pip install sitesavvy          # from PyPI

Or download the platform binary for your OS (no Python required):

Linux: sitesavvy-0.5.0-linux-x86_64.tar.gz
macOS: sitesavvy-0.5.0-macos-x86_64.tar.gz (built by CI)
Windows: sitesavvy-0.5.0-windows-x86_64.exe (built by CI)

AI Configuration

AI features use any OpenAI-compatible API:

export SITESAVVY_LLM_API_KEY=sk-...
export SITESAVVY_LLM_BASE_URL=https://api.openai.com/v1   # or http://localhost:11434/v1 for Ollama
export SITESAVVY_LLM_MODEL=gpt-4o-mini

Quality

282 tests (171 new), 90% coverage
ruff + mypy clean
17 new modules, 15 new config fields

See the full changelog for every detail.

Legal

SiteSavvy is for personal, non-commercial use only. Respect copyright, ToS, and robots.txt of every site you crawl. MIT licensed.

Assets 7

22 Jun 22:15

Bloody-Crow

v0.1.0

724dc6b

SiteSavvy v0.1.0 — Capture the web, your way.

A modern, async, cross-platform web scraper with full-site and text-only modes, six export formats (HTML, Markdown, Text, PDF, EPUB, ZIP), robots.txt awareness, resume/incremental crawling, and optional Playwright headless rendering.

Installation

Via pip (once published to PyPI)

pip install sitesavvy

From this release (no Python required)

Download the platform binary for your OS, extract, and run:

# Linux (x86_64)
tar -xzf sitesavvy-0.1.0-linux-x86_64.tar.gz
./sitesavvy --help

# Windows
# sitesavvy-0.1.0-windows-x86_64.exe — run in Command Prompt / PowerShell

# macOS
# sitesavvy-0.1.0-macos-x86_64 — chmod +x and run

From source (sdist/wheel)

pip install sitesavvy-0.1.0-py3-none-any.whl
# or
pip install sitesavvy-0.1.0.tar.gz

Release Assets

Asset	Type	Size	Notes
`sitesavvy-0.1.0-linux-x86_64.tar.gz`	PyInstaller binary (Linux)	~88 MB	Single-file executable, no Python needed
`sitesavvy-0.1.0-windows-x86_64.exe`	PyInstaller binary (Windows)	pending	Built by CI
`sitesavvy-0.1.0-macos-x86_64`	PyInstaller binary (macOS)	pending	Built by CI
`sitesavvy-0.1.0-py3-none-any.whl`	Python wheel	32 KB	Universal, `pip install`
`sitesavvy-0.1.0.tar.gz`	Source distribution	28 KB	`pip install` from source

Note on platform binaries: PyInstaller binaries are OS- and architecture-specific and must be built on the target OS. The Linux binary is built directly here; the Windows .exe and macOS binary will be attached automatically by the GitHub Actions release workflow on the next push (see .github/workflows/release.yml).

Features

Two crawl modes: full (mirror the whole site) and text (readable text only)
Six output formats: HTML, Markdown, plain text, PDF, EPUB, ZIP archive
Polite by default: respects robots.txt, per-host rate limiting, auto-throttle on 429/5xx
Resume & incremental: JSON manifest tracks every URL; resume skips completed work, incremental re-downloads only changed resources (ETag/Last-Modified)
Concurrency control with a global semaphore and per-host locks
Dry-run mode to preview URLs before downloading
Headless rendering via Playwright (with graceful aiohttp fallback)
Fine-grained --download-types filtering (html, css, js, img, pdf, other)
External-link gating — stays on the start host unless --external is passed
Rich CLI with progress tables and coloured output

Verification

111 tests, 92% line+branch coverage
ruff check . clean
mypy sitesavvy clean
Tested on Python 3.12

Legal

SiteSavvy is provided for personal, non-commercial use only. Respect the copyright, terms of service, and robots.txt of every site you crawl. The authors assume no liability for misuse. Licensed under the MIT License.

Assets 7

Releases: Bloody-Crow/SiteSavvy

SiteSavvy v0.6.0

SiteSavvy v0.6.0

Installation

What's new in v0.6.0

Quick examples

Stats

Release assets

Legal

Uh oh!

SiteSavvy v0.5.0 — AI, RAG, MCP server & 15+ new features

SiteSavvy v0.5.0 — Capture the web, your way. (Major release)

🤖 AI & Intelligence

💬 RAG问答 — sitesavvy ask "..."

🔌 MCP Server — sitesavvy mcp

📋 9 Output Formats

🎯 Scraping Power

⚙️ Config & UX

Installation

AI Configuration

Quality

Legal

Uh oh!

SiteSavvy v0.1.0 — Capture the web, your way.

SiteSavvy v0.1.0 — Capture the web, your way.

Installation

Via pip (once published to PyPI)

From this release (no Python required)

From source (sdist/wheel)

Release Assets

Features

Verification

Legal

Uh oh!

💬 RAG问答 — `sitesavvy ask "..."`

🔌 MCP Server — `sitesavvy mcp`