Skip to content

Releases: Bloody-Crow/SiteSavvy

SiteSavvy v0.6.0

24 Jun 00:31

Choose a tag to compare

SiteSavvy v0.6.0

Capture the web, your way.

v0.6.0 completes the feature set with 7 new modules covering pagination, authentication, proxy/Tor, stealth, recipes, docs-site mode, and offline full-text search — on top of v0.5.0's AI/RAG/MCP capabilities.

Installation

pip install sitesavvy

Or download the stand-alone binary for your OS (no Python required) from the assets below.

What's new in v0.6.0

Feature Flag Module
📄 Pagination awareness --follow-pagination (default on) pagination.py
🔐 Authenticated crawling --login-url / --login-user / --login-pass auth.py
🌐 Proxy / Tor / SOCKS5 --proxy http://... or socks5://... proxies.py
🥸 Stealth mode --stealth stealth.py
🍳 Recipe mode → cookbook EPUB --recipe-mode recipe.py
📚 Docs-site mode --docs-mode docs_mode.py
🔍 Offline full-text search --offline-search offline_search.py

Quick examples

# Offline-searchable mirror
sitesavvy crawl https://example.com --offline-search --format html --out-dir ./out
# → open ./out/search.html in any browser

# Recipe site → cookbook EPUB
sitesavvy crawl https://recipes.example.com --recipe-mode --out-dir ./out
# → ./out/sitesavvy-cookbook.epub

# Authenticated crawl
sitesavvy crawl https://private.example.com \
    --login-url https://private.example.com/login \
    --login-user alice --login-pass secret --out-dir ./out

# Tor + stealth
sitesavvy crawl https://example.onion \
    --proxy socks5://127.0.0.1:9050 --stealth --out-dir ./out

Stats

  • 38 source modules (7 new in v0.6.0)
  • 534 tests passing (252 new), 90% coverage
  • ruff check . clean, mypy sitesavvy clean
  • Tested on Python 3.12

Release assets

Asset OS Notes
sitesavvy-0.6.0-linux-x86_64.tar.gz Linux x86_64 Single-file PyInstaller binary
sitesavvy-0.6.0-macos-x86_64.tar.gz macOS x86_64 Single-file PyInstaller binary
sitesavvy-0.6.0-windows-x86_64.exe Windows x86_64 Single-file PyInstaller binary
sitesavvy-0.6.0-py3-none-any.whl Universal pip install wheel
sitesavvy-0.6.0.tar.gz Universal Source distribution

Legal

SiteSavvy is provided for personal, non-commercial use only. Respect the copyright, terms of service, and robots.txt of every site you crawl. Licensed under the MIT License.

SiteSavvy v0.5.0 — AI, RAG, MCP server & 15+ new features

23 Jun 23:21

Choose a tag to compare

SiteSavvy v0.5.0 — Capture the web, your way. (Major release)

A massive update that transforms SiteSavvy from a scraper into an AI-powered web research tool. This release adds LLM integration, RAG问答, an MCP server, 6 new output formats, and 15+ new features.

🤖 AI & Intelligence

  • LLM content extraction (--ai-extract): use a language model to extract clean main content from messy HTML
  • Per-page summaries + site digest (--summarize): generate a TL;DR for every page and a single site-digest.md for the whole crawl
  • Auto-categorization (--categorize): tag each page (article/product/listing/docs/…)

💬 RAG问答 — sitesavvy ask "..."

After crawling with --index, ask natural-language questions about the site and get cited answers:

sitesavvy crawl https://docs.example.com --mode text --format md --index
sitesavvy ask "how do I configure authentication?"

Backed by a SQLite vector store with cosine similarity over OpenAI-compatible embeddings.

🔌 MCP Server — sitesavvy mcp

Expose SiteSavvy to AI assistants (Claude, Cursor, VS Code Copilot) via the Model Context Protocol. Six tools: crawl, list_pages, search, get_page, ask, info. Configure in Claude Desktop:

{"mcpServers": {"sitesavvy": {"command": "sitesavvy", "args": ["mcp"]}}}

📋 9 Output Formats

html, md, txt, pdf, epub, zip, sqlite (queryable DB), warc (ISO 28500:2017, replayweb.page-compatible), obsidian (Markdown vault with wikilinks)

🎯 Scraping Power

  • --include / --exclude URL patterns (glob */** or re: regex)
  • --scope "main" CSS selector restricts link discovery + content extraction
  • --sitemap seeds URLs from sitemap.xml (incl. sitemap indexes)
  • --max-pages / --max-bytes / --max-time budgets
  • --structured emits JSON-LD + Open Graph + table sidecars
  • --proxy http://... or socks5://...
  • --screenshots full-page PNGs (headless)
  • --archive submits every page to the Wayback Machine

⚙️ Config & UX

  • sitesavvy.toml config files with [default] + [profiles.<name>]
  • 6 built-in presets: --preset docs|blog|wiki|shop|archive|research
  • sitesavvy new interactive wizard
  • sitesavvy diff <old> <new> compares two crawls
  • --report generates an HTML crawl report
  • sitesavvy init-config writes an example config

Installation

pip install sitesavvy          # from PyPI

Or download the platform binary for your OS (no Python required):

  • Linux: sitesavvy-0.5.0-linux-x86_64.tar.gz
  • macOS: sitesavvy-0.5.0-macos-x86_64.tar.gz (built by CI)
  • Windows: sitesavvy-0.5.0-windows-x86_64.exe (built by CI)

AI Configuration

AI features use any OpenAI-compatible API:

export SITESAVVY_LLM_API_KEY=sk-...
export SITESAVVY_LLM_BASE_URL=https://api.openai.com/v1   # or http://localhost:11434/v1 for Ollama
export SITESAVVY_LLM_MODEL=gpt-4o-mini

Quality

  • 282 tests (171 new), 90% coverage
  • ruff + mypy clean
  • 17 new modules, 15 new config fields

See the full changelog for every detail.

Legal

SiteSavvy is for personal, non-commercial use only. Respect copyright, ToS, and robots.txt of every site you crawl. MIT licensed.

SiteSavvy v0.1.0 — Capture the web, your way.

22 Jun 22:15

Choose a tag to compare

SiteSavvy v0.1.0 — Capture the web, your way.

A modern, async, cross-platform web scraper with full-site and text-only modes, six export formats (HTML, Markdown, Text, PDF, EPUB, ZIP), robots.txt awareness, resume/incremental crawling, and optional Playwright headless rendering.

Installation

Via pip (once published to PyPI)

pip install sitesavvy

From this release (no Python required)

Download the platform binary for your OS, extract, and run:

# Linux (x86_64)
tar -xzf sitesavvy-0.1.0-linux-x86_64.tar.gz
./sitesavvy --help

# Windows
# sitesavvy-0.1.0-windows-x86_64.exe — run in Command Prompt / PowerShell

# macOS
# sitesavvy-0.1.0-macos-x86_64 — chmod +x and run

From source (sdist/wheel)

pip install sitesavvy-0.1.0-py3-none-any.whl
# or
pip install sitesavvy-0.1.0.tar.gz

Release Assets

Asset Type Size Notes
sitesavvy-0.1.0-linux-x86_64.tar.gz PyInstaller binary (Linux) ~88 MB Single-file executable, no Python needed
sitesavvy-0.1.0-windows-x86_64.exe PyInstaller binary (Windows) pending Built by CI
sitesavvy-0.1.0-macos-x86_64 PyInstaller binary (macOS) pending Built by CI
sitesavvy-0.1.0-py3-none-any.whl Python wheel 32 KB Universal, pip install
sitesavvy-0.1.0.tar.gz Source distribution 28 KB pip install from source

Note on platform binaries: PyInstaller binaries are OS- and architecture-specific and must be built on the target OS. The Linux binary is built directly here; the Windows .exe and macOS binary will be attached automatically by the GitHub Actions release workflow on the next push (see .github/workflows/release.yml).

Features

  • Two crawl modes: full (mirror the whole site) and text (readable text only)
  • Six output formats: HTML, Markdown, plain text, PDF, EPUB, ZIP archive
  • Polite by default: respects robots.txt, per-host rate limiting, auto-throttle on 429/5xx
  • Resume & incremental: JSON manifest tracks every URL; resume skips completed work, incremental re-downloads only changed resources (ETag/Last-Modified)
  • Concurrency control with a global semaphore and per-host locks
  • Dry-run mode to preview URLs before downloading
  • Headless rendering via Playwright (with graceful aiohttp fallback)
  • Fine-grained --download-types filtering (html, css, js, img, pdf, other)
  • External-link gating — stays on the start host unless --external is passed
  • Rich CLI with progress tables and coloured output

Verification

  • 111 tests, 92% line+branch coverage
  • ruff check . clean
  • mypy sitesavvy clean
  • Tested on Python 3.12

Legal

SiteSavvy is provided for personal, non-commercial use only. Respect the copyright, terms of service, and robots.txt of every site you crawl. The authors assume no liability for misuse. Licensed under the MIT License.