Skip to content

SiteSavvy v0.1.0 — Capture the web, your way.

Choose a tag to compare

@Bloody-Crow Bloody-Crow released this 22 Jun 22:15
· 5 commits to main since this release

SiteSavvy v0.1.0 — Capture the web, your way.

A modern, async, cross-platform web scraper with full-site and text-only modes, six export formats (HTML, Markdown, Text, PDF, EPUB, ZIP), robots.txt awareness, resume/incremental crawling, and optional Playwright headless rendering.

Installation

Via pip (once published to PyPI)

pip install sitesavvy

From this release (no Python required)

Download the platform binary for your OS, extract, and run:

# Linux (x86_64)
tar -xzf sitesavvy-0.1.0-linux-x86_64.tar.gz
./sitesavvy --help

# Windows
# sitesavvy-0.1.0-windows-x86_64.exe — run in Command Prompt / PowerShell

# macOS
# sitesavvy-0.1.0-macos-x86_64 — chmod +x and run

From source (sdist/wheel)

pip install sitesavvy-0.1.0-py3-none-any.whl
# or
pip install sitesavvy-0.1.0.tar.gz

Release Assets

Asset Type Size Notes
sitesavvy-0.1.0-linux-x86_64.tar.gz PyInstaller binary (Linux) ~88 MB Single-file executable, no Python needed
sitesavvy-0.1.0-windows-x86_64.exe PyInstaller binary (Windows) pending Built by CI
sitesavvy-0.1.0-macos-x86_64 PyInstaller binary (macOS) pending Built by CI
sitesavvy-0.1.0-py3-none-any.whl Python wheel 32 KB Universal, pip install
sitesavvy-0.1.0.tar.gz Source distribution 28 KB pip install from source

Note on platform binaries: PyInstaller binaries are OS- and architecture-specific and must be built on the target OS. The Linux binary is built directly here; the Windows .exe and macOS binary will be attached automatically by the GitHub Actions release workflow on the next push (see .github/workflows/release.yml).

Features

  • Two crawl modes: full (mirror the whole site) and text (readable text only)
  • Six output formats: HTML, Markdown, plain text, PDF, EPUB, ZIP archive
  • Polite by default: respects robots.txt, per-host rate limiting, auto-throttle on 429/5xx
  • Resume & incremental: JSON manifest tracks every URL; resume skips completed work, incremental re-downloads only changed resources (ETag/Last-Modified)
  • Concurrency control with a global semaphore and per-host locks
  • Dry-run mode to preview URLs before downloading
  • Headless rendering via Playwright (with graceful aiohttp fallback)
  • Fine-grained --download-types filtering (html, css, js, img, pdf, other)
  • External-link gating — stays on the start host unless --external is passed
  • Rich CLI with progress tables and coloured output

Verification

  • 111 tests, 92% line+branch coverage
  • ruff check . clean
  • mypy sitesavvy clean
  • Tested on Python 3.12

Legal

SiteSavvy is provided for personal, non-commercial use only. Respect the copyright, terms of service, and robots.txt of every site you crawl. The authors assume no liability for misuse. Licensed under the MIT License.