SiteSavvy v0.1.0 — Capture the web, your way.

A modern, async, cross-platform web scraper with full-site and text-only modes, six export formats (HTML, Markdown, Text, PDF, EPUB, ZIP), robots.txt awareness, resume/incremental crawling, and optional Playwright headless rendering.

Installation

Via pip (once published to PyPI)

pip install sitesavvy

From this release (no Python required)

Download the platform binary for your OS, extract, and run:

# Linux (x86_64)
tar -xzf sitesavvy-0.1.0-linux-x86_64.tar.gz
./sitesavvy --help

# Windows
# sitesavvy-0.1.0-windows-x86_64.exe — run in Command Prompt / PowerShell

# macOS
# sitesavvy-0.1.0-macos-x86_64 — chmod +x and run

From source (sdist/wheel)

pip install sitesavvy-0.1.0-py3-none-any.whl
# or
pip install sitesavvy-0.1.0.tar.gz

Release Assets

Asset	Type	Size	Notes
`sitesavvy-0.1.0-linux-x86_64.tar.gz`	PyInstaller binary (Linux)	~88 MB	Single-file executable, no Python needed
`sitesavvy-0.1.0-windows-x86_64.exe`	PyInstaller binary (Windows)	pending	Built by CI
`sitesavvy-0.1.0-macos-x86_64`	PyInstaller binary (macOS)	pending	Built by CI
`sitesavvy-0.1.0-py3-none-any.whl`	Python wheel	32 KB	Universal, `pip install`
`sitesavvy-0.1.0.tar.gz`	Source distribution	28 KB	`pip install` from source

Note on platform binaries: PyInstaller binaries are OS- and architecture-specific and must be built on the target OS. The Linux binary is built directly here; the Windows .exe and macOS binary will be attached automatically by the GitHub Actions release workflow on the next push (see .github/workflows/release.yml).

Features

Two crawl modes: full (mirror the whole site) and text (readable text only)
Six output formats: HTML, Markdown, plain text, PDF, EPUB, ZIP archive
Polite by default: respects robots.txt, per-host rate limiting, auto-throttle on 429/5xx
Resume & incremental: JSON manifest tracks every URL; resume skips completed work, incremental re-downloads only changed resources (ETag/Last-Modified)
Concurrency control with a global semaphore and per-host locks
Dry-run mode to preview URLs before downloading
Headless rendering via Playwright (with graceful aiohttp fallback)
Fine-grained --download-types filtering (html, css, js, img, pdf, other)
External-link gating — stays on the start host unless --external is passed
Rich CLI with progress tables and coloured output

Verification

111 tests, 92% line+branch coverage
ruff check . clean
mypy sitesavvy clean
Tested on Python 3.12

Legal

SiteSavvy is provided for personal, non-commercial use only. Respect the copyright, terms of service, and robots.txt of every site you crawl. The authors assume no liability for misuse. Licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SiteSavvy v0.1.0 — Capture the web, your way.

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

SiteSavvy v0.1.0 — Capture the web, your way.

Installation

Via pip (once published to PyPI)

From this release (no Python required)

From source (sdist/wheel)

Release Assets

Features

Verification

Legal

Uh oh!