SiteSavvy v0.1.0 — Capture the web, your way.
SiteSavvy v0.1.0 — Capture the web, your way.
A modern, async, cross-platform web scraper with full-site and text-only modes, six export formats (HTML, Markdown, Text, PDF, EPUB, ZIP), robots.txt awareness, resume/incremental crawling, and optional Playwright headless rendering.
Installation
Via pip (once published to PyPI)
pip install sitesavvyFrom this release (no Python required)
Download the platform binary for your OS, extract, and run:
# Linux (x86_64)
tar -xzf sitesavvy-0.1.0-linux-x86_64.tar.gz
./sitesavvy --help
# Windows
# sitesavvy-0.1.0-windows-x86_64.exe — run in Command Prompt / PowerShell
# macOS
# sitesavvy-0.1.0-macos-x86_64 — chmod +x and runFrom source (sdist/wheel)
pip install sitesavvy-0.1.0-py3-none-any.whl
# or
pip install sitesavvy-0.1.0.tar.gzRelease Assets
| Asset | Type | Size | Notes |
|---|---|---|---|
sitesavvy-0.1.0-linux-x86_64.tar.gz |
PyInstaller binary (Linux) | ~88 MB | Single-file executable, no Python needed |
sitesavvy-0.1.0-windows-x86_64.exe |
PyInstaller binary (Windows) | pending | Built by CI |
sitesavvy-0.1.0-macos-x86_64 |
PyInstaller binary (macOS) | pending | Built by CI |
sitesavvy-0.1.0-py3-none-any.whl |
Python wheel | 32 KB | Universal, pip install |
sitesavvy-0.1.0.tar.gz |
Source distribution | 28 KB | pip install from source |
Note on platform binaries: PyInstaller binaries are OS- and architecture-specific and must be built on the target OS. The Linux binary is built directly here; the Windows
.exeand macOS binary will be attached automatically by the GitHub Actions release workflow on the next push (see.github/workflows/release.yml).
Features
- Two crawl modes:
full(mirror the whole site) andtext(readable text only) - Six output formats: HTML, Markdown, plain text, PDF, EPUB, ZIP archive
- Polite by default: respects
robots.txt, per-host rate limiting, auto-throttle on 429/5xx - Resume & incremental: JSON manifest tracks every URL; resume skips completed work, incremental re-downloads only changed resources (ETag/Last-Modified)
- Concurrency control with a global semaphore and per-host locks
- Dry-run mode to preview URLs before downloading
- Headless rendering via Playwright (with graceful
aiohttpfallback) - Fine-grained
--download-typesfiltering (html, css, js, img, pdf, other) - External-link gating — stays on the start host unless
--externalis passed - Rich CLI with progress tables and coloured output
Verification
- 111 tests, 92% line+branch coverage
ruff check .cleanmypy sitesavvyclean- Tested on Python 3.12
Legal
SiteSavvy is provided for personal, non-commercial use only. Respect the copyright, terms of service, and robots.txt of every site you crawl. The authors assume no liability for misuse. Licensed under the MIT License.