Sitesync is an experimental CLI for synchronizing web source assets into a local, queryable asset store (SQLite + normalized outputs).
This repository is early-stage. Interfaces and storage layouts may change.
- Resumable runs backed by SQLite (task leasing + restart support)
- Pluggable normalization via asset plugins (built-in
pagenormalizer, plus entry points) - Run metadata + a lightweight Markdown status report
- Data access CLI for querying and exporting captured assets
- Domain-scoped allow/deny globs to keep crawl scope bounded
- Configurable fetch timeouts and queue backpressure for long-running runs
- Auth-redirect guardrails with suggested config updates
- Python 3.13
uv(used by the Make targets): https://github.com/astral-sh/uv- Optional: Playwright browser binaries if using the Playwright fetcher (
uv run playwright install chromium)
Install Sitesync into /usr/local/bin using a wheel build (no sudo):
make install
# or: make install PREFIX=$HOME/.localDeveloper tooling (validate/test, creates/updates .venv):
make install-devRun Sitesync without activating a virtualenv:
uv run sitesync --helpBuild and install a bundled executable (PyInstaller spec is committed as sitesync.spec):
make standalone
# or: make standalone BINDIR=$HOME/.local/binStart a run from one or more seed URLs:
uv run sitesync crawl --start-url https://example.com --depth 1To stop a running run, press Esc twice.
Check status/history:
uv run sitesync statusSitesync loads YAML configuration in one of two modes:
- Config document: pass
--config PATHto load a single configuration file (replace-by-default). - Default precedence: when
--configis not provided, load:config/default.yaml(or packaged default if missing)config/local.yaml(optional; ignored by git)
If a configuration contains multiple source profiles, select one with --source NAME.
Environment variables are optional. See .env.example.
allowed_domains is a mapping from domain to optional path filters. Path rules are
exact by default; use glob wildcards for broader matches. Deny rules take precedence.
allowed_domains:
example.com:
allow_paths:
- /docs # exact match only
- /docs/** # allow subtree
deny_paths:
- /login # exact match
- /docs/private/** # deny subtree
api.example.com: {}Optional hard per-task timeout:
crawler:
fetch_timeout_seconds: 20Auth redirects
If a fetch ends on /auth/login with a continue= parameter, Sitesync will
skip link discovery on that page and add a runtime deny rule for /auth/**
and the continue path subtree for the rest of the run.
When this happens, Sitesync prints a suggested YAML update at the end of the run
so you can make the deny rules permanent in your config.
To create a starter config interactively:
uv run sitesync initThen run a crawl:
uv run sitesync crawlInspect the effective configuration:
uv run sitesync config show --paths- SQLite database:
storage.pathor./sitesync.sqlite - Run metadata JSON:
outputs.base_path/outputs.metadata_subdir(per run) - Status report:
tracking/status.md
After a run completes, use the data command group to inspect results.
List all sources:
uv run sitesync dataView source summary:
uv run sitesync data source MySiteList runs for a source:
uv run sitesync data source MySite runsList assets from the most recent completed run:
uv run sitesync data source MySite assets
uv run sitesync data source MySite assets --type page
uv run sitesync data source MySite assets --url "**/products/*"View asset content:
uv run sitesync data source MySite content 1234
uv run sitesync data source MySite content --url "https://example.com/page"Search content within a source:
uv run sitesync data source MySite grep "pricing"
uv run sitesync data source MySite grep "pricing" --regex -C 2Search content across all sources:
uv run sitesync data sources grep "spend management"View detailed statistics:
uv run sitesync data source MySite statsExport assets to a directory:
uv run sitesync data source MySite export ./output --with-metadata
uv run sitesync data source MySite export ./output --dry-runDelete all data for a source:
uv run sitesync data source OldSource delete
uv run sitesync data source OldSource delete --forceSitesync fetches and normalizes content from configured sources. Ensure you have permission to access targets and comply with applicable laws, access policies, and site terms. Captured content may contain sensitive data; store and handle it appropriately.
sitesync crawl: start or resume a crawl runsitesync init: interactively generate a starter config filesitesync config show: show the effective configuration for this invocationsitesync status: show recent runs and queue summarysitesync data: list sources (discovery)sitesync data sources: list sourcessitesync data sources grep: search across all sourcessitesync data source <name>: show source summarysitesync data source <name> runs: list runssitesync data source <name> assets: list assetssitesync data source <name> content: view asset contentsitesync data source <name> tasks: inspect task queuesitesync data source <name> export: export assets to directorysitesync data source <name> stats: detailed statisticssitesync data source <name> grep: search within sourcesitesync data source <name> delete: delete source datasitesync version: print version
Install pre-commit hooks to run ruff on every commit:
uv run pre-commit installRun hooks manually on all files:
uv run pre-commit run --all-filesmake validate # ruff check, ruff format, uv ty, pytest
make test # unit tests only
make e2e # end-to-end tests (requires playwright)Releases are automated via GitHub Actions:
- Push a tag
v*to trigger a release build - Or use manual workflow dispatch from the Actions tab
Release artifacts:
- Python wheel (
sitesync-{version}-py3-none-any.whl) - Linux x64 binary (
sitesync-{version}-linux-x64) - Linux ARM64 binary (
sitesync-{version}-linux-arm64)
Build a wheel locally for testing:
make release-build- Architecture:
docs/architecture.md - Agent roles:
docs/agents.md - Contributing:
CONTRIBUTING.md - Code of Conduct:
CODE_OF_CONDUCT.md - Security:
SECURITY.md - Support:
SUPPORT.md
See CHANGELOG.md.
- MIT License:
LICENSE - Third-party notices:
THIRD_PARTY_NOTICES.md
No emojis are used in documentation, console output, or logs.