Distributed web crawler built on NATS JetStream and Playwright. Workers pull tasks from the queue, visit pages, and save network traffic to the object store as Parquet files.
To read more about each part of the crawler, check their README-files:
- Docker and Docker Compose (for worker and nats jetstream)
uvfor running scripts- Rill to analyze the completed experiments.
Start NATS:
docker compose -f docker-compose.hub.yml up -dStart the worker:
docker compose -f docker-compose.worker.yml up -dInstall script dependencies:
uv syncYou can specify endpoints / behaviours of the crawler by specifying environment variables in .env.
For that, just copy .env.example to .env and adjust the variables you need. I added comments to explain what each one of the variables does.
The default values should work for local development.
- Fetch and cache the Tranco list (only needed once):
uv run python scripts/fetch_tranco.py- Seed the queue with URLs:
uv run dist-seed --accept-cookies --no-navigate --depth 0 --dwell-seconds 30 --scroll-amounts 0 --num-tranco 5000 --browser chromium --no-headless example- Check the progress:
uv run dist-status- Download and post-process results:
uv run dist-download- Analyze the results:
cd analysis && rill startAfter the model is built by Rill, you can connect a DuckDB client to
tmp/default/duckdb/main.dbfor further analysis.
Run all tests (unit tests + e2e tests):
uv run pytest tests/Run only unit tests:
uv run pytest tests/scripts/ tests/worker/Run e2e tests (requires Docker Compose):
uv run pytest tests/e2e/Scraping thousands of URLs benefits greatly from horizontal scaling, so for larger crawls it is a good idea to work with a large number of parallel nodes. Reliability of the nodes is not a high priority (NATS automatically redelivers failed tasks), so AWS Spot Instances or Distributed Clouds like Salad can be used to keep the costs low. See worker/README.md for more details on how to distribute to multiple nodes.
Here is a high-level overview of the architecture I used to deploy the crawler to the cloud.
MIT
