distcrawl

Distributed web crawler built on NATS JetStream and Playwright. Workers pull tasks from the queue, visit pages, and save network traffic to the object store as Parquet files.

Documentation

To read more about each part of the crawler, check their README-files:

Prerequisites

Docker and Docker Compose (for worker and nats jetstream)
uv for running scripts
Rill to analyze the completed experiments.

Local Setup

Start NATS:

docker compose -f docker-compose.hub.yml up -d

Start the worker:

docker compose -f docker-compose.worker.yml up -d

Install script dependencies:

uv sync

You can specify endpoints / behaviours of the crawler by specifying environment variables in .env. For that, just copy .env.example to .env and adjust the variables you need. I added comments to explain what each one of the variables does. The default values should work for local development.

Running an Experiment

Fetch and cache the Tranco list (only needed once):

uv run python scripts/fetch_tranco.py

Seed the queue with URLs:

uv run dist-seed --accept-cookies --no-navigate --depth 0 --dwell-seconds 30 --scroll-amounts 0 --num-tranco 5000 --browser chromium --no-headless example

Check the progress:

uv run dist-status

Download and post-process results:

uv run dist-download

Analyze the results:

cd analysis && rill start

After the model is built by Rill, you can connect a DuckDB client to tmp/default/duckdb/main.db for further analysis.

Running Tests

Run all tests (unit tests + e2e tests):

uv run pytest tests/

Run only unit tests:

uv run pytest tests/scripts/ tests/worker/

Run e2e tests (requires Docker Compose):

uv run pytest tests/e2e/

Deploying to the Cloud

Scraping thousands of URLs benefits greatly from horizontal scaling, so for larger crawls it is a good idea to work with a large number of parallel nodes. Reliability of the nodes is not a high priority (NATS automatically redelivers failed tasks), so AWS Spot Instances or Distributed Clouds like Salad can be used to keep the costs low. See worker/README.md for more details on how to distribute to multiple nodes.

Architecture diagram

Here is a high-level overview of the architecture I used to deploy the crawler to the cloud.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
analysis		analysis
assets		assets
common		common
scripts		scripts
tests		tests
worker		worker
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
NOTICES		NOTICES
README.md		README.md
docker-compose.hub.yml		docker-compose.hub.yml
docker-compose.worker.yml		docker-compose.worker.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

distcrawl

Documentation

Prerequisites

Local Setup

Running an Experiment

Running Tests

Deploying to the Cloud

Architecture diagram

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

distcrawl

Documentation

Prerequisites

Local Setup

Running an Experiment

Running Tests

Deploying to the Cloud

Architecture diagram

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages