Crawler

Crawler for probing popular domains for machine-readable, callable, commerce, and payment surfaces.

The crawler reads a ranked domain CSV, writes compact receipt shards for every crawled domain, and writes expanded JSON/evidence only for domains with interesting signals.

Requirements

Python 3.10 or newer
curl and unzip to fetch the Tranco input list

The crawler uses only the Python standard library.

Download the Tranco top 1M

Download the latest standard Tranco list from the permanent URL documented at https://tranco-list.eu/:

curl -L -o top-1m.csv.zip https://tranco-list.eu/top-1m.csv.zip
unzip -p top-1m.csv.zip top-1m.csv > top-1m.csv
rm top-1m.csv.zip

The resulting top-1m.csv file is ignored by git.

Run a crawl

Run a small smoke crawl first:

python3 concurrent_crawl.py --csv ./top-1m.csv --limit 100 --concurrency 8

Run the full crawl:

python3 concurrent_crawl.py --csv ./top-1m.csv --results-dir ./results --concurrency 24

By default, the crawler resumes from results/checkpoint.json. Use --no-resume to start reading the CSV from the beginning while appending new receipt rows.

Useful options:

python3 concurrent_crawl.py --help

Outputs

results/receipts/receipt-*.ndjson: one compact receipt per crawled domain
results/positives/*.json: expanded receipts for domains with interesting signals
results/evidence/<domain>/: selected raw evidence for interesting domains
results/checkpoint.json: resume state

results*/ directories are ignored by git.

Export public artifacts

After a crawl, build the compact public export:

python3 export_public.py --results-dir ./results --output-dir ./results/exports/public --clean

Build a targeted rerun slice

Create a smaller CSV from prior receipt shards:

python3 build_rerun_slice.py --results-dir ./results --csv ./top-1m.csv --output ./rerun.csv
python3 concurrent_crawl.py --csv ./rerun.csv --results-dir ./results-rerun

Generated rerun*.csv files are ignored by git.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
concurrent_crawl_app		concurrent_crawl_app
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
build_rerun_slice.py		build_rerun_slice.py
concurrent_crawl.py		concurrent_crawl.py
export_public.py		export_public.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler

Requirements

Download the Tranco top 1M

Run a crawl

Outputs

Export public artifacts

Build a targeted rerun slice

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crawler

Requirements

Download the Tranco top 1M

Run a crawl

Outputs

Export public artifacts

Build a targeted rerun slice

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages