Crawler for probing popular domains for machine-readable, callable, commerce, and payment surfaces.
The crawler reads a ranked domain CSV, writes compact receipt shards for every crawled domain, and writes expanded JSON/evidence only for domains with interesting signals.
- Python 3.10 or newer
curlandunzipto fetch the Tranco input list
The crawler uses only the Python standard library.
Download the latest standard Tranco list from the permanent URL documented at https://tranco-list.eu/:
curl -L -o top-1m.csv.zip https://tranco-list.eu/top-1m.csv.zip
unzip -p top-1m.csv.zip top-1m.csv > top-1m.csv
rm top-1m.csv.zipThe resulting top-1m.csv file is ignored by git.
Run a small smoke crawl first:
python3 concurrent_crawl.py --csv ./top-1m.csv --limit 100 --concurrency 8Run the full crawl:
python3 concurrent_crawl.py --csv ./top-1m.csv --results-dir ./results --concurrency 24By default, the crawler resumes from results/checkpoint.json. Use --no-resume to start reading the CSV from the beginning while appending new receipt rows.
Useful options:
python3 concurrent_crawl.py --helpresults/receipts/receipt-*.ndjson: one compact receipt per crawled domainresults/positives/*.json: expanded receipts for domains with interesting signalsresults/evidence/<domain>/: selected raw evidence for interesting domainsresults/checkpoint.json: resume state
results*/ directories are ignored by git.
After a crawl, build the compact public export:
python3 export_public.py --results-dir ./results --output-dir ./results/exports/public --cleanCreate a smaller CSV from prior receipt shards:
python3 build_rerun_slice.py --results-dir ./results --csv ./top-1m.csv --output ./rerun.csv
python3 concurrent_crawl.py --csv ./rerun.csv --results-dir ./results-rerunGenerated rerun*.csv files are ignored by git.