This project is now optimized for larger batch runs with Playwright.
It also now includes a local platform starter for the next build phase:
docker-compose.ymlfor self-hosted infrasql/001_init.sqlfor canonical crawler and entity tablesdocs/architecture.mdfor the build-first system blueprintscripts/bootstrap-dev.ps1for fast local startup
It supports:
- concurrent workers
- incremental JSONL output
- checkpoint metadata
- resume support
- retries
- page metadata fetching for every result
- optional proxy rotation
Start the local infra stack:
.\scripts\bootstrap-dev.ps1Or manually:
docker compose up -d postgres redis kafka minio minio-initOptional analytics and search services:
docker compose --profile analytics up -d clickhouse
docker compose --profile search up -d opensearchCore local endpoints:
- Postgres:
localhost:5432 - Redis:
localhost:6379 - Kafka:
localhost:9092 - MinIO API:
localhost:9000 - MinIO Console:
localhost:9001
The initial relational schema is in sql/001_init.sql. The platform blueprint is in docs/architecture.md.
The repo now includes the first production-shaped loop for:
- frontier seeding
- scheduler leasing
- HTML fetch + raw artifact storage in MinIO
- parser extraction
- observation write + company resolution in Postgres
New package:
crawler_platform/
New scripts:
python scripts/seed_urls.py --input seeds.txtpython scripts/run_scheduler.py --batch-size 10
Expected environment values are listed in .env.example.
Typical local flow:
.\scripts\bootstrap-dev.ps1
python -m pip install -r requirements.txt
python scripts/seed_urls.py --input seeds.txt
python scripts/run_scheduler.py --batch-size 10What this loop does:
- inserts seed URLs into the
urlsfrontier table - leases ready URLs from Postgres
- fetches HTML over HTTP
- stores raw HTML in MinIO
- extracts links and a starter company observation
- writes observations and resolves into
companies
The real large-scale path in this repo is now:
services/frontier/main.pyservices/scheduler/main.pyworkers/http_fetch/main.pyworkers/browser_fetch/main.pyservices/parser_resolver/main.pydistributed_crawler/shared/
This is the intended direction for the 1 crore company architecture:
- frontier owns URL state, leasing, and priority
- scheduler owns host/domain budgets and render-path decisions
- HTTP workers handle the high-throughput majority path
- browser workers are only for promoted URLs
- parser/resolver maximizes data extraction and writes canonical company records
Starter commands:
python services/frontier/main.py --input seeds.txt
python services/scheduler/main.py --batch-size 100 --render-mode http
python workers/http_fetch/main.py --batch-size 100 --concurrency 20
python services/parser_resolver/main.py --mode parse
python services/parser_resolver/main.py --mode resolve
python workers/browser_fetch/main.pyImportant:
crawler.pyremains as a legacy utility crawler- the new
distributed_crawler/+services/+workers/layout is the architecture path to continue building - set
KAFKA_ENABLED=trueto split fetch, parse, and resolve into separate stages - scheduler now applies host-aware selection before leasing and promotes risky hosts to the browser queue
- parser now prioritizes important company pages and extracts richer fields such as addresses, socials, contact links, and page types
- resolver now prefers stronger same-domain evidence from
contact,about, andleadershippages over generic pages when selecting company fields
You can read input company names from MySQL and save crawler output JSON back into MySQL with:
.\.venv\Scripts\python.exe mysql_pipeline.py `
--host 192.168.1.133 `
--port 3306 `
--user crawler `
--password "YOUR_PASSWORD" `
--database master `
--input-table filtered_bharatfleet `
--output-table bharatfleet_data `
--engine google `
--fallback-engine duckduckgoWhat it does:
- reads
company_namefrommaster.filtered_bharatfleet - skips names already present in
master.bharatfleet_datawhen--resumeis enabled - crawls each company name using the current Playwright crawler flow
- inserts
(company_name, json)intomaster.bharatfleet_data
Recommended guarded batch for search-driven crawling:
.\.venv\Scripts\python.exe mysql_pipeline.py `
--host 192.168.1.133 `
--port 3306 `
--user crawler `
--password "YOUR_PASSWORD" `
--database master `
--input-table filtered_bharatfleet `
--output-table bharatfleet_data `
--engine google `
--fallback-engine duckduckgo `
--limit 20 `
--min-delay-seconds 4 `
--max-delay-seconds 9 `
--batch-size 5 `
--batch-cooldown-seconds 45 `
--block-cooldown-seconds 180 `
--session-max-queries 4 `
--stop-block-rate 0.20These controls reduce block risk by rotating search engines, refreshing browser sessions, adding jitter between queries, cooling down between batches, and stopping early if challenge/block signals rise too much.
python -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m playwright install chromiumPut one search query per line in queries.txt.
Example:
L&T
Ford
OpenAI API
python crawler.py `
--input queries.txt `
--output results.jsonl `
--checkpoint crawler.checkpoint.json `
--summary-output results.json `
--concurrency 3 `
--max-retries 3 `
--delay 1results.jsonl
- primary scalable output
- one JSON record per completed query
- safe for long-running jobs and resume
crawler.checkpoint.json
- progress metadata
- latest counters and last completed query ID
results.json
- optional aggregated JSON built from
results.jsonl - useful for inspection after the run
-
--concurrency 5Run more workers in parallel. Increase carefully because search engines may rate-limit aggressive traffic. -
--resumeor--no-resumeResume is on by default. The crawler skips any query IDs already written toresults.jsonl. -
--proxy-file proxies.txtRotate proxies across workers. -
--engine googleChoose the main engine. -
--fallback-engine duckduckgoUse a fallback engine if the primary engine fails. -
--channel msedgePrefer a local browser channel like Edge or Chrome before falling back to bundled Chromium.
You can use either format:
http://host1:port
http://user:password@host2:port
or:
http://host1:port,username,password
http://host2:port,,
For very large runs:
- use JSONL output, not a single giant JSON object
- keep concurrency modest at first, such as
2to5 - expect Google blocking if you push too fast from one IP
- add proxies only if you have permission to use them
Because titles are always fetched now, large runs will be slower than URL-only crawling. If throughput becomes the bottleneck, the next optimization would be switching title collection to a separate post-processing pass rather than disabling it.
Each result object now includes:
urltitlemetadata.final_urlmetadata.domainmetadata.descriptionmetadata.keywordsmetadata.canonical_urlmetadata.fetch_duration_secondsmetadata.business.company_namemetadata.business.headquarters_or_location_mentionsmetadata.business.emailsmetadata.business.phonesmetadata.business.social_profilesmetadata.business.brand_domain_mappingmetadata.business.classification
The crawler also prints terminal logs for every metadata fetch, including the URL and time taken.
Each query record also now includes a smart aggregate:
company_profile.company_namecompany_profile.official_websitecompany_profile.primary_domaincompany_profile.all_domainscompany_profile.headquarters_or_location_mentionscompany_profile.emailscompany_profile.phonescompany_profile.social_profilescompany_profile.source_urlscompany_profile.resolved_singletons
The company profile keeps collecting all multi-value fields across URLs, while singleton fields stop being re-searched once a strong value is found.