A self-hosted web scraper that goes through walls.
Stealth, resilience, and infrastructure in one tool — without a single external API key.
Quick start · Demo UI · Why this exists · Comparison · CLI · Architecture
A single Python package + CLI that handles every layer of modern web scraping:
- Realistic Chrome HTTP stack — headers, client hints, brotli, HTTP/2
- TLS fingerprint impersonation — JA3/JA4 indistinguishable from real Chrome
(via
curl_cffi) - Stealth headless browser — Playwright with init scripts that defeat
every check on
bot.sannysoft.com(31/31) - Behavioural noise — mouse moves, scroll, dwell
- Per-host rate limiting + exponential backoff with jitter
- Persistent cookie jar —
cf_clearancesurvives between runs - Sitemap-driven discovery — robots.txt → sitemap.xml → recurse into indexes
- Wayback-style local mirror — saves HTML + same-origin assets, rewrites URLs to local paths
- Markdown export — one clean
.mdper page, ready to read or feed an LLM - Self-hosted IP rotation — Tor SOCKS5 profile in
docker-compose.yml - Distributed mode — Redis-backed job queue with DLQ, AOF persistence,
N parallel workers, Prometheus
/metrics
No paid captcha solvers. No external proxy SaaS. No "request credits." Runs on your hardware, your bandwidth, your terms.
The scraping ecosystem is fragmented. To get a working production stack today, you typically wire together:
requests/httpx ─► basic HTTP
+ playwright ─► JS rendering
+ playwright-stealth ─► defeat headless detection
+ curl_cffi ─► TLS impersonation
+ scrapy + scrapyd ─► queue, scheduling
+ wget ─► local snapshots
+ 2captcha API ─► captchas ($)
+ Bright Data API ─► residential IPs ($)
+ ScrapingBee / ZenRows ─► managed scraping ($)
That's 6+ libraries and 3+ paid services to glue together. scraper ships all of the free parts as one coherent CLI and library, and gives you a clear seam to plug in proxies/captchas only when you actually need them.
Requirements: Docker with Compose.
git clone https://github.com/KyleYYC/scraper && cd scraper
make upThat command:
- creates
.envwith a randomSCRAPER_API_KEY - builds the Docker image
- starts the private API on the first free port from
8080..8100 - waits for
/healthz - runs an authenticated end-to-end smoke test
- verifies bearer auth, private-target blocking, scraping, results, and artifacts
- prints the exact commands to call it from another service
Target setup time is about 60 seconds on a warm Docker cache. The first run can take longer while Docker pulls the Playwright base image and Python wheels; subsequent runs should stay inside the 60-second path.
After it prints scraper API is running and verified, load the local
credentials without echoing the secret:
set -a; . ./.env; set +a
export SCRAPER_API_URL="http://127.0.0.1:${SCRAPER_HOST_PORT:-8080}"Create a scrape job:
curl -sS -X POST "$SCRAPER_API_URL/v1/jobs" \
-H "Authorization: Bearer $SCRAPER_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com","depth":1,"max_pages":1,"markdown":true,"mirror":true}'Use the returned id:
curl -sS "$SCRAPER_API_URL/v1/jobs/{job_id}" \
-H "Authorization: Bearer $SCRAPER_API_KEY"
curl -sS "$SCRAPER_API_URL/v1/jobs/{job_id}/result" \
-H "Authorization: Bearer $SCRAPER_API_KEY"Run the same verification again any time:
make verifyDrop-in service clients are included:
python examples/python_client.py https://example.com
node examples/node_client.mjs https://example.comRun a local web UI demo:
make demo-uiThe UI server keeps SCRAPER_API_KEY private and proxies browser requests to
the scraper API.
Stop it:
make downgit clone https://github.com/KyleYYC/scraper && cd scraper
make docker
docker run --rm -p 127.0.0.1:8080:8080 \
-e SCRAPER_API_KEY="$(openssl rand -hex 32)" \
-v scraper-results:/app/results \
scraper:latestThe container starts the private REST API by default. Override the command to
run the CLI directly: docker run --rm scraper:latest python -m scraper https://example.com.
git clone https://github.com/KyleYYC/scraper && cd scraper
make install # creates .venv, installs deps, fetches Chromium
python -m scraper example.comOutput lands at results/{host}.json.
Run a complete browser demo with one command:
make demo-uiThe runner starts the private scraper API if needed, then starts a local web UI
on the first free port from 5173..5189. The browser talks only to the demo
server; the demo server holds SCRAPER_API_KEY and proxies calls to the scraper
API.
The UI can:
- create generic scrape jobs
- choose depth, max pages, sitemap, browser, Markdown, and mirror options
- poll job status
- display the JSON result
- link generated artifacts such as
result.json,markdown/index.md, andmirror/index.html
Recent warm-cache timing from a clean copy:
Elapsed to ready API: 7s
created_ms: 9
job_status: succeeded
result_title: Example Domain
demo_backend_elapsed: 1s
# Crawl 3 levels, same-origin, capped at 30 pages
python -m scraper example.com --depth 3 --max-pages 30
# Auto-discover via robots.txt + sitemap.xml
python -m scraper example.com --sitemap
# Wayback-style local snapshot (open results/snap/index.html in a browser)
python -m scraper example.com --sitemap --browser --mirror results/snap
# Polite — proactively rate-limit, persist cookies, honour robots.txt
python -m scraper example.com \
--rps 2 \
--cookies-dir ~/.scraper/cookies \
--respect-robots
# Maximum stealth — TLS impersonation + browser stealth + behavioural noise
python -m scraper https://protected.example \
--tls --escalate --humanize --browser| Cost driver | What others do | What scraper does |
|---|---|---|
| Captcha solving | $1–5 per 1,000 captchas (2captcha/CapSolver) | Skipped — captcha-protected pages flagged, not solved |
| Residential proxies | $5–15 per GB (Bright Data, IPRoyal) | Built-in ProxyPool — bring your own, or use the free Tor compose profile |
| Managed scraping APIs | $0.50–5 per 1,000 requests (ScrapingBee, ZenRows) | None — your machine, your bandwidth |
| Cloud compute | Hosted browser farms | Single Docker image (~1.2 GB), runs on a $5 VPS |
| Per-request fees | Always | Zero |
Operational efficiency:
- HTTP/2 + brotli + keep-alive — fewer bytes, fewer round-trips
- Per-host token bucket — proactive pacing avoids 429s in the first place
- Persistent cookie jar — one cf_clearance solve per ~24 h, not per session
- Sitemap discovery — skip the BFS overhead when the site already lists URLs
- Same-origin filter — never accidentally mirror a CDN
- Redis AOF persistence — workers crash, queue survives, no work lost
Head-to-head against the same target, vanilla baselines vs scraper:
[scraper] tls.peet.ws (TLS fingerprint)
baseline: t13d3513h1... ← Python stdlib stack
scraper : t13d1516h2... ← byte-identical to real Chrome 124
[scraper] httpbin.org/headers (server-perceived)
baseline: User-Agent: python-httpx/0.28.1
scraper : User-Agent: Mozilla/5.0 (X11; Linux) Chrome/145
Sec-Ch-Ua: "Chromium";v="145", "Not-A.Brand";v="99"
[scraper] bot.sannysoft.com (31-test detection battery)
baseline: 18 pass / 12 fail
scraper : 31 pass / 0 fail
[scraper] arh.antoinevastel.com/areyouheadless
baseline: "You are Chrome headless"
scraper : "You are not Chrome headless"
[scraper] httpbin.org/status/429 (rate-limit reaction)
baseline: 1 attempt, surfaces 429 in 0.3 s
scraper : 6 attempts with exp. backoff (transient 429s clear)
Run it yourself:
make compare| Capability | requests / httpx |
scrapy |
raw playwright |
scraper |
|---|---|---|---|---|
| Plain HTTP (sync/async) | ✓ | ✓ | overkill | ✓ |
| Realistic Chrome headers + brotli | manual | manual | n/a | default |
| TLS impersonation (JA3/JA4) | ✗ | ✗ | implicit | --tls |
Stealth init (defeat webdriver checks) |
n/a | n/a | ✗ | default |
Pass bot.sannysoft.com |
n/a | n/a | 18/30 | 31/31 |
| JS-rendered DOM | ✗ | needs Splash | ✓ | ✓ |
| Behavioural mimicry (mouse/scroll) | n/a | n/a | manual | --humanize |
| Per-host rate limit (proactive) | ✗ | DOWNLOAD_DELAY | n/a | --rps |
| Backoff on 429/5xx | manual | partial | n/a | default |
| Persistent cookies (cross-run) | per-session | partial | per-session | --cookies-dir |
| robots.txt | urllib.robotparser |
built-in | n/a | --respect-robots |
| Sitemap discovery | manual | needs plugin | n/a | --sitemap |
| Wayback-style local mirror | wget --mirror (no JS) |
✗ | ✗ | --mirror |
| Markdown export | ✗ | ✗ | ✗ | --markdown-dir |
| Distributed queue | ✗ | scrapyd ($) | ✗ | Redis (built-in) |
| Worker fleet + DLQ | ✗ | ✗ | ✗ | built-in |
| JSON logs + Prometheus metrics | ✗ | ✗ | ✗ | built-in |
| Self-hosted IP rotation | external proxies | external proxies | external proxies | Tor compose profile |
| Auto-escalate HTTP→TLS→Browser | ✗ | ✗ | n/a | --escalate |
| Zero API keys / fully self-hosted | ✓ | ✓ | ✓ | ✓ |
| One tool that does all of the above | ✗ | ✗ | ✗ | ✓ |
Requires Python 3.11+ and a few system libraries Playwright Chromium needs.
git clone https://github.com/KyleYYC/scraper && cd scraper
make installmake install creates a .venv, installs the package and all dependencies,
and fetches Chromium. On Linux the Playwright postinstall handles system libs;
on macOS / Windows use pip install -e '.[dev]' && playwright install --with-deps chromium.
make docker # builds the image
make docker-api # runs the private API
make docker-mock # runs the bundled mock target on :8765Image base: mcr.microsoft.com/playwright/python:v1.58.0-noble — Chromium and its system libs are preinstalled. Runs as non-root (pwuser) with a /healthz HEALTHCHECK, and removes pip from the final runtime image after package installation.
The API is meant to stay small and universal: bearer auth, async jobs, JSON results, and optional artifacts. For local or staging use, prefer:
make upmake up creates .env, starts Docker, and runs make verify against the
authenticated REST contract. With a warm Docker cache, this is the intended
2-command, roughly 60-second path after cloning.
The Compose stack publishes local ports on 127.0.0.1 by default, so the API
is not exposed to the LAN during local development.
Manual Docker:
export SCRAPER_API_KEY="$(openssl rand -hex 32)"
docker run --rm -p 127.0.0.1:8080:8080 \
-e SCRAPER_API_KEY="$SCRAPER_API_KEY" \
-v scraper-results:/app/results \
scraper:latestCreate a scrape job:
set -a; . ./.env; set +a
export SCRAPER_API_URL="http://127.0.0.1:${SCRAPER_HOST_PORT:-8080}"
curl -sS -X POST "$SCRAPER_API_URL/v1/jobs" \
-H "Authorization: Bearer $SCRAPER_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com","depth":1,"max_pages":1,"markdown":true,"mirror":true}'Poll and fetch the result:
curl -sS "$SCRAPER_API_URL/v1/jobs/{job_id}" \
-H "Authorization: Bearer $SCRAPER_API_KEY"
curl -sS "$SCRAPER_API_URL/v1/jobs/{job_id}/result" \
-H "Authorization: Bearer $SCRAPER_API_KEY"For services, copy one of the dependency-free examples:
python examples/python_client.py https://example.com
node examples/node_client.mjs https://example.comEndpoints:
| Method | Path | Notes |
|---|---|---|
GET |
/healthz |
Public liveness check |
GET |
/v1/status |
Authenticated queue/API status |
POST |
/v1/jobs |
Create async scrape job |
GET |
/v1/jobs |
List recent jobs; add ?include_artifacts=true only when needed |
GET |
/v1/jobs/{id} |
Inspect one job |
GET |
/v1/jobs/{id}/result |
Return JSON result |
GET |
/v1/jobs/{id}/artifacts/{path} |
Download generated artifact |
POST |
/v1/jobs/{id}/retry |
Requeue same request |
DELETE |
/v1/jobs/{id} |
Cancel queued job |
Core API environment:
| Var | Default | Notes |
|---|---|---|
SCRAPER_API_KEY |
required | Bearer token, at least 32 chars |
PORT |
8080 |
Railway and most hosts inject this |
RESULTS_DIR |
/app/results |
Job metadata, JSON, markdown, mirrors |
COOKIES_DIR |
/app/.cookies |
Persistent cookie jar |
SCRAPER_EMBEDDED_WORKERS |
1 |
Single-container background workers |
SCRAPER_PUBLIC_ONLY |
true |
Blocks localhost/private/metadata targets |
SCRAPER_MAX_PAGES |
200 |
API request cap |
SCRAPER_MAX_DEPTH |
5 |
API request cap |
SCRAPER_MAX_RPS |
5 |
API request cap |
SCRAPER_RETENTION_JOBS |
1000 |
Keep at most this many terminal local jobs |
SCRAPER_RETENTION_SECONDS |
604800 |
Delete terminal local jobs older than this |
SCRAPER_RETENTION_BYTES |
0 |
Optional local results byte cap; 0 disables |
SCRAPER_ENABLE_DOCS |
false |
Enables FastAPI docs when true |
Railway deployment:
The repo includes railway.json so Railway uses the root Dockerfile,
starts scraper-api, restarts on failure, and checks /healthz.
Two-command CLI deploy, after railway login:
railway init
make railway-vars && railway up -dTo print the template variables without changing a Railway project:
RAILWAY_DRY_RUN=1 make railway-varsFor a Railway template, configure one web service from this GitHub repo:
| Setting | Value |
|---|---|
| Builder | Dockerfile |
| Start command | scraper-api |
| Healthcheck path | /healthz |
| Public networking | HTTP |
| Optional volume | mount at /app/results |
Template variables:
| Var | Template value |
|---|---|
SCRAPER_API_KEY |
${{ secret(64) }} |
SCRAPER_EMBEDDED_WORKERS |
1 |
SCRAPER_PUBLIC_ONLY |
true |
SCRAPER_MAX_PAGES |
200 |
SCRAPER_MAX_DEPTH |
5 |
SCRAPER_MAX_RPS |
5 |
SCRAPER_RETENTION_JOBS |
1000 |
SCRAPER_RETENTION_SECONDS |
604800 |
SCRAPER_RETENTION_BYTES |
0 |
SCRAPER_ENABLE_DOCS |
false |
RESULTS_DIR |
/app/results |
COOKIES_DIR |
/app/.cookies |
LOG_FORMAT |
json |
LOG_LEVEL |
INFO |
Railway injects PORT; the API already listens on it. Use the generated
Railway domain as SCRAPER_API_URL in client services.
After publishing the template, copy Railway's generated deploy-button URL into this README. Do not ship a placeholder button URL because it sends users to a dead template.
make up # API on first free port, embedded worker enabled
make stack-results # show what landed in the named volume
make stack-downFor Redis-backed workers, run the API with SCRAPER_EMBEDDED_WORKERS=0, set
REDIS_URL, and start scraper-worker containers with the same results volume.
Zero API keys. Just a local SOCKS5 proxy.
docker compose --profile tor up -d tor
PROXY_POOL=socks5h://localhost:9050 python -m scraper https://target.examplescraper URL [--level N] [-o output] [options]
| Flag | Default | Notes |
|---|---|---|
URL |
(required) | URL or domain (https:// added if missing) |
--level |
— | Workshop level 1..7; omit for generic mode |
--pages |
9 |
Listing pages per session (workshop mode) |
-o, --output |
results/{host}.{ext} |
.csv or .json |
--public-key |
auto | RSA public key for level-6 preflight |
--no-headless |
off | Run the browser headed (debugging) |
--tls |
off | curl_cffi Chrome TLS impersonation |
--humanize |
off | Mouse / scroll / dwell noise (browser mode) |
--browser |
off | Force browser engine in generic mode |
--escalate |
off | HTTP → TLS → Browser, advances on block detection |
--depth |
1 |
BFS crawl depth |
--max-pages |
50 |
Hard cap on pages crawled |
--sitemap |
off | URL discovery via robots.txt + sitemap.xml |
--markdown-dir |
— | Also write one .md per page |
--mirror |
— | Save HTML + assets locally (Wayback-style) |
--rps |
0 |
Per-host requests-per-second cap (0 = unlimited) |
--cookies-dir |
— | Persist cookies between runs |
--respect-robots |
off | Honour robots.txt Disallow rules |
--metrics-port |
0 |
Prometheus /metrics (0 = disabled) |
--log-format |
text |
text or json |
--log-level |
INFO |
Environment variables:
| Var | Notes |
|---|---|
PROXY_POOL |
Comma-separated proxy URIs (http://u:p@h:1,...) |
PROXY_POOL_FILE |
Path to a file with one URI per line |
REDIS_URL |
Redis connection (default redis://localhost:6379/0) |
| Mode | Flag | What it does |
|---|---|---|
| Single page | (default) | Fetch one URL, return JSON |
| BFS crawl | --depth N |
Same-origin, capped at --max-pages |
| Sitemap-driven | --sitemap |
robots.txt → sitemap.xml → fetch each declared URL |
| Markdown export | --markdown-dir |
One .md per page, front-matter + clean prose |
| Local mirror | --mirror DIR |
HTML + same-origin assets, URLs rewritten to local paths |
| Workshop | --level 1..7 |
Per-tier hotel scrape (CSV output) — solves the scraping-workshop challenges |
| Auto-escalate | --escalate |
HTTP → TLS → Browser, advancing on block detection |
python -m scraper acme.example \
--sitemap --browser \
--markdown-dir corpus/acme \
-o corpus/acme.jsonResult: every public page from acme.example as one structured JSON file
(corpus/acme.json) plus one Markdown file per page in corpus/acme/ ready
to feed into a RAG pipeline or LLM context window.
python -m scraper docs.example \
--sitemap --browser \
--mirror snapshots/docs-2026-04-26open snapshots/docs-2026-04-26/index.html to browse offline.
python -m scraper api-docs.example \
--depth 5 --max-pages 500 \
--rps 2 \
--cookies-dir ~/.scraper/cookies \
--metrics-port 9100 \
--log-format jsonThen point Prometheus at :9100/metrics:
scraper_pages_total{outcome="ok",mode="http",level="..."} 492
scraper_pages_total{outcome="fail",mode="http",level="..."} 8
scraper_request_retries_total{reason="status:429"} 34
scraper_active_workers 0
SCRAPER_EMBEDDED_WORKERS=0 docker compose --profile redis up --build -d
docker compose --profile redis --profile cli run --rm cli python -m scraper.enqueue \
https://target.example --depth 3 -o /app/results/target.json
docker compose exec redis redis-cli LLEN scraper:jobs ┌─────────────┐
│ CLI │ scraper URL [--flags]
└──┬──────┬───┘
│ │
generic mode │ │ workshop mode (--level N)
│ │
┌──▼──────▼───┐
│ core │ orchestrator: which engine, which mode
└─┬───────┬───┘
│ │
HTTP/TLS │ │ Browser (Playwright)
│ │
┌───────────▼───┐ ┌▼─────────────────────────┐
│ http_session │ │ browser + behavioral │
│ tls_session │ │ stealth init script │
│ rate_limit │ │ timezone / locale │
│ cookies │ │ mouse/scroll noise │
└─┬─────────────┘ └─┬────────────────────────┘
│ │
└──────────┬───────┘
│
┌────────────▼────────────────┐
│ parser → Hotel / dict │
│ generic.extract │ title, links, emails, JSON-LD,
│ │ full text, markdown
└─┬───────────┬───────┬───────┘
│ │ │
output mirror markdown_dir
(CSV/JSON) (local (one .md
site) per page)
Distributed mode adds:
enqueue ─────► Redis ────► worker (×N) ──► same orchestrator
(AOF) with lock+
DLQ on fail
scraper/
api.py private FastAPI job API
api_models.py API request / response models
job_runner.py shared job execution for API + workers
local_jobs.py file-backed single-container job store
__main__.py CLI dispatch (generic vs workshop modes)
core.py workshop orchestrator (HTTP / Browser / auto)
generic.py single-page / BFS / sitemap / content extraction
sitemap.py robots.txt + sitemap.xml + index discovery
robots.py robots.txt parser + per-UA matching
mirror.py Wayback-style local snapshot
http_session.py httpx + retry + backoff + cookies + rate limit
tls_session.py curl_cffi (Chrome TLS impersonation)
browser.py Playwright + stealth init + timezone
behavioral.py mouse / scroll / dwell noise
proxy_pool.py sticky-per-worker rotation
cookies.py persistent per-host cookie jar
rate_limit.py per-host token bucket
observability.py JSON logs + Prometheus metrics
queue.py Redis-backed job queue + status
worker.py distributed worker entry point
enqueue.py CLI to push jobs onto the queue
crypto.py RSA-OAEP-SHA256 (for the workshop challenges)
parser.py CSS-selector hotel extraction
config.py per-level workshop profiles
models.py Hotel / Review dataclasses
output.py CSV / JSON writers
mock_site/ local FastAPI mirror of all 7 challenges + sitemap
examples/ service clients + local server-side web UI demo
stress/ real-target stress runner + head-to-head comparison
tests/ pytest (uses fixture-spawned mock + fakeredis)
make testmake coverageThis enforces 100% line coverage for the hosted API/service layer. Browser engines, CLI wrappers, the mock site, and real-target stress harnesses stay out of that percentage because they are covered by integration and smoke tests.
................................................. [100%]
49 passed in 101s
The pytest fixture spawns the bundled FastAPI mock on a random port. No network or API keys required.
make stress[PASS] books.toscrape.com 1000/1000 books, 1000/1000 details
[PASS] quotes.toscrape.com/js 100 quotes, 10 pages
[PASS] arh.antoinevastel.com "You are not Chrome headless"
[PASS] bot.sannysoft.com 31 pass / 0 fail
[PASS] nowsecure.nl (Cloudflare) title='nowsecure.nl'
[PASS] tls.peet.ws JA4: t13d1516h2 (Chrome shape)
[PASS] httpbin.org/cookies sid=abc123 roundtripped
[PASS] httpbin.org/redirect/5 final_url=/get
[PASS] httpbin.org/encoding/utf8 7808 chars, has_non_ascii=True
[PASS] variety sweep wikipedia / HN / GitHub / python.org
=== Overall: PASS (10/10, 60s) ===
make compareSide-by-side runs of vanilla httpx / playwright against scraper on the
same targets. Latest: scraper 6 / tie 0 / baseline 0.
- Logs:
--log-format jsonemits line-delimited JSON ready for any log aggregator (Loki, Datadog, CloudWatch). - Metrics:
--metrics-port 9100exposes Prometheus counters and gauges:scraper_pages_total,scraper_request_retries_total,scraper_active_workers. - Redis durability: AOF + persistent volume, so a crashed worker fleet picks up where it left off. Per-job locks (TTL) auto-release if a worker dies mid-job.
- Path-traversal protection: the mirror module rejects any URL whose resolved local path escapes the mirror root (verified by test).
- Same-origin filter: BFS crawl, sitemap, and mirror all enforce same registered domain — no accidental cross-site fetches.
These are deliberate boundaries — each would require an API key or paid service, breaking the "self-hosted, no keys" promise.
| Not in scope | Why |
|---|---|
| Captcha solving | Requires 2captcha / CapSolver API key (~$3/1k) |
| Residential proxy SaaS | Requires Bright Data / IPRoyal account ($) |
| ML-based fingerprint randomisation | Requires GPU-class compute or external API |
| Cloud-managed scraping (ScrapingBee, etc.) | Requires API key |
The seams are ready: bring your own captcha solver via the CaptchaProvider
ABC pattern (removed in cleanup, easy to restore), or your own proxy URIs via
PROXY_POOL. For free IP rotation, docker compose --profile tor up gives
you Tor SOCKS5 with no signup.
Things I'd build next when need arises:
- Per-host concurrency limit (currently global)
- Resume-from-failure across CLI invocations (currently only across worker restarts via Redis)
- Browser context fingerprint variance (vary viewport/UA/timezone per worker)
- Sitemap-aware crawl (BFS but seeded from sitemap)
- HAR export alongside mirror
PRs welcome. Style:
- One module per concern; no file > 400 LOC.
pytest+pyflakesclean before merge.- Add a unit test for new functionality (uses the mock fixture; no live network required).
- Don't add features, refactor, or introduce abstractions beyond what the task requires.
MIT — code only.