Skip to content

KyleYYC/scraper

Repository files navigation

scraper

A self-hosted web scraper that goes through walls.

Stealth, resilience, and infrastructure in one tool — without a single external API key.

Tests Coverage Stress Head-to-head License Python Docker

Quick start · Demo UI · Why this exists · Comparison · CLI · Architecture


What is this?

A single Python package + CLI that handles every layer of modern web scraping:

  • Realistic Chrome HTTP stack — headers, client hints, brotli, HTTP/2
  • TLS fingerprint impersonation — JA3/JA4 indistinguishable from real Chrome (via curl_cffi)
  • Stealth headless browser — Playwright with init scripts that defeat every check on bot.sannysoft.com (31/31)
  • Behavioural noise — mouse moves, scroll, dwell
  • Per-host rate limiting + exponential backoff with jitter
  • Persistent cookie jarcf_clearance survives between runs
  • Sitemap-driven discovery — robots.txt → sitemap.xml → recurse into indexes
  • Wayback-style local mirror — saves HTML + same-origin assets, rewrites URLs to local paths
  • Markdown export — one clean .md per page, ready to read or feed an LLM
  • Self-hosted IP rotation — Tor SOCKS5 profile in docker-compose.yml
  • Distributed mode — Redis-backed job queue with DLQ, AOF persistence, N parallel workers, Prometheus /metrics

No paid captcha solvers. No external proxy SaaS. No "request credits." Runs on your hardware, your bandwidth, your terms.


Why this exists

The scraping ecosystem is fragmented. To get a working production stack today, you typically wire together:

requests/httpx          ─►  basic HTTP
+ playwright            ─►  JS rendering
+ playwright-stealth    ─►  defeat headless detection
+ curl_cffi             ─►  TLS impersonation
+ scrapy + scrapyd      ─►  queue, scheduling
+ wget                  ─►  local snapshots
+ 2captcha API          ─►  captchas      ($)
+ Bright Data API       ─►  residential IPs ($)
+ ScrapingBee / ZenRows ─►  managed scraping ($)

That's 6+ libraries and 3+ paid services to glue together. scraper ships all of the free parts as one coherent CLI and library, and gives you a clear seam to plug in proxies/captchas only when you actually need them.


Quick start

Hosted API — 2 commands

Requirements: Docker with Compose.

git clone https://github.com/KyleYYC/scraper && cd scraper
make up

That command:

  • creates .env with a random SCRAPER_API_KEY
  • builds the Docker image
  • starts the private API on the first free port from 8080..8100
  • waits for /healthz
  • runs an authenticated end-to-end smoke test
  • verifies bearer auth, private-target blocking, scraping, results, and artifacts
  • prints the exact commands to call it from another service

Target setup time is about 60 seconds on a warm Docker cache. The first run can take longer while Docker pulls the Playwright base image and Python wheels; subsequent runs should stay inside the 60-second path.

After it prints scraper API is running and verified, load the local credentials without echoing the secret:

set -a; . ./.env; set +a
export SCRAPER_API_URL="http://127.0.0.1:${SCRAPER_HOST_PORT:-8080}"

Create a scrape job:

curl -sS -X POST "$SCRAPER_API_URL/v1/jobs" \
  -H "Authorization: Bearer $SCRAPER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com","depth":1,"max_pages":1,"markdown":true,"mirror":true}'

Use the returned id:

curl -sS "$SCRAPER_API_URL/v1/jobs/{job_id}" \
  -H "Authorization: Bearer $SCRAPER_API_KEY"

curl -sS "$SCRAPER_API_URL/v1/jobs/{job_id}/result" \
  -H "Authorization: Bearer $SCRAPER_API_KEY"

Run the same verification again any time:

make verify

Drop-in service clients are included:

python examples/python_client.py https://example.com
node examples/node_client.mjs https://example.com

Run a local web UI demo:

make demo-ui

The UI server keeps SCRAPER_API_KEY private and proxies browser requests to the scraper API.

Stop it:

make down

Docker — manual

git clone https://github.com/KyleYYC/scraper && cd scraper
make docker
docker run --rm -p 127.0.0.1:8080:8080 \
  -e SCRAPER_API_KEY="$(openssl rand -hex 32)" \
  -v scraper-results:/app/results \
  scraper:latest

The container starts the private REST API by default. Override the command to run the CLI directly: docker run --rm scraper:latest python -m scraper https://example.com.

Local (Python 3.11+)

git clone https://github.com/KyleYYC/scraper && cd scraper
make install              # creates .venv, installs deps, fetches Chromium
python -m scraper example.com

Output lands at results/{host}.json.

Demo UI

Run a complete browser demo with one command:

make demo-ui

The runner starts the private scraper API if needed, then starts a local web UI on the first free port from 5173..5189. The browser talks only to the demo server; the demo server holds SCRAPER_API_KEY and proxies calls to the scraper API.

The UI can:

  • create generic scrape jobs
  • choose depth, max pages, sitemap, browser, Markdown, and mirror options
  • poll job status
  • display the JSON result
  • link generated artifacts such as result.json, markdown/index.md, and mirror/index.html

Recent warm-cache timing from a clean copy:

Elapsed to ready API: 7s
created_ms: 9
job_status: succeeded
result_title: Example Domain
demo_backend_elapsed: 1s

Try the show-off modes

# Crawl 3 levels, same-origin, capped at 30 pages
python -m scraper example.com --depth 3 --max-pages 30

# Auto-discover via robots.txt + sitemap.xml
python -m scraper example.com --sitemap

# Wayback-style local snapshot (open results/snap/index.html in a browser)
python -m scraper example.com --sitemap --browser --mirror results/snap

# Polite — proactively rate-limit, persist cookies, honour robots.txt
python -m scraper example.com \
  --rps 2 \
  --cookies-dir ~/.scraper/cookies \
  --respect-robots

# Maximum stealth — TLS impersonation + browser stealth + behavioural noise
python -m scraper https://protected.example \
  --tls --escalate --humanize --browser

Why it's efficient and cheap

Cost driver What others do What scraper does
Captcha solving $1–5 per 1,000 captchas (2captcha/CapSolver) Skipped — captcha-protected pages flagged, not solved
Residential proxies $5–15 per GB (Bright Data, IPRoyal) Built-in ProxyPool — bring your own, or use the free Tor compose profile
Managed scraping APIs $0.50–5 per 1,000 requests (ScrapingBee, ZenRows) None — your machine, your bandwidth
Cloud compute Hosted browser farms Single Docker image (~1.2 GB), runs on a $5 VPS
Per-request fees Always Zero

Operational efficiency:

  • HTTP/2 + brotli + keep-alive — fewer bytes, fewer round-trips
  • Per-host token bucket — proactive pacing avoids 429s in the first place
  • Persistent cookie jar — one cf_clearance solve per ~24 h, not per session
  • Sitemap discovery — skip the BFS overhead when the site already lists URLs
  • Same-origin filter — never accidentally mirror a CDN
  • Redis AOF persistence — workers crash, queue survives, no work lost

Comparison

Head-to-head against the same target, vanilla baselines vs scraper:

[scraper] tls.peet.ws (TLS fingerprint)
  baseline: t13d3513h1...                ← Python stdlib stack
  scraper : t13d1516h2...                ← byte-identical to real Chrome 124

[scraper] httpbin.org/headers (server-perceived)
  baseline: User-Agent: python-httpx/0.28.1
  scraper : User-Agent: Mozilla/5.0 (X11; Linux) Chrome/145
            Sec-Ch-Ua: "Chromium";v="145", "Not-A.Brand";v="99"

[scraper] bot.sannysoft.com (31-test detection battery)
  baseline:  18 pass / 12 fail
  scraper :  31 pass /  0 fail

[scraper] arh.antoinevastel.com/areyouheadless
  baseline: "You are Chrome headless"
  scraper : "You are not Chrome headless"

[scraper] httpbin.org/status/429 (rate-limit reaction)
  baseline: 1 attempt, surfaces 429 in 0.3 s
  scraper : 6 attempts with exp. backoff (transient 429s clear)

Run it yourself:

make compare

Feature matrix

Capability requests / httpx scrapy raw playwright scraper
Plain HTTP (sync/async) overkill
Realistic Chrome headers + brotli manual manual n/a default
TLS impersonation (JA3/JA4) implicit --tls
Stealth init (defeat webdriver checks) n/a n/a default
Pass bot.sannysoft.com n/a n/a 18/30 31/31
JS-rendered DOM needs Splash
Behavioural mimicry (mouse/scroll) n/a n/a manual --humanize
Per-host rate limit (proactive) DOWNLOAD_DELAY n/a --rps
Backoff on 429/5xx manual partial n/a default
Persistent cookies (cross-run) per-session partial per-session --cookies-dir
robots.txt urllib.robotparser built-in n/a --respect-robots
Sitemap discovery manual needs plugin n/a --sitemap
Wayback-style local mirror wget --mirror (no JS) --mirror
Markdown export --markdown-dir
Distributed queue scrapyd ($) Redis (built-in)
Worker fleet + DLQ built-in
JSON logs + Prometheus metrics built-in
Self-hosted IP rotation external proxies external proxies external proxies Tor compose profile
Auto-escalate HTTP→TLS→Browser n/a --escalate
Zero API keys / fully self-hosted
One tool that does all of the above

Installation

Requires Python 3.11+ and a few system libraries Playwright Chromium needs.

git clone https://github.com/KyleYYC/scraper && cd scraper
make install

make install creates a .venv, installs the package and all dependencies, and fetches Chromium. On Linux the Playwright postinstall handles system libs; on macOS / Windows use pip install -e '.[dev]' && playwright install --with-deps chromium.

Docker

make docker        # builds the image
make docker-api    # runs the private API
make docker-mock   # runs the bundled mock target on :8765

Image base: mcr.microsoft.com/playwright/python:v1.58.0-noble — Chromium and its system libs are preinstalled. Runs as non-root (pwuser) with a /healthz HEALTHCHECK, and removes pip from the final runtime image after package installation.

Hosted API

The API is meant to stay small and universal: bearer auth, async jobs, JSON results, and optional artifacts. For local or staging use, prefer:

make up

make up creates .env, starts Docker, and runs make verify against the authenticated REST contract. With a warm Docker cache, this is the intended 2-command, roughly 60-second path after cloning.

The Compose stack publishes local ports on 127.0.0.1 by default, so the API is not exposed to the LAN during local development.

Manual Docker:

export SCRAPER_API_KEY="$(openssl rand -hex 32)"
docker run --rm -p 127.0.0.1:8080:8080 \
  -e SCRAPER_API_KEY="$SCRAPER_API_KEY" \
  -v scraper-results:/app/results \
  scraper:latest

Create a scrape job:

set -a; . ./.env; set +a
export SCRAPER_API_URL="http://127.0.0.1:${SCRAPER_HOST_PORT:-8080}"

curl -sS -X POST "$SCRAPER_API_URL/v1/jobs" \
  -H "Authorization: Bearer $SCRAPER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com","depth":1,"max_pages":1,"markdown":true,"mirror":true}'

Poll and fetch the result:

curl -sS "$SCRAPER_API_URL/v1/jobs/{job_id}" \
  -H "Authorization: Bearer $SCRAPER_API_KEY"

curl -sS "$SCRAPER_API_URL/v1/jobs/{job_id}/result" \
  -H "Authorization: Bearer $SCRAPER_API_KEY"

For services, copy one of the dependency-free examples:

python examples/python_client.py https://example.com
node examples/node_client.mjs https://example.com

Endpoints:

Method Path Notes
GET /healthz Public liveness check
GET /v1/status Authenticated queue/API status
POST /v1/jobs Create async scrape job
GET /v1/jobs List recent jobs; add ?include_artifacts=true only when needed
GET /v1/jobs/{id} Inspect one job
GET /v1/jobs/{id}/result Return JSON result
GET /v1/jobs/{id}/artifacts/{path} Download generated artifact
POST /v1/jobs/{id}/retry Requeue same request
DELETE /v1/jobs/{id} Cancel queued job

Core API environment:

Var Default Notes
SCRAPER_API_KEY required Bearer token, at least 32 chars
PORT 8080 Railway and most hosts inject this
RESULTS_DIR /app/results Job metadata, JSON, markdown, mirrors
COOKIES_DIR /app/.cookies Persistent cookie jar
SCRAPER_EMBEDDED_WORKERS 1 Single-container background workers
SCRAPER_PUBLIC_ONLY true Blocks localhost/private/metadata targets
SCRAPER_MAX_PAGES 200 API request cap
SCRAPER_MAX_DEPTH 5 API request cap
SCRAPER_MAX_RPS 5 API request cap
SCRAPER_RETENTION_JOBS 1000 Keep at most this many terminal local jobs
SCRAPER_RETENTION_SECONDS 604800 Delete terminal local jobs older than this
SCRAPER_RETENTION_BYTES 0 Optional local results byte cap; 0 disables
SCRAPER_ENABLE_DOCS false Enables FastAPI docs when true

Railway deployment:

The repo includes railway.json so Railway uses the root Dockerfile, starts scraper-api, restarts on failure, and checks /healthz.

Two-command CLI deploy, after railway login:

railway init
make railway-vars && railway up -d

To print the template variables without changing a Railway project:

RAILWAY_DRY_RUN=1 make railway-vars

For a Railway template, configure one web service from this GitHub repo:

Setting Value
Builder Dockerfile
Start command scraper-api
Healthcheck path /healthz
Public networking HTTP
Optional volume mount at /app/results

Template variables:

Var Template value
SCRAPER_API_KEY ${{ secret(64) }}
SCRAPER_EMBEDDED_WORKERS 1
SCRAPER_PUBLIC_ONLY true
SCRAPER_MAX_PAGES 200
SCRAPER_MAX_DEPTH 5
SCRAPER_MAX_RPS 5
SCRAPER_RETENTION_JOBS 1000
SCRAPER_RETENTION_SECONDS 604800
SCRAPER_RETENTION_BYTES 0
SCRAPER_ENABLE_DOCS false
RESULTS_DIR /app/results
COOKIES_DIR /app/.cookies
LOG_FORMAT json
LOG_LEVEL INFO

Railway injects PORT; the API already listens on it. Use the generated Railway domain as SCRAPER_API_URL in client services.

After publishing the template, copy Railway's generated deploy-button URL into this README. Do not ship a placeholder button URL because it sends users to a dead template.

Distributed stack

make up             # API on first free port, embedded worker enabled
make stack-results  # show what landed in the named volume
make stack-down

For Redis-backed workers, run the API with SCRAPER_EMBEDDED_WORKERS=0, set REDIS_URL, and start scraper-worker containers with the same results volume.

Self-hosted IP rotation via Tor

Zero API keys. Just a local SOCKS5 proxy.

docker compose --profile tor up -d tor
PROXY_POOL=socks5h://localhost:9050 python -m scraper https://target.example

CLI

scraper URL [--level N] [-o output] [options]
Flag Default Notes
URL (required) URL or domain (https:// added if missing)
--level Workshop level 1..7; omit for generic mode
--pages 9 Listing pages per session (workshop mode)
-o, --output results/{host}.{ext} .csv or .json
--public-key auto RSA public key for level-6 preflight
--no-headless off Run the browser headed (debugging)
--tls off curl_cffi Chrome TLS impersonation
--humanize off Mouse / scroll / dwell noise (browser mode)
--browser off Force browser engine in generic mode
--escalate off HTTP → TLS → Browser, advances on block detection
--depth 1 BFS crawl depth
--max-pages 50 Hard cap on pages crawled
--sitemap off URL discovery via robots.txt + sitemap.xml
--markdown-dir Also write one .md per page
--mirror Save HTML + assets locally (Wayback-style)
--rps 0 Per-host requests-per-second cap (0 = unlimited)
--cookies-dir Persist cookies between runs
--respect-robots off Honour robots.txt Disallow rules
--metrics-port 0 Prometheus /metrics (0 = disabled)
--log-format text text or json
--log-level INFO

Environment variables:

Var Notes
PROXY_POOL Comma-separated proxy URIs (http://u:p@h:1,...)
PROXY_POOL_FILE Path to a file with one URI per line
REDIS_URL Redis connection (default redis://localhost:6379/0)

Modes

Mode Flag What it does
Single page (default) Fetch one URL, return JSON
BFS crawl --depth N Same-origin, capped at --max-pages
Sitemap-driven --sitemap robots.txt → sitemap.xml → fetch each declared URL
Markdown export --markdown-dir One .md per page, front-matter + clean prose
Local mirror --mirror DIR HTML + same-origin assets, URLs rewritten to local paths
Workshop --level 1..7 Per-tier hotel scrape (CSV output) — solves the scraping-workshop challenges
Auto-escalate --escalate HTTP → TLS → Browser, advancing on block detection

Examples

Scrape a domain, save markdown for an LLM

python -m scraper acme.example \
  --sitemap --browser \
  --markdown-dir corpus/acme \
  -o corpus/acme.json

Result: every public page from acme.example as one structured JSON file (corpus/acme.json) plus one Markdown file per page in corpus/acme/ ready to feed into a RAG pipeline or LLM context window.

Mirror a site for offline reading

python -m scraper docs.example \
  --sitemap --browser \
  --mirror snapshots/docs-2026-04-26

open snapshots/docs-2026-04-26/index.html to browse offline.

Production scrape — polite, persistent, observable

python -m scraper api-docs.example \
  --depth 5 --max-pages 500 \
  --rps 2 \
  --cookies-dir ~/.scraper/cookies \
  --metrics-port 9100 \
  --log-format json

Then point Prometheus at :9100/metrics:

scraper_pages_total{outcome="ok",mode="http",level="..."}    492
scraper_pages_total{outcome="fail",mode="http",level="..."}    8
scraper_request_retries_total{reason="status:429"}             34
scraper_active_workers                                          0

Distributed mode — N workers from a Redis queue

SCRAPER_EMBEDDED_WORKERS=0 docker compose --profile redis up --build -d
docker compose --profile redis --profile cli run --rm cli python -m scraper.enqueue \
  https://target.example --depth 3 -o /app/results/target.json
docker compose exec redis redis-cli LLEN scraper:jobs

Architecture

                  ┌─────────────┐
                  │   CLI       │  scraper URL [--flags]
                  └──┬──────┬───┘
                     │      │
       generic mode  │      │  workshop mode (--level N)
                     │      │
                  ┌──▼──────▼───┐
                  │   core      │  orchestrator: which engine, which mode
                  └─┬───────┬───┘
                    │       │
         HTTP/TLS   │       │   Browser (Playwright)
                    │       │
        ┌───────────▼───┐  ┌▼─────────────────────────┐
        │ http_session  │  │ browser + behavioral     │
        │ tls_session   │  │   stealth init script    │
        │ rate_limit    │  │   timezone / locale       │
        │ cookies       │  │   mouse/scroll noise      │
        └─┬─────────────┘  └─┬────────────────────────┘
          │                  │
          └──────────┬───────┘
                     │
        ┌────────────▼────────────────┐
        │  parser  →  Hotel / dict    │
        │  generic.extract            │  title, links, emails, JSON-LD,
        │                              │  full text, markdown
        └─┬───────────┬───────┬───────┘
          │           │       │
       output      mirror   markdown_dir
       (CSV/JSON)  (local   (one .md
                    site)    per page)

Distributed mode adds:

        enqueue ─────► Redis ────► worker (×N) ──► same orchestrator
                       (AOF)        with lock+
                                    DLQ on fail

Module map

scraper/
  api.py             private FastAPI job API
  api_models.py      API request / response models
  job_runner.py      shared job execution for API + workers
  local_jobs.py      file-backed single-container job store
  __main__.py        CLI dispatch (generic vs workshop modes)
  core.py            workshop orchestrator (HTTP / Browser / auto)
  generic.py         single-page / BFS / sitemap / content extraction
  sitemap.py         robots.txt + sitemap.xml + index discovery
  robots.py          robots.txt parser + per-UA matching
  mirror.py          Wayback-style local snapshot
  http_session.py    httpx + retry + backoff + cookies + rate limit
  tls_session.py     curl_cffi (Chrome TLS impersonation)
  browser.py         Playwright + stealth init + timezone
  behavioral.py      mouse / scroll / dwell noise
  proxy_pool.py      sticky-per-worker rotation
  cookies.py         persistent per-host cookie jar
  rate_limit.py      per-host token bucket
  observability.py   JSON logs + Prometheus metrics
  queue.py           Redis-backed job queue + status
  worker.py          distributed worker entry point
  enqueue.py         CLI to push jobs onto the queue
  crypto.py          RSA-OAEP-SHA256 (for the workshop challenges)
  parser.py          CSS-selector hotel extraction
  config.py          per-level workshop profiles
  models.py          Hotel / Review dataclasses
  output.py          CSV / JSON writers
mock_site/           local FastAPI mirror of all 7 challenges + sitemap
examples/            service clients + local server-side web UI demo
stress/              real-target stress runner + head-to-head comparison
tests/               pytest (uses fixture-spawned mock + fakeredis)

Testing

Unit tests (no network)

make test

Hosted-service coverage gate

make coverage

This enforces 100% line coverage for the hosted API/service layer. Browser engines, CLI wrappers, the mock site, and real-target stress harnesses stay out of that percentage because they are covered by integration and smoke tests.

................................................. [100%]
49 passed in 101s

The pytest fixture spawns the bundled FastAPI mock on a random port. No network or API keys required.

Stress (real targets)

make stress
[PASS] books.toscrape.com         1000/1000 books, 1000/1000 details
[PASS] quotes.toscrape.com/js     100 quotes, 10 pages
[PASS] arh.antoinevastel.com      "You are not Chrome headless"
[PASS] bot.sannysoft.com          31 pass / 0 fail
[PASS] nowsecure.nl (Cloudflare)  title='nowsecure.nl'
[PASS] tls.peet.ws                JA4: t13d1516h2 (Chrome shape)
[PASS] httpbin.org/cookies        sid=abc123 roundtripped
[PASS] httpbin.org/redirect/5     final_url=/get
[PASS] httpbin.org/encoding/utf8  7808 chars, has_non_ascii=True
[PASS] variety sweep              wikipedia / HN / GitHub / python.org
=== Overall: PASS (10/10, 60s) ===

Head-to-head vs vanilla

make compare

Side-by-side runs of vanilla httpx / playwright against scraper on the same targets. Latest: scraper 6 / tie 0 / baseline 0.


Operational notes

  • Logs: --log-format json emits line-delimited JSON ready for any log aggregator (Loki, Datadog, CloudWatch).
  • Metrics: --metrics-port 9100 exposes Prometheus counters and gauges: scraper_pages_total, scraper_request_retries_total, scraper_active_workers.
  • Redis durability: AOF + persistent volume, so a crashed worker fleet picks up where it left off. Per-job locks (TTL) auto-release if a worker dies mid-job.
  • Path-traversal protection: the mirror module rejects any URL whose resolved local path escapes the mirror root (verified by test).
  • Same-origin filter: BFS crawl, sitemap, and mirror all enforce same registered domain — no accidental cross-site fetches.

Honest limitations

These are deliberate boundaries — each would require an API key or paid service, breaking the "self-hosted, no keys" promise.

Not in scope Why
Captcha solving Requires 2captcha / CapSolver API key (~$3/1k)
Residential proxy SaaS Requires Bright Data / IPRoyal account ($)
ML-based fingerprint randomisation Requires GPU-class compute or external API
Cloud-managed scraping (ScrapingBee, etc.) Requires API key

The seams are ready: bring your own captcha solver via the CaptchaProvider ABC pattern (removed in cleanup, easy to restore), or your own proxy URIs via PROXY_POOL. For free IP rotation, docker compose --profile tor up gives you Tor SOCKS5 with no signup.


Roadmap

Things I'd build next when need arises:

  • Per-host concurrency limit (currently global)
  • Resume-from-failure across CLI invocations (currently only across worker restarts via Redis)
  • Browser context fingerprint variance (vary viewport/UA/timezone per worker)
  • Sitemap-aware crawl (BFS but seeded from sitemap)
  • HAR export alongside mirror

Contributing

PRs welcome. Style:

  • One module per concern; no file > 400 LOC.
  • pytest + pyflakes clean before merge.
  • Add a unit test for new functionality (uses the mock fixture; no live network required).
  • Don't add features, refactor, or introduce abstractions beyond what the task requires.

License

MIT — code only.

About

A self-hosted web scraper that goes through walls. Stealth, resilience, and infrastructure in one tool — without a single external API key.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors