scraper

A self-hosted web scraper that goes through walls.

Stealth, resilience, and infrastructure in one tool — without a single external API key.

Quick start · Demo UI · Why this exists · Comparison · CLI · Architecture

What is this?

A single Python package + CLI that handles every layer of modern web scraping:

Realistic Chrome HTTP stack — headers, client hints, brotli, HTTP/2
TLS fingerprint impersonation — JA3/JA4 indistinguishable from real Chrome (via curl_cffi)
Stealth headless browser — Playwright with init scripts that defeat every check on bot.sannysoft.com (31/31)
Behavioural noise — mouse moves, scroll, dwell
Per-host rate limiting + exponential backoff with jitter
Persistent cookie jar — cf_clearance survives between runs
Sitemap-driven discovery — robots.txt → sitemap.xml → recurse into indexes
Wayback-style local mirror — saves HTML + same-origin assets, rewrites URLs to local paths
Markdown export — one clean .md per page, ready to read or feed an LLM
Self-hosted IP rotation — Tor SOCKS5 profile in docker-compose.yml
Distributed mode — Redis-backed job queue with DLQ, AOF persistence, N parallel workers, Prometheus /metrics

No paid captcha solvers. No external proxy SaaS. No "request credits." Runs on your hardware, your bandwidth, your terms.

Why this exists

The scraping ecosystem is fragmented. To get a working production stack today, you typically wire together:

requests/httpx          ─►  basic HTTP
+ playwright            ─►  JS rendering
+ playwright-stealth    ─►  defeat headless detection
+ curl_cffi             ─►  TLS impersonation
+ scrapy + scrapyd      ─►  queue, scheduling
+ wget                  ─►  local snapshots
+ 2captcha API          ─►  captchas      ($)
+ Bright Data API       ─►  residential IPs ($)
+ ScrapingBee / ZenRows ─►  managed scraping ($)

That's 6+ libraries and 3+ paid services to glue together. scraper ships all of the free parts as one coherent CLI and library, and gives you a clear seam to plug in proxies/captchas only when you actually need them.

Quick start

Hosted API — 2 commands

Requirements: Docker with Compose.

git clone https://github.com/KyleYYC/scraper && cd scraper
make up

That command:

creates .env with a random SCRAPER_API_KEY
builds the Docker image
starts the private API on the first free port from 8080..8100
waits for /healthz
runs an authenticated end-to-end smoke test
verifies bearer auth, private-target blocking, scraping, results, and artifacts
prints the exact commands to call it from another service

Target setup time is about 60 seconds on a warm Docker cache. The first run can take longer while Docker pulls the Playwright base image and Python wheels; subsequent runs should stay inside the 60-second path.

After it prints scraper API is running and verified, load the local credentials without echoing the secret:

set -a; . ./.env; set +a
export SCRAPER_API_URL="http://127.0.0.1:${SCRAPER_HOST_PORT:-8080}"

Create a scrape job:

curl -sS -X POST "$SCRAPER_API_URL/v1/jobs" \
  -H "Authorization: Bearer $SCRAPER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com","depth":1,"max_pages":1,"markdown":true,"mirror":true}'

Use the returned id:

curl -sS "$SCRAPER_API_URL/v1/jobs/{job_id}" \
  -H "Authorization: Bearer $SCRAPER_API_KEY"

curl -sS "$SCRAPER_API_URL/v1/jobs/{job_id}/result" \
  -H "Authorization: Bearer $SCRAPER_API_KEY"

Run the same verification again any time:

make verify

Drop-in service clients are included:

python examples/python_client.py https://example.com
node examples/node_client.mjs https://example.com

Run a local web UI demo:

make demo-ui

The UI server keeps SCRAPER_API_KEY private and proxies browser requests to the scraper API.

Stop it:

make down

Docker — manual

git clone https://github.com/KyleYYC/scraper && cd scraper
make docker
docker run --rm -p 127.0.0.1:8080:8080 \
  -e SCRAPER_API_KEY="$(openssl rand -hex 32)" \
  -v scraper-results:/app/results \
  scraper:latest

The container starts the private REST API by default. Override the command to run the CLI directly: docker run --rm scraper:latest python -m scraper https://example.com.

Local (Python 3.11+)

git clone https://github.com/KyleYYC/scraper && cd scraper
make install              # creates .venv, installs deps, fetches Chromium
python -m scraper example.com

Output lands at results/{host}.json.

Demo UI

Run a complete browser demo with one command:

make demo-ui

The runner starts the private scraper API if needed, then starts a local web UI on the first free port from 5173..5189. The browser talks only to the demo server; the demo server holds SCRAPER_API_KEY and proxies calls to the scraper API.

The UI can:

create generic scrape jobs
choose depth, max pages, sitemap, browser, Markdown, and mirror options
poll job status
display the JSON result
link generated artifacts such as result.json, markdown/index.md, and mirror/index.html

Recent warm-cache timing from a clean copy:

Elapsed to ready API: 7s
created_ms: 9
job_status: succeeded
result_title: Example Domain
demo_backend_elapsed: 1s

Try the show-off modes

# Crawl 3 levels, same-origin, capped at 30 pages
python -m scraper example.com --depth 3 --max-pages 30

# Auto-discover via robots.txt + sitemap.xml
python -m scraper example.com --sitemap

# Wayback-style local snapshot (open results/snap/index.html in a browser)
python -m scraper example.com --sitemap --browser --mirror results/snap

# Polite — proactively rate-limit, persist cookies, honour robots.txt
python -m scraper example.com \
  --rps 2 \
  --cookies-dir ~/.scraper/cookies \
  --respect-robots

# Maximum stealth — TLS impersonation + browser stealth + behavioural noise
python -m scraper https://protected.example \
  --tls --escalate --humanize --browser

Why it's efficient and cheap

Cost driver	What others do	What `scraper` does
Captcha solving	$1–5 per 1,000 captchas (2captcha/CapSolver)	Skipped — captcha-protected pages flagged, not solved
Residential proxies	$5–15 per GB (Bright Data, IPRoyal)	Built-in `ProxyPool` — bring your own, or use the free Tor compose profile
Managed scraping APIs	$0.50–5 per 1,000 requests (ScrapingBee, ZenRows)	None — your machine, your bandwidth
Cloud compute	Hosted browser farms	Single Docker image (~1.2 GB), runs on a $5 VPS
Per-request fees	Always	Zero

Operational efficiency:

HTTP/2 + brotli + keep-alive — fewer bytes, fewer round-trips
Per-host token bucket — proactive pacing avoids 429s in the first place
Persistent cookie jar — one cf_clearance solve per ~24 h, not per session
Sitemap discovery — skip the BFS overhead when the site already lists URLs
Same-origin filter — never accidentally mirror a CDN
Redis AOF persistence — workers crash, queue survives, no work lost

Comparison

Head-to-head against the same target, vanilla baselines vs scraper:

[scraper] tls.peet.ws (TLS fingerprint)
  baseline: t13d3513h1...                ← Python stdlib stack
  scraper : t13d1516h2...                ← byte-identical to real Chrome 124

[scraper] httpbin.org/headers (server-perceived)
  baseline: User-Agent: python-httpx/0.28.1
  scraper : User-Agent: Mozilla/5.0 (X11; Linux) Chrome/145
            Sec-Ch-Ua: "Chromium";v="145", "Not-A.Brand";v="99"

[scraper] bot.sannysoft.com (31-test detection battery)
  baseline:  18 pass / 12 fail
  scraper :  31 pass /  0 fail

[scraper] arh.antoinevastel.com/areyouheadless
  baseline: "You are Chrome headless"
  scraper : "You are not Chrome headless"

[scraper] httpbin.org/status/429 (rate-limit reaction)
  baseline: 1 attempt, surfaces 429 in 0.3 s
  scraper : 6 attempts with exp. backoff (transient 429s clear)

Run it yourself:

make compare

Feature matrix

Capability	`requests` / `httpx`	`scrapy`	raw `playwright`	scraper
Plain HTTP (sync/async)	✓	✓	overkill	✓
Realistic Chrome headers + brotli	manual	manual	n/a	default
TLS impersonation (JA3/JA4)	✗	✗	implicit	`--tls`
Stealth init (defeat `webdriver` checks)	n/a	n/a	✗	default
Pass `bot.sannysoft.com`	n/a	n/a	18/30	31/31
JS-rendered DOM	✗	needs Splash	✓	✓
Behavioural mimicry (mouse/scroll)	n/a	n/a	manual	`--humanize`
Per-host rate limit (proactive)	✗	DOWNLOAD_DELAY	n/a	`--rps`
Backoff on 429/5xx	manual	partial	n/a	default
Persistent cookies (cross-run)	per-session	partial	per-session	`--cookies-dir`
robots.txt	`urllib.robotparser`	built-in	n/a	`--respect-robots`
Sitemap discovery	manual	needs plugin	n/a	`--sitemap`
Wayback-style local mirror	`wget --mirror` (no JS)	✗	✗	`--mirror`
Markdown export	✗	✗	✗	`--markdown-dir`
Distributed queue	✗	scrapyd ($)	✗	Redis (built-in)
Worker fleet + DLQ	✗	✗	✗	built-in
JSON logs + Prometheus metrics	✗	✗	✗	built-in
Self-hosted IP rotation	external proxies	external proxies	external proxies	Tor compose profile
Auto-escalate HTTP→TLS→Browser	✗	✗	n/a	`--escalate`
Zero API keys / fully self-hosted	✓	✓	✓	✓
One tool that does all of the above	✗	✗	✗	✓

Installation

Requires Python 3.11+ and a few system libraries Playwright Chromium needs.

git clone https://github.com/KyleYYC/scraper && cd scraper
make install

make install creates a .venv, installs the package and all dependencies, and fetches Chromium. On Linux the Playwright postinstall handles system libs; on macOS / Windows use pip install -e '.[dev]' && playwright install --with-deps chromium.

Docker

make docker        # builds the image
make docker-api    # runs the private API
make docker-mock   # runs the bundled mock target on :8765

Image base: mcr.microsoft.com/playwright/python:v1.58.0-noble — Chromium and its system libs are preinstalled. Runs as non-root (pwuser) with a /healthz HEALTHCHECK, and removes pip from the final runtime image after package installation.

Hosted API

The API is meant to stay small and universal: bearer auth, async jobs, JSON results, and optional artifacts. For local or staging use, prefer:

make up

make up creates .env, starts Docker, and runs make verify against the authenticated REST contract. With a warm Docker cache, this is the intended 2-command, roughly 60-second path after cloning.

The Compose stack publishes local ports on 127.0.0.1 by default, so the API is not exposed to the LAN during local development.

Manual Docker:

export SCRAPER_API_KEY="$(openssl rand -hex 32)"
docker run --rm -p 127.0.0.1:8080:8080 \
  -e SCRAPER_API_KEY="$SCRAPER_API_KEY" \
  -v scraper-results:/app/results \
  scraper:latest

Create a scrape job:

set -a; . ./.env; set +a
export SCRAPER_API_URL="http://127.0.0.1:${SCRAPER_HOST_PORT:-8080}"

curl -sS -X POST "$SCRAPER_API_URL/v1/jobs" \
  -H "Authorization: Bearer $SCRAPER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com","depth":1,"max_pages":1,"markdown":true,"mirror":true}'

Poll and fetch the result:

curl -sS "$SCRAPER_API_URL/v1/jobs/{job_id}" \
  -H "Authorization: Bearer $SCRAPER_API_KEY"

curl -sS "$SCRAPER_API_URL/v1/jobs/{job_id}/result" \
  -H "Authorization: Bearer $SCRAPER_API_KEY"

For services, copy one of the dependency-free examples:

python examples/python_client.py https://example.com
node examples/node_client.mjs https://example.com

Endpoints:

Method	Path	Notes
`GET`	`/healthz`	Public liveness check
`GET`	`/v1/status`	Authenticated queue/API status
`POST`	`/v1/jobs`	Create async scrape job
`GET`	`/v1/jobs`	List recent jobs; add `?include_artifacts=true` only when needed
`GET`	`/v1/jobs/{id}`	Inspect one job
`GET`	`/v1/jobs/{id}/result`	Return JSON result
`GET`	`/v1/jobs/{id}/artifacts/{path}`	Download generated artifact
`POST`	`/v1/jobs/{id}/retry`	Requeue same request
`DELETE`	`/v1/jobs/{id}`	Cancel queued job

Core API environment:

Var	Default	Notes
`SCRAPER_API_KEY`	required	Bearer token, at least 32 chars
`PORT`	`8080`	Railway and most hosts inject this
`RESULTS_DIR`	`/app/results`	Job metadata, JSON, markdown, mirrors
`COOKIES_DIR`	`/app/.cookies`	Persistent cookie jar
`SCRAPER_EMBEDDED_WORKERS`	`1`	Single-container background workers
`SCRAPER_PUBLIC_ONLY`	`true`	Blocks localhost/private/metadata targets
`SCRAPER_MAX_PAGES`	`200`	API request cap
`SCRAPER_MAX_DEPTH`	`5`	API request cap
`SCRAPER_MAX_RPS`	`5`	API request cap
`SCRAPER_RETENTION_JOBS`	`1000`	Keep at most this many terminal local jobs
`SCRAPER_RETENTION_SECONDS`	`604800`	Delete terminal local jobs older than this
`SCRAPER_RETENTION_BYTES`	`0`	Optional local results byte cap; `0` disables
`SCRAPER_ENABLE_DOCS`	`false`	Enables FastAPI docs when true

Railway deployment:

The repo includes railway.json so Railway uses the root Dockerfile, starts scraper-api, restarts on failure, and checks /healthz.

Two-command CLI deploy, after railway login:

railway init
make railway-vars && railway up -d

To print the template variables without changing a Railway project:

RAILWAY_DRY_RUN=1 make railway-vars

For a Railway template, configure one web service from this GitHub repo:

Setting	Value
Builder	Dockerfile
Start command	`scraper-api`
Healthcheck path	`/healthz`
Public networking	HTTP
Optional volume	mount at `/app/results`

Template variables:

Var	Template value
`SCRAPER_API_KEY`	`${{ secret(64) }}`
`SCRAPER_EMBEDDED_WORKERS`	`1`
`SCRAPER_PUBLIC_ONLY`	`true`
`SCRAPER_MAX_PAGES`	`200`
`SCRAPER_MAX_DEPTH`	`5`
`SCRAPER_MAX_RPS`	`5`
`SCRAPER_RETENTION_JOBS`	`1000`
`SCRAPER_RETENTION_SECONDS`	`604800`
`SCRAPER_RETENTION_BYTES`	`0`
`SCRAPER_ENABLE_DOCS`	`false`
`RESULTS_DIR`	`/app/results`
`COOKIES_DIR`	`/app/.cookies`
`LOG_FORMAT`	`json`
`LOG_LEVEL`	`INFO`

Railway injects PORT; the API already listens on it. Use the generated Railway domain as SCRAPER_API_URL in client services.

After publishing the template, copy Railway's generated deploy-button URL into this README. Do not ship a placeholder button URL because it sends users to a dead template.

Distributed stack

make up             # API on first free port, embedded worker enabled
make stack-results  # show what landed in the named volume
make stack-down

For Redis-backed workers, run the API with SCRAPER_EMBEDDED_WORKERS=0, set REDIS_URL, and start scraper-worker containers with the same results volume.

Self-hosted IP rotation via Tor

Zero API keys. Just a local SOCKS5 proxy.

docker compose --profile tor up -d tor
PROXY_POOL=socks5h://localhost:9050 python -m scraper https://target.example

CLI

scraper URL [--level N] [-o output] [options]

Flag	Default	Notes
`URL`	(required)	URL or domain (`https://` added if missing)
`--level`	—	Workshop level `1..7`; omit for generic mode
`--pages`	`9`	Listing pages per session (workshop mode)
`-o, --output`	`results/{host}.{ext}`	`.csv` or `.json`
`--public-key`	auto	RSA public key for level-6 preflight
`--no-headless`	off	Run the browser headed (debugging)
`--tls`	off	curl_cffi Chrome TLS impersonation
`--humanize`	off	Mouse / scroll / dwell noise (browser mode)
`--browser`	off	Force browser engine in generic mode
`--escalate`	off	HTTP → TLS → Browser, advances on block detection
`--depth`	`1`	BFS crawl depth
`--max-pages`	`50`	Hard cap on pages crawled
`--sitemap`	off	URL discovery via robots.txt + sitemap.xml
`--markdown-dir`	—	Also write one `.md` per page
`--mirror`	—	Save HTML + assets locally (Wayback-style)
`--rps`	`0`	Per-host requests-per-second cap (`0` = unlimited)
`--cookies-dir`	—	Persist cookies between runs
`--respect-robots`	off	Honour robots.txt Disallow rules
`--metrics-port`	`0`	Prometheus `/metrics` (`0` = disabled)
`--log-format`	`text`	`text` or `json`
`--log-level`	`INFO`

Environment variables:

Var	Notes
`PROXY_POOL`	Comma-separated proxy URIs (`http://u:p@h:1,...`)
`PROXY_POOL_FILE`	Path to a file with one URI per line
`REDIS_URL`	Redis connection (default `redis://localhost:6379/0`)

Modes

Mode	Flag	What it does
Single page	(default)	Fetch one URL, return JSON
BFS crawl	`--depth N`	Same-origin, capped at `--max-pages`
Sitemap-driven	`--sitemap`	robots.txt → sitemap.xml → fetch each declared URL
Markdown export	`--markdown-dir`	One `.md` per page, front-matter + clean prose
Local mirror	`--mirror DIR`	HTML + same-origin assets, URLs rewritten to local paths
Workshop	`--level 1..7`	Per-tier hotel scrape (CSV output) — solves the scraping-workshop challenges
Auto-escalate	`--escalate`	HTTP → TLS → Browser, advancing on block detection

Examples

Scrape a domain, save markdown for an LLM

python -m scraper acme.example \
  --sitemap --browser \
  --markdown-dir corpus/acme \
  -o corpus/acme.json

Result: every public page from acme.example as one structured JSON file (corpus/acme.json) plus one Markdown file per page in corpus/acme/ ready to feed into a RAG pipeline or LLM context window.

Mirror a site for offline reading

python -m scraper docs.example \
  --sitemap --browser \
  --mirror snapshots/docs-2026-04-26

open snapshots/docs-2026-04-26/index.html to browse offline.

Production scrape — polite, persistent, observable

python -m scraper api-docs.example \
  --depth 5 --max-pages 500 \
  --rps 2 \
  --cookies-dir ~/.scraper/cookies \
  --metrics-port 9100 \
  --log-format json

Then point Prometheus at :9100/metrics:

scraper_pages_total{outcome="ok",mode="http",level="..."}    492
scraper_pages_total{outcome="fail",mode="http",level="..."}    8
scraper_request_retries_total{reason="status:429"}             34
scraper_active_workers                                          0

Distributed mode — N workers from a Redis queue

SCRAPER_EMBEDDED_WORKERS=0 docker compose --profile redis up --build -d
docker compose --profile redis --profile cli run --rm cli python -m scraper.enqueue \
  https://target.example --depth 3 -o /app/results/target.json
docker compose exec redis redis-cli LLEN scraper:jobs

Architecture

                  ┌─────────────┐
                  │   CLI       │  scraper URL [--flags]
                  └──┬──────┬───┘
                     │      │
       generic mode  │      │  workshop mode (--level N)
                     │      │
                  ┌──▼──────▼───┐
                  │   core      │  orchestrator: which engine, which mode
                  └─┬───────┬───┘
                    │       │
         HTTP/TLS   │       │   Browser (Playwright)
                    │       │
        ┌───────────▼───┐  ┌▼─────────────────────────┐
        │ http_session  │  │ browser + behavioral     │
        │ tls_session   │  │   stealth init script    │
        │ rate_limit    │  │   timezone / locale       │
        │ cookies       │  │   mouse/scroll noise      │
        └─┬─────────────┘  └─┬────────────────────────┘
          │                  │
          └──────────┬───────┘
                     │
        ┌────────────▼────────────────┐
        │  parser  →  Hotel / dict    │
        │  generic.extract            │  title, links, emails, JSON-LD,
        │                              │  full text, markdown
        └─┬───────────┬───────┬───────┘
          │           │       │
       output      mirror   markdown_dir
       (CSV/JSON)  (local   (one .md
                    site)    per page)

Distributed mode adds:

        enqueue ─────► Redis ────► worker (×N) ──► same orchestrator
                       (AOF)        with lock+
                                    DLQ on fail

Module map

scraper/
  api.py             private FastAPI job API
  api_models.py      API request / response models
  job_runner.py      shared job execution for API + workers
  local_jobs.py      file-backed single-container job store
  __main__.py        CLI dispatch (generic vs workshop modes)
  core.py            workshop orchestrator (HTTP / Browser / auto)
  generic.py         single-page / BFS / sitemap / content extraction
  sitemap.py         robots.txt + sitemap.xml + index discovery
  robots.py          robots.txt parser + per-UA matching
  mirror.py          Wayback-style local snapshot
  http_session.py    httpx + retry + backoff + cookies + rate limit
  tls_session.py     curl_cffi (Chrome TLS impersonation)
  browser.py         Playwright + stealth init + timezone
  behavioral.py      mouse / scroll / dwell noise
  proxy_pool.py      sticky-per-worker rotation
  cookies.py         persistent per-host cookie jar
  rate_limit.py      per-host token bucket
  observability.py   JSON logs + Prometheus metrics
  queue.py           Redis-backed job queue + status
  worker.py          distributed worker entry point
  enqueue.py         CLI to push jobs onto the queue
  crypto.py          RSA-OAEP-SHA256 (for the workshop challenges)
  parser.py          CSS-selector hotel extraction
  config.py          per-level workshop profiles
  models.py          Hotel / Review dataclasses
  output.py          CSV / JSON writers
mock_site/           local FastAPI mirror of all 7 challenges + sitemap
examples/            service clients + local server-side web UI demo
stress/              real-target stress runner + head-to-head comparison
tests/               pytest (uses fixture-spawned mock + fakeredis)

Testing

Unit tests (no network)

make test

Hosted-service coverage gate

make coverage

This enforces 100% line coverage for the hosted API/service layer. Browser engines, CLI wrappers, the mock site, and real-target stress harnesses stay out of that percentage because they are covered by integration and smoke tests.

................................................. [100%]
49 passed in 101s

The pytest fixture spawns the bundled FastAPI mock on a random port. No network or API keys required.

Stress (real targets)

make stress

[PASS] books.toscrape.com         1000/1000 books, 1000/1000 details
[PASS] quotes.toscrape.com/js     100 quotes, 10 pages
[PASS] arh.antoinevastel.com      "You are not Chrome headless"
[PASS] bot.sannysoft.com          31 pass / 0 fail
[PASS] nowsecure.nl (Cloudflare)  title='nowsecure.nl'
[PASS] tls.peet.ws                JA4: t13d1516h2 (Chrome shape)
[PASS] httpbin.org/cookies        sid=abc123 roundtripped
[PASS] httpbin.org/redirect/5     final_url=/get
[PASS] httpbin.org/encoding/utf8  7808 chars, has_non_ascii=True
[PASS] variety sweep              wikipedia / HN / GitHub / python.org
=== Overall: PASS (10/10, 60s) ===

Head-to-head vs vanilla

make compare

Side-by-side runs of vanilla httpx / playwright against scraper on the same targets. Latest: scraper 6 / tie 0 / baseline 0.

Operational notes

Logs: --log-format json emits line-delimited JSON ready for any log aggregator (Loki, Datadog, CloudWatch).
Metrics: --metrics-port 9100 exposes Prometheus counters and gauges: scraper_pages_total, scraper_request_retries_total, scraper_active_workers.
Redis durability: AOF + persistent volume, so a crashed worker fleet picks up where it left off. Per-job locks (TTL) auto-release if a worker dies mid-job.
Path-traversal protection: the mirror module rejects any URL whose resolved local path escapes the mirror root (verified by test).
Same-origin filter: BFS crawl, sitemap, and mirror all enforce same registered domain — no accidental cross-site fetches.

Honest limitations

These are deliberate boundaries — each would require an API key or paid service, breaking the "self-hosted, no keys" promise.

Not in scope	Why
Captcha solving	Requires 2captcha / CapSolver API key (~$3/1k)
Residential proxy SaaS	Requires Bright Data / IPRoyal account ($)
ML-based fingerprint randomisation	Requires GPU-class compute or external API
Cloud-managed scraping (ScrapingBee, etc.)	Requires API key

The seams are ready: bring your own captcha solver via the CaptchaProvider ABC pattern (removed in cleanup, easy to restore), or your own proxy URIs via PROXY_POOL. For free IP rotation, docker compose --profile tor up gives you Tor SOCKS5 with no signup.

Roadmap

Things I'd build next when need arises:

Per-host concurrency limit (currently global)
Resume-from-failure across CLI invocations (currently only across worker restarts via Redis)
Browser context fingerprint variance (vary viewport/UA/timezone per worker)
Sitemap-aware crawl (BFS but seeded from sitemap)
HAR export alongside mirror

Contributing

PRs welcome. Style:

One module per concern; no file > 400 LOC.
pytest + pyflakes clean before merge.
Add a unit test for new functionality (uses the mock fixture; no live network required).
Don't add features, refactor, or introduce abstractions beyond what the task requires.

License

MIT — code only.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
mock_site		mock_site
scraper		scraper
scripts		scripts
stress		stress
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
railway.json		railway.json
requirements-mock.txt		requirements-mock.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

scraper

What is this?

Why this exists

Quick start

Hosted API — 2 commands

Docker — manual

Local (Python 3.11+)

Demo UI

Try the show-off modes

Why it's efficient and cheap

Comparison

Feature matrix

Installation

Docker

Hosted API

Distributed stack

Self-hosted IP rotation via Tor

CLI

Modes

Examples

Scrape a domain, save markdown for an LLM

Mirror a site for offline reading

Production scrape — polite, persistent, observable

Distributed mode — N workers from a Redis queue

Architecture

Module map

Testing

Unit tests (no network)

Hosted-service coverage gate

Stress (real targets)

Head-to-head vs vanilla

Operational notes

Honest limitations

Roadmap

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages