Multi-Language Web Crawler Framework
AI Β· Media Download Β· Distributed Β· Anti-Bot Β· Node-Reverse
β If this project helped you, please give it a star!
SuperSpider is a multi-language web crawler framework that ships four production-ready runtimes in Python, Go, Rust, and Java. Each runtime covers the same broad capability surface β web crawling, browser automation, AI extraction, media download, anti-bot, distributed execution β but is optimized for a different engineering environment.
| Runtime | Language | Delivery | Tagline |
|---|---|---|---|
| π pyspider | Python | virtualenv | AI-first, project-oriented, rapid iteration |
| πΉ gospider | Go | compiled binary | Concurrent, binary-first, distributed workers |
| π¦ rustspider | Rust | release binary | Performance-first, feature-gated, strongly typed |
| β javaspider | Java | Maven / JAR | Enterprise-first, browser workflows, audit trails |
- HTTP and browser-based crawling (Playwright + Selenium)
- Scrapy-style project interface with plugin injection
- Dynamic site handling (JavaScript-rendered pages)
- Crawler type templates β hydrated SPA, bootstrap JSON, infinite scroll, login session, and e-commerce search JobSpec starters
- Site presets β JD, Taobao, Tmall, Pinduoduo, Xiaohongshu, and Douyin Shop starter JobSpec templates
- Spider class kits β reusable spider class templates for PySpider, GoSpider, RustSpider, and JavaSpider
- Native ecommerce crawler classes β catalog/detail/review wrappers in all four runtimes, with browser-backed capture companions where the runtime supports them
- Native ecommerce examples β catalog/detail/review examples in all four runtimes with a JD fast path plus a
genericfallback for unknown storefronts - Proxy pool with health checking and automatic rotation
- Rate limiting, circuit breaker, deduplication
- Robots.txt compliance
- Session and cookie management
- Checkpoint and incremental crawl (resume interrupted crawls)
- Priority-based crawling β request priority queue with SQLite persistence
- Multi-threaded execution β thread pool, concurrent executor, async executor, rate-limited executor
- Incremental crawling with ETag/Last-Modified β content hash comparison, min-change interval enforcement, delta token generation
- Cookie management β per-domain cookie jar with SameSite, Secure, HttpOnly, auto-expiry, Netscape export/import
- Persistent priority queue β SQLite-backed with URL deduplication, priority sorting, visited tracking
- Worker pool β configurable thread pool with shutdown, wait, and statistics
- Concurrent executor β semaphore-controlled ThreadPoolExecutor with execute-many
- Async executor β asyncio.Semaphore-controlled async task execution
- Rate-limited executor β token bucket algorithm with wait and execute
- Priority task queue β heapq-based priority queue for task scheduling
All four runtimes can download from:
| Platform | Format |
|---|---|
| YouTube | HLS, DASH, MP4 |
| Bilibili | HLS, DASH, M4S |
| IQIYI | HLS, DASH |
| Tencent Video | HLS, direct link |
| Youku | HLS, DASH |
| Douyin | MP4, direct link |
| Generic HLS | M3U8 streams |
| Generic DASH | MPD manifests |
| FFmpeg merge | TS/M4S β MP4 |
| DRM detection | Widevine, PlayReady |
- LLM extraction β OpenAI (GPT-4o, etc.) and Anthropic/Claude
- Entity extraction β named entities, structured data
- Content summarization β automatic page summarization
- Sentiment analysis β positive/negative/neutral classification
- Smart parser β auto-detects page type and extracts relevant fields (PySpider only)
- Schema-driven output β strongly typed structured extraction (PySpider only)
- Few-shot examples β guide the LLM with examples
- XPath suggestion studio β AI-suggested XPath selectors
- Keyword extraction β AI-powered keyword extraction from page content
- Content classification β AI categorization into predefined categories
- Translation β AI-powered content translation to target languages
- Q&A over content β ask questions about crawled page content with context
- TLS fingerprint rotation (JA3/JA4 mimicry)
- Browser behavior simulation (mouse movement, scroll, reading pace)
- WAF and access-friction detection with compliant browser upgrade paths
- Night mode (reduced crawl rate during off-hours)
- Access friction classifier: shared
level,signals,recommended_actions,challenge_handoff, andcapability_planacross all four runtimes - HTTP response diagnostics: PySpider
Response.meta["access_friction"], GoSpiderResponse.AccessFriction, RustSpiderResponse.access_friction, and JavaSpiderPage.getField("access_friction") - Captcha and login challenge handling: detect CAPTCHA/auth/risk-control pages, pause for authorized human access, persist session assets, and resume only after validation
- Cloudflare/Akamai handling: vendor profiling, browser-render recommendation, artifact capture, and stop conditions when access is denied
- Browser fingerprint management: Canvas, WebGL, font fingerprint generation with session persistence
- Smart delay strategy: adaptive frequency-based delay adjustment with human-like jitter
- Cookie management: per-domain cookie jar with automatic rotation
- SSRF protection (blocks internal network access, cloud metadata endpoints)
- Input sanitization: XSS prevention, HTML cleaning, dangerous character filtering
- Block detection: keyword-based block/ban detection with automatic proxy switching
- Compliance boundary: the framework does not promise automated CAPTCHA cracking, forced risk-control bypass, or access to private/login-gated data without authorization
- SSRF protection: blocks requests to private IPs, cloud metadata (169.254.169.254), loopback, multicast
- URL validation: protocol whitelisting, domain allowlist/blocklist, port restrictions, length limits
- Input sanitization: script tag removal, event handler stripping, HTML entity decoding, filename sanitization
- Circuit breaker: configurable failure threshold, half-open state recovery, prevents cascade failures
- Retry strategies: fixed, linear, exponential, exponential+jitter backoff with configurable status code handling
- Failure classification: automatic categorization (blocked, throttled, anti_bot, timeout, server, proxy)
- Request fingerprinting: SHA-256 fingerprints based on URL + method + headers + cookies + body
- Content deduplication: SHA-256 content hashing to avoid re-processing identical pages
Many modern sites protect their APIs with JavaScript-generated signatures. SuperSpider handles this via a Node.js bridge:
- Node-reverse client for JS-encrypted sites
- Encrypted site crawler (HMAC, AES, token-based)
- JS signature execution via Node.js bridge server
- Supports HMAC-SHA256, AES-encrypted params, timestamp tokens
- Redis queue (native, all four runtimes)
- RabbitMQ β broker-native (Go, Java), bridge (Rust), native (Python)
- Kafka β broker-native (Go, Java), bridge (Rust), native (Python)
- Distributed workers with state machine
- Node discovery: environment variables, file, DNS-SRV, Consul, etcd
- Dataset mirror to database backends
- Autoscaled frontier: auto-adjusts concurrency based on latency and failure rate
- Session pool: reusable session slots with fingerprint profile + proxy affinity
- Dead-letter queue: failed requests after max retries are quarantined for inspection
- Lease-based request dispatching: TTL-gated leases with heartbeat renewal and domain inflight limits
- Checkpoint persistence: SQLite-backed checkpoint manager with auto-save intervals
- Proxy scoring: success/failure ratio-based proxy selection with automatic degradation
- Middleware chain: composable request/response processing pipeline
| Backend | PySpider | GoSpider | RustSpider | JavaSpider |
|---|---|---|---|---|
| SQLite | β | β | β | β |
| PostgreSQL | β | β process | β driver+process | β |
| MySQL | β | β process | β driver+process | β |
| MongoDB | β | β process | β driver+process | β |
| JSON / CSV / JSONL | β | β | β | β |
- Audit trail: in-memory, file, JSONL, composite
- Monitoring and metrics dashboard
- Preflight validation (check config before crawling)
- Checkpoint and resume (SQLite-backed)
- Incremental crawl (only crawl new/changed pages)
- Structured event logging: trace-id correlated events with Prometheus text and OpenTelemetry export
- Observability collector: request latency tracking, failure classification, outcome histograms
- Artifact store: filesystem-based artifact storage for screenshots, traces, JSON snapshots, HTML
- Graph artifact persistence: per-page DOM graph (nodes, edges, stats) saved automatically during crawl
- Frontier state snapshot: pending, known, leases, domain-inflight, dead-letters all exportable as JSON
- Dashboard API:
/api/v1/monitors/<name>/dashboardprovides real-time crawl stats, performance metrics, resource usage - REST API server: full spider lifecycle management via HTTP (start/stop/stats/queues/tasks)
- API authentication: Bearer token / X-API-Token support for production API security
- Shared config scaffolding β
config initbootstraps the cross-runtime contract config - Site profiling β
profile-siteemitscrawler_type,site_family,runner_order,strategy_hints, andjob_templates - Pre-crawl discovery β
sitemap-discoverexpands crawl candidates before you commit to selectors - Selector debugging β
selector-studiolets you validate CSS/XPath/regex rules against saved HTML - Control-plane tools β
plugins,jobdir,http-cache,console, andauditare public operator surfaces - Browser tooling β
fetch,trace,mock, andcodegenare exposed in the browser CLI across the runtimes - Shared starter assets β
examples/crawler-types/,examples/site-presets/, andexamples/class-kits/are the canonical starting points for hard site families - Research flows β async research, notebook-style output, and scenario playbooks are part of the published runtime surface
The native ecommerce examples, crawler classes, and class kits are built for publicly accessible marketplace data.
They currently aim for:
- product links and identifiers
- price and promotion signals
- shop / seller signals
- review and rating summaries
- images, videos, embedded JSON, and API candidates
Current fast paths:
jd: SKU extraction plus price/review public APIstaobao,tmall,pinduoduo,amazon: JSON-LD product / aggregate-rating fast paths when presentgeneric: fallback public-data extraction for unknown storefronts
Each runtime now exposes a unified ecommerce crawler class style entrypoint, plus a browser-backed companion in the runtimes that support full browser capture. The naming is intentionally consistent so the examples can be lifted into a project with minimal translation.
They do not guarantee universal extraction across every storefront and they do not imply access to login-gated, private, or user-owned commerce data.
Best for: AI-powered extraction, rapid prototyping, research workflows
- Smart parser β automatically detects page type (article, product, listing, etc.) and extracts relevant fields without writing selectors
- Schema-driven LLM extraction β define a JSON schema, get structured output from any page
- Graph crawler β crawl relationship graphs, extract nodes and edges; REST API
/api/v1/graph/extract - Research runtime β Jupyter-style notebook output for data analysis
- Plugin injection β extend any part of the pipeline with Python plugins
- Async runtime β full async/await support with aiohttp
- REST API server β Flask-based server with spider start/stop, task management, queue control, monitoring dashboards, and metrics
- Advanced anti-bot β browser fingerprint generation (Canvas, WebGL, fonts), TLS profile management, captcha detection/handling, human behavior simulation (mouse trajectories, scroll, reading time), smart delay strategy
- Cloudflare & Akamai bypass β specialized header profiles for major WAFs
- Security suite β SSRF protection (blocks private IPs, cloud metadata), URL validation (protocol/domain/port whitelisting), input sanitization (XSS prevention, HTML cleaning)
- Circuit breaker β configurable failure threshold with half-open recovery, prevents cascade failures
- Retry strategies β fixed, linear, exponential, exponential+jitter backoff with async support
- Failure classification β automatic categorization: blocked, throttled, anti_bot, timeout, server, proxy, runtime
- Autoscaled frontier β auto-adjusts concurrency based on latency and failure rate with dead-letter queue
- Session pool β reusable session slots with fingerprint profile + proxy affinity, max 32 sessions
- Middleware chain β composable request/response processing pipeline
- Request fingerprinting β SHA-256 fingerprints from URL + method + headers + cookies + body + meta
- Artifact store β filesystem storage for screenshots, traces, JSON, HTML with metadata
- Prometheus + OTel export β metrics export in Prometheus text format and OpenTelemetry payload
- Robots.txt compliance β crawl-delay respect and disallow enforcement
- Curl converter β convert curl commands to spider requests
- Production config β multi-environment configuration with validation
- Crawler type playbook β
docs/CRAWLER_TYPE_PLAYBOOK.mdplusexamples/crawler-types/
# Install all four runtimes
scripts\windows\install-superspider.bat
bash scripts/linux/install-superspider.sh
bash scripts/macos/install-superspider.sh
# Install only PySpider
# Windows
scripts\windows\install-pyspider.bat
# Linux / macOS
bash scripts/linux/install-pyspider.sh
bash scripts/macos/install-pyspider.shOutput: .venv-pyspider β run python -m pyspider version to verify
Best for: High-concurrency production crawling, binary deployment, distributed worker clusters
- Single binary β no runtime dependencies, deploy anywhere
- Native Selenium/WebDriver β direct WebDriver protocol, no wrapper overhead
- Broker-native queues β RabbitMQ and Kafka via native Go clients
- Dedicated platform extractors β separate packages for Bilibili, IQIYI, Tencent, Youku, Douyin
- Process + driver DB adapters β flexible database backend selection
- Audit trail module β structured audit logging with composite writers
- Browser automation β Chrome browser pool with lifecycle management, auto-restart on failure, graceful shutdown
- WAF bypass suite β Cloudflare, Akamai, Alibaba Cloud, Tencent Cloud specialized bypass strategies
- Anti-detection β stealth mode, WebDriver property removal, Chrome automation flag masking
- TLS fingerprint rotation β JA3/JA4 mimicry profiles for different browsers
- Behavior simulation β mouse movement, reading pace, scroll patterns
- DASH media downloader β HTTP range-based segment download with parallel workers, FFmpeg merge, retry logic
- Distributed node reverse β Node.js bridge with managed subprocess lifecycle
- Task engine β task creation, execution, status tracking, result storage
- Scheduler β Cron-based scheduling, one-time tasks, interval tasks, concurrent management
- Event system β structured events, priority queue, dispatcher, subscriber model
- Monitor suite β performance monitoring, resource tracking, health checks, alerting
- Extractor framework β XPath, CSS selector, regex, JSONPath extraction with validation
- Proxy rotation β proxy pool with health checking, automatic failover
- Rate limiting β token bucket, sliding window, adaptive rate control
- Config-driven crawling β JSON-based spider configuration, template-driven execution
# Windows
scripts\windows\install-gospider.bat
# Linux / macOS
bash scripts/linux/install-gospider.sh
bash scripts/macos/install-gospider.shOutput: gospider/gospider binary β run ./gospider/gospider --version to verify
Best for: Performance-sensitive deployments, strict resource boundaries, feature-gated release control
- Feature-gated modules β compile only what you need (browser, distributed, API, web)
- Native node+playwright process β Playwright runs as a managed subprocess, not a wrapper
- Fantoccini Selenium facade β async Rust WebDriver client
- Real captcha API flow β async 2captcha/Anti-Captcha with polling, not placeholder
- Driver-level DB adapters β native Rust drivers for Postgres, MySQL, MongoDB
- Benchmark suite β built-in performance benchmarks
- Preflight validation β validate all config and dependencies before starting
- Encrypted site crawler β HMAC-SHA256, AES-encrypted params, timestamp token generation
- Media downloader β HLS/DASH with FFmpeg integration, segment tracking, progress reporting
- Async runtime β tokio-based async execution with cancellation support
- Distributed worker β Redis-based task distribution with worker heartbeat
- Proxy rotation β proxy pool with success rate scoring, automatic failover
- Task scheduler β cron-based scheduling with execution history
- Performance monitor β resource usage tracking, latency histograms, throughput metrics
- Transformer pipeline β composable data transformation stages
- Node reverse β Node.js subprocess management for JS signature execution
- FFI bindings β C-compatible interface for embedding in other languages
- API server β HTTP API for spider control and status querying
- Artifact storage β file-based artifact persistence with metadata
# Windows
scripts\windows\install-rustspider.bat
# Linux / macOS
bash scripts/linux/install-rustspider.sh
bash scripts/macos/install-rustspider.shOutput: rustspider/target/release/rustspider β run ./rustspider/target/release/rustspider --version to verify
Best for: Enterprise Java environments, Maven/JAR delivery, browser-heavy automation, audit-conscious execution
- Maven profiles β
lite / ai / browser / distributed / fullβ build only what you need - Dedicated audit trail β the strongest audit support in the set: in-memory, file, JSONL, composite
- Broker-native queues β RabbitMQ via
amqp-client, Kafka viakafka-clients - REST API server β built-in
/health,/jobs,/jobs/{id},/jobs/{id}/resultendpoints - Async spider runtime β
AsyncSpiderRuntimefor non-blocking execution - Workflow replay β record and replay browser workflows
- Generic-parser fallback β media parsing never fails silently; falls back to generic extraction
- Adaptive rate limiter β AI-guided rate control with latency-based backpressure
- Batch media downloader β concurrent download with progress tracking, retry, merge
- User-agent rotator β browser-specific UA pools with header consistency
- Connector framework β pluggable database connectors (SQLite, PostgreSQL, MySQL, MongoDB)
- CLI interface β command-line spider execution with config loading
- Bridge module β cross-runtime communication bridge
- Session management β cookie jar, session persistence across requests
- Media pipeline β YouTube, Bilibili, IQIYI, Tencent, Youku, Douyin with format detection
- Workflow engine β DAG-based workflow execution with conditional branching
# Minimal (core crawling only)
mvn -f javaspider/pom.xml -P lite -DskipTests package
# With AI extraction
mvn -f javaspider/pom.xml -P ai -DskipTests package
# With browser automation
mvn -f javaspider/pom.xml -P browser -DskipTests package
# With distributed runtime
mvn -f javaspider/pom.xml -P distributed -DskipTests package
# Everything
mvn -f javaspider/pom.xml -P full -DskipTests package# Windows
scripts\windows\install-javaspider.bat
# Linux / macOS
bash scripts/linux/install-javaspider.sh
bash scripts/macos/install-javaspider.shOutput: javaspider/target/ β run java -jar javaspider/target/javaspider-*.jar --version to verify
| Framework | Required |
|---|---|
| π PySpider | Python 3.10+ recommended, pip, venv |
| πΉ GoSpider | Go 1.24+ |
| π¦ RustSpider | Rust 1.70+ recommended, Cargo |
| β JavaSpider | Java 17 target, Maven 3.8+ |
Supported installer operating systems: Windows 10/11 or Windows Server 2022+, Ubuntu/Debian/RHEL-compatible Linux, and macOS 13+. The current Windows verification host is Microsoft Windows 11 Pro 10.0.28000, 64-bit.
| I need... | Use |
|---|---|
| AI-powered extraction with LLM | π PySpider |
| High-concurrency binary deployment | πΉ GoSpider |
| Maximum performance, strict boundaries | π¦ RustSpider |
| Enterprise Java, Maven/JAR, audit trail | β JavaSpider |
| Download YouTube / Bilibili / Douyin | any (all four) |
| Crawl JS-encrypted sites | any (all four support node-reverse) |
| Distributed worker cluster | πΉ GoSpider or π¦ RustSpider |
| Rapid prototyping and research | π PySpider |
| REST API to control crawlers | π PySpider (Flask) or β JavaSpider |
| Browser fingerprint management | π PySpider (Canvas, WebGL, fonts) |
| Circuit breaker + retry strategies | π PySpider (4 strategies + circuit breaker) |
| Cloudflare / Akamai bypass | π PySpider or πΉ GoSpider |
| SSRF protection + input sanitization | π PySpider |
| Workflow automation (DAG) | β JavaSpider |
| Feature-gated compilation | π¦ RustSpider |
| Single binary deployment | πΉ GoSpider |
| Prometheus / OTel metrics export | π PySpider |
| Graph crawling + relationship extraction | π PySpider |
| Session pool management | π PySpider (32 sessions with fingerprint affinity) |
| Autoscaled concurrency | π PySpider (frontier-based auto-scaling) |
| Document | Description |
|---|---|
docs/DOCS_INDEX.md |
Canonical documentation index and recommended reading order |
docs/FRAMEWORK_CAPABILITIES.md |
Detailed per-framework capability descriptions |
docs/FRAMEWORK_CAPABILITY_MATRIX.md |
Full capability comparison tables |
docs/ACCESS_FRICTION_PLAYBOOK.md |
High-friction crawl model, challenge handoff, and compliant recovery policy |
docs/CRAWL_SCENARIO_GAP_MATRIX.md |
Real crawling scenarios that are still partial or missing across the four runtimes |
docs/LATEST_SCENARIO_CASES.md |
Latest practical scenario playbooks and recommended runtime choices |
docs/CRAWLER_TYPE_PLAYBOOK.md |
Shared crawler types, runner-order guidance, and JobSpec template mapping |
docs/SITE_PRESET_PLAYBOOK.md |
Site-family starter presets for major marketplace and social-commerce domains |
examples/class-kits/README.md |
Reusable spider class templates for all four runtimes |
docs/SUPERSPIDER_INSTALLS.md |
Install instructions for Windows, Linux, and macOS |
docs/FOUR_RUNTIME_HEALTH_REPORT.md |
Current compile, dependency, and test status for all four runtimes |
MEDIA_PARITY_REPORT.md |
Media platform coverage evidence |
ADVANCED_USAGE_GUIDE.md |
Advanced crawling scenarios |
ENCRYPTED_SITE_CRAWLING_GUIDE.md |
JS-encrypted site crawling |
NODE_REVERSE_INTEGRATION_GUIDE.md |
Node.js reverse engineering bridge |
ULTIMATE_ENHANCEMENT_GUIDE.md |
Full capability enhancement reference |
PUBLISH_RELEASE_STATUS.md |
Publish-time verification status and release notes |
CHANGELOG.md |
Version history |
CONTRIBUTING.md |
Contribution guide |
Checked against the current workspace on 2026-04-25:
| Runtime | Verified command(s) | Current result |
|---|---|---|
| π PySpider | python -m pytest tests\test_access_friction.py tests\test_locator_analyzer.py tests\test_super_framework.py tests\test_api_server.py tests\test_core_spider.py tests\test_downloader.py -q |
Pass, 40 tests |
| πΉ GoSpider | go test ./... |
Pass |
| π¦ RustSpider | cargo test --quiet --lib, cargo test --quiet --test access_friction |
Pass on checked slices; full suite is heavy and should be run in CI with a longer timeout window |
| β JavaSpider | mvn -q test, mvn -q -Dtest=HtmlSelectorContractTest test |
Pass |
Notes:
- The four runtimes now share access-friction detection for high-risk pages, browser-upgrade planning, XPath/CSS locator helpers, and browser/devtools-oriented element analysis.
- PySpider full-suite success is not claimed here because an earlier broad
pytest -qrun exceeded the local timeout window; use CI with longer timeouts for unrestricted release coverage.
If SuperSpider saved you time, helped you build something, or taught you something new β please consider giving it a β star on GitHub!
Why it matters:
- β Stars help other developers discover this project
- β Stars motivate continued development and maintenance
- β Stars show the community that multi-language crawler frameworks are valuable
Has SuperSpider helped you?
- π¬ Downloaded videos from YouTube, Bilibili, or other platforms?
- π€ Extracted structured data using AI/LLM?
- π‘οΈ Bypassed anti-bot protection on a challenging site?
- π Cracked a JS-encrypted API?
- π Built a distributed crawler cluster?
- π Automated data collection for research or business?
If yes to any of the above β click the β Star button at the top of this page. It takes 2 seconds and means a lot! π
MIT License β see LICENSE for details.