A modular, intelligent data extraction and normalization engine for multimedia metadata across diverse content platforms.
AetherScraper is a sophisticated, API-first orchestration framework designed to intelligently gather, normalize, and structure metadata from a wide array of multimedia content platforms. Unlike conventional scrapers, it employs a modular plugin architecture, semantic understanding via integrated AI APIs, and produces a unified, queryable data schemaβturning fragmented platform data into a cohesive knowledge graph. It's built for researchers, developers building content-aware applications, and data archivists who require consistency from chaos.
Think of it as a "linguistic translator for platform data dialects," converting the idiosyncratic JSON of one site into the same structured language as another, all while enriching the data with contextual intelligence.
- π§© Modular Plugin Architecture: Each supported platform is a separate, hot-swappable plugin. Contribute new ones without touching the core engine.
- π€ AI-Powered Metadata Enrichment: Optional integration with OpenAI's GPT API and Anthropic's Claude API to generate descriptions, tags, summaries, and content classifications, adding a layer of semantic understanding to raw data.
- π Unified Normalized Schema: All extracted data conforms to a single, well-defined
AetherSchema(JSON Schema/TypeScript), making downstream processing predictable. - π Intelligent Rate Limiting & Resilience: Mimics human interaction patterns, handles CAPTCHAs via proxy services, and features exponential backoff retry logic.
- π¬ Multilingual Support & Localization: The UI and extracted metadata can be processed and output in multiple languages. The CLI and API responses respect locale headers.
- π¨ Responsive Web Dashboard: A sleek, real-time dashboard (built with SvelteKit) to monitor extraction jobs, manage plugins, and explore the normalized data graph visually.
- π‘οΈ Privacy-Centric Design: Operates as a user-agent tool. Includes data sanitization modules and guidelines for ethical scraping. Stores no media content, only metadata.
- π 24/7 Community & Automated Support: Active community forums and an AI-powered support bot trained on the project's documentation for instant troubleshooting.
- Node.js 20+ or Python 3.11+
- An API Key (for AI enrichment features, optional)
Download the latest release package:
Extract the package and navigate into the directory.
Using the provided installer:
# Install core dependencies and CLI tool globally
./install.sh --cliAetherScraper uses a profile.yaml file to define your extraction targets, output format, and AI preferences.
# profile.yaml
project: "Documentary Research 2026"
output:
format: "ndjson" # JSON Lines for easy streaming
directory: "./data/aether_output"
compression: true
plugins:
enabled:
- "plugin_educational_stream"
- "plugin_creative_archive"
config:
plugin_educational_stream:
depth: "shallow" # shallow, deep, complete
include_comments: false
ai_enrichment:
provider: "openai" # openai, claude, or none
api_key_env: "OPENAI_API_KEY" # Your key stored in environment variable
modules:
- "generate_summary"
- "infer_keywords"
- "categorize_content"
language: "en"
rate_limiting:
requests_per_minute: 30
respect_robots_txt: trueRun a scrape job using your profile and a list of target URLs.
# Basic run
aether-scraper run --profile ./profile.yaml --urls ./url_list.txt
# Run with verbose logging and a specific output tag
aether-scraper run --profile ./profile.yaml --urls ./url_list.txt --verbose --tag "batch_2026_04_01"
# Validate a plugin without running a full job
aether-scraper plugin test plugin_educational_stream --url "https://example-platform.com/item/xyz"The following Mermaid diagram illustrates the data flow and modular design of AetherScraper:
graph TB
subgraph "Input Layer"
A[CLI / Dashboard] --> B[Job Scheduler]
C[Profile YAML] --> B
D[URL List] --> B
end
subgraph "Orchestration Core"
B --> E{Plugin Router}
E --> F[Plugin: Platform A]
E --> G[Plugin: Platform B]
E --> H[... Plugin: Platform N]
F --> I[Raw Data Fetcher]
G --> I
H --> I
I --> J[Normalization Engine]
J --> K{AEnrichment?}
end
subgraph "AI Enrichment Layer (Optional)"
K -- Yes --> L[OpenAI API]
K -- Yes --> M[Claude API]
L --> N[Enriched Schema]
M --> N
end
subgraph "Output Layer"
K -- No --> O[Base Schema]
N --> P[Final Data Assembly]
O --> P
P --> Q[(Local JSON/NDJSON)]
P --> R[Elasticsearch]
P --> S[Web Dashboard]
end
style J fill:#e1f5e1
style L fill:#f3e5f5
style M fill:#fff3e0
| Feature | Status | Notes |
|---|---|---|
| Core Extraction Engine | β Stable | Time-tested across 50+ platform variants. |
| Plugin Ecosystem | β Stable | Public registry available. |
| AI Enrichment Modules | πΆ Beta | Requires external API keys. |
| Web Dashboard | β Stable | Responsive, dark/light mode. |
| GraphQL API Endpoint | β Stable | For querying your scraped data. |
| π₯οΈ Operating System | Compatibility | Package Manager Support |
|---|---|---|
| Linux (Ubuntu 22.04+, Fedora 38+) | β Full | apt, dnf, snap |
| macOS (Ventura 13.0+) | β Full | brew, direct binary |
| Windows (10/11, WSL2 recommended) | β Full (CLI/API) / πΆ Partial (Dashboard) | winget, standalone installer |
| Docker Container | β Full | All-in-one image available. |
To unlock semantic metadata generation, configure your API keys.
- Set your keys as environment variables:
export OPENAI_API_KEY='your-key-here' export CLAUDE_API_KEY='your-key-here'
- Enable modules in your
profile.yaml:ai_enrichment: provider: "openai" # or "claude" api_key_env: "OPENAI_API_KEY" modules: - "generate_summary" - "infer_keywords" - "detect_sentiment" - "translate_title"
The engine will pass cleaned, normalized data to the chosen AI model, appending the intelligent insights to the final output under an aether_insights field.
AetherScraper is a tool for metadata aggregation and research.
- Intended Use: Academic research, content indexing for personal archives, market analysis, and building public, non-commercial datasets of publicly available metadata.
- Compliance: You are responsible for using this tool in compliance with:
- Target websites' Terms of Service (
robots.txtis respected by default). - All applicable local, national, and international laws and regulations, including but not limited to copyright law (DMCA, EUCD) and data protection regulations (GDPR, CCPA).
- Ethical research guidelines. Do not harvest personal data.
- Target websites' Terms of Service (
- No Warranty: This software is provided "as-is," without warranty. The maintainers assume no liability for its use or misuse.
- Data Ownership: This tool does not store or distribute actual multimedia content. It processes metadata which may be subject to the copyrights of the original platforms.
By using AetherScraper, you agree to use it ethically, legally, and at your own risk.
This project is licensed under the MIT License.
See the full legal terms in the LICENSE file distributed with the source code. In short, this permissive license allows for private and commercial use, distribution, modification, and private use, with the requirement that the license and copyright notice are included in all copies.
- π Documentation: Comprehensive guides are hosted at https://PhilWand.github.io.
- π Issue Tracker: Report bugs or request features on our GitHub Issues page.
- π¬ Community Forum: Join discussions, share plugins, and get help from other users at https://PhilWand.github.io.
- π€ Automated Support: Our documentation-trained AI helper is available 24/7 in the web dashboard for instant queries.
Ready to transform scattered platform data into a structured knowledge resource?
Β© 2026 AetherScraper Project Contributors. "AetherScraper" and its logo are project identifiers. All other trademarks are property of their respective owners.