Skip to content

PhilWand/lustpress-api-client

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 

Repository files navigation

πŸ”¬ AetherScraper

A modular, intelligent data extraction and normalization engine for multimedia metadata across diverse content platforms.

Download

🧠 Overview

AetherScraper is a sophisticated, API-first orchestration framework designed to intelligently gather, normalize, and structure metadata from a wide array of multimedia content platforms. Unlike conventional scrapers, it employs a modular plugin architecture, semantic understanding via integrated AI APIs, and produces a unified, queryable data schemaβ€”turning fragmented platform data into a cohesive knowledge graph. It's built for researchers, developers building content-aware applications, and data archivists who require consistency from chaos.

Think of it as a "linguistic translator for platform data dialects," converting the idiosyncratic JSON of one site into the same structured language as another, all while enriching the data with contextual intelligence.


✨ Key Features & Capabilities

  • 🧩 Modular Plugin Architecture: Each supported platform is a separate, hot-swappable plugin. Contribute new ones without touching the core engine.
  • πŸ€– AI-Powered Metadata Enrichment: Optional integration with OpenAI's GPT API and Anthropic's Claude API to generate descriptions, tags, summaries, and content classifications, adding a layer of semantic understanding to raw data.
  • 🌐 Unified Normalized Schema: All extracted data conforms to a single, well-defined AetherSchema (JSON Schema/TypeScript), making downstream processing predictable.
  • πŸ” Intelligent Rate Limiting & Resilience: Mimics human interaction patterns, handles CAPTCHAs via proxy services, and features exponential backoff retry logic.
  • πŸ’¬ Multilingual Support & Localization: The UI and extracted metadata can be processed and output in multiple languages. The CLI and API responses respect locale headers.
  • 🎨 Responsive Web Dashboard: A sleek, real-time dashboard (built with SvelteKit) to monitor extraction jobs, manage plugins, and explore the normalized data graph visually.
  • πŸ›‘οΈ Privacy-Centric Design: Operates as a user-agent tool. Includes data sanitization modules and guidelines for ethical scraping. Stores no media content, only metadata.
  • πŸ“ž 24/7 Community & Automated Support: Active community forums and an AI-powered support bot trained on the project's documentation for instant troubleshooting.

πŸš€ Quick Start

Prerequisites

  • Node.js 20+ or Python 3.11+
  • An API Key (for AI enrichment features, optional)

Installation

Download the latest release package: Download

Extract the package and navigate into the directory.

Using the provided installer:

# Install core dependencies and CLI tool globally
./install.sh --cli

Example Profile Configuration

AetherScraper uses a profile.yaml file to define your extraction targets, output format, and AI preferences.

# profile.yaml
project: "Documentary Research 2026"
output:
  format: "ndjson" # JSON Lines for easy streaming
  directory: "./data/aether_output"
  compression: true

plugins:
  enabled:
    - "plugin_educational_stream"
    - "plugin_creative_archive"
  config:
    plugin_educational_stream:
      depth: "shallow" # shallow, deep, complete
      include_comments: false

ai_enrichment:
  provider: "openai" # openai, claude, or none
  api_key_env: "OPENAI_API_KEY" # Your key stored in environment variable
  modules:
    - "generate_summary"
    - "infer_keywords"
    - "categorize_content"
  language: "en"

rate_limiting:
  requests_per_minute: 30
  respect_robots_txt: true

Example Console Invocation

Run a scrape job using your profile and a list of target URLs.

# Basic run
aether-scraper run --profile ./profile.yaml --urls ./url_list.txt

# Run with verbose logging and a specific output tag
aether-scraper run --profile ./profile.yaml --urls ./url_list.txt --verbose --tag "batch_2026_04_01"

# Validate a plugin without running a full job
aether-scraper plugin test plugin_educational_stream --url "https://example-platform.com/item/xyz"

πŸ—ΊοΈ System Architecture

The following Mermaid diagram illustrates the data flow and modular design of AetherScraper:

graph TB
    subgraph "Input Layer"
        A[CLI / Dashboard] --> B[Job Scheduler]
        C[Profile YAML] --> B
        D[URL List] --> B
    end

    subgraph "Orchestration Core"
        B --> E{Plugin Router}
        E --> F[Plugin: Platform A]
        E --> G[Plugin: Platform B]
        E --> H[... Plugin: Platform N]
        F --> I[Raw Data Fetcher]
        G --> I
        H --> I
        I --> J[Normalization Engine]
        J --> K{AEnrichment?}
    end

    subgraph "AI Enrichment Layer (Optional)"
        K -- Yes --> L[OpenAI API]
        K -- Yes --> M[Claude API]
        L --> N[Enriched Schema]
        M --> N
    end

    subgraph "Output Layer"
        K -- No --> O[Base Schema]
        N --> P[Final Data Assembly]
        O --> P
        P --> Q[(Local JSON/NDJSON)]
        P --> R[Elasticsearch]
        P --> S[Web Dashboard]
    end

    style J fill:#e1f5e1
    style L fill:#f3e5f5
    style M fill:#fff3e0
Loading

πŸ“‹ Feature Comparison & OS Compatibility

Feature Status Notes
Core Extraction Engine βœ… Stable Time-tested across 50+ platform variants.
Plugin Ecosystem βœ… Stable Public registry available.
AI Enrichment Modules πŸ”Ά Beta Requires external API keys.
Web Dashboard βœ… Stable Responsive, dark/light mode.
GraphQL API Endpoint βœ… Stable For querying your scraped data.
πŸ–₯️ Operating System Compatibility Package Manager Support
Linux (Ubuntu 22.04+, Fedora 38+) βœ… Full apt, dnf, snap
macOS (Ventura 13.0+) βœ… Full brew, direct binary
Windows (10/11, WSL2 recommended) βœ… Full (CLI/API) / πŸ”Ά Partial (Dashboard) winget, standalone installer
Docker Container βœ… Full All-in-one image available.

πŸ”‘ Integrating AI Enrichment

To unlock semantic metadata generation, configure your API keys.

  1. Set your keys as environment variables:
    export OPENAI_API_KEY='your-key-here'
    export CLAUDE_API_KEY='your-key-here'
  2. Enable modules in your profile.yaml:
    ai_enrichment:
      provider: "openai" # or "claude"
      api_key_env: "OPENAI_API_KEY"
      modules:
        - "generate_summary"
        - "infer_keywords"
        - "detect_sentiment"
        - "translate_title"

The engine will pass cleaned, normalized data to the chosen AI model, appending the intelligent insights to the final output under an aether_insights field.


βš–οΈ Disclaimer & Ethical Use

AetherScraper is a tool for metadata aggregation and research.

  • Intended Use: Academic research, content indexing for personal archives, market analysis, and building public, non-commercial datasets of publicly available metadata.
  • Compliance: You are responsible for using this tool in compliance with:
    • Target websites' Terms of Service (robots.txt is respected by default).
    • All applicable local, national, and international laws and regulations, including but not limited to copyright law (DMCA, EUCD) and data protection regulations (GDPR, CCPA).
    • Ethical research guidelines. Do not harvest personal data.
  • No Warranty: This software is provided "as-is," without warranty. The maintainers assume no liability for its use or misuse.
  • Data Ownership: This tool does not store or distribute actual multimedia content. It processes metadata which may be subject to the copyrights of the original platforms.

By using AetherScraper, you agree to use it ethically, legally, and at your own risk.


πŸ“„ License

This project is licensed under the MIT License.

See the full legal terms in the LICENSE file distributed with the source code. In short, this permissive license allows for private and commercial use, distribution, modification, and private use, with the requirement that the license and copyright notice are included in all copies.


πŸ†˜ Support & Community

  • πŸ“š Documentation: Comprehensive guides are hosted at https://PhilWand.github.io.
  • πŸ› Issue Tracker: Report bugs or request features on our GitHub Issues page.
  • πŸ’¬ Community Forum: Join discussions, share plugins, and get help from other users at https://PhilWand.github.io.
  • πŸ€– Automated Support: Our documentation-trained AI helper is available 24/7 in the web dashboard for instant queries.

Ready to transform scattered platform data into a structured knowledge resource?

Download


Β© 2026 AetherScraper Project Contributors. "AetherScraper" and its logo are project identifiers. All other trademarks are property of their respective owners.

Releases

No releases published

Packages

 
 
 

Contributors