Skip to content

7and1/datafromurl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataFromURL

Node.js Version License

Transform any URL into structured JSON data using intelligent scraping and AI parsing.

Features

🖥️ Multiple Interfaces

  • Web Playground - Interactive browser-based UI for testing
  • HTTP API - RESTful JSON API for programmatic access
  • Command Line - Full-featured CLI for terminal usage
  • Client Libraries - Example implementations in Node.js, Python, Shell

🚀 Multi-Strategy Scraping

  • Simple HTTP - Fast, efficient fetching for static pages
  • Browser Rendering - Cloudflare Browser API for JavaScript-heavy sites
  • Proxy Support - Bright Data and Apify integration for complex scenarios
  • Smart Fallback - Automatic strategy selection with retry logic

🤖 AI-Powered Extraction

  • Cloudflare Workers AI - Fast, cost-effective LLM parsing ($0.01 per 1k tokens)
  • Intelligent Classification - Auto-detect articles, products, or general content
  • Schema-Based Output - Structured JSON tailored to content type
  • Graceful Degradation - Heuristic fallback when AI is unavailable
  • SEO Content Generation - Generate 1000+ word SEO-optimized content following E-E-A-T principles

🎯 Content Type Support

  • Articles - Title, author, date, body, tags, images
  • Products - Name, price, brand, availability, ratings, images
  • General Pages - Headlines, summaries, key points, sections

🔒 Production-Ready

  • Rate Limiting - Protect against abuse (100 req/min default)
  • SSRF Protection - Block access to private/local addresses
  • Request Validation - Input sanitization and security headers
  • Error Handling - Comprehensive error codes and retries
  • Caching - In-memory LRU cache (1 hour TTL)
  • Logging - Structured logging with configurable levels

📊 Monitoring & Observability

  • Health check endpoint with service status
  • Performance metrics and processing times
  • Detailed error reporting with retry hints
  • Cache statistics and hit rates

Quick Start

Prerequisites

  • Node.js 18.18 or higher
  • (Optional) Cloudflare account for AI parsing

Installation

  1. Clone and install:

    git clone <repository-url>
    cd datafromurl
    npm install
  2. Configure environment:

    cp .env.example .env
    # Edit .env and add your Cloudflare credentials (optional)
  3. Start the server:

    npm start

    Server runs on http://localhost:4000

  4. Open Playground (optional):

    Open your browser and visit: http://localhost:4000/

    The interactive Playground allows you to test extractions directly in your browser with a beautiful UI!

Running Tests

npm test                  # Run all tests
npm run test:watch        # Watch mode
npm run test:coverage     # With coverage

All tests use mocked dependencies - no external API calls are made.

Web Playground

DataFromURL includes a beautiful, interactive web playground for testing extractions directly in your browser!

Access Playground

  1. Start the server:

    npm start
  2. Open your browser and visit:

    http://localhost:4000/
    

Features

  • 🎨 Beautiful UI - Modern, responsive design with gradient backgrounds
  • 🚀 Single URL Extraction - Test individual URLs with full option control
  • 📦 Batch Processing - Process multiple URLs at once
  • 📊 Real-time Results - See extracted data, raw responses, and metadata
  • 📋 Metadata Cards - Quick view of key metrics (type, confidence, time, source)
  • 🎯 Example URLs - Pre-loaded examples for quick testing
  • 📱 Mobile Friendly - Works on all devices

Screenshots

Single URL Extraction:

  • Input any URL
  • Configure options (content type, timeout, browser rendering)
  • View structured results in multiple formats

Batch Processing:

  • Process multiple URLs at once
  • View aggregated results
  • Export-ready JSON output

See public/README.md for detailed Playground documentation.

CLI Usage

The CLI provides a convenient way to extract data from URLs directly from the command line.

Installation

# Install CLI globally
npm install -g datafromurl

# Or link locally
npm link

Basic Usage

# Extract data from a URL
datafromurl https://example.com/article

# Extract product data with browser rendering
datafromurl https://shop.com/product --type product --browser

# Save output to file
datafromurl https://example.com --output result.json

# Process multiple URLs from a file
datafromurl --batch urls.txt --format csv --output results.csv

See CLI.md for complete CLI documentation.

API Usage

Basic Example

curl -X POST http://localhost:4000/api/extract \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article"
  }'

With Options

curl -X POST http://localhost:4000/api/extract \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product",
    "options": {
      "contentType": "product",
      "preferBrowser": true,
      "includeMetadata": true
    }
  }'

Batch Extraction

curl -X POST http://localhost:4000/api/extract/batch \
  -H "Content-Type": "application/json" \
  -d '{
    "urls": [
      "https://example.com/article1",
      "https://example.com/article2",
      "https://example.com/article3"
    ]
  }'

SEO Content Generation

Generate 1000+ word SEO-optimized content following E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) principles:

curl -X POST http://localhost:4000/api/extract/seo \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product",
    "keywords": ["product", "review", "guide"],
    "minWords": 1000,
    "includeStatistics": false
  }'

Features:

  • 1000+ words of high-quality, engaging content
  • Elon Musk writing style - Direct, conversational, explains complex topics simply
  • E-E-A-T compliant - Follows Google's quality guidelines
  • SEO optimized - Proper H1/H2/H3 structure with keywords
  • Schema.org markup - Structured data for search engines
  • Meta tags - Complete SEO meta tags (title, description, Open Graph, Twitter Card)

See API.md for complete API documentation.

Client Libraries

Example client implementations are provided in the examples/ directory:

Quick Start with Clients

# Node.js client
node examples/client-nodejs.js

# Python client
python3 examples/client-python.py

# Shell client
bash examples/client-shell.sh

Configuration

Environment variables (.env):

# Server
PORT=4000
NODE_ENV=development

# Logging
LOG_LEVEL=info  # debug, info, warn, error
ENABLE_REQUEST_LOGGING=true

# Rate Limiting
RATE_LIMIT_ENABLED=true
RATE_LIMIT_MAX_REQUESTS=100
RATE_LIMIT_WINDOW_MS=60000

# Scraping
DEFAULT_USER_AGENT=DataFromURLBot/1.0
SIMPLE_FETCH_TIMEOUT_MS=8000
BROWSER_FETCH_TIMEOUT_MS=20000
MAX_CONTENT_LENGTH=2097152  # 2MB
RESPECT_ROBOTS=true

# Cloudflare (Optional - for AI parsing)
CLOUDFLARE_ACCOUNT_ID=your_account_id
CLOUDFLARE_API_TOKEN=your_api_token
CLOUDFLARE_MODEL=@cf/meta/llama-3.1-8b-instruct
CLOUDFLARE_BROWSER_ENDPOINT=

# AI Settings
AI_MAX_OUTPUT_TOKENS=2048
AI_TEMPERATURE=0.1

# Bright Data (Optional)
BRIGHTDATA_API_TOKEN=
BRIGHTDATA_DC_PROXY=
BRIGHTDATA_SERP_PROXY=

# Apify (Optional)
APIFY_API_TOKEN=
APIFY_DEFAULT_ACTOR=apify/website-content-crawler

Architecture

┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       ▼
┌─────────────────────────┐
│   HTTP Server           │
│  - Rate Limiter         │
│  - Validator            │
│  - Request Logger       │
└──────┬──────────────────┘
       │
       ▼
┌─────────────────────────┐
│ Extraction Service      │
│  - Cache Layer (LRU)    │
│  - Error Handler        │
└──────┬──────────────────┘
       │
       ├─────────────────┐
       │                 │
       ▼                 ▼
┌──────────────┐   ┌──────────────┐
│   Scraper    │   │  Processing  │
│ Orchestrator │   │   Pipeline   │
└──────┬───────┘   └──────┬───────┘
       │                  │
       ▼                  ▼
┌──────────────┐   ┌──────────────┐
│  Strategies  │   │ HTML Cleaner │
│  - Simple    │   │  - Cheerio   │
│  - Browser   │   │  - Readability│
│  - Proxy     │   │ Classifier   │
│  - Apify     │   │  - Heuristic │
└──────────────┘   └──────┬───────┘
                          │
                          ▼
                   ┌──────────────┐
                   │  AI Parser   │
                   │  - Cloudflare│
                   │  - Prompts   │
                   │  - Retry     │
                   └──────────────┘

Project Structure

datafromurl/
├── src/
│   ├── ai/                    # AI parsing with Cloudflare Workers AI
│   │   ├── cloudflare-client.js
│   │   ├── parser.js
│   │   └── index.js
│   ├── http/                  # HTTP server and routing
│   │   ├── server.js
│   │   └── validators.js
│   ├── processing/            # HTML cleaning and classification
│   │   ├── html-cleaner.js
│   │   ├── content-classifier.js
│   │   └── index.js
│   ├── scraper/              # Multi-strategy scraper
│   │   ├── strategies.js
│   │   └── index.js
│   ├── services/             # High-level extraction service
│   │   └── extraction-service.js
│   ├── utils/                # Utilities
│   │   ├── cache.js          # LRU cache implementation
│   │   ├── logger.js         # Structured logging
│   │   ├── rate-limiter.js   # Rate limiting
│   │   ├── proxy.js          # Proxy configuration
│   │   ├── html.js           # HTML utilities
│   │   └── json.js           # JSON utilities
│   ├── config.js             # Configuration loader
│   ├── errors.js             # Custom error classes
│   ├── index.js              # Module exports
│   └── server.js             # Server entry point
├── tests/                    # Test suites
│   ├── integration.test.js
│   ├── extraction-service.test.js
│   ├── http-server.test.js
│   ├── strategies.test.js
│   └── fixtures/
├── spec/
│   └── openspec.yaml         # OpenAPI specification
├── task/                     # Planning docs
├── .env.example              # Environment template
├── API.md                    # API documentation
├── CLAUDE.md                 # Development guide
└── package.json

Development

Key Design Principles

  1. Graceful Degradation - Works without AI, proxies, or browser rendering
  2. Strategy Pattern - Easy to add new scraping methods
  3. Dependency Injection - All components are testable
  4. Error Recovery - Retry logic with exponential backoff
  5. Security First - SSRF protection, input validation, rate limiting

Adding a New Scraper Strategy

import { BaseStrategy } from './strategies.js';

export class CustomStrategy extends BaseStrategy {
  constructor(fetcher) {
    super('custom-strategy', fetcher);
  }

  canRun(options = {}) {
    // Return true if this strategy can run
    return Boolean(someCondition);
  }

  async fetch(url, options = {}) {
    // Implement scraping logic
    return {
      source: this.name,
      status: 200,
      finalUrl: url,
      contentType: 'text/html',
      html: htmlContent,
      durationMs: elapsedTime,
      headers: {},
    };
  }
}

Running in Development Mode

npm run dev  # Enables debug logging and request logging

Performance

Benchmarks (approximate)

  • Simple HTTP fetch: ~200-500ms
  • Browser rendering: ~2-5s
  • AI parsing: ~500-2000ms
  • Total (cached): ~50-100ms
  • Total (uncached, simple): ~1-3s
  • Total (uncached, browser): ~3-7s

Optimization Tips

  1. Enable caching for repeated URLs
  2. Use contentType hint to skip classification
  3. Avoid preferBrowser unless necessary
  4. Set appropriate timeouts
  5. Use batch endpoint for multiple URLs

Deployment

Production Checklist

  • Set NODE_ENV=production
  • Configure Cloudflare credentials
  • Enable rate limiting (RATE_LIMIT_ENABLED=true)
  • Set up monitoring/alerting
  • Configure reverse proxy (nginx, Cloudflare)
  • Set up SSL/TLS certificates
  • Review and adjust rate limits
  • Configure log aggregation
  • Set up health check monitoring

Docker

# Build image
docker build -t datafromurl .

# Run container
docker run -p 4000:4000 --env-file .env datafromurl

# Or use Docker Compose
docker-compose up -d

# View logs
docker-compose logs -f

# Stop
docker-compose down

See docker-compose.yml for configuration options.

Troubleshooting

Common Issues

Q: AI parsing not working

  • Ensure CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN are set
  • Check API token has Workers AI permissions
  • Review logs for API errors

Q: Browser rendering fails

  • Verify CLOUDFLARE_BROWSER_ENDPOINT is configured
  • Check Cloudflare Browser Rendering API is enabled
  • Falls back to simple HTTP automatically

Q: Rate limiting too strict

  • Adjust RATE_LIMIT_MAX_REQUESTS and RATE_LIMIT_WINDOW_MS
  • Disable for development: RATE_LIMIT_ENABLED=false

Q: Slow performance

  • Enable caching (on by default)
  • Reduce AI_MAX_OUTPUT_TOKENS
  • Lower timeout values
  • Use simple HTTP instead of browser

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass (npm test)
  5. Submit a pull request

License

MIT License - see LICENSE file for details

Support


Built with ❤️ using Cloudflare Workers AI, Node.js, Cheerio, and @mozilla/readability

About

Data From URL - Transform any URL into structured JSON data with AI-powered extraction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published