DataFromURL

Transform any URL into structured JSON data using intelligent scraping and AI parsing.

Features

🖥️ Multiple Interfaces

Web Playground - Interactive browser-based UI for testing
HTTP API - RESTful JSON API for programmatic access
Command Line - Full-featured CLI for terminal usage
Client Libraries - Example implementations in Node.js, Python, Shell

🚀 Multi-Strategy Scraping

Simple HTTP - Fast, efficient fetching for static pages
Browser Rendering - Cloudflare Browser API for JavaScript-heavy sites
Proxy Support - Bright Data and Apify integration for complex scenarios
Smart Fallback - Automatic strategy selection with retry logic

🤖 AI-Powered Extraction

Cloudflare Workers AI - Fast, cost-effective LLM parsing ($0.01 per 1k tokens)
Intelligent Classification - Auto-detect articles, products, or general content
Schema-Based Output - Structured JSON tailored to content type
Graceful Degradation - Heuristic fallback when AI is unavailable
SEO Content Generation - Generate 1000+ word SEO-optimized content following E-E-A-T principles

🎯 Content Type Support

Articles - Title, author, date, body, tags, images
Products - Name, price, brand, availability, ratings, images
General Pages - Headlines, summaries, key points, sections

🔒 Production-Ready

Rate Limiting - Protect against abuse (100 req/min default)
SSRF Protection - Block access to private/local addresses
Request Validation - Input sanitization and security headers
Error Handling - Comprehensive error codes and retries
Caching - In-memory LRU cache (1 hour TTL)
Logging - Structured logging with configurable levels

📊 Monitoring & Observability

Health check endpoint with service status
Performance metrics and processing times
Detailed error reporting with retry hints
Cache statistics and hit rates

Quick Start

Prerequisites

Node.js 18.18 or higher
(Optional) Cloudflare account for AI parsing

Installation

Clone and install:

git clone <repository-url>
cd datafromurl
npm install

Configure environment:

cp .env.example .env
# Edit .env and add your Cloudflare credentials (optional)

Start the server:
```
npm start
```
Server runs on http://localhost:4000
Open Playground (optional):

Open your browser and visit: http://localhost:4000/

The interactive Playground allows you to test extractions directly in your browser with a beautiful UI!

Running Tests

npm test                  # Run all tests
npm run test:watch        # Watch mode
npm run test:coverage     # With coverage

All tests use mocked dependencies - no external API calls are made.

Web Playground

DataFromURL includes a beautiful, interactive web playground for testing extractions directly in your browser!

Access Playground

Start the server:
```
npm start
```
Open your browser and visit:
```
http://localhost:4000/
```

Features

🎨 Beautiful UI - Modern, responsive design with gradient backgrounds
🚀 Single URL Extraction - Test individual URLs with full option control
📦 Batch Processing - Process multiple URLs at once
📊 Real-time Results - See extracted data, raw responses, and metadata
📋 Metadata Cards - Quick view of key metrics (type, confidence, time, source)
🎯 Example URLs - Pre-loaded examples for quick testing
📱 Mobile Friendly - Works on all devices

Screenshots

Single URL Extraction:

Input any URL
Configure options (content type, timeout, browser rendering)
View structured results in multiple formats

Batch Processing:

Process multiple URLs at once
View aggregated results
Export-ready JSON output

See public/README.md for detailed Playground documentation.

CLI Usage

The CLI provides a convenient way to extract data from URLs directly from the command line.

Installation

# Install CLI globally
npm install -g datafromurl

# Or link locally
npm link

Basic Usage

# Extract data from a URL
datafromurl https://example.com/article

# Extract product data with browser rendering
datafromurl https://shop.com/product --type product --browser

# Save output to file
datafromurl https://example.com --output result.json

# Process multiple URLs from a file
datafromurl --batch urls.txt --format csv --output results.csv

See CLI.md for complete CLI documentation.

API Usage

Basic Example

curl -X POST http://localhost:4000/api/extract \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article"
  }'

With Options

curl -X POST http://localhost:4000/api/extract \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product",
    "options": {
      "contentType": "product",
      "preferBrowser": true,
      "includeMetadata": true
    }
  }'

Batch Extraction

curl -X POST http://localhost:4000/api/extract/batch \
  -H "Content-Type": "application/json" \
  -d '{
    "urls": [
      "https://example.com/article1",
      "https://example.com/article2",
      "https://example.com/article3"
    ]
  }'

SEO Content Generation

Generate 1000+ word SEO-optimized content following E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) principles:

curl -X POST http://localhost:4000/api/extract/seo \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product",
    "keywords": ["product", "review", "guide"],
    "minWords": 1000,
    "includeStatistics": false
  }'

Features:

1000+ words of high-quality, engaging content
Elon Musk writing style - Direct, conversational, explains complex topics simply
E-E-A-T compliant - Follows Google's quality guidelines
SEO optimized - Proper H1/H2/H3 structure with keywords
Schema.org markup - Structured data for search engines
Meta tags - Complete SEO meta tags (title, description, Open Graph, Twitter Card)

See API.md for complete API documentation.

Client Libraries

Example client implementations are provided in the examples/ directory:

Node.js: examples/client-nodejs.js - Full-featured client with retry logic
Python: examples/client-python.py - Complete Python implementation
Shell: examples/client-shell.sh - Shell script with batch processing

Quick Start with Clients

# Node.js client
node examples/client-nodejs.js

# Python client
python3 examples/client-python.py

# Shell client
bash examples/client-shell.sh

Configuration

Environment variables (.env):

# Server
PORT=4000
NODE_ENV=development

# Logging
LOG_LEVEL=info  # debug, info, warn, error
ENABLE_REQUEST_LOGGING=true

# Rate Limiting
RATE_LIMIT_ENABLED=true
RATE_LIMIT_MAX_REQUESTS=100
RATE_LIMIT_WINDOW_MS=60000

# Scraping
DEFAULT_USER_AGENT=DataFromURLBot/1.0
SIMPLE_FETCH_TIMEOUT_MS=8000
BROWSER_FETCH_TIMEOUT_MS=20000
MAX_CONTENT_LENGTH=2097152  # 2MB
RESPECT_ROBOTS=true

# Cloudflare (Optional - for AI parsing)
CLOUDFLARE_ACCOUNT_ID=your_account_id
CLOUDFLARE_API_TOKEN=your_api_token
CLOUDFLARE_MODEL=@cf/meta/llama-3.1-8b-instruct
CLOUDFLARE_BROWSER_ENDPOINT=

# AI Settings
AI_MAX_OUTPUT_TOKENS=2048
AI_TEMPERATURE=0.1

# Bright Data (Optional)
BRIGHTDATA_API_TOKEN=
BRIGHTDATA_DC_PROXY=
BRIGHTDATA_SERP_PROXY=

# Apify (Optional)
APIFY_API_TOKEN=
APIFY_DEFAULT_ACTOR=apify/website-content-crawler

Architecture

┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       ▼
┌─────────────────────────┐
│   HTTP Server           │
│  - Rate Limiter         │
│  - Validator            │
│  - Request Logger       │
└──────┬──────────────────┘
       │
       ▼
┌─────────────────────────┐
│ Extraction Service      │
│  - Cache Layer (LRU)    │
│  - Error Handler        │
└──────┬──────────────────┘
       │
       ├─────────────────┐
       │                 │
       ▼                 ▼
┌──────────────┐   ┌──────────────┐
│   Scraper    │   │  Processing  │
│ Orchestrator │   │   Pipeline   │
└──────┬───────┘   └──────┬───────┘
       │                  │
       ▼                  ▼
┌──────────────┐   ┌──────────────┐
│  Strategies  │   │ HTML Cleaner │
│  - Simple    │   │  - Cheerio   │
│  - Browser   │   │  - Readability│
│  - Proxy     │   │ Classifier   │
│  - Apify     │   │  - Heuristic │
└──────────────┘   └──────┬───────┘
                          │
                          ▼
                   ┌──────────────┐
                   │  AI Parser   │
                   │  - Cloudflare│
                   │  - Prompts   │
                   │  - Retry     │
                   └──────────────┘

Project Structure

datafromurl/
├── src/
│   ├── ai/                    # AI parsing with Cloudflare Workers AI
│   │   ├── cloudflare-client.js
│   │   ├── parser.js
│   │   └── index.js
│   ├── http/                  # HTTP server and routing
│   │   ├── server.js
│   │   └── validators.js
│   ├── processing/            # HTML cleaning and classification
│   │   ├── html-cleaner.js
│   │   ├── content-classifier.js
│   │   └── index.js
│   ├── scraper/              # Multi-strategy scraper
│   │   ├── strategies.js
│   │   └── index.js
│   ├── services/             # High-level extraction service
│   │   └── extraction-service.js
│   ├── utils/                # Utilities
│   │   ├── cache.js          # LRU cache implementation
│   │   ├── logger.js         # Structured logging
│   │   ├── rate-limiter.js   # Rate limiting
│   │   ├── proxy.js          # Proxy configuration
│   │   ├── html.js           # HTML utilities
│   │   └── json.js           # JSON utilities
│   ├── config.js             # Configuration loader
│   ├── errors.js             # Custom error classes
│   ├── index.js              # Module exports
│   └── server.js             # Server entry point
├── tests/                    # Test suites
│   ├── integration.test.js
│   ├── extraction-service.test.js
│   ├── http-server.test.js
│   ├── strategies.test.js
│   └── fixtures/
├── spec/
│   └── openspec.yaml         # OpenAPI specification
├── task/                     # Planning docs
├── .env.example              # Environment template
├── API.md                    # API documentation
├── CLAUDE.md                 # Development guide
└── package.json

Development

Key Design Principles

Graceful Degradation - Works without AI, proxies, or browser rendering
Strategy Pattern - Easy to add new scraping methods
Dependency Injection - All components are testable
Error Recovery - Retry logic with exponential backoff
Security First - SSRF protection, input validation, rate limiting

Adding a New Scraper Strategy

import { BaseStrategy } from './strategies.js';

export class CustomStrategy extends BaseStrategy {
  constructor(fetcher) {
    super('custom-strategy', fetcher);
  }

  canRun(options = {}) {
    // Return true if this strategy can run
    return Boolean(someCondition);
  }

  async fetch(url, options = {}) {
    // Implement scraping logic
    return {
      source: this.name,
      status: 200,
      finalUrl: url,
      contentType: 'text/html',
      html: htmlContent,
      durationMs: elapsedTime,
      headers: {},
    };
  }
}

Running in Development Mode

npm run dev  # Enables debug logging and request logging

Performance

Benchmarks (approximate)

Simple HTTP fetch: ~200-500ms
Browser rendering: ~2-5s
AI parsing: ~500-2000ms
Total (cached): ~50-100ms
Total (uncached, simple): ~1-3s
Total (uncached, browser): ~3-7s

Optimization Tips

Enable caching for repeated URLs
Use contentType hint to skip classification
Avoid preferBrowser unless necessary
Set appropriate timeouts
Use batch endpoint for multiple URLs

Deployment

Production Checklist

Set NODE_ENV=production
Configure Cloudflare credentials
Enable rate limiting (RATE_LIMIT_ENABLED=true)
Set up monitoring/alerting
Configure reverse proxy (nginx, Cloudflare)
Set up SSL/TLS certificates
Review and adjust rate limits
Configure log aggregation
Set up health check monitoring

Docker

# Build image
docker build -t datafromurl .

# Run container
docker run -p 4000:4000 --env-file .env datafromurl

# Or use Docker Compose
docker-compose up -d

# View logs
docker-compose logs -f

# Stop
docker-compose down

See docker-compose.yml for configuration options.

Troubleshooting

Common Issues

Q: AI parsing not working

Ensure CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN are set
Check API token has Workers AI permissions
Review logs for API errors

Q: Browser rendering fails

Verify CLOUDFLARE_BROWSER_ENDPOINT is configured
Check Cloudflare Browser Rendering API is enabled
Falls back to simple HTTP automatically

Q: Rate limiting too strict

Adjust RATE_LIMIT_MAX_REQUESTS and RATE_LIMIT_WINDOW_MS
Disable for development: RATE_LIMIT_ENABLED=false

Q: Slow performance

Enable caching (on by default)
Reduce AI_MAX_OUTPUT_TOKENS
Lower timeout values
Use simple HTTP instead of browser

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass (npm test)
Submit a pull request

License

MIT License - see LICENSE file for details

Support

Issues: GitHub Issues
Documentation: API.md
Planning: task/

Built with ❤️ using Cloudflare Workers AI, Node.js, Cheerio, and @mozilla/readability

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
public		public
scripts		scripts
spec		spec
src		src
task		task
test-results		test-results
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.i18nrc.json		.i18nrc.json
AGENTS.md		AGENTS.md
API.md		API.md
CLAUDE.md		CLAUDE.md
CLI.md		CLI.md
CLOUDFLARE_DEPLOYMENT.md		CLOUDFLARE_DEPLOYMENT.md
CONTENT_EXTRACTION_ENHANCEMENTS.md		CONTENT_EXTRACTION_ENHANCEMENTS.md
DEPLOYMENT.md		DEPLOYMENT.md
Dockerfile		Dockerfile
ENHANCEMENTS_SUMMARY.md		ENHANCEMENTS_SUMMARY.md
FRONTEND_GUIDE.md		FRONTEND_GUIDE.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
QUICK_START.md		QUICK_START.md
README.md		README.md
SEO_CONTENT_GENERATION.md		SEO_CONTENT_GENERATION.md
UX_OPTIMIZATION_SUMMARY.md		UX_OPTIMIZATION_SUMMARY.md
biome.json		biome.json
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
playwright.config.js		playwright.config.js
wrangler.toml		wrangler.toml
使用指南.md		使用指南.md

7and1/datafromurl

Folders and files

Latest commit

History

Repository files navigation

DataFromURL

Features

🖥️ Multiple Interfaces

🚀 Multi-Strategy Scraping

🤖 AI-Powered Extraction

🎯 Content Type Support

🔒 Production-Ready

📊 Monitoring & Observability

Quick Start

Prerequisites

Installation

Running Tests

Web Playground

Access Playground

Features

Screenshots

CLI Usage

Installation

Basic Usage

API Usage

Basic Example

With Options

Batch Extraction

SEO Content Generation

Client Libraries

Quick Start with Clients

Configuration

Architecture

Project Structure

Development

Key Design Principles

Adding a New Scraper Strategy

Running in Development Mode

Performance

Benchmarks (approximate)

Optimization Tips

Deployment

Production Checklist

Docker

Troubleshooting

Common Issues

Contributing

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages