Transform any URL into structured JSON data using intelligent scraping and AI parsing.
- Web Playground - Interactive browser-based UI for testing
- HTTP API - RESTful JSON API for programmatic access
- Command Line - Full-featured CLI for terminal usage
- Client Libraries - Example implementations in Node.js, Python, Shell
- Simple HTTP - Fast, efficient fetching for static pages
- Browser Rendering - Cloudflare Browser API for JavaScript-heavy sites
- Proxy Support - Bright Data and Apify integration for complex scenarios
- Smart Fallback - Automatic strategy selection with retry logic
- Cloudflare Workers AI - Fast, cost-effective LLM parsing ($0.01 per 1k tokens)
- Intelligent Classification - Auto-detect articles, products, or general content
- Schema-Based Output - Structured JSON tailored to content type
- Graceful Degradation - Heuristic fallback when AI is unavailable
- SEO Content Generation - Generate 1000+ word SEO-optimized content following E-E-A-T principles
- Articles - Title, author, date, body, tags, images
- Products - Name, price, brand, availability, ratings, images
- General Pages - Headlines, summaries, key points, sections
- Rate Limiting - Protect against abuse (100 req/min default)
- SSRF Protection - Block access to private/local addresses
- Request Validation - Input sanitization and security headers
- Error Handling - Comprehensive error codes and retries
- Caching - In-memory LRU cache (1 hour TTL)
- Logging - Structured logging with configurable levels
- Health check endpoint with service status
- Performance metrics and processing times
- Detailed error reporting with retry hints
- Cache statistics and hit rates
- Node.js 18.18 or higher
- (Optional) Cloudflare account for AI parsing
-
Clone and install:
git clone <repository-url> cd datafromurl npm install
-
Configure environment:
cp .env.example .env # Edit .env and add your Cloudflare credentials (optional) -
Start the server:
npm start
Server runs on
http://localhost:4000 -
Open Playground (optional):
Open your browser and visit: http://localhost:4000/
The interactive Playground allows you to test extractions directly in your browser with a beautiful UI!
npm test # Run all tests
npm run test:watch # Watch mode
npm run test:coverage # With coverageAll tests use mocked dependencies - no external API calls are made.
DataFromURL includes a beautiful, interactive web playground for testing extractions directly in your browser!
-
Start the server:
npm start
-
Open your browser and visit:
http://localhost:4000/
- 🎨 Beautiful UI - Modern, responsive design with gradient backgrounds
- 🚀 Single URL Extraction - Test individual URLs with full option control
- 📦 Batch Processing - Process multiple URLs at once
- 📊 Real-time Results - See extracted data, raw responses, and metadata
- 📋 Metadata Cards - Quick view of key metrics (type, confidence, time, source)
- 🎯 Example URLs - Pre-loaded examples for quick testing
- 📱 Mobile Friendly - Works on all devices
Single URL Extraction:
- Input any URL
- Configure options (content type, timeout, browser rendering)
- View structured results in multiple formats
Batch Processing:
- Process multiple URLs at once
- View aggregated results
- Export-ready JSON output
See public/README.md for detailed Playground documentation.
The CLI provides a convenient way to extract data from URLs directly from the command line.
# Install CLI globally
npm install -g datafromurl
# Or link locally
npm link# Extract data from a URL
datafromurl https://example.com/article
# Extract product data with browser rendering
datafromurl https://shop.com/product --type product --browser
# Save output to file
datafromurl https://example.com --output result.json
# Process multiple URLs from a file
datafromurl --batch urls.txt --format csv --output results.csvSee CLI.md for complete CLI documentation.
curl -X POST http://localhost:4000/api/extract \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article"
}'curl -X POST http://localhost:4000/api/extract \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/product",
"options": {
"contentType": "product",
"preferBrowser": true,
"includeMetadata": true
}
}'curl -X POST http://localhost:4000/api/extract/batch \
-H "Content-Type": "application/json" \
-d '{
"urls": [
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3"
]
}'Generate 1000+ word SEO-optimized content following E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) principles:
curl -X POST http://localhost:4000/api/extract/seo \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/product",
"keywords": ["product", "review", "guide"],
"minWords": 1000,
"includeStatistics": false
}'Features:
- 1000+ words of high-quality, engaging content
- Elon Musk writing style - Direct, conversational, explains complex topics simply
- E-E-A-T compliant - Follows Google's quality guidelines
- SEO optimized - Proper H1/H2/H3 structure with keywords
- Schema.org markup - Structured data for search engines
- Meta tags - Complete SEO meta tags (title, description, Open Graph, Twitter Card)
See API.md for complete API documentation.
Example client implementations are provided in the examples/ directory:
- Node.js: examples/client-nodejs.js - Full-featured client with retry logic
- Python: examples/client-python.py - Complete Python implementation
- Shell: examples/client-shell.sh - Shell script with batch processing
# Node.js client
node examples/client-nodejs.js
# Python client
python3 examples/client-python.py
# Shell client
bash examples/client-shell.shEnvironment variables (.env):
# Server
PORT=4000
NODE_ENV=development
# Logging
LOG_LEVEL=info # debug, info, warn, error
ENABLE_REQUEST_LOGGING=true
# Rate Limiting
RATE_LIMIT_ENABLED=true
RATE_LIMIT_MAX_REQUESTS=100
RATE_LIMIT_WINDOW_MS=60000
# Scraping
DEFAULT_USER_AGENT=DataFromURLBot/1.0
SIMPLE_FETCH_TIMEOUT_MS=8000
BROWSER_FETCH_TIMEOUT_MS=20000
MAX_CONTENT_LENGTH=2097152 # 2MB
RESPECT_ROBOTS=true
# Cloudflare (Optional - for AI parsing)
CLOUDFLARE_ACCOUNT_ID=your_account_id
CLOUDFLARE_API_TOKEN=your_api_token
CLOUDFLARE_MODEL=@cf/meta/llama-3.1-8b-instruct
CLOUDFLARE_BROWSER_ENDPOINT=
# AI Settings
AI_MAX_OUTPUT_TOKENS=2048
AI_TEMPERATURE=0.1
# Bright Data (Optional)
BRIGHTDATA_API_TOKEN=
BRIGHTDATA_DC_PROXY=
BRIGHTDATA_SERP_PROXY=
# Apify (Optional)
APIFY_API_TOKEN=
APIFY_DEFAULT_ACTOR=apify/website-content-crawler┌─────────────┐
│ Client │
└──────┬──────┘
│
▼
┌─────────────────────────┐
│ HTTP Server │
│ - Rate Limiter │
│ - Validator │
│ - Request Logger │
└──────┬──────────────────┘
│
▼
┌─────────────────────────┐
│ Extraction Service │
│ - Cache Layer (LRU) │
│ - Error Handler │
└──────┬──────────────────┘
│
├─────────────────┐
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Scraper │ │ Processing │
│ Orchestrator │ │ Pipeline │
└──────┬───────┘ └──────┬───────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Strategies │ │ HTML Cleaner │
│ - Simple │ │ - Cheerio │
│ - Browser │ │ - Readability│
│ - Proxy │ │ Classifier │
│ - Apify │ │ - Heuristic │
└──────────────┘ └──────┬───────┘
│
▼
┌──────────────┐
│ AI Parser │
│ - Cloudflare│
│ - Prompts │
│ - Retry │
└──────────────┘
datafromurl/
├── src/
│ ├── ai/ # AI parsing with Cloudflare Workers AI
│ │ ├── cloudflare-client.js
│ │ ├── parser.js
│ │ └── index.js
│ ├── http/ # HTTP server and routing
│ │ ├── server.js
│ │ └── validators.js
│ ├── processing/ # HTML cleaning and classification
│ │ ├── html-cleaner.js
│ │ ├── content-classifier.js
│ │ └── index.js
│ ├── scraper/ # Multi-strategy scraper
│ │ ├── strategies.js
│ │ └── index.js
│ ├── services/ # High-level extraction service
│ │ └── extraction-service.js
│ ├── utils/ # Utilities
│ │ ├── cache.js # LRU cache implementation
│ │ ├── logger.js # Structured logging
│ │ ├── rate-limiter.js # Rate limiting
│ │ ├── proxy.js # Proxy configuration
│ │ ├── html.js # HTML utilities
│ │ └── json.js # JSON utilities
│ ├── config.js # Configuration loader
│ ├── errors.js # Custom error classes
│ ├── index.js # Module exports
│ └── server.js # Server entry point
├── tests/ # Test suites
│ ├── integration.test.js
│ ├── extraction-service.test.js
│ ├── http-server.test.js
│ ├── strategies.test.js
│ └── fixtures/
├── spec/
│ └── openspec.yaml # OpenAPI specification
├── task/ # Planning docs
├── .env.example # Environment template
├── API.md # API documentation
├── CLAUDE.md # Development guide
└── package.json
- Graceful Degradation - Works without AI, proxies, or browser rendering
- Strategy Pattern - Easy to add new scraping methods
- Dependency Injection - All components are testable
- Error Recovery - Retry logic with exponential backoff
- Security First - SSRF protection, input validation, rate limiting
import { BaseStrategy } from './strategies.js';
export class CustomStrategy extends BaseStrategy {
constructor(fetcher) {
super('custom-strategy', fetcher);
}
canRun(options = {}) {
// Return true if this strategy can run
return Boolean(someCondition);
}
async fetch(url, options = {}) {
// Implement scraping logic
return {
source: this.name,
status: 200,
finalUrl: url,
contentType: 'text/html',
html: htmlContent,
durationMs: elapsedTime,
headers: {},
};
}
}npm run dev # Enables debug logging and request logging- Simple HTTP fetch: ~200-500ms
- Browser rendering: ~2-5s
- AI parsing: ~500-2000ms
- Total (cached): ~50-100ms
- Total (uncached, simple): ~1-3s
- Total (uncached, browser): ~3-7s
- Enable caching for repeated URLs
- Use
contentTypehint to skip classification - Avoid
preferBrowserunless necessary - Set appropriate timeouts
- Use batch endpoint for multiple URLs
- Set
NODE_ENV=production - Configure Cloudflare credentials
- Enable rate limiting (
RATE_LIMIT_ENABLED=true) - Set up monitoring/alerting
- Configure reverse proxy (nginx, Cloudflare)
- Set up SSL/TLS certificates
- Review and adjust rate limits
- Configure log aggregation
- Set up health check monitoring
# Build image
docker build -t datafromurl .
# Run container
docker run -p 4000:4000 --env-file .env datafromurl
# Or use Docker Compose
docker-compose up -d
# View logs
docker-compose logs -f
# Stop
docker-compose downSee docker-compose.yml for configuration options.
Q: AI parsing not working
- Ensure
CLOUDFLARE_ACCOUNT_IDandCLOUDFLARE_API_TOKENare set - Check API token has Workers AI permissions
- Review logs for API errors
Q: Browser rendering fails
- Verify
CLOUDFLARE_BROWSER_ENDPOINTis configured - Check Cloudflare Browser Rendering API is enabled
- Falls back to simple HTTP automatically
Q: Rate limiting too strict
- Adjust
RATE_LIMIT_MAX_REQUESTSandRATE_LIMIT_WINDOW_MS - Disable for development:
RATE_LIMIT_ENABLED=false
Q: Slow performance
- Enable caching (on by default)
- Reduce
AI_MAX_OUTPUT_TOKENS - Lower timeout values
- Use simple HTTP instead of browser
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass (
npm test) - Submit a pull request
MIT License - see LICENSE file for details
- Issues: GitHub Issues
- Documentation: API.md
- Planning: task/
Built with ❤️ using Cloudflare Workers AI, Node.js, Cheerio, and @mozilla/readability