Universal Scraper

The Python package for scraping data from any website

An AI Agent that handles all kinds of scraping work for you - just tell it what fields you want and point it at a URL.

Under the hood the agent writes a custom BeautifulSoup4 extractor for your target page, caches it against a structural hash of the HTML, and reuses that same code on every subsequent run.

The AI is only ever called once per unique page layout - not on every scrape - so your token spend stays in the single-digit cents range even across thousands of requests.

When the page layout changes the agent detects it automatically and regenerates the extractor, then caches the new version.

Web UI - No-Code Mode
How Universal Scraper Works
Installation (Recommended)
Installation
Quick Start
- 1. Set up your API key
- 2. Basic Usage
- 3. Convenience Function
Export Formats
CLI Usage
MCP Server Usage
Cache Management
Advanced Usage
API Reference
Output Format
Common Field Examples
Multi-Provider AI Support
Troubleshooting
Roadmap
Contributors
Contributing
License
Changelog

Web UI - No-Code Mode

The fastest way to use Universal Scraper - no Python required. Install the package and launch the local web UI with one command:

pip install universal-scraper
universal-scraper-ui

Your browser opens automatically at http://127.0.0.1:5000.

What you can do in the UI

Feature	Details
Provider & Model	Select Google Gemini, OpenAI, Anthropic, or Ollama. Models are fetched live from the provider's API when you enter a key - always current, never hardcoded. Falls back to 1,700+ LiteLLM models when no key is entered. Only text/chat models are listed.
API Key auto-fill	`GEMINI_API_KEY`, `OPENAI_API_KEY`, and `ANTHROPIC_API_KEY` environment variables are pre-filled on page load.
Extraction fields	Add fields as interactive chips (`product_name`, `price`, `rating` …). Press Enter or comma to add; click × to remove.
Output formats	JSON → syntax-highlighted result. CSV → rendered as an HTML table in the browser; download exports a proper `.csv` file.
Real-time logs	Live terminal-style stream (Server-Sent Events) showing every internal step - fetch, clean, AI call, cache hit - as the scrape runs.
Token usage	After each scrape a token bar shows total tokens used, prompt/completion split, and cache-hit count. Click Breakdown → for a per-API-call modal.

CLI options

universal-scraper-ui --port 8080        # custom port
universal-scraper-ui --host 0.0.0.0    # bind to all interfaces
universal-scraper-ui --no-browser      # skip auto-opening the browser

Why Universal Scraper?

Traditional scraping is brittle

Writing a scraper the old way - requests / cloudscraper / selenium in Python, or Axios / Cheerio / Puppeteer in JS - means hand-crafting BeautifulSoup4 selectors by reading raw HTML. The moment a website updates its layout, every selector breaks. Teams end up spending more time maintaining scrapers than using the data they collect.

Universal Scraper fixes this

Instead of hard-coded selectors, the agent generates a custom BeautifulSoup4 extractor on the fly by analysing a compressed snapshot of the page:

What happens	The numbers
HTML cleaned before AI sees it	98%+ size reduction (e.g. 163 KB → 2.3 KB)
AI called to write the extractor	once per unique page layout
Same extractor reused on repeat runs	$0.00786 per scrape (~0.7 cents)
What would cost with raw HTML sent to AI	57.5× more tokens, ~$0.45 per call
Time to extract hundreds of items	~5 seconds

When the page layout changes the agent detects the structural difference, regenerates the extractor automatically, and caches the new version - so you never touch the code.

How Universal Scraper Works

For a full technical breakdown — pipeline diagram, live working example, HTML cleaner internals, and token cost comparisons — see HOW_IT_WORKS.md.

Installation (Recommended)

pip install universal-scraper

Installation (Global level on Mac)

brew install pipx
sudo pipx install "universal-scraper[mcp]" --global

Installation

Clone the repository:

git clone <repository-url>
cd Universal_Scrapper

Install dependencies:

pip install -r requirements.txt

Or install manually:

pip install google-genai beautifulsoup4 requests selenium lxml fake-useragent flask

Install the module:
```
pip install -e .
```

Quick Start

1. Set up your API key

Option A: Use Gemini (Default - Recommended) Get a Gemini API key from Google AI Studio:

export GEMINI_API_KEY="your_gemini_api_key_here"

Option B: Use OpenAI

export OPENAI_API_KEY="your_openai_api_key_here"

Option C: Use Anthropic Claude

export ANTHROPIC_API_KEY="your_anthropic_api_key_here"

Option D: Pass API key directly

# For any provider - just pass the API key directly
scraper = UniversalScraper(api_key="your_api_key")

2. Basic Usage

from universal_scraper import UniversalScraper

# Option 1: Auto-detect provider (uses Gemini by default)
scraper = UniversalScraper(api_key="your_gemini_api_key")

# Option 2: Specify Gemini model explicitly
scraper = UniversalScraper(api_key="your_gemini_api_key", model_name="gemini-2.5-flash")

# Option 3: Use OpenAI
scraper = UniversalScraper(api_key="your_openai_api_key", model_name="gpt-4")

# Option 4: Use Anthropic Claude
scraper = UniversalScraper(api_key="your_anthropic_api_key", model_name="claude-3-sonnet-20240229")

# Option 5: Use any other provider supported by LiteLLM
scraper = UniversalScraper(api_key="your_api_key", model_name="llama-2-70b-chat")

# Set the fields you want to extract
scraper.set_fields([
    "company_name", 
    "job_title", 
    "apply_link", 
    "salary_range",
    "location"
])

# Check current model
print(f"Using model: {scraper.get_model_name()}")

# Scrape a URL (default JSON format)
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True)

print(f"Extracted {result['metadata']['items_extracted']} items")
print(f"Data saved to: {result.get('saved_to')}")

# Scrape and save as CSV
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='csv')
print(f"CSV data saved to: {result.get('saved_to')}")

3. Convenience Function

For quick one-off scraping:

from universal_scraper import scrape

# Quick scraping with default JSON format
data = scrape(
    url="https://example.com/jobs",
    api_key="your_gemini_api_key",
    fields=["company_name", "job_title", "apply_link"]
)

# Quick scraping with CSV format
data = scrape(
    url="https://example.com/jobs",
    api_key="your_gemini_api_key",
    fields=["company_name", "job_title", "apply_link"],
    format="csv"
)

# Quick scraping with OpenAI
data = scrape(
    url="https://example.com/jobs",
    api_key="your_openai_api_key",
    fields=["company_name", "job_title", "apply_link"],
    model_name="gpt-4"
)

# Quick scraping with Anthropic Claude
data = scrape(
    url="https://example.com/jobs",
    api_key="your_anthropic_api_key",
    fields=["company_name", "job_title", "apply_link"],
    model_name="claude-3-haiku-20240307"
)

print(data['data'])  # The extracted data

Export Formats

Universal Scraper supports multiple output formats to suit your data processing needs:

JSON Export (Default)

# JSON is the default format
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True)
# or explicitly specify
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='json')

JSON Output Structure:

{
  "url": "https://example.com",
  "timestamp": "2025-01-01T12:00:00",
  "fields": ["company_name", "job_title", "apply_link"],
  "data": [
    {
      "company_name": "Example Corp",
      "job_title": "Software Engineer", 
      "apply_link": "https://example.com/apply/123"
    }
  ],
  "metadata": {
    "raw_html_length": 50000,
    "cleaned_html_length": 15000,
    "items_extracted": 1
  }
}

CSV Export

# Export as CSV for spreadsheet analysis
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='csv')

CSV Output:

Clean tabular format with headers
All fields as columns, missing values filled with empty strings
Perfect for Excel, Google Sheets, or pandas processing
Automatically handles varying field structures across items

Multiple URLs with Format Choice

urls = ["https://site1.com", "https://site2.com", "https://site3.com"]

# Save all as JSON (default)
results = scraper.scrape_multiple_urls(urls, save_to_files=True)

# Save all as CSV
results = scraper.scrape_multiple_urls(urls, save_to_files=True, format='csv')

CLI Usage

# Gemini (default) - auto-detects from environment
universal-scraper https://example.com/jobs --output jobs.json

# OpenAI GPT models
universal-scraper https://example.com/products --api-key YOUR_OPENAI_KEY --model gpt-4 --format csv

# Anthropic Claude models  
universal-scraper https://example.com/data --api-key YOUR_ANTHROPIC_KEY --model claude-3-haiku-20240307

# Custom fields extraction
universal-scraper https://example.com/listings --fields product_name product_price product_rating

# Batch processing multiple URLs
universal-scraper --urls urls.txt --output-dir results --format csv --model gpt-4o-mini

# Verbose logging with any provider
universal-scraper https://example.com --api-key YOUR_KEY --model gpt-4 --verbose

🔧 Advanced CLI Options:

# Set custom extraction fields
universal-scraper URL --fields title price description availability

# Use environment variables (auto-detected)
export OPENAI_API_KEY="your_key"
universal-scraper URL --model gpt-4

# Multiple output formats
universal-scraper URL --format json    # Default
universal-scraper URL --format csv     # Spreadsheet-ready

# Batch processing
echo -e "https://site1.com\nhttps://site2.com" > urls.txt
universal-scraper --urls urls.txt --output-dir batch_results

🔗 Provider Support: All 100+ models supported by LiteLLM work in CLI! See LiteLLM Providers for complete list.

Development Usage (from cloned repo):

python main.py https://example.com/jobs --api-key YOUR_KEY --model gpt-4

MCP Server Usage

Universal Scraper works as an MCP (Model Context Protocol) server, allowing AI assistants to scrape websites directly.

Quick Setup

Install with MCP support:

pip install universal-scraper

Set your AI API key:

export GEMINI_API_KEY="your_key"  # or OPENAI_API_KEY, ANTHROPIC_API_KEY

Claude Code Setup

Add this to your Claude Code MCP settings:

{
  "mcpServers": {
    "universal-scraper": {
      "command": "universal-scraper-mcp"
    }
  }
}

or Run this command in your terminal

claude mcp add universal-scraper universal-scraper-mcp

Cursor Setup

Add this to your Cursor MCP configuration:

{
  "mcpServers": {
    "universal-scraper": {
      "command": "universal-scraper-mcp"
    }
  }
}

Available Tools

scrape_url: Scrape a single URL
scrape_multiple_urls: Scrape multiple URLs
configure_scraper: Set API keys and models
get_scraper_info: Check current settings
clear_cache: Clear cached data

Example Usage

Once configured, just ask your AI assistant:

"Scrape https://news.ycombinator.com and extract the top story titles and links"

"Scrape this product page and get the price, name, and reviews"

Cache Management

scraper = UniversalScraper(api_key="your_key")

# View cache statistics
stats = scraper.get_cache_stats()
print(f"Cached entries: {stats['total_entries']}")
print(f"Total cache hits: {stats['total_uses']}")

# Clear old entries (30+ days)
removed = scraper.cleanup_old_cache(30)
print(f"Removed {removed} old entries")

# Clear entire cache
scraper.clear_cache()

# Disable/enable caching
scraper.disable_cache()  # For testing
scraper.enable_cache()   # Re-enable

Advanced Usage

Multiple URLs

scraper = UniversalScraper(api_key="your_api_key")
scraper.set_fields(["title", "price", "description"])

urls = [
    "https://site1.com/products",
    "https://site2.com/items", 
    "https://site3.com/listings"
]

# Scrape all URLs and save as JSON (default)
results = scraper.scrape_multiple_urls(urls, save_to_files=True)

# Scrape all URLs and save as CSV for analysis
results = scraper.scrape_multiple_urls(urls, save_to_files=True, format='csv')

for result in results:
    if result.get('error'):
        print(f"Failed {result['url']}: {result['error']}")
    else:
        print(f"Success {result['url']}: {result['metadata']['items_extracted']} items")

Custom Configuration

scraper = UniversalScraper(
    api_key="your_api_key",
    temp_dir="custom_temp",      # Custom temporary directory
    output_dir="custom_output",  # Custom output directory  
    log_level=logging.DEBUG,     # Enable debug logging
    model_name="gpt-4"           # Custom model (OpenAI, Gemini, Claude, etc.)
)

# Configure for e-commerce scraping
scraper.set_fields([
    "product_name",
    "product_price", 
    "product_rating",
    "product_reviews_count",
    "product_availability",
    "product_description"
])

# Check and change model dynamically
print(f"Current model: {scraper.get_model_name()}")
scraper.set_model_name("gpt-4")  # Switch to OpenAI
print(f"Switched to: {scraper.get_model_name()}")

# Or switch to Claude
scraper.set_model_name("claude-3-sonnet-20240229")
print(f"Switched to: {scraper.get_model_name()}")

result = scraper.scrape_url("https://ecommerce-site.com", save_to_file=True)

API Reference

UniversalScraper Class

Constructor

UniversalScraper(api_key=None, temp_dir="temp", output_dir="output", log_level=logging.INFO, model_name=None)

api_key: AI provider API key (auto-detects provider, or set specific env vars)
temp_dir: Directory for temporary files
output_dir: Directory for output files
log_level: Logging level
model_name: AI model name (default: 'gemini-2.5-flash', supports 100+ models via LiteLLM)
- See LiteLLM Providers for complete model list and setup

Methods

set_fields(fields: List[str]): Set the fields to extract
get_fields() -> List[str]: Get current fields configuration
get_model_name() -> str: Get current Gemini model name
set_model_name(model_name: str): Change the Gemini model
scrape_url(url: str, save_to_file=False, output_filename=None, format='json') -> Dict: Scrape a single URL
scrape_multiple_urls(urls: List[str], save_to_files=True, format='json') -> List[Dict]: Scrape multiple URLs

Convenience Function

scrape(url: str, api_key: str, fields: List[str], model_name: Optional[str] = None, format: str = 'json') -> Dict

Quick scraping function for simple use cases. Auto-detects AI provider from API key pattern.

Note: For model names and provider-specific setup, refer to the LiteLLM Providers Documentation.

Output Format

The scraped data is returned in a structured format:

{
  "url": "https://example.com",
  "timestamp": "2025-01-01T12:00:00",
  "fields": ["company_name", "job_title", "apply_link"],
  "data": [
    {
      "company_name": "Example Corp",
      "job_title": "Software Engineer", 
      "apply_link": "https://example.com/apply/123"
    }
  ],
  "metadata": {
    "raw_html_length": 50000,
    "cleaned_html_length": 15000,
    "items_extracted": 1
  }
}

Common Field Examples

Job Listings

scraper.set_fields([
    "company_name",
    "job_title", 
    "apply_link",
    "salary_range",
    "location",
    "job_description",
    "employment_type",
    "experience_level"
])

E-commerce Products

scraper.set_fields([
    "product_name",
    "product_price",
    "product_rating", 
    "product_reviews_count",
    "product_availability",
    "product_image_url",
    "product_description"
])

News Articles

scraper.set_fields([
    "article_title",
    "article_content",
    "article_author",
    "publish_date", 
    "article_url",
    "article_category"
])

Multi-Provider AI Support

Universal Scraper now supports multiple AI providers through LiteLLM integration:

Supported Providers

Google Gemini (Default): gemini-2.5-flash, gemini-1.5-pro, etc.
OpenAI: gpt-4, gpt-4-turbo, gpt-3.5-turbo, etc.
Anthropic: claude-3-opus-20240229, claude-3-sonnet-20240229, claude-3-haiku-20240307
100+ Other Models: Via LiteLLM including Llama, PaLM, Cohere, and more

For complete model names and provider setup: See LiteLLM Providers Documentation

Usage Examples

# Gemini (Default - Free tier available)
scraper = UniversalScraper(api_key="your_gemini_key")
# Auto-detects as gemini-2.5-flash

# OpenAI
scraper = UniversalScraper(api_key="sk-...", model_name="gpt-4")

# Anthropic Claude
scraper = UniversalScraper(api_key="sk-ant-...", model_name="claude-3-haiku-20240307")

# Environment variable approach
# Set GEMINI_API_KEY, OPENAI_API_KEY, or ANTHROPIC_API_KEY
scraper = UniversalScraper()  # Auto-detects from env vars

# Any other provider from LiteLLM (see link above for model names)
scraper = UniversalScraper(api_key="your_api_key", model_name="llama-2-70b-chat")

Model Configuration Guide

Quick Reference for Popular Models:

# Gemini Models
model_name="gemini-2.5-flash"        # Fast, efficient
model_name="gemini-1.5-pro"          # More capable

# OpenAI Models  
model_name="gpt-4"                   # Most capable
model_name="gpt-4o-mini"             # Fast, cost-effective
model_name="gpt-3.5-turbo"           # Legacy but reliable

# Anthropic Models
model_name="claude-3-opus-20240229"      # Most capable
model_name="claude-3-sonnet-20240229"    # Balanced
model_name="claude-3-haiku-20240307"     # Fast, efficient

# Other Popular Models (see LiteLLM docs for setup)
model_name="llama-2-70b-chat"        # Meta Llama
model_name="command-nightly"          # Cohere
model_name="palm-2-chat-bison"        # Google PaLM

🔗 Complete Model List: Visit LiteLLM Providers Documentation for:

All available model names
Provider-specific API key setup
Environment variable configuration
Rate limits and pricing information

Model Auto-Detection

If you don't specify a model, the scraper automatically selects:

Gemini: If GEMINI_API_KEY is set or API key contains "AIza"
OpenAI: If OPENAI_API_KEY is set or API key starts with "sk-"
Anthropic: If ANTHROPIC_API_KEY is set or API key starts with "sk-ant-"

Troubleshooting

Common Issues

API Key Error: Make sure your API key is valid and set correctly:
- Gemini: Set GEMINI_API_KEY or pass directly
- OpenAI: Set OPENAI_API_KEY or pass directly
- Anthropic: Set ANTHROPIC_API_KEY or pass directly
Model Not Found: Ensure you're using the correct model name for your provider
Empty Results: The AI might need more specific field names or the page might not contain the expected data
Network Errors: Some sites block scrapers - the tool uses cloudscraper to handle most cases
Model Name Issues: Check LiteLLM Providers for correct model names and setup instructions

Debug Mode

Enable debug logging to see what's happening:

import logging
scraper = UniversalScraper(api_key="your_key", log_level=logging.DEBUG)

Roadmap

See ROADMAP.md for planned features and improvements.

Contributors

Contributors List

Contributing

Fork the repository
Create a feature branch
Make your changes
Run pytest to run testcases
Test PEP Standard:

flake8 universal_scraper/ --count --select=E9,F63,F7,F82 --show-source --statistics

flake8 universal_scraper/ --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

Submit a pull request

License

GPT 3.0 License - see LICENSE file for details.

Changelog

See CHANGELOG.md for detailed version history and release notes.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
.github/workflows		.github/workflows
docs		docs
tests		tests
universal_scraper		universal_scraper
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
HOW_IT_WORKS.md		HOW_IT_WORKS.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
main.py		main.py
mcp_server_main.py		mcp_server_main.py
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Universal Scraper

The Python package for scraping data from any website

Table of Contents

Web UI - No-Code Mode

What you can do in the UI

CLI options

Why Universal Scraper?

Traditional scraping is brittle

Universal Scraper fixes this

How Universal Scraper Works

Installation (Recommended)

Installation (Global level on Mac)

Installation

Quick Start

1. Set up your API key

2. Basic Usage

3. Convenience Function

Export Formats

JSON Export (Default)

CSV Export

Multiple URLs with Format Choice

CLI Usage

MCP Server Usage

Quick Setup

Claude Code Setup

Cursor Setup

Available Tools

Example Usage

Cache Management

Advanced Usage

Multiple URLs

Custom Configuration

API Reference

UniversalScraper Class

Constructor

Methods

Convenience Function

Output Format

Common Field Examples

Job Listings

E-commerce Products

News Articles

Multi-Provider AI Support

Supported Providers

Usage Examples

Model Configuration Guide

Model Auto-Detection

Troubleshooting

Common Issues

Debug Mode

Roadmap

Contributors

Contributing

License

Changelog

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages