An AI Agent that handles all kinds of scraping work for you - just tell it what fields you want and point it at a URL.
Under the hood the agent writes a custom BeautifulSoup4 extractor for your target page, caches it against a structural hash of the HTML, and reuses that same code on every subsequent run.
The AI is only ever called once per unique page layout - not on every scrape - so your token spend stays in the single-digit cents range even across thousands of requests.
When the page layout changes the agent detects it automatically and regenerates the extractor, then caches the new version.
- Web UI - No-Code Mode
- How Universal Scraper Works
- Installation (Recommended)
- Installation
- Quick Start
- Export Formats
- CLI Usage
- MCP Server Usage
- Cache Management
- Advanced Usage
- API Reference
- Output Format
- Common Field Examples
- Multi-Provider AI Support
- Troubleshooting
- Roadmap
- Contributors
- Contributing
- License
- Changelog
The fastest way to use Universal Scraper - no Python required. Install the package and launch the local web UI with one command:
pip install universal-scraper
universal-scraper-uiYour browser opens automatically at http://127.0.0.1:5000.
| Feature | Details |
|---|---|
| Provider & Model | Select Google Gemini, OpenAI, Anthropic, or Ollama. Models are fetched live from the provider's API when you enter a key - always current, never hardcoded. Falls back to 1,700+ LiteLLM models when no key is entered. Only text/chat models are listed. |
| API Key auto-fill | GEMINI_API_KEY, OPENAI_API_KEY, and ANTHROPIC_API_KEY environment variables are pre-filled on page load. |
| Extraction fields | Add fields as interactive chips (product_name, price, rating …). Press Enter or comma to add; click × to remove. |
| Output formats | JSON → syntax-highlighted result. CSV → rendered as an HTML table in the browser; download exports a proper .csv file. |
| Real-time logs | Live terminal-style stream (Server-Sent Events) showing every internal step - fetch, clean, AI call, cache hit - as the scrape runs. |
| Token usage | After each scrape a token bar shows total tokens used, prompt/completion split, and cache-hit count. Click Breakdown → for a per-API-call modal. |
universal-scraper-ui --port 8080 # custom port
universal-scraper-ui --host 0.0.0.0 # bind to all interfaces
universal-scraper-ui --no-browser # skip auto-opening the browserWriting a scraper the old way - requests / cloudscraper / selenium in Python, or Axios / Cheerio / Puppeteer in JS - means hand-crafting BeautifulSoup4 selectors by reading raw HTML. The moment a website updates its layout, every selector breaks. Teams end up spending more time maintaining scrapers than using the data they collect.
Instead of hard-coded selectors, the agent generates a custom BeautifulSoup4 extractor on the fly by analysing a compressed snapshot of the page:
| What happens | The numbers |
|---|---|
| HTML cleaned before AI sees it | 98%+ size reduction (e.g. 163 KB → 2.3 KB) |
| AI called to write the extractor | once per unique page layout |
| Same extractor reused on repeat runs | $0.00786 per scrape (~0.7 cents) |
| What would cost with raw HTML sent to AI | 57.5× more tokens, ~$0.45 per call |
| Time to extract hundreds of items | ~5 seconds |
When the page layout changes the agent detects the structural difference, regenerates the extractor automatically, and caches the new version - so you never touch the code.
For a full technical breakdown — pipeline diagram, live working example, HTML cleaner internals, and token cost comparisons — see HOW_IT_WORKS.md.
pip install universal-scraper
brew install pipx
sudo pipx install "universal-scraper[mcp]" --global
-
Clone the repository:
git clone <repository-url> cd Universal_Scrapper
-
Install dependencies:
pip install -r requirements.txt
Or install manually:
pip install google-genai beautifulsoup4 requests selenium lxml fake-useragent flask
-
Install the module:
pip install -e .
Option A: Use Gemini (Default - Recommended) Get a Gemini API key from Google AI Studio:
export GEMINI_API_KEY="your_gemini_api_key_here"Option B: Use OpenAI
export OPENAI_API_KEY="your_openai_api_key_here"Option C: Use Anthropic Claude
export ANTHROPIC_API_KEY="your_anthropic_api_key_here"Option D: Pass API key directly
# For any provider - just pass the API key directly
scraper = UniversalScraper(api_key="your_api_key")from universal_scraper import UniversalScraper
# Option 1: Auto-detect provider (uses Gemini by default)
scraper = UniversalScraper(api_key="your_gemini_api_key")
# Option 2: Specify Gemini model explicitly
scraper = UniversalScraper(api_key="your_gemini_api_key", model_name="gemini-2.5-flash")
# Option 3: Use OpenAI
scraper = UniversalScraper(api_key="your_openai_api_key", model_name="gpt-4")
# Option 4: Use Anthropic Claude
scraper = UniversalScraper(api_key="your_anthropic_api_key", model_name="claude-3-sonnet-20240229")
# Option 5: Use any other provider supported by LiteLLM
scraper = UniversalScraper(api_key="your_api_key", model_name="llama-2-70b-chat")
# Set the fields you want to extract
scraper.set_fields([
"company_name",
"job_title",
"apply_link",
"salary_range",
"location"
])
# Check current model
print(f"Using model: {scraper.get_model_name()}")
# Scrape a URL (default JSON format)
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True)
print(f"Extracted {result['metadata']['items_extracted']} items")
print(f"Data saved to: {result.get('saved_to')}")
# Scrape and save as CSV
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='csv')
print(f"CSV data saved to: {result.get('saved_to')}")For quick one-off scraping:
from universal_scraper import scrape
# Quick scraping with default JSON format
data = scrape(
url="https://example.com/jobs",
api_key="your_gemini_api_key",
fields=["company_name", "job_title", "apply_link"]
)
# Quick scraping with CSV format
data = scrape(
url="https://example.com/jobs",
api_key="your_gemini_api_key",
fields=["company_name", "job_title", "apply_link"],
format="csv"
)
# Quick scraping with OpenAI
data = scrape(
url="https://example.com/jobs",
api_key="your_openai_api_key",
fields=["company_name", "job_title", "apply_link"],
model_name="gpt-4"
)
# Quick scraping with Anthropic Claude
data = scrape(
url="https://example.com/jobs",
api_key="your_anthropic_api_key",
fields=["company_name", "job_title", "apply_link"],
model_name="claude-3-haiku-20240307"
)
print(data['data']) # The extracted dataUniversal Scraper supports multiple output formats to suit your data processing needs:
# JSON is the default format
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True)
# or explicitly specify
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='json')JSON Output Structure:
{
"url": "https://example.com",
"timestamp": "2025-01-01T12:00:00",
"fields": ["company_name", "job_title", "apply_link"],
"data": [
{
"company_name": "Example Corp",
"job_title": "Software Engineer",
"apply_link": "https://example.com/apply/123"
}
],
"metadata": {
"raw_html_length": 50000,
"cleaned_html_length": 15000,
"items_extracted": 1
}
}# Export as CSV for spreadsheet analysis
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='csv')CSV Output:
- Clean tabular format with headers
- All fields as columns, missing values filled with empty strings
- Perfect for Excel, Google Sheets, or pandas processing
- Automatically handles varying field structures across items
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
# Save all as JSON (default)
results = scraper.scrape_multiple_urls(urls, save_to_files=True)
# Save all as CSV
results = scraper.scrape_multiple_urls(urls, save_to_files=True, format='csv')# Gemini (default) - auto-detects from environment
universal-scraper https://example.com/jobs --output jobs.json
# OpenAI GPT models
universal-scraper https://example.com/products --api-key YOUR_OPENAI_KEY --model gpt-4 --format csv
# Anthropic Claude models
universal-scraper https://example.com/data --api-key YOUR_ANTHROPIC_KEY --model claude-3-haiku-20240307
# Custom fields extraction
universal-scraper https://example.com/listings --fields product_name product_price product_rating
# Batch processing multiple URLs
universal-scraper --urls urls.txt --output-dir results --format csv --model gpt-4o-mini
# Verbose logging with any provider
universal-scraper https://example.com --api-key YOUR_KEY --model gpt-4 --verbose🔧 Advanced CLI Options:
# Set custom extraction fields
universal-scraper URL --fields title price description availability
# Use environment variables (auto-detected)
export OPENAI_API_KEY="your_key"
universal-scraper URL --model gpt-4
# Multiple output formats
universal-scraper URL --format json # Default
universal-scraper URL --format csv # Spreadsheet-ready
# Batch processing
echo -e "https://site1.com\nhttps://site2.com" > urls.txt
universal-scraper --urls urls.txt --output-dir batch_results🔗 Provider Support: All 100+ models supported by LiteLLM work in CLI! See LiteLLM Providers for complete list.
Development Usage (from cloned repo):
python main.py https://example.com/jobs --api-key YOUR_KEY --model gpt-4Universal Scraper works as an MCP (Model Context Protocol) server, allowing AI assistants to scrape websites directly.
- Install with MCP support:
pip install universal-scraper- Set your AI API key:
export GEMINI_API_KEY="your_key" # or OPENAI_API_KEY, ANTHROPIC_API_KEYAdd this to your Claude Code MCP settings:
{
"mcpServers": {
"universal-scraper": {
"command": "universal-scraper-mcp"
}
}
}or Run this command in your terminal
claude mcp add universal-scraper universal-scraper-mcp
Add this to your Cursor MCP configuration:
{
"mcpServers": {
"universal-scraper": {
"command": "universal-scraper-mcp"
}
}
}- scrape_url: Scrape a single URL
- scrape_multiple_urls: Scrape multiple URLs
- configure_scraper: Set API keys and models
- get_scraper_info: Check current settings
- clear_cache: Clear cached data
Once configured, just ask your AI assistant:
"Scrape https://news.ycombinator.com and extract the top story titles and links"
"Scrape this product page and get the price, name, and reviews"
scraper = UniversalScraper(api_key="your_key")
# View cache statistics
stats = scraper.get_cache_stats()
print(f"Cached entries: {stats['total_entries']}")
print(f"Total cache hits: {stats['total_uses']}")
# Clear old entries (30+ days)
removed = scraper.cleanup_old_cache(30)
print(f"Removed {removed} old entries")
# Clear entire cache
scraper.clear_cache()
# Disable/enable caching
scraper.disable_cache() # For testing
scraper.enable_cache() # Re-enablescraper = UniversalScraper(api_key="your_api_key")
scraper.set_fields(["title", "price", "description"])
urls = [
"https://site1.com/products",
"https://site2.com/items",
"https://site3.com/listings"
]
# Scrape all URLs and save as JSON (default)
results = scraper.scrape_multiple_urls(urls, save_to_files=True)
# Scrape all URLs and save as CSV for analysis
results = scraper.scrape_multiple_urls(urls, save_to_files=True, format='csv')
for result in results:
if result.get('error'):
print(f"Failed {result['url']}: {result['error']}")
else:
print(f"Success {result['url']}: {result['metadata']['items_extracted']} items")scraper = UniversalScraper(
api_key="your_api_key",
temp_dir="custom_temp", # Custom temporary directory
output_dir="custom_output", # Custom output directory
log_level=logging.DEBUG, # Enable debug logging
model_name="gpt-4" # Custom model (OpenAI, Gemini, Claude, etc.)
)
# Configure for e-commerce scraping
scraper.set_fields([
"product_name",
"product_price",
"product_rating",
"product_reviews_count",
"product_availability",
"product_description"
])
# Check and change model dynamically
print(f"Current model: {scraper.get_model_name()}")
scraper.set_model_name("gpt-4") # Switch to OpenAI
print(f"Switched to: {scraper.get_model_name()}")
# Or switch to Claude
scraper.set_model_name("claude-3-sonnet-20240229")
print(f"Switched to: {scraper.get_model_name()}")
result = scraper.scrape_url("https://ecommerce-site.com", save_to_file=True)UniversalScraper(api_key=None, temp_dir="temp", output_dir="output", log_level=logging.INFO, model_name=None)api_key: AI provider API key (auto-detects provider, or set specific env vars)temp_dir: Directory for temporary filesoutput_dir: Directory for output fileslog_level: Logging levelmodel_name: AI model name (default: 'gemini-2.5-flash', supports 100+ models via LiteLLM)- See LiteLLM Providers for complete model list and setup
set_fields(fields: List[str]): Set the fields to extractget_fields() -> List[str]: Get current fields configurationget_model_name() -> str: Get current Gemini model nameset_model_name(model_name: str): Change the Gemini modelscrape_url(url: str, save_to_file=False, output_filename=None, format='json') -> Dict: Scrape a single URLscrape_multiple_urls(urls: List[str], save_to_files=True, format='json') -> List[Dict]: Scrape multiple URLs
scrape(url: str, api_key: str, fields: List[str], model_name: Optional[str] = None, format: str = 'json') -> DictQuick scraping function for simple use cases. Auto-detects AI provider from API key pattern.
Note: For model names and provider-specific setup, refer to the LiteLLM Providers Documentation.
The scraped data is returned in a structured format:
{
"url": "https://example.com",
"timestamp": "2025-01-01T12:00:00",
"fields": ["company_name", "job_title", "apply_link"],
"data": [
{
"company_name": "Example Corp",
"job_title": "Software Engineer",
"apply_link": "https://example.com/apply/123"
}
],
"metadata": {
"raw_html_length": 50000,
"cleaned_html_length": 15000,
"items_extracted": 1
}
}scraper.set_fields([
"company_name",
"job_title",
"apply_link",
"salary_range",
"location",
"job_description",
"employment_type",
"experience_level"
])scraper.set_fields([
"product_name",
"product_price",
"product_rating",
"product_reviews_count",
"product_availability",
"product_image_url",
"product_description"
])scraper.set_fields([
"article_title",
"article_content",
"article_author",
"publish_date",
"article_url",
"article_category"
])Universal Scraper now supports multiple AI providers through LiteLLM integration:
- Google Gemini (Default):
gemini-2.5-flash,gemini-1.5-pro, etc. - OpenAI:
gpt-4,gpt-4-turbo,gpt-3.5-turbo, etc. - Anthropic:
claude-3-opus-20240229,claude-3-sonnet-20240229,claude-3-haiku-20240307 - 100+ Other Models: Via LiteLLM including Llama, PaLM, Cohere, and more
For complete model names and provider setup: See LiteLLM Providers Documentation
# Gemini (Default - Free tier available)
scraper = UniversalScraper(api_key="your_gemini_key")
# Auto-detects as gemini-2.5-flash
# OpenAI
scraper = UniversalScraper(api_key="sk-...", model_name="gpt-4")
# Anthropic Claude
scraper = UniversalScraper(api_key="sk-ant-...", model_name="claude-3-haiku-20240307")
# Environment variable approach
# Set GEMINI_API_KEY, OPENAI_API_KEY, or ANTHROPIC_API_KEY
scraper = UniversalScraper() # Auto-detects from env vars
# Any other provider from LiteLLM (see link above for model names)
scraper = UniversalScraper(api_key="your_api_key", model_name="llama-2-70b-chat")Quick Reference for Popular Models:
# Gemini Models
model_name="gemini-2.5-flash" # Fast, efficient
model_name="gemini-1.5-pro" # More capable
# OpenAI Models
model_name="gpt-4" # Most capable
model_name="gpt-4o-mini" # Fast, cost-effective
model_name="gpt-3.5-turbo" # Legacy but reliable
# Anthropic Models
model_name="claude-3-opus-20240229" # Most capable
model_name="claude-3-sonnet-20240229" # Balanced
model_name="claude-3-haiku-20240307" # Fast, efficient
# Other Popular Models (see LiteLLM docs for setup)
model_name="llama-2-70b-chat" # Meta Llama
model_name="command-nightly" # Cohere
model_name="palm-2-chat-bison" # Google PaLM🔗 Complete Model List: Visit LiteLLM Providers Documentation for:
- All available model names
- Provider-specific API key setup
- Environment variable configuration
- Rate limits and pricing information
If you don't specify a model, the scraper automatically selects:
- Gemini: If
GEMINI_API_KEYis set or API key contains "AIza" - OpenAI: If
OPENAI_API_KEYis set or API key starts with "sk-" - Anthropic: If
ANTHROPIC_API_KEYis set or API key starts with "sk-ant-"
- API Key Error: Make sure your API key is valid and set correctly:
- Gemini: Set
GEMINI_API_KEYor pass directly - OpenAI: Set
OPENAI_API_KEYor pass directly - Anthropic: Set
ANTHROPIC_API_KEYor pass directly
- Gemini: Set
- Model Not Found: Ensure you're using the correct model name for your provider
- Empty Results: The AI might need more specific field names or the page might not contain the expected data
- Network Errors: Some sites block scrapers - the tool uses cloudscraper to handle most cases
- Model Name Issues: Check LiteLLM Providers for correct model names and setup instructions
Enable debug logging to see what's happening:
import logging
scraper = UniversalScraper(api_key="your_key", log_level=logging.DEBUG)See ROADMAP.md for planned features and improvements.
- Fork the repository
- Create a feature branch
- Make your changes
- Run
pytestto run testcases - Test PEP Standard:
flake8 universal_scraper/ --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 universal_scraper/ --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- Submit a pull request
GPT 3.0 License - see LICENSE file for details.
See CHANGELOG.md for detailed version history and release notes.

