Skip to content

Rumblingb/web-scraper-mcp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraper MCP Server

A lightweight, dependency-minimal web scraping server implementing the Model Context Protocol (MCP). Uses httpx for HTTP requests and regex for HTML parsing — no BeautifulSoup, no lxml, no heavy dependencies.

$19/month — Subscribe at stripe.com

Tools

1. scrape_url(url, selector?)

Fetch a single URL and extract structured content:

  • Title — content of the <title> tag
  • Meta description — content of <meta name="description">
  • Text content — all visible text with HTML tags stripped
  • Links — all <a href="..."> with absolute URLs and link text
  • Images — all <img src="..."> with absolute URLs and alt text

If a CSS selector is provided, returns matching elements with their text, attributes, nested links, and nested images instead.

Supported selector patterns:

Pattern Example Matches
Tag h1, div, p All elements with that tag
Class .content, .article Elements with the given class
ID #main, #header Element with the given ID
Attribute [href], [data-type] Elements with the attribute
Attr = value [type="submit"] Elements with exact attribute value
Combined div.content#main All of the above together

Examples:

# Get everything
scrape_url("https://example.com")

# Extract specific elements
scrape_url("https://example.com", "h1")
scrape_url("https://example.com", ".article-body")
scrape_url("https://example.com", "div.content a")

2. scrape_urls(urls[])

Batch scrape multiple URLs concurrently. Returns title, text snippet (first 500 chars), link count, and image count for each URL.

Example:

scrape_urls(["https://example.com", "https://example.org"])

3. extract_emails(url)

Extract all email addresses from a web page. Scans both visible text and mailto: links. Deduplicates results.

Example:

extract_emails("https://example.com/contact")

4. extract_links(url, domain_filter?)

Extract all hyperlinks from a page. Optionally filter by domain (substring match on hostname).

Examples:

# All links
extract_links("https://example.com")

# Only links pointing to a specific domain
extract_links("https://example.com", "docs.python.org")

Installation

# Clone the repository
git clone https://github.com/nousresearch/web-scraper-mcp.git
cd web-scraper-mcp

# Install dependencies
pip install -r requirements.txt

Usage

As a standalone MCP server

python server.py

This starts a stdio-based MCP server. Connect it to any MCP client (Claude Desktop, VS Code extension, etc.).

With Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "web-scraper": {
      "command": "python",
      "args": ["/path/to/web-scraper-mcp/server.py"]
    }
  }
}

With VS Code (Cline / Roo Code)

Add to your MCP settings:

{
  "mcpServers": {
    "web-scraper": {
      "command": "python",
      "args": ["/path/to/web-scraper-mcp/server.py"]
    }
  }
}

Deployment

This server is compatible with Smithery.ai. Deploy with one click or via the provided smithery.yaml.

Dependencies

Minimal and intentionally slim:

Package Version Purpose
mcp >= 1.0.0 Model Context Protocol SDK
httpx >= 0.27.0 Async HTTP client with connection pooling

Pricing

$19/monthSubscribe on Stripe

License

MIT

About

Web scraper via MCP. Extract text, links, images, emails. Free tier.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages