From e8708995c0688e782d2505e9152098822de54d12 Mon Sep 17 00:00:00 2001 From: Vikrant-Khedkar Date: Thu, 30 Apr 2026 16:01:02 +0530 Subject: [PATCH] docs: update MCP docs for scrapegraph-mcp v3 rename MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit scrapegraph-mcp v3.0.0 renamed 7 MCP tools to match the v2 API canonical names. This brings the docs in sync. Changes: - services/mcp-server/introduction.mdx: replace tool table with v3 names (smartscraper→extract, searchscraper→search, smartcrawler_initiate→ crawl_start, smartcrawler_fetch_results→crawl_get_status, sgai_history→ history, generate_schema→schema, markdownify removed). Tool count 18→17. Added v2→v3 migration callout. - services/schema.mdx: NEW canonical page for the schema tool / POST /schema endpoint (was missing). - docs.json: add services/schema to v2 Services nav. - services/mcp-server.mdx: replaced 363-line stale duplicate (referenced a 404 openapi.json) with a slim redirect to introduction.mdx + setup guide cards. - Deprecation banners on legacy v1 service pages pointing to v2/v3 canonical pages: markdownify→scrape, smartscraper→extract, searchscraper→search, smartcrawler→crawl. agenticscraper and sitemap marked as removed in API v2. Out of scope: SDK method-name references (x402, agno, additional-parameters examples) — those track the Python SDK separately. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs.json | 1 + services/agenticscraper.mdx | 4 + services/markdownify.mdx | 4 + services/mcp-server.mdx | 369 ++------------------------- services/mcp-server/introduction.mdx | 21 +- services/schema.mdx | 172 +++++++++++++ services/searchscraper.mdx | 4 + services/sitemap.mdx | 4 + services/smartcrawler.mdx | 4 + services/smartscraper.mdx | 4 + 10 files changed, 223 insertions(+), 364 deletions(-) create mode 100644 services/schema.mdx diff --git a/docs.json b/docs.json index 3764147..45bb85a 100644 --- a/docs.json +++ b/docs.json @@ -53,6 +53,7 @@ "services/search", "services/crawl", "services/monitor", + "services/schema", "services/history", { "group": "Additional Parameters", diff --git a/services/agenticscraper.mdx b/services/agenticscraper.mdx index b2167d6..c351ad5 100644 --- a/services/agenticscraper.mdx +++ b/services/agenticscraper.mdx @@ -4,6 +4,10 @@ description: 'Automate browser actions and scrape data with or without AI, even icon: 'bolt' --- + + **Removed in API v2 — no direct replacement.** For multi-step browser automation, see [`Crawl`](/services/crawl) (with `cookies`, `headers`, and `wait`) or contact us about custom workflows. This page is kept for reference. + + Agentic Scraper Service diff --git a/services/markdownify.mdx b/services/markdownify.mdx index b3eb13b..d95113b 100644 --- a/services/markdownify.mdx +++ b/services/markdownify.mdx @@ -4,6 +4,10 @@ description: 'Convert web content to clean, structured markdown' icon: 'markdown' --- + + **Legacy v1 service.** Use [`Scrape`](/services/scrape) with `output_format="markdown"` instead — it is the canonical v2 API and the equivalent MCP tool. This page is kept for reference. + + Markdownify Service diff --git a/services/mcp-server.mdx b/services/mcp-server.mdx index b272e09..9fa1e01 100644 --- a/services/mcp-server.mdx +++ b/services/mcp-server.mdx @@ -1,363 +1,26 @@ --- title: '🔌 MCP Server' description: 'Use ScrapeGraphAI through the Model Context Protocol (MCP)' -openapi: 'https://raw.githubusercontent.com/ScrapeGraphAI/scrapegraph-mcp/main/openapi.json' --- -# ScrapeGraph MCP Server - -[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) -[![Python 3.13+](https://img.shields.io/badge/python-3.13+-blue.svg)](https://www.python.org/downloads/) -[![smithery badge](https://smithery.ai/badge/@ScrapeGraphAI/scrapegraph-mcp)](https://smithery.ai/server/@ScrapeGraphAI/scrapegraph-mcp) - -A production‑ready Model Context Protocol (MCP) server that connects LLMs to the ScrapeGraph AI API for AI‑powered web scraping, research, and crawling. +The MCP Server documentation has moved. - If this server is helpful, a star goes a long way. Thanks! + Tool reference, setup guides, and quick start. -## Key Features - -- Full v2 API coverage: scrape, extract, search, crawl (+ stop/resume), monitor lifecycle (+ activity polling), credits, history, and schema generation -- Uses the v2 API base URL (`https://api.scrapegraphai.com/api/v2`) with the `SGAI-APIKEY` header — wire format matches [scrapegraph-py v2](https://github.com/ScrapeGraphAI/scrapegraph-py/pull/84) -- Remote HTTP MCP endpoint and local Python server support -- Works with Cursor, Claude Desktop, and any MCP‑compatible client -- Robust error handling, timeouts, and production‑tested reliability - - -The MCP server is now on **v2** (`scrapegraph-mcp@2.0.0`). The v1 tools `sitemap`, `agentic_scrapper`, `markdownify_status`, and `smartscraper_status` have been removed. See [scrapegraph-mcp#16](https://github.com/ScrapeGraphAI/scrapegraph-mcp/pull/16) for the migration details. - - -## Get Your API Key - -Create an account and copy your API key from the [ScrapeGraph Dashboard](https://scrapegraphai.com/dashboard). - ---- - -## Recommended: Use the Remote MCP Server - -Endpoint: - -``` -https://mcp.scrapegraphai.com/mcp -``` - -Follow the instructions below: - -### Cursor (HTTP MCP) - -Add this to your Cursor MCP settings (`~/.cursor/mcp.json`): - -```json -{ - "mcpServers": { - "scrapegraph-mcp": { - "url": "https://mcp.scrapegraphai.com/mcp", - "headers": { - "X-API-Key": "YOUR_API_KEY" - } - } - } -} -``` - - -### Claude Desktop (via mcp-remote) - -Claude Desktop connects to HTTP MCP via a lightweight proxy. Add the following to `~/Library/Application Support/Claude/claude_desktop_config.json` on macOS (adjust path on Windows): - -```json -{ - "mcpServers": { - "scrapegraph-mcp": { - "command": "npx", - "args": [ - "mcp-remote@0.1.25", - "https://mcp.scrapegraphai.com/mcp", - "--header", - "X-API-Key:YOUR_API_KEY" - ] - } - } -} -``` - -### Smithery (optional) - -```bash -npx -y @smithery/cli install @ScrapeGraphAI/scrapegraph-mcp --client claude -``` - ---- - -## Local Usage (Python) - -Prefer running locally? Install and wire the server via stdio. - -### Install - -```bash -pip install -e . -# or -uv pip install -e . -``` - -Set your key: - -```bash -# macOS/Linux -export SGAI_API_KEY=your-api-key-here -# Windows (PowerShell) -$env:SGAI_API_KEY="your-api-key-here" -``` - -### Run the server - -```bash -scrapegraph-mcp -# or -python -m scrapegraph_mcp.server -``` - -
-Cursor (Local stdio) - -`~/.cursor/mcp.json`: - -```json -{ - "mcpServers": { - "scrapegraph-mcp-local": { - "command": "python", - "args": ["-m", "scrapegraph_mcp.server"], - "env": { - "SGAI_API_KEY": "YOUR_API_KEY" - } - } - } -} -``` - -
- -
-Claude Desktop (Local stdio) - -`~/Library/Application Support/Claude/claude_desktop_config.json`: - -```json -{ - "mcpServers": { - "scrapegraph-mcp-local": { - "command": "python", - "args": ["-m", "scrapegraph_mcp.server"], - "env": { - "SGAI_API_KEY": "YOUR_API_KEY" - } - } - } -} -``` - -
- ---- - ---- - -## Configuration - -The server reads the ScrapeGraph API key from `SGAI_API_KEY` (local) or the `X-API-Key` header (remote). Environment variables align 1:1 with the Python SDK: - -| Variable | Description | Default | -|---|---|---| -| `SGAI_API_KEY` | ScrapeGraph API key | — | -| `SGAI_API_URL` | Override the v2 API base URL | `https://api.scrapegraphai.com/api/v2` | -| `SGAI_TIMEOUT` | Request timeout in seconds | `120` | -| `SCRAPEGRAPH_API_BASE_URL` | Legacy alias for `SGAI_API_URL` (still honored) | — | -| `SGAI_TIMEOUT_S` | Legacy alias for `SGAI_TIMEOUT` (still honored) | — | - -## Available Tools - -The server exposes the full v2 API surface. - -### Content tools - -All content tools accept the same `FetchConfig` passthrough parameters: `mode` (`auto | fast | js`), `stealth`, `timeout`, `wait`, `scrolls`, `country`, `headers`, `cookies`, `mock`. - -#### markdownify -Convert a webpage to clean markdown (wraps v2 `POST /scrape` with a markdown format entry). - -```python -markdownify(website_url: str, **fetch_config) -``` - -#### scrape -Fetch a URL via v2 `POST /scrape` with a single format entry. - -```python -scrape( - website_url: str, - output_format: str = "markdown", # markdown | html | screenshot | branding | links | images | summary - screenshot_full_page: bool = False, - content_type: str | None = None, - **fetch_config, -) -``` - -#### smartscraper -AI‑powered structured extraction (v2 `POST /extract`). - -```python -smartscraper( - user_prompt: str, - website_url: str, - output_schema: dict | str | None = None, - **fetch_config, -) -``` - -#### searchscraper -Search the web and optionally extract structured results (v2 `POST /search`). - -```python -searchscraper( - user_prompt: str, # maps to the v2 `query` field - num_results: int | None = None, # 1–20 - output_schema: dict | str | None = None, - prompt: str | None = None, # required when output_schema is set - country_search: str | None = None, # locationGeoCode (e.g. "us", "it") - time_range: str | None = None, # past_hour | past_24_hours | past_week | past_month | past_year - search_format: str = "markdown", # markdown | html - search_mode: str = "prune", # prune | normal - **fetch_config, -) -``` - -#### generate_schema -Generate or augment a JSON Schema from a prompt (v2 `POST /schema`). - -```python -generate_schema( - prompt: str, - existing_schema: dict | str | None = None, - model: str | None = None, -) -``` - -### Crawl tools - -#### smartcrawler_initiate -Start a multi‑page crawl. `extraction_mode` defaults to `markdown` (also: `html`, `links`, `images`, `summary`, `branding`, `screenshot`). - -```python -smartcrawler_initiate( - url: str, - extraction_mode: str = "markdown", # markdown | html | links | images | summary | branding | screenshot - depth: int | None = None, # v2 maxDepth - max_pages: int | None = None, - max_links_per_page: int | None = None, - allow_external: bool = False, - include_patterns: list[str] | None = None, - exclude_patterns: list[str] | None = None, - content_types: list[str] | None = None, - # FetchConfig passthrough - mode: str | None = None, # auto | fast | js - stealth: bool | None = None, - timeout: int | None = None, - wait: int | None = None, - scrolls: int | None = None, - country: str | None = None, - headers: dict | None = None, - cookies: dict | None = None, -) -``` - -#### smartcrawler_fetch_results -Poll status / results for a crawl. - -```python -smartcrawler_fetch_results(request_id: str) -``` - -#### crawl_stop -Stop a running crawl. - -```python -crawl_stop(request_id: str) -``` - -#### crawl_resume -Resume a paused / stopped crawl. - -```python -crawl_resume(request_id: str) -``` - -### Monitor tools - -Replace v1 "scheduled jobs". `monitor_create` wraps the supplied `prompt` (+ optional `output_schema`) into a v2 `{type: "json", ...}` format entry for you. - -```python -monitor_create( - url: str, - prompt: str, # what to extract on each run - interval: str, # 5-field cron expression - name: str | None = None, - webhook_url: str | None = None, - output_schema: dict | str | None = None, - **fetch_config, -) -monitor_list() -monitor_get(monitor_id: str) -monitor_pause(monitor_id: str) -monitor_resume(monitor_id: str) -monitor_delete(monitor_id: str) -monitor_activity( - monitor_id: str, - limit: int | None = None, # 1–100, default 20 - cursor: str | None = None, # pagination cursor -) -``` - -`monitor_activity` returns the tick history (`id`, `createdAt`, `status`, `changed`, `elapsedMs`, `diffs`) plus a `nextCursor` when more results are available — mirrors `sgai.monitor.activity()` in the SDKs. - -### Account tools - -#### credits -Get the remaining credit balance. - -```python -credits() -``` - -#### sgai_history -Browse paginated request history, optionally filtered by service. - -```python -sgai_history( - service: str | None = None, # scrape | extract | search | monitor | crawl - page: int | None = None, - limit: int | None = None, -) -``` - ---- - -## Troubleshooting - -- Verify your key is present in config (`X-API-Key` for remote, `SGAI_API_KEY` for local). -- Claude Desktop logs: - - macOS: `~/Library/Logs/Claude/` - - Windows: `%APPDATA%\\Claude\\Logs\\` -- If a long crawl is “still running”, keep polling `smartcrawler_fetch_results`. - -## License - -MIT License – see LICENSE file for details. - - + + + Set up ScrapeGraph MCP in Cursor. + + + Set up ScrapeGraph MCP in Claude Desktop. + + + Install via the Smithery registry. + + diff --git a/services/mcp-server/introduction.mdx b/services/mcp-server/introduction.mdx index ba3fc97..959e2d2 100644 --- a/services/mcp-server/introduction.mdx +++ b/services/mcp-server/introduction.mdx @@ -22,7 +22,7 @@ The Model Context Protocol (MCP) is a standardized way for AI assistants to acce ## Key Features - + Scrape, extract, search, crawl, generate schemas, monitor scheduled jobs (with activity polling), and manage your account @@ -42,17 +42,16 @@ The MCP server exposes the following tools via API v2: | Tool | Description | |---|---| -| **markdownify** | Convert webpages to clean markdown (POST /scrape) | -| **scrape** | Fetch page content in any single format: markdown, html, screenshot, branding, links, images, summary (POST /scrape) | -| **smartscraper** | AI-powered structured extraction from a URL (POST /extract) | -| **searchscraper** | Search the web and extract structured results (POST /search) | -| **smartcrawler_initiate** | Start async multi-page crawl — markdown, html, links, images, summary, branding, or screenshot (POST /crawl) | -| **smartcrawler_fetch_results** | Poll crawl results (GET /crawl/:id) | +| **scrape** | Fetch page content in any format: markdown (default), html, screenshot, branding, links, images, summary (POST /scrape) | +| **extract** | AI-powered structured extraction from a URL (POST /extract) | +| **search** | Search the web and extract structured results (POST /search) | +| **crawl_start** | Start async multi-page crawl — markdown, html, links, images, summary, branding, or screenshot (POST /crawl) | +| **crawl_get_status** | Poll crawl results (GET /crawl/:id) | | **crawl_stop** | Stop a running crawl job (POST /crawl/:id/stop) | | **crawl_resume** | Resume a stopped crawl job (POST /crawl/:id/resume) | -| **generate_schema** | Generate or augment a JSON Schema from a prompt (POST /schema) | +| **schema** | Generate or augment a JSON Schema from a prompt (POST /schema) | | **credits** | Check your credit balance (GET /credits) | -| **sgai_history** | Browse request history with pagination (GET /history) | +| **history** | Browse request history with pagination (GET /history) | | **monitor_create** | Create a scheduled extraction job (POST /monitor) | | **monitor_list** | List all monitors (GET /monitor) | | **monitor_get** | Get monitor details (GET /monitor/:id) | @@ -62,7 +61,7 @@ The MCP server exposes the following tools via API v2: | **monitor_activity** | Poll tick history for a monitor with pagination (GET /monitor/:id/activity) | - Removed from v1: `sitemap`, `agentic_scrapper`, `markdownify_status`, `smartscraper_status` (no v2 API equivalents). + **Migrating from v2 (scrapegraph-mcp ≤ 2.x)?** Tools were renamed in v3.0.0 to match the v2 API canonical names: `smartscraper` → `extract`, `searchscraper` → `search`, `smartcrawler_initiate` → `crawl_start`, `smartcrawler_fetch_results` → `crawl_get_status`, `sgai_history` → `history`, `generate_schema` → `schema`. `markdownify` was removed — use `scrape` with `output_format="markdown"` instead. See the [v3.0.0 release notes](https://github.com/ScrapeGraphAI/scrapegraph-mcp/releases/tag/v3.0.0) for full details. ## Quick Start @@ -131,7 +130,7 @@ Prefer running locally? You can install the Python package and run it via stdio. - Read the detailed setup guide for Cursor - Read the detailed setup guide for Claude Desktop -- Explore the full MCP Server documentation for advanced features +- Browse the [GitHub repo](https://github.com/ScrapeGraphAI/scrapegraph-mcp) for source, advanced configuration, and release notes Choose your client and start scraping with AI! diff --git a/services/schema.mdx b/services/schema.mdx new file mode 100644 index 0000000..433e013 --- /dev/null +++ b/services/schema.mdx @@ -0,0 +1,172 @@ +--- +title: 'Schema' +description: 'Generate or augment a JSON Schema from a natural-language prompt' +icon: 'sitemap' +--- + +## Overview + +Schema turns a plain-English description of the data you want into a valid JSON Schema you can pass to **Extract**, **Search**, or **Monitor** as `output_schema`. Optionally seed it with an `existing_schema` to extend rather than start from scratch. + +Use it when you want strongly-typed output but don't want to hand-write the schema. + +## Pricing + +Each Schema call costs **1 credit**. See the [pricing page](https://scrapegraphai.com/pricing) for the full breakdown. + +## Getting Started + +### Quick Start + + + +```python Python +from scrapegraph_py import ScrapeGraphAI + +sgai = ScrapeGraphAI() + +res = sgai.schema( + prompt="A product listing on an e-commerce site. Include name, price (number), currency, in_stock (boolean), rating (0-5), and a list of review excerpts." +) + +print(res.data.schema) +``` + +```javascript JavaScript +import { ScrapeGraphAI } from "scrapegraph-js"; + +const sgai = ScrapeGraphAI(); + +const res = await sgai.schema({ + prompt: "A product listing on an e-commerce site. Include name, price (number), currency, in_stock (boolean), rating (0-5), and a list of review excerpts.", +}); + +console.log(res.data?.schema); +``` + +```bash cURL +curl -X POST https://v2-api.scrapegraphai.com/api/schema \ + -H "SGAI-APIKEY: $SGAI_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "prompt": "A product listing on an e-commerce site. Include name, price (number), currency, in_stock (boolean), rating (0-5), and a list of review excerpts." + }' +``` + + + +#### Parameters + +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| `prompt` | string | Yes | Natural-language description of the schema to generate. | +| `existing_schema` | object \| string | No | Existing JSON Schema (object or JSON string) to extend with the new fields described in `prompt`. | +| `model` | string | No | Optional LLM model override. | + +#### Response + +```json +{ + "refinedPrompt": "Extract all product listings with their name, price, currency, stock status, rating, and review excerpts from the e-commerce site", + "schema": { + "$defs": { + "ItemSchema": { + "title": "ItemSchema", + "type": "object", + "properties": { + "name": { "title": "Name", "description": "Name of the product", "type": "string" }, + "price": { "title": "Price", "description": "Price of the product as a number", "type": "number" }, + "currency": { "title": "Currency", "description": "Currency code for the price (e.g., USD, EUR)", "type": "string" }, + "in_stock": { "title": "In Stock", "description": "Whether the product is currently in stock", "type": "boolean" }, + "rating": { "title": "Rating", "description": "Product rating on a scale from 0 to 5", "type": "number", "minimum": 0, "maximum": 5 }, + "review_excerpts": { "title": "Review Excerpts", "description": "List of short review excerpts for the product", "type": "array", "items": { "type": "string" } } + }, + "required": ["name", "price", "currency", "in_stock", "rating", "review_excerpts"] + } + }, + "title": "MainSchema", + "type": "object", + "properties": { + "items": { + "title": "Items", + "description": "Array of product listings", + "type": "array", + "items": { "$ref": "#/$defs/ItemSchema" } + } + }, + "required": ["items"] + }, + "usage": { "promptTokens": 1160, "completionTokens": 743 } +} +``` + +## Extending an existing schema + +Pass `existing_schema` to grow a schema you already have rather than regenerating from scratch: + + + +```python Python +existing = { + "title": "Product", + "type": "object", + "properties": { + "name": {"type": "string"}, + "price": {"type": "number"} + }, + "required": ["name", "price"] +} + +res = sgai.schema( + prompt="Add brand, sku, and a list of category tags.", + existing_schema=existing, +) +``` + +```javascript JavaScript +const existing = { + title: "Product", + type: "object", + properties: { + name: { type: "string" }, + price: { type: "number" } + }, + required: ["name", "price"] +}; + +const res = await sgai.schema({ + prompt: "Add brand, sku, and a list of category tags.", + existing_schema: existing, +}); +``` + + + +## Using the generated schema + +Pipe the returned schema directly into **Extract**, **Search**, or **Monitor** as `output_schema`: + +```python +schema_res = sgai.schema(prompt="A blog post with title, author, published_at (ISO date), and tags[].") +generated_schema = schema_res.data.schema + +extract_res = sgai.extract( + "Extract the post details.", + url="https://example.com/blog/post-slug", + output_schema=generated_schema, +) +print(extract_res.data.json_data) +``` + +## When to use Schema + +- ✅ You want structured output but don't have a hand-written schema yet +- ✅ You're prototyping and want a quick starting point you'll refine +- ✅ You have a partial schema and want to grow it +- ❌ You already have a finalized JSON Schema — pass it directly to Extract/Search and skip Schema + +## See also + +- [Extract](/services/extract) — Use `output_schema` for typed extraction +- [Search](/services/search) — Use `output_schema` for typed search results +- [Monitor](/services/monitor) — Use `output_schema` on scheduled jobs diff --git a/services/searchscraper.mdx b/services/searchscraper.mdx index cdede96..f807a89 100644 --- a/services/searchscraper.mdx +++ b/services/searchscraper.mdx @@ -4,6 +4,10 @@ description: 'Search and extract information from multiple web sources using AI' icon: 'magnifying-glass' --- + + **Legacy v1 service.** Use [`Search`](/services/search) instead — it is the canonical v2 API and the equivalent MCP tool. This page is kept for reference. + + SearchScraper Service diff --git a/services/sitemap.mdx b/services/sitemap.mdx index 5b1fbac..8c8fa2d 100644 --- a/services/sitemap.mdx +++ b/services/sitemap.mdx @@ -4,6 +4,10 @@ description: 'Extract all URLs from a website sitemap automatically' icon: 'sitemap' --- + + **Removed in API v2 — no direct replacement.** For multi-page URL discovery, use [`Crawl`](/services/crawl) with `extraction_mode="links"`, or fetch a `sitemap.xml` directly via [`Scrape`](/services/scrape). This page is kept for reference. + + Sitemap Service diff --git a/services/smartcrawler.mdx b/services/smartcrawler.mdx index 0a7dbb2..7336c97 100644 --- a/services/smartcrawler.mdx +++ b/services/smartcrawler.mdx @@ -4,6 +4,10 @@ description: 'AI-powered website crawling and multi-page extraction' icon: 'spider' --- + + **Legacy v1 service.** Use [`Crawl`](/services/crawl) instead — it is the canonical v2 API. The equivalent MCP tools are `crawl_start`, `crawl_get_status`, `crawl_stop`, and `crawl_resume`. This page is kept for reference. + + ## Overview SmartCrawler is our advanced web crawling service that offers two modes: diff --git a/services/smartscraper.mdx b/services/smartscraper.mdx index d10eb70..6ee5f6c 100644 --- a/services/smartscraper.mdx +++ b/services/smartscraper.mdx @@ -4,6 +4,10 @@ description: 'AI-powered web scraping for any website' icon: 'robot' --- + + **Legacy v1 service.** Use [`Extract`](/services/extract) instead — it is the canonical v2 API and the equivalent MCP tool. This page is kept for reference. + + SmartScraper Service