Natural-language → structured data web extraction with FastAPI, optional Playwright JS rendering, and a JSON-Schema-driven LLM (with a BeautifulSoup fallback).
- What this is
- Features
- Prerequisites
- Install & Setup
- Running the server
- Testing in Swagger UI
- API Reference
- Examples (copy/paste)
- Troubleshooting
- Notes on ethics & robots.txt
- Roadmap (post-MVP)
Smartcrawl turns a natural-language prompt plus a JSON Schema into structured JSON extracted from web pages. It can:
- Fetch HTML (optionally render JavaScript)
- Ask an LLM to extract only what the schema describes
- Validate the output against the schema
- (Crawl) follow internal links with concurrency
- Dedupe pages via URL canonicalization and content fingerprinting
The LLM path falls back to a heuristic extractor (BeautifulSoup) when no API key is provided, so you can test without credentials on simple sites (e.g., books.toscrape.com).
- FastAPI service with interactive docs at /docs
- Endpoints:
- POST
/smartcrawl/scrape
— single URL → structured JSON - POST
/smartcrawl/crawl
— multi-page, same-origin crawl
- POST
- Fetcher:
render_js=false
→ HTTP client (httpx) — no browserrender_js=true
→ Playwright (Chromium) — JS rendering
- Extractor:
- LLM (OpenAI SDK v1) with JSON Schema enforcement
- Heuristic fallback (BeautifulSoup) for demos and no-key runs
- Crawl:
- Same-origin link discovery
- Concurrency control & polite jitter
- URL canonicalization (treat
/
and/index.html
as the same) - Content fingerprinting to avoid duplicate results
- Robots:
- By default, the code respects robots.txt inside the crawler and
/scrape
handler
- By default, the code respects robots.txt inside the crawler and
- Python 3.11+ recommended (3.10 works on many setups)
- Windows, macOS, or Linux
- Optional for JS rendering: Playwright Chromium browser
Create a virtual environment and install dependencies.
Windows (PowerShell/CMD):
python -m venv .venv
.\.venv\Scripts\activate
pip install -r requirements.txt
macOS/Linux (bash/zsh):
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Install the Playwright browser only if you plan to use render_js=true
:
python -m playwright install chromium
Environment (optional, for real LLM extraction):
# Copy .env.example → .env and set:
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
Start the API:
uvicorn app.api:app --port 8000
# Add --reload for hot reload during development
Windows tip (JS rendering): if you plan to use Playwright (render_js=true
), the code sets the Windows Selector event loop policy which supports subprocesses. No extra action needed if you used the provided app/api.py
.
Health & home:
http://127.0.0.1:8000/
(banner JSON)http://127.0.0.1:8000/healthz
→{"status":"ok"}
http://127.0.0.1:8000/docs
(Swagger UI)http://127.0.0.1:8000/redoc
(ReDoc UI)
Open http://127.0.0.1:8000/docs
→ expand POST /smartcrawl/scrape
or /smartcrawl/crawl
→ Try it out → paste a JSON body (see examples below) → Execute.
Single URL → structured JSON
Request body:
{
"url": "https://example.com",
"prompt": "Natural language goal for extraction",
"json_schema": { "... JSON Schema ..." },
"render_js": true
}
Notes
render_js=false
uses httpx (fast, no headless browser).render_js=true
uses Playwright (Chromium) for JS-heavy pages.- The extractor enforces the provided JSON Schema. If invalid, you'll get a 500 with a validation message.
Response 200:
{
"url": "...",
"elapsed_s": 1.234,
"data": { "... schema-conformant JSON ..." }
}
Same-origin multi-page crawl.
Request body:
{
"url": "https://example.com",
"prompt": "Goal for extraction",
"json_schema": { "... JSON Schema ..." },
"max_pages": 10,
"concurrency": 3,
"render_js": false
}
Response 200:
{
"root": "...",
"elapsed_s": 12.345,
"count": 3,
"results": [
{"url": "...", "data": { "... " }}
],
"errors": [
{"url": "...", "error": "blocked_by_robots"}
]
}
Implementation details
- The crawler canonicalizes URLs (treat
/
and/index.html
as the same) and also computes a content fingerprint of the page to avoid duplicate results when different URLs show identical content. - Concurrency and polite jitter are applied to avoid hammering servers.
A) /smartcrawl/scrape
(books.toscrape.com — no JS needed):
{
"url": "https://books.toscrape.com/",
"prompt": "Extract all book titles shown on the page and their prices.",
"json_schema": {
"type": "object",
"properties": {
"titles": { "type": "array", "items": { "type": "string" } },
"prices": { "type": "array", "items": { "type": "string" } }
},
"required": ["titles", "prices"]
},
"render_js": false
}
B) /smartcrawl/crawl
(same site):
{
"url": "https://books.toscrape.com/",
"prompt": "Extract book titles and prices on each page.",
"json_schema": {
"type": "object",
"properties": {
"titles": { "type": "array", "items": { "type": "string" } },
"prices": { "type": "array", "items": { "type": "string" } }
},
"required": ["titles", "prices"]
},
"max_pages": 3,
"concurrency": 2,
"render_js": false
}
C) /smartcrawl/scrape
with links (title + all anchors):
{
"url": "https://example.com",
"prompt": "Extract the page title and all hyperlinks (text + href).",
"json_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"links": {
"type": "array",
"items": {
"type": "object",
"properties": {
"text": {"type": "string"},
"href": {"type": "string"}
},
"required": ["text","href"]
}
}
},
"required": ["title","links"]
},
"render_js": false
}
-
GET / returns 404
You haven’t added the root route. The provided code includes it. -
500 on
/scrape
or/crawl
- If using the LLM path: ensure
OPENAI_API_KEY
andopenai
package are installed. The error message is surfaced back in the response. - Without a key: the fallback extractor handles simple patterns but won’t magically parse complex structures.
- If using the LLM path: ensure
-
Playwright errors (Windows
NotImplementedError
)- Install browser bundle:
python -m playwright install chromium
. - The code sets
WindowsSelectorEventLoopPolicy
to support subprocesses. - If you don’t need JS rendering, set
"render_js": false
to avoid Playwright entirely.
- Install browser bundle:
-
403 Blocked by robots.txt
- By default, the service respects robots. Only crawl allowed paths.
- If you’ve added the optional robots override flag, understand the risks (see section below).
-
422 Unprocessable Entity
- Your JSON body doesn’t match the request model (missing required fields) or your
data
didn’t match thejson_schema
.
- Your JSON body doesn’t match the request model (missing required fields) or your
-
Duplicate pages in results
- The crawler canonicalizes URLs and fingerprints content to dedupe. If you still see duplicates, raise
max_pages
or use a stricter schema.
- The crawler canonicalizes URLs and fingerprints content to dedupe. If you still see duplicates, raise
Respect each site’s Terms of Service and robots.txt
. Use official APIs when available. Apply rate limits and be polite. Avoid collecting PII or sensitive data.
Optional patch: you can add a
respect_robots: "ignore"
flag and a customuser_agent
if you own the site or have explicit permission. Default should remain strict.
- Queue/worker (Redis + RQ/Celery) and
/status/{job_id}
for long crawls - Proxy rotation & CAPTCHA handling (where permitted)
- Provider adapters (local LLMs, structured outputs/functions)
- DOM targeting heuristics & content normalization
sitemap.xml
seeding and pagination strategies- Schema registry & versioning
- Observability: structured logs, metrics, tracing
- Security/compliance: PII redaction, TTL for raw HTML, opt-out lists
Happy crawling!