Smartcrawl MVP

Natural-language → structured data web extraction with FastAPI, optional Playwright JS rendering, and a JSON-Schema-driven LLM (with a BeautifulSoup fallback).

What this is

Smartcrawl turns a natural-language prompt plus a JSON Schema into structured JSON extracted from web pages. It can:

Fetch HTML (optionally render JavaScript)
Ask an LLM to extract only what the schema describes
Validate the output against the schema
(Crawl) follow internal links with concurrency
Dedupe pages via URL canonicalization and content fingerprinting

The LLM path falls back to a heuristic extractor (BeautifulSoup) when no API key is provided, so you can test without credentials on simple sites (e.g., books.toscrape.com).

Features

FastAPI service with interactive docs at /docs
Endpoints:
- POST /smartcrawl/scrape — single URL → structured JSON
- POST /smartcrawl/crawl — multi-page, same-origin crawl
Fetcher:
- render_js=false → HTTP client (httpx) — no browser
- render_js=true → Playwright (Chromium) — JS rendering
Extractor:
- LLM (OpenAI SDK v1) with JSON Schema enforcement
- Heuristic fallback (BeautifulSoup) for demos and no-key runs
Crawl:
- Same-origin link discovery
- Concurrency control & polite jitter
- URL canonicalization (treat / and /index.html as the same)
- Content fingerprinting to avoid duplicate results
Robots:
- By default, the code respects robots.txt inside the crawler and /scrape handler

Prerequisites

Python 3.11+ recommended (3.10 works on many setups)
Windows, macOS, or Linux
Optional for JS rendering: Playwright Chromium browser

Install & Setup

Create a virtual environment and install dependencies.

Windows (PowerShell/CMD):

python -m venv .venv
.\.venv\Scripts\activate
pip install -r requirements.txt

macOS/Linux (bash/zsh):

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Install the Playwright browser only if you plan to use render_js=true:

python -m playwright install chromium

Environment (optional, for real LLM extraction):

# Copy .env.example → .env and set:
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini

Running the server

Start the API:

uvicorn app.api:app --port 8000
# Add --reload for hot reload during development

Windows tip (JS rendering): if you plan to use Playwright (render_js=true), the code sets the Windows Selector event loop policy which supports subprocesses. No extra action needed if you used the provided app/api.py.

Health & home:

http://127.0.0.1:8000/ (banner JSON)
http://127.0.0.1:8000/healthz → {"status":"ok"}
http://127.0.0.1:8000/docs (Swagger UI)
http://127.0.0.1:8000/redoc (ReDoc UI)

Testing in Swagger UI

Open http://127.0.0.1:8000/docs → expand POST /smartcrawl/scrape or /smartcrawl/crawl → Try it out → paste a JSON body (see examples below) → Execute.

API Reference

POST `/smartcrawl/scrape`

Single URL → structured JSON

Request body:

{
  "url": "https://example.com",
  "prompt": "Natural language goal for extraction",
  "json_schema": { "... JSON Schema ..." },
  "render_js": true
}

Notes

render_js=false uses httpx (fast, no headless browser).
render_js=true uses Playwright (Chromium) for JS-heavy pages.
The extractor enforces the provided JSON Schema. If invalid, you'll get a 500 with a validation message.

Response 200:

{
  "url": "...",
  "elapsed_s": 1.234,
  "data": { "... schema-conformant JSON ..." }
}

POST `/smartcrawl/crawl`

Same-origin multi-page crawl.

Request body:

{
  "url": "https://example.com",
  "prompt": "Goal for extraction",
  "json_schema": { "... JSON Schema ..." },
  "max_pages": 10,
  "concurrency": 3,
  "render_js": false
}

Response 200:

{
  "root": "...",
  "elapsed_s": 12.345,
  "count": 3,
  "results": [
    {"url": "...", "data": { "... " }}
  ],
  "errors": [
    {"url": "...", "error": "blocked_by_robots"}
  ]
}

Implementation details

The crawler canonicalizes URLs (treat / and /index.html as the same) and also computes a content fingerprint of the page to avoid duplicate results when different URLs show identical content.
Concurrency and polite jitter are applied to avoid hammering servers.

Examples (copy/paste)

A) /smartcrawl/scrape (books.toscrape.com — no JS needed):

{
  "url": "https://books.toscrape.com/",
  "prompt": "Extract all book titles shown on the page and their prices.",
  "json_schema": {
    "type": "object",
    "properties": {
      "titles": { "type": "array", "items": { "type": "string" } },
      "prices": { "type": "array", "items": { "type": "string" } }
    },
    "required": ["titles", "prices"]
  },
  "render_js": false
}

B) /smartcrawl/crawl (same site):

{
  "url": "https://books.toscrape.com/",
  "prompt": "Extract book titles and prices on each page.",
  "json_schema": {
    "type": "object",
    "properties": {
      "titles": { "type": "array", "items": { "type": "string" } },
      "prices": { "type": "array", "items": { "type": "string" } }
    },
    "required": ["titles", "prices"]
  },
  "max_pages": 3,
  "concurrency": 2,
  "render_js": false
}

C) /smartcrawl/scrape with links (title + all anchors):

{
  "url": "https://example.com",
  "prompt": "Extract the page title and all hyperlinks (text + href).",
  "json_schema": {
    "type": "object",
    "properties": {
      "title": {"type": "string"},
      "links": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "text": {"type": "string"},
            "href": {"type": "string"}
          },
          "required": ["text","href"]
        }
      }
    },
    "required": ["title","links"]
  },
  "render_js": false
}

Troubleshooting

GET / returns 404
You haven’t added the root route. The provided code includes it.
500 on /scrape or /crawl
- If using the LLM path: ensure OPENAI_API_KEY and openai package are installed. The error message is surfaced back in the response.
- Without a key: the fallback extractor handles simple patterns but won’t magically parse complex structures.
Playwright errors (Windows NotImplementedError)
- Install browser bundle: python -m playwright install chromium.
- The code sets WindowsSelectorEventLoopPolicy to support subprocesses.
- If you don’t need JS rendering, set "render_js": false to avoid Playwright entirely.
403 Blocked by robots.txt
- By default, the service respects robots. Only crawl allowed paths.
- If you’ve added the optional robots override flag, understand the risks (see section below).
422 Unprocessable Entity
- Your JSON body doesn’t match the request model (missing required fields) or your data didn’t match the json_schema.
Duplicate pages in results
- The crawler canonicalizes URLs and fingerprints content to dedupe. If you still see duplicates, raise max_pages or use a stricter schema.

Notes on ethics & robots.txt

Respect each site’s Terms of Service and robots.txt. Use official APIs when available. Apply rate limits and be polite. Avoid collecting PII or sensitive data.

Optional patch: you can add a respect_robots: "ignore" flag and a custom user_agent if you own the site or have explicit permission. Default should remain strict.

Roadmap (post-MVP)

Queue/worker (Redis + RQ/Celery) and /status/{job_id} for long crawls
Proxy rotation & CAPTCHA handling (where permitted)
Provider adapters (local LLMs, structured outputs/functions)
DOM targeting heuristics & content normalization
sitemap.xml seeding and pagination strategies
Schema registry & versioning
Observability: structured logs, metrics, tracing
Security/compliance: PII redaction, TTL for raw HTML, opt-out lists

Happy crawling!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Smartcrawl MVP

Contents

What this is

Features

Prerequisites

Install & Setup

Running the server

Testing in Swagger UI

API Reference

POST `/smartcrawl/scrape`

POST `/smartcrawl/crawl`

Examples (copy/paste)

Troubleshooting

Notes on ethics & robots.txt

Roadmap (post-MVP)

About

Uh oh!

Releases

Packages

CodingRaemajor/SmartCrawl

Folders and files

Latest commit

History

Repository files navigation

Smartcrawl MVP

Contents

What this is

Features

Prerequisites

Install & Setup

Running the server

Testing in Swagger UI

API Reference

POST /smartcrawl/scrape

POST /smartcrawl/crawl

Examples (copy/paste)

Troubleshooting

Notes on ethics & robots.txt

Roadmap (post-MVP)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

POST `/smartcrawl/scrape`

POST `/smartcrawl/crawl`

Packages