Skip to content

CodingRaemajor/SmartCrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Smartcrawl MVP

Natural-language → structured data web extraction with FastAPI, optional Playwright JS rendering, and a JSON-Schema-driven LLM (with a BeautifulSoup fallback).


Contents

  1. What this is
  2. Features
  3. Prerequisites
  4. Install & Setup
  5. Running the server
  6. Testing in Swagger UI
  7. API Reference
  8. Examples (copy/paste)
  9. Troubleshooting
  10. Notes on ethics & robots.txt
  11. Roadmap (post-MVP)

What this is

Smartcrawl turns a natural-language prompt plus a JSON Schema into structured JSON extracted from web pages. It can:

  • Fetch HTML (optionally render JavaScript)
  • Ask an LLM to extract only what the schema describes
  • Validate the output against the schema
  • (Crawl) follow internal links with concurrency
  • Dedupe pages via URL canonicalization and content fingerprinting

The LLM path falls back to a heuristic extractor (BeautifulSoup) when no API key is provided, so you can test without credentials on simple sites (e.g., books.toscrape.com).


Features

  • FastAPI service with interactive docs at /docs
  • Endpoints:
    • POST /smartcrawl/scrape — single URL → structured JSON
    • POST /smartcrawl/crawl — multi-page, same-origin crawl
  • Fetcher:
    • render_js=false → HTTP client (httpx) — no browser
    • render_js=true → Playwright (Chromium) — JS rendering
  • Extractor:
    • LLM (OpenAI SDK v1) with JSON Schema enforcement
    • Heuristic fallback (BeautifulSoup) for demos and no-key runs
  • Crawl:
    • Same-origin link discovery
    • Concurrency control & polite jitter
    • URL canonicalization (treat / and /index.html as the same)
    • Content fingerprinting to avoid duplicate results
  • Robots:
    • By default, the code respects robots.txt inside the crawler and /scrape handler

Prerequisites

  • Python 3.11+ recommended (3.10 works on many setups)
  • Windows, macOS, or Linux
  • Optional for JS rendering: Playwright Chromium browser

Install & Setup

Create a virtual environment and install dependencies.

Windows (PowerShell/CMD):

python -m venv .venv
.\.venv\Scripts\activate
pip install -r requirements.txt

macOS/Linux (bash/zsh):

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Install the Playwright browser only if you plan to use render_js=true:

python -m playwright install chromium

Environment (optional, for real LLM extraction):

# Copy .env.example → .env and set:
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini

Running the server

Start the API:

uvicorn app.api:app --port 8000
# Add --reload for hot reload during development

Windows tip (JS rendering): if you plan to use Playwright (render_js=true), the code sets the Windows Selector event loop policy which supports subprocesses. No extra action needed if you used the provided app/api.py.

Health & home:

  • http://127.0.0.1:8000/ (banner JSON)
  • http://127.0.0.1:8000/healthz{"status":"ok"}
  • http://127.0.0.1:8000/docs (Swagger UI)
  • http://127.0.0.1:8000/redoc (ReDoc UI)

Testing in Swagger UI

Open http://127.0.0.1:8000/docs → expand POST /smartcrawl/scrape or /smartcrawl/crawlTry it out → paste a JSON body (see examples below) → Execute.


API Reference

POST /smartcrawl/scrape

Single URL → structured JSON

Request body:

{
  "url": "https://example.com",
  "prompt": "Natural language goal for extraction",
  "json_schema": { "... JSON Schema ..." },
  "render_js": true
}

Notes

  • render_js=false uses httpx (fast, no headless browser).
  • render_js=true uses Playwright (Chromium) for JS-heavy pages.
  • The extractor enforces the provided JSON Schema. If invalid, you'll get a 500 with a validation message.

Response 200:

{
  "url": "...",
  "elapsed_s": 1.234,
  "data": { "... schema-conformant JSON ..." }
}

POST /smartcrawl/crawl

Same-origin multi-page crawl.

Request body:

{
  "url": "https://example.com",
  "prompt": "Goal for extraction",
  "json_schema": { "... JSON Schema ..." },
  "max_pages": 10,
  "concurrency": 3,
  "render_js": false
}

Response 200:

{
  "root": "...",
  "elapsed_s": 12.345,
  "count": 3,
  "results": [
    {"url": "...", "data": { "... " }}
  ],
  "errors": [
    {"url": "...", "error": "blocked_by_robots"}
  ]
}

Implementation details

  • The crawler canonicalizes URLs (treat / and /index.html as the same) and also computes a content fingerprint of the page to avoid duplicate results when different URLs show identical content.
  • Concurrency and polite jitter are applied to avoid hammering servers.

Examples (copy/paste)

A) /smartcrawl/scrape (books.toscrape.com — no JS needed):

{
  "url": "https://books.toscrape.com/",
  "prompt": "Extract all book titles shown on the page and their prices.",
  "json_schema": {
    "type": "object",
    "properties": {
      "titles": { "type": "array", "items": { "type": "string" } },
      "prices": { "type": "array", "items": { "type": "string" } }
    },
    "required": ["titles", "prices"]
  },
  "render_js": false
}

B) /smartcrawl/crawl (same site):

{
  "url": "https://books.toscrape.com/",
  "prompt": "Extract book titles and prices on each page.",
  "json_schema": {
    "type": "object",
    "properties": {
      "titles": { "type": "array", "items": { "type": "string" } },
      "prices": { "type": "array", "items": { "type": "string" } }
    },
    "required": ["titles", "prices"]
  },
  "max_pages": 3,
  "concurrency": 2,
  "render_js": false
}

C) /smartcrawl/scrape with links (title + all anchors):

{
  "url": "https://example.com",
  "prompt": "Extract the page title and all hyperlinks (text + href).",
  "json_schema": {
    "type": "object",
    "properties": {
      "title": {"type": "string"},
      "links": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "text": {"type": "string"},
            "href": {"type": "string"}
          },
          "required": ["text","href"]
        }
      }
    },
    "required": ["title","links"]
  },
  "render_js": false
}

Troubleshooting

  • GET / returns 404
    You haven’t added the root route. The provided code includes it.

  • 500 on /scrape or /crawl

    • If using the LLM path: ensure OPENAI_API_KEY and openai package are installed. The error message is surfaced back in the response.
    • Without a key: the fallback extractor handles simple patterns but won’t magically parse complex structures.
  • Playwright errors (Windows NotImplementedError)

    • Install browser bundle: python -m playwright install chromium.
    • The code sets WindowsSelectorEventLoopPolicy to support subprocesses.
    • If you don’t need JS rendering, set "render_js": false to avoid Playwright entirely.
  • 403 Blocked by robots.txt

    • By default, the service respects robots. Only crawl allowed paths.
    • If you’ve added the optional robots override flag, understand the risks (see section below).
  • 422 Unprocessable Entity

    • Your JSON body doesn’t match the request model (missing required fields) or your data didn’t match the json_schema.
  • Duplicate pages in results

    • The crawler canonicalizes URLs and fingerprints content to dedupe. If you still see duplicates, raise max_pages or use a stricter schema.

Notes on ethics & robots.txt

Respect each site’s Terms of Service and robots.txt. Use official APIs when available. Apply rate limits and be polite. Avoid collecting PII or sensitive data.

Optional patch: you can add a respect_robots: "ignore" flag and a custom user_agent if you own the site or have explicit permission. Default should remain strict.


Roadmap (post-MVP)

  • Queue/worker (Redis + RQ/Celery) and /status/{job_id} for long crawls
  • Proxy rotation & CAPTCHA handling (where permitted)
  • Provider adapters (local LLMs, structured outputs/functions)
  • DOM targeting heuristics & content normalization
  • sitemap.xml seeding and pagination strategies
  • Schema registry & versioning
  • Observability: structured logs, metrics, tracing
  • Security/compliance: PII redaction, TTL for raw HTML, opt-out lists

Happy crawling!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published