Skip to content

Lightweight tool to convert raw HTML into a machine-readable JSON schema: page type, product cards, buttons, forms, links.

Notifications You must be signed in to change notification settings

Flowdefi/schema-extractor

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

@threvo/schema-extractor

A minimal yet practical JavaScript SDK for extracting meaningful page schemas that agents can use while browsing the web. Give it a URL (or raw HTML) and it will fetch, parse, and map the document into a simple JSON structure containing page type, inputs, products, buttons, forms, and links.

Features

  • 🚀 Accepts either remote URLs or inline HTML strings
  • 🌐 Fetches pages via axios with sensible defaults
  • 🧠 Uses lightweight DOM heuristics built with cheerio
  • ⚙️ Returns a deterministic schema object ready for downstream agents
  • 🖥️ Ships with an Express demo server + static UI for quick experiments

Installation

npm install @threvo/schema-extractor

This repository already includes the SDK source. Publishing to npm is optional.

Usage

import { extractSchema } from '@threvo/schema-extractor';

const schema = await extractSchema('https://example.com/products');

console.log(schema);
// {
//   page_type: 'product',
//   search_input: '.search input',
//   product_cards: ['.product-card'],
//   product_title: '.product-title',
//   product_price: '.price',
//   product_image: '.product img',
//   buttons: ['button.primary'],
//   forms: ['form[action*="search"]'],
//   links: ['a[href*="/details"]']
// }

API

extractSchema(urlOrHtml: string): Promise<Schema>

  • When the argument begins with http the SDK fetches the page via axios before parsing.
  • For plain HTML strings the DOM is parsed directly.
  • Errors are swallowed and yield an unknown schema so the caller never has to guard with try/catch.

Demo Server

npm run demo
  • Starts an Express server on http://localhost:4000.
  • GET /api/extract?url=... proxies to the SDK and returns the schema JSON.
  • Static demo UI at / lets you paste any URL and pretty-print the results.

Demo screenshot placeholder

Project Structure

/src
  extractSchema.js
  detectors/
    detectPageType.js
    detectSearchInput.js
    detectProducts.js
    detectButtons.js
    detectForms.js
    detectLinks.js
/demo
  index.html
  server.js

Development

  1. Install dependencies with npm install (already done if you are here).
  2. Update heuristics in src/detectors/* to support new page types.
  3. Run npm run demo and open the UI to validate changes live.

License

MIT © Threvo

About

Lightweight tool to convert raw HTML into a machine-readable JSON schema: page type, product cards, buttons, forms, links.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 100.0%