A minimal yet practical JavaScript SDK for extracting meaningful page schemas that agents can use while browsing the web. Give it a URL (or raw HTML) and it will fetch, parse, and map the document into a simple JSON structure containing page type, inputs, products, buttons, forms, and links.
- 🚀 Accepts either remote URLs or inline HTML strings
- 🌐 Fetches pages via
axioswith sensible defaults - 🧠 Uses lightweight DOM heuristics built with
cheerio - ⚙️ Returns a deterministic schema object ready for downstream agents
- 🖥️ Ships with an Express demo server + static UI for quick experiments
npm install @threvo/schema-extractorThis repository already includes the SDK source. Publishing to npm is optional.
import { extractSchema } from '@threvo/schema-extractor';
const schema = await extractSchema('https://example.com/products');
console.log(schema);
// {
// page_type: 'product',
// search_input: '.search input',
// product_cards: ['.product-card'],
// product_title: '.product-title',
// product_price: '.price',
// product_image: '.product img',
// buttons: ['button.primary'],
// forms: ['form[action*="search"]'],
// links: ['a[href*="/details"]']
// }extractSchema(urlOrHtml: string): Promise<Schema>
- When the argument begins with
httpthe SDK fetches the page viaaxiosbefore parsing. - For plain HTML strings the DOM is parsed directly.
- Errors are swallowed and yield an
unknownschema so the caller never has to guard with try/catch.
npm run demo- Starts an Express server on
http://localhost:4000. GET /api/extract?url=...proxies to the SDK and returns the schema JSON.- Static demo UI at
/lets you paste any URL and pretty-print the results.
/src
extractSchema.js
detectors/
detectPageType.js
detectSearchInput.js
detectProducts.js
detectButtons.js
detectForms.js
detectLinks.js
/demo
index.html
server.js
- Install dependencies with
npm install(already done if you are here). - Update heuristics in
src/detectors/*to support new page types. - Run
npm run demoand open the UI to validate changes live.
MIT © Threvo
