-
Notifications
You must be signed in to change notification settings - Fork 0
HTML Mapping
HTMLMapper.toComponents(html, params?) turns an HTML string into a typed Component[]. This page explains the pipeline and the default rules; configuration is on Custom Mappings.
← Back to Home · Related: Custom Mappings · Component Types
import { HTMLMapper } from '@canvasflow/feed';
const components = HTMLMapper.toComponents(html, params /* optional */);
const root = HTMLMapper.getRootElement(html, rootMapping); // string | null| Method | Returns | Purpose |
|---|---|---|
toComponents(html, params?) |
Component[] |
The full HTML → components conversion. |
getRootElement(html, mapping) |
string | null |
Serialize the first element matching mapping (used to scope extraction). |
-
Pre-process the HTML string:
- remove breaklines;
- sanitize invalid
hrefs (replace with#); - lift
<a>wrappers around images out of<p>/heading tags; - split
<p>/h1–h6tags that contain<img>so the image becomes its own block.
-
Parse with
himalayainto aNode[]AST. -
Reduce the AST via
reduceComponents(params)frommapping/Mapping.tsintoComponent[].
For each element the reducer tries, in order:
-
Exclusion — matches an
excludesmapping, or hasdata-cf-ignore→ element and children skipped. -
Built-in detection — social embeds (Instagram, Twitter/X, TikTok, YouTube, Vimeo, Dailymotion, Infogram, Apple Podcasts),
<table>,<video>,<audio>,<iframe>, buttons, images (<img>,<picture>),<figure>(always produces aFigureContainerComponent), androle="gallery"/role="mosaic". -
Custom mappings — each
mappingsentry, in order; the first match wins. - Default text rules — the tag → text-component table below.
- Descend — otherwise ignore the element and evaluate its children.
| HTML | Component type |
|---|---|
h1 |
headline |
h2 |
title |
h3 |
subtitle |
h4 |
intro |
p |
body |
blockquote |
blockquote |
footer |
footer |
Any text element's role attribute overrides the default (e.g. <p role="crosshead"> → crosshead, <p role="text12"> → text12).
Text components keep only phrasing content; styles and classes are stripped. On <a> elements only href, target, and rel survive. Whitespace-only text between inline elements is preserved as a non-breaking space so spacing in markup like <b>foo</b> <i>bar</i> is not collapsed.
| Content | Detected from |
|---|---|
| Image |
<img>, <picture> (uses the fallback <img>). |
| Figure |
<figure> — always produces a FigureContainerComponent (component: 'container', type: 'figure'). Caption and credit are extracted from a <figcaption>; credit nodes are identified by the <small> tag, role="credit", or class="credit". The contained media components (image, video, audio) are nested under components. |
| Gallery |
role="gallery"/role="mosaic" container, or a custom gallery mapping. |
| Video |
<video> (src or first <source>); YouTube/Vimeo/Dailymotion via <iframe>. |
| Audio |
<audio>; Apple Podcasts via <iframe>. |
| Social |
blockquote/a markers for Instagram, Twitter/X, TikTok. |
| Table |
<table> → htmltable (restricted tag allow-list). |
| Button |
<a role="button"> or <button><a></button>. |
To recognise content that does not follow these conventions, define a custom mapping.
Start here
Reference
Operations