Skip to content

Article-mode oEmbed extraction for video pages + V1 payload parity#658

Open
wikirby wants to merge 5 commits into
masterfrom
user/wikirby/article-oembed
Open

Article-mode oEmbed extraction for video pages + V1 payload parity#658
wikirby wants to merge 5 commits into
masterfrom
user/wikirby/article-oembed

Conversation

@wikirby
Copy link
Copy Markdown
Contributor

@wikirby wikirby commented May 14, 2026

Summary

V1's article-mode flow on YouTube/Vimeo produced a saved OneNote page with an embedded video iframe, citation, title/author caption, and AutoPageTags* meta tags that OneNote's renderer uses to recognize the result as an article-style clip with playable embeds. V2 shipped without that machinery — clicking Article on a YouTube page just ran Readability, which strips iframes and produces a text-only result with no player and no description. Users reported the regression.

This PR restores V1's user-visible behavior using the oEmbed standard instead of V1's per-provider URL parsing. Both YouTube and Vimeo publish CORS-enabled oEmbed endpoints that the chrome-extension origin can fetch directly under existing <all_urls> host_permissions.

Provider scope

Narrowed to V1 parity: YouTube + Vimeo only.

V1 also handled Khan Academy via regex scrape of lesson-page HTML for embedded YouTube IDs — markup likely no longer matches modern Khan Academy pages; skipped per maintainer direction. Additional providers may be added as a focused capability-expansion PR in the future.

Changes

Area Change
src/scripts/contentCapture/oembedExtractor.ts (new) 3-entry provider table + hostname-pattern matching + fetch/parse + sanitizeProviderHtml helper. Returns raw OEmbedData (renderer composes preview & save separately).
renderer.ts extractArticle Try oEmbed first; on null fall through to existing Readability path (no behavior change for non-matching domains).
renderer.ts composeOEmbedForPreview Thumbnail at fixed 600x338 (16:9 — matches saved iframe size), title <h3>, "author · provider" attribution, og:description, CSS-only overlay on video-type thumbnails. No nested iframe in preview because preview-frame is sandboxed (allow-same-origin) and the player can't execute JS inside it.
renderer.ts composeOEmbedForSave Sanitized provider iframe with data-original-src=<pageUrl> injected and dimensions normalized to 600x338 — the marker OneNote's renderer uses to recognize the embed (matches V1's YoutubeVideoExtractor.createEmbeddedVideoFromId).
renderer.ts save dispatch Threads new pageMetadata map through the save port message. Article paths (oEmbed + Readability) populate AutoPageTagsCodes=Article, AutoPageTags=Article, plus title/author/siteName + description.
webExtensionWorker.ts buildPage HTML output realigned to V1 OneNoteApi.OneNotePage.getEntireOnml: no <!DOCTYPE>, <html xmlns="http://www.w3.org/1999/xhtml" lang=<locale>> (no quotes around lang — matches V1 literally), iterates pageMetadata to emit one <meta> per entry. Same shape applied to the parallel distHtml builder for distributed-PDF saves.
renderer.ts imageToDataUrl Drive-by bookmark fix: initial PNG encode + iterative JPEG-quality step-down when encoded data URL exceeds OneNote's per-MIME-part limit. Matches V1's deleted DomUtils.adjustImageQualityIfNecessary. Surfaced because bookmark mode on YouTube was hitting 400 Maximum request size exceeded (1280x720 og:image PNG-encoded ~2.5MB).

Follow-up fixes (separate commits)

  • Article-mode highlighter (9bc0549): Pre-existing regression from PR Client-side migration: unified renderer window with self-contained sign-in #651's sandbox hardening. Library now loads as a parent-window script (same pattern as pdf.js) and operates on the iframe DOM via allow-same-origin rather than executing inside it. Small textHighlighter.js patch routes event bindings through el.ownerDocument.defaultView. Sandbox stays tight.
  • Sidebar group spacing (ca6ac23 + a292926): Per designer, more breathing room around the Fluent dividers — .sidebar-group padding 8px → 12px. Window-height cap stays at 900 per owning team (PDF-mode success-banner button may sit under the fold on caps-bound displays; UX trade-off accepted).
  • Segoe UI @font-face restored for Mac/Linux (fe40b3d): Re-adds the V1 @font-face declaration so non-Windows users get Segoe UI from OneNote's CDN instead of falling through to -apple-system (San Francisco). Local() cascade keeps Windows on the installed font, no network request. No CSP change — font loading was already unrestricted.

Test plan

  • npm run build:prod clean (zero TS errors, tslint passes)
  • YouTube watch page → article mode → preview shows thumbnail + ▶ overlay + title/author/description; save produces OneNote page with playable embedded video
  • Vimeo video page → same flow
  • Non-matching domain (regular article) → falls through to Readability with no regression
  • Bookmark mode on YouTube → saves successfully (was 400 before the imageToDataUrl JPEG fallback)
  • Article-mode highlighter: select text → yellow span with red-circle delete button; click delete removes
  • Segoe UI loads from network on machines without it locally (verified via local()-strip dry run)
  • PDF mode, full-page mode, region mode otherwise unaffected

Out of scope (deferred)

  • Image-size fallback on PDF mode / full-page mode / region mode — those paths already use JPEG and aren't reported as failing; add if needed
  • Broader provider support — separate, focused PR if appetite arises
  • Pixel-perfect Mac/Windows parity on Semibold rendering (browser synthesizes from Regular today; add @font-face for Segoe UI Semibold if designer flags it)

Demo

image image

🤖 Generated with Claude Code

V1's article-mode flow on video pages (YouTube, Vimeo) produced a save
payload with an embedded video iframe, citation, title/author caption,
and `<meta name="AutoPageTagsCodes" content="Article" />` /
`<meta name="AutoPageTags" content="Article" />` tags that OneNote's
page renderer uses to recognize the result as an article-style clip
with playable embeds. V2 shipped without any of that machinery -- the
article mode just ran Readability over the YouTube DOM, which strips
iframes and produces a text-only result with no player and no
description. Users reported the regression on YouTube specifically;
the same gap applied to Vimeo as well.

oEmbed standard provides exactly the shape we need (iframe `html`,
title, author_name, thumbnail_url, dimensions) without any
provider-specific scraping. Both YouTube and Vimeo publish CORS-enabled
oEmbed endpoints that the chrome-extension origin can fetch directly
under our existing `<all_urls>` host_permissions.

Changes:

  - New `src/scripts/contentCapture/oembedExtractor.ts` -- thin module
    with a provider table (YouTube + Vimeo only, matching V1's
    SupportedVideoDomains), hostname-pattern matching, fetch + JSON
    parse, and a small `sanitizeProviderHtml` helper that strips
    script-execution surfaces from provider-supplied HTML.
  - `extractArticle` in renderer now tries oEmbed first; on no-match
    or fetch failure it falls through to the existing Readability
    path with zero behavior change.
  - Preview vs save are decoupled:
      - Preview shows the `thumbnail_url` at the same 600x338 (16:9)
        box the saved iframe uses, with title / "author . provider"
        attribution, page description (og:description fallback chain
        same as bookmark mode), and a CSS-only play-glyph overlay
        when `type === "video"`. No iframe in preview because the
        renderer's `preview-frame` is sandboxed (allow-same-origin)
        and the YouTube/Vimeo player can't run JS inside it -- which
        is why earlier attempts produced a broken "Unable to execute
        JavaScript" placeholder.
      - Save uses the provider's iframe HTML (sanitized), with
        `data-original-src=<pageUrl>` injected and dimensions
        normalized to 600x338 -- the marker OneNote's renderer uses
        to recognize and render the embedded player on the saved
        page, matching V1's YoutubeVideoExtractor behavior exactly.
  - PageMetadata plumbing: renderer threads a `pageMetadata` map
    through the save port message; worker's `buildPage` iterates
    and emits `<meta name="K" content="V" />` for each entry.
    Mirrors V1's `OneNoteApi.OneNotePage.getPageMetadataAsHtml`
    behavior. Article mode (both oEmbed and Readability paths)
    populates `AutoPageTagsCodes=Article`, `AutoPageTags=Article`,
    plus title/author/siteName (oEmbed) or
    title/description/author/siteName/publishedTime (Readability,
    matching V1 augmentationHelper).
  - `buildPage` HTML output realigned to V1
    `OneNoteApi.OneNotePage.getEntireOnml` shape: no `<!DOCTYPE>`,
    `<html xmlns="http://www.w3.org/1999/xhtml" lang=<locale>>`
    (no quotes around lang -- matches V1 output literally), locale
    via `chrome.i18n.getUILanguage()`. Same change applied to the
    parallel `distHtml` builder for distributed-PDF saves so all
    save paths emit the same shape.
  - Bookmark thumbnail size fallback restored: `imageToDataUrl`
    initial-encode is PNG (good for icons/logos), with iterative
    JPEG-quality step-down when the encoded data URL exceeds the
    OneNote API per-MIME-part limit (~2MB minus padding). Matches
    V1's deleted `DomUtils.adjustImageQualityIfNecessary` behavior
    including the 0.1 step size. Surfaced because the user was
    hitting "400 Maximum request size exceeded" on bookmark-mode
    saves of YouTube pages whose 1280x720 og:image PNG-encoded to
    ~2.5MB.

Provider scope is intentionally narrow (YouTube + Vimeo only) to
match V1's effective surface and avoid accidentally enabling capture
on sites V1 never supported. V1 also handled Khan Academy via regex
scrape for embedded YouTube IDs in lesson-page HTML; that markup
likely no longer matches modern Khan Academy pages and is skipped
here per maintainer direction.

Verified manually: YouTube watch page and Vimeo video page produce
saved OneNote pages with the embedded player, title/author caption,
and og:description text below; non-matching domains fall through to
Readability with no regression; bookmark mode on YouTube saves
successfully without the 400 limit error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wikirby wikirby force-pushed the user/wikirby/article-oembed branch from a066275 to 60aea08 Compare May 14, 2026 06:48
wikirby and others added 4 commits May 14, 2026 03:54
The article-mode highlighter has been non-functional since commit
16f62c0 (Security hardening) added `sandbox="allow-same-origin"` to
the preview-frame iframe. The sandbox blocks script execution inside
the iframe document, so `setupHighlighter`'s runtime injection of
`<script src="textHighlighter.js">` into the iframe head silently
no-opped -- the script tag existed but the library code never ran,
so `pWin.TextHighlighter` stayed undefined and selection never
produced highlight spans.

Fix mirrors how pdf.js is structured in this codebase: load the
trusted script in the parent renderer.html window (chrome-extension://
origin, hardened CSP), then operate on the iframe's DOM from the
parent. `allow-same-origin` permits cross-frame DOM access and
selection reads from the parent; only inside-iframe script execution
is blocked, which we don't need.

Changes:

  - renderer.html: add `<script src="textHighlighter.js">` before
    renderer.js so `window.TextHighlighter` is available in parent
    context, same loading pattern as pdf.combined.js.
  - renderer.ts initHighlighter / createHighlighterInstance: drop
    the runtime script injection path entirely. Construct directly
    from `(window as any).TextHighlighter` against
    `previewFrame.contentDocument.body` -- same constructor signature
    the library was designed for, just invoked from a different
    script context.
  - textHighlighter.js bindEvents / unbindEvents: route the
    keyup / mouseup / touchend listeners through
    `el.ownerDocument.defaultView` (the iframe's window) instead of
    bare `window` (the parent's window). When the highlighter is
    constructed from the parent and operates on a same-origin iframe,
    bare `window` resolves to the parent, missing events that fire
    inside the iframe. All other DOM access in the library already
    routes through `dom(el).get{Window,Document,Selection}()` helpers
    -- this is the only outlier, six lines patched + a clarifying
    comment.

No sandbox change. Security hardening from PR #651 remains intact:
script execution inside preview-frame is still blocked, so untrusted
content (page-author HTML in bookmark mode, provider-supplied iframe
HTML in oEmbed mode) cannot run JS. The highlighter is trusted code
that lives outside the sandbox and reaches in.

Verified manually: article mode on a regular page -- click highlight
button, drag-select text, span appears with #fefe56 background and
red-circle delete button; click delete button removes the highlight;
mode-switch and save preserve highlights via the existing
articleWorkingHtml path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per designer: more breathing room around the 1px Fluent dividers
between sidebar groups (mode buttons, meta fields, section picker).
Bumped `.sidebar-group` padding from 8px/8px to 12px/12px so each
gap goes from 8+8=16px around the divider to 12+12=24px. Designer
asked for 16/16 originally; 12/12 was the landing point after
checking PDF mode -- 16/16 pushed the "View in OneNote" success
banner against the window bottom and required scrolling. 12/12 is
still a clear improvement over 8/8 while leaving the post-clip
button in view.

Bumped the window-height cap from 900px to 980px to absorb the
+24px sidebar growth and keep the PDF-mode success banner comfortably
visible. Floor of 600px unchanged; smaller browsers still shrink to
fit. 980 stays well clear of typical 1080p displays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owning team prefers to keep the renderer-window height cap at the
original 900px and let the UX consequence of the roomier divider
spacing surface (PDF-mode "View in OneNote" success-banner button
sits below the fold and requires scrolling on caps-bound displays).
Sidebar group padding stays at 12px/12px from the prior commit so
the design ask is still in place; only the height-cap accommodation
is reverted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V1 had an @font-face declaration that loaded Segoe UI from OneNote's
own CDN, explicitly so Apple/Android users without local Segoe UI
would still render the intended Fluent typography. V2 dropped that
declaration and never replaced it -- Mac users have been falling
through the font-family stack to -apple-system (San Francisco)
since V2 launch, which renders noticeably differently from the
intended look. Designer flagged it.

Re-added the declaration verbatim from V1 (same
www.onenote.com/styles/segoeui.{eot,woff} URLs V1 used; verified
currently serving 200). Local() cascade hits first on Windows (no
network), remote fallback only on machines that don't have Segoe UI
installed.

V1 also declared @font-face for `Segoe UI Light`; V2's CSS never
references the Light variant -- skipped here as dead code. V1
referenced `Segoe UI Semibold` as a font-family value without ever
declaring an @font-face for it; V2 instead uses `font-weight: 600`
on the Regular face throughout, which the browser synthesizes
(matches V1's effective behavior on Mac since Mac doesn't have
Segoe UI Semibold locally either).

No manifest CSP change: the prior CSP omitted font-src entirely,
so font loading was already unrestricted -- the @font-face just
works under existing policy. Article-preview pages with their own
@font-face declarations (e.g. FontAwesome on a Gentoo page)
continue to render their icon fonts in the sandboxed preview iframe
as before.

Verified on Windows: with local() in place, no network request
(uses installed Segoe UI). Stripping local() temporarily forced
the remote fetch path; renderer DevTools' "Rendered Font:
Network resource" confirmed the CDN-served woff was used.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant