Article-mode oEmbed extraction for video pages + V1 payload parity#658
Open
wikirby wants to merge 5 commits into
Open
Article-mode oEmbed extraction for video pages + V1 payload parity#658wikirby wants to merge 5 commits into
wikirby wants to merge 5 commits into
Conversation
V1's article-mode flow on video pages (YouTube, Vimeo) produced a save
payload with an embedded video iframe, citation, title/author caption,
and `<meta name="AutoPageTagsCodes" content="Article" />` /
`<meta name="AutoPageTags" content="Article" />` tags that OneNote's
page renderer uses to recognize the result as an article-style clip
with playable embeds. V2 shipped without any of that machinery -- the
article mode just ran Readability over the YouTube DOM, which strips
iframes and produces a text-only result with no player and no
description. Users reported the regression on YouTube specifically;
the same gap applied to Vimeo as well.
oEmbed standard provides exactly the shape we need (iframe `html`,
title, author_name, thumbnail_url, dimensions) without any
provider-specific scraping. Both YouTube and Vimeo publish CORS-enabled
oEmbed endpoints that the chrome-extension origin can fetch directly
under our existing `<all_urls>` host_permissions.
Changes:
- New `src/scripts/contentCapture/oembedExtractor.ts` -- thin module
with a provider table (YouTube + Vimeo only, matching V1's
SupportedVideoDomains), hostname-pattern matching, fetch + JSON
parse, and a small `sanitizeProviderHtml` helper that strips
script-execution surfaces from provider-supplied HTML.
- `extractArticle` in renderer now tries oEmbed first; on no-match
or fetch failure it falls through to the existing Readability
path with zero behavior change.
- Preview vs save are decoupled:
- Preview shows the `thumbnail_url` at the same 600x338 (16:9)
box the saved iframe uses, with title / "author . provider"
attribution, page description (og:description fallback chain
same as bookmark mode), and a CSS-only play-glyph overlay
when `type === "video"`. No iframe in preview because the
renderer's `preview-frame` is sandboxed (allow-same-origin)
and the YouTube/Vimeo player can't run JS inside it -- which
is why earlier attempts produced a broken "Unable to execute
JavaScript" placeholder.
- Save uses the provider's iframe HTML (sanitized), with
`data-original-src=<pageUrl>` injected and dimensions
normalized to 600x338 -- the marker OneNote's renderer uses
to recognize and render the embedded player on the saved
page, matching V1's YoutubeVideoExtractor behavior exactly.
- PageMetadata plumbing: renderer threads a `pageMetadata` map
through the save port message; worker's `buildPage` iterates
and emits `<meta name="K" content="V" />` for each entry.
Mirrors V1's `OneNoteApi.OneNotePage.getPageMetadataAsHtml`
behavior. Article mode (both oEmbed and Readability paths)
populates `AutoPageTagsCodes=Article`, `AutoPageTags=Article`,
plus title/author/siteName (oEmbed) or
title/description/author/siteName/publishedTime (Readability,
matching V1 augmentationHelper).
- `buildPage` HTML output realigned to V1
`OneNoteApi.OneNotePage.getEntireOnml` shape: no `<!DOCTYPE>`,
`<html xmlns="http://www.w3.org/1999/xhtml" lang=<locale>>`
(no quotes around lang -- matches V1 output literally), locale
via `chrome.i18n.getUILanguage()`. Same change applied to the
parallel `distHtml` builder for distributed-PDF saves so all
save paths emit the same shape.
- Bookmark thumbnail size fallback restored: `imageToDataUrl`
initial-encode is PNG (good for icons/logos), with iterative
JPEG-quality step-down when the encoded data URL exceeds the
OneNote API per-MIME-part limit (~2MB minus padding). Matches
V1's deleted `DomUtils.adjustImageQualityIfNecessary` behavior
including the 0.1 step size. Surfaced because the user was
hitting "400 Maximum request size exceeded" on bookmark-mode
saves of YouTube pages whose 1280x720 og:image PNG-encoded to
~2.5MB.
Provider scope is intentionally narrow (YouTube + Vimeo only) to
match V1's effective surface and avoid accidentally enabling capture
on sites V1 never supported. V1 also handled Khan Academy via regex
scrape for embedded YouTube IDs in lesson-page HTML; that markup
likely no longer matches modern Khan Academy pages and is skipped
here per maintainer direction.
Verified manually: YouTube watch page and Vimeo video page produce
saved OneNote pages with the embedded player, title/author caption,
and og:description text below; non-matching domains fall through to
Readability with no regression; bookmark mode on YouTube saves
successfully without the 400 limit error.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a066275 to
60aea08
Compare
The article-mode highlighter has been non-functional since commit 16f62c0 (Security hardening) added `sandbox="allow-same-origin"` to the preview-frame iframe. The sandbox blocks script execution inside the iframe document, so `setupHighlighter`'s runtime injection of `<script src="textHighlighter.js">` into the iframe head silently no-opped -- the script tag existed but the library code never ran, so `pWin.TextHighlighter` stayed undefined and selection never produced highlight spans. Fix mirrors how pdf.js is structured in this codebase: load the trusted script in the parent renderer.html window (chrome-extension:// origin, hardened CSP), then operate on the iframe's DOM from the parent. `allow-same-origin` permits cross-frame DOM access and selection reads from the parent; only inside-iframe script execution is blocked, which we don't need. Changes: - renderer.html: add `<script src="textHighlighter.js">` before renderer.js so `window.TextHighlighter` is available in parent context, same loading pattern as pdf.combined.js. - renderer.ts initHighlighter / createHighlighterInstance: drop the runtime script injection path entirely. Construct directly from `(window as any).TextHighlighter` against `previewFrame.contentDocument.body` -- same constructor signature the library was designed for, just invoked from a different script context. - textHighlighter.js bindEvents / unbindEvents: route the keyup / mouseup / touchend listeners through `el.ownerDocument.defaultView` (the iframe's window) instead of bare `window` (the parent's window). When the highlighter is constructed from the parent and operates on a same-origin iframe, bare `window` resolves to the parent, missing events that fire inside the iframe. All other DOM access in the library already routes through `dom(el).get{Window,Document,Selection}()` helpers -- this is the only outlier, six lines patched + a clarifying comment. No sandbox change. Security hardening from PR #651 remains intact: script execution inside preview-frame is still blocked, so untrusted content (page-author HTML in bookmark mode, provider-supplied iframe HTML in oEmbed mode) cannot run JS. The highlighter is trusted code that lives outside the sandbox and reaches in. Verified manually: article mode on a regular page -- click highlight button, drag-select text, span appears with #fefe56 background and red-circle delete button; click delete button removes the highlight; mode-switch and save preserve highlights via the existing articleWorkingHtml path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per designer: more breathing room around the 1px Fluent dividers between sidebar groups (mode buttons, meta fields, section picker). Bumped `.sidebar-group` padding from 8px/8px to 12px/12px so each gap goes from 8+8=16px around the divider to 12+12=24px. Designer asked for 16/16 originally; 12/12 was the landing point after checking PDF mode -- 16/16 pushed the "View in OneNote" success banner against the window bottom and required scrolling. 12/12 is still a clear improvement over 8/8 while leaving the post-clip button in view. Bumped the window-height cap from 900px to 980px to absorb the +24px sidebar growth and keep the PDF-mode success banner comfortably visible. Floor of 600px unchanged; smaller browsers still shrink to fit. 980 stays well clear of typical 1080p displays. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owning team prefers to keep the renderer-window height cap at the original 900px and let the UX consequence of the roomier divider spacing surface (PDF-mode "View in OneNote" success-banner button sits below the fold and requires scrolling on caps-bound displays). Sidebar group padding stays at 12px/12px from the prior commit so the design ask is still in place; only the height-cap accommodation is reverted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V1 had an @font-face declaration that loaded Segoe UI from OneNote's own CDN, explicitly so Apple/Android users without local Segoe UI would still render the intended Fluent typography. V2 dropped that declaration and never replaced it -- Mac users have been falling through the font-family stack to -apple-system (San Francisco) since V2 launch, which renders noticeably differently from the intended look. Designer flagged it. Re-added the declaration verbatim from V1 (same www.onenote.com/styles/segoeui.{eot,woff} URLs V1 used; verified currently serving 200). Local() cascade hits first on Windows (no network), remote fallback only on machines that don't have Segoe UI installed. V1 also declared @font-face for `Segoe UI Light`; V2's CSS never references the Light variant -- skipped here as dead code. V1 referenced `Segoe UI Semibold` as a font-family value without ever declaring an @font-face for it; V2 instead uses `font-weight: 600` on the Regular face throughout, which the browser synthesizes (matches V1's effective behavior on Mac since Mac doesn't have Segoe UI Semibold locally either). No manifest CSP change: the prior CSP omitted font-src entirely, so font loading was already unrestricted -- the @font-face just works under existing policy. Article-preview pages with their own @font-face declarations (e.g. FontAwesome on a Gentoo page) continue to render their icon fonts in the sandboxed preview iframe as before. Verified on Windows: with local() in place, no network request (uses installed Segoe UI). Stripping local() temporarily forced the remote fetch path; renderer DevTools' "Rendered Font: Network resource" confirmed the CDN-served woff was used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
V1's article-mode flow on YouTube/Vimeo produced a saved OneNote page with an embedded video iframe, citation, title/author caption, and
AutoPageTags*meta tags that OneNote's renderer uses to recognize the result as an article-style clip with playable embeds. V2 shipped without that machinery — clicking Article on a YouTube page just ran Readability, which strips iframes and produces a text-only result with no player and no description. Users reported the regression.This PR restores V1's user-visible behavior using the oEmbed standard instead of V1's per-provider URL parsing. Both YouTube and Vimeo publish CORS-enabled oEmbed endpoints that the chrome-extension origin can fetch directly under existing
<all_urls>host_permissions.Provider scope
Narrowed to V1 parity: YouTube + Vimeo only.
V1 also handled Khan Academy via regex scrape of lesson-page HTML for embedded YouTube IDs — markup likely no longer matches modern Khan Academy pages; skipped per maintainer direction. Additional providers may be added as a focused capability-expansion PR in the future.
Changes
src/scripts/contentCapture/oembedExtractor.ts(new)sanitizeProviderHtmlhelper. Returns rawOEmbedData(renderer composes preview & save separately).renderer.tsextractArticlerenderer.tscomposeOEmbedForPreview<h3>, "author · provider" attribution, og:description, CSS-only▶overlay on video-type thumbnails. No nested iframe in preview because preview-frame is sandboxed (allow-same-origin) and the player can't execute JS inside it.renderer.tscomposeOEmbedForSavedata-original-src=<pageUrl>injected and dimensions normalized to 600x338 — the marker OneNote's renderer uses to recognize the embed (matches V1'sYoutubeVideoExtractor.createEmbeddedVideoFromId).renderer.tssave dispatchpageMetadatamap through the save port message. Article paths (oEmbed + Readability) populateAutoPageTagsCodes=Article,AutoPageTags=Article, plus title/author/siteName + description.webExtensionWorker.tsbuildPageOneNoteApi.OneNotePage.getEntireOnml: no<!DOCTYPE>,<html xmlns="http://www.w3.org/1999/xhtml" lang=<locale>>(no quotes around lang — matches V1 literally), iteratespageMetadatato emit one<meta>per entry. Same shape applied to the paralleldistHtmlbuilder for distributed-PDF saves.renderer.tsimageToDataUrlDomUtils.adjustImageQualityIfNecessary. Surfaced because bookmark mode on YouTube was hitting400 Maximum request size exceeded(1280x720 og:image PNG-encoded ~2.5MB).Follow-up fixes (separate commits)
9bc0549): Pre-existing regression from PR Client-side migration: unified renderer window with self-contained sign-in #651's sandbox hardening. Library now loads as a parent-window script (same pattern as pdf.js) and operates on the iframe DOM viaallow-same-originrather than executing inside it. SmalltextHighlighter.jspatch routes event bindings throughel.ownerDocument.defaultView. Sandbox stays tight.ca6ac23+a292926): Per designer, more breathing room around the Fluent dividers —.sidebar-grouppadding 8px → 12px. Window-height cap stays at 900 per owning team (PDF-mode success-banner button may sit under the fold on caps-bound displays; UX trade-off accepted).fe40b3d): Re-adds the V1@font-facedeclaration so non-Windows users get Segoe UI from OneNote's CDN instead of falling through to-apple-system(San Francisco). Local() cascade keeps Windows on the installed font, no network request. No CSP change — font loading was already unrestricted.Test plan
npm run build:prodclean (zero TS errors, tslint passes)imageToDataUrlJPEG fallback)Out of scope (deferred)
@font-faceforSegoe UI Semiboldif designer flags it)Demo
🤖 Generated with Claude Code