Skip to content

v1.3.0 — COLRv1 colour emoji, USE-lite shaper integration, true constant-memory streaming, UAX #9 X4–X5 overrides, pixel-diff visual regression, #48 CP-1252 fix

Latest

Choose a tag to compare

@Nizoka Nizoka released this 08 Jun 17:53
· 1 commit to main since this release
309d515

Closes issue #48 (CP-1252
extended characters not extractable under base-14 Helvetica) and delivers the
complete v1.3.0 roadmap with zero deferrals: COLRv1 colour emoji,
USE-lite shaper integration, true constant-memory streaming, UAX #9 X4–X5
character-level overrides, and a dual-mode pixel-diff visual-regression suite.
100% backward-compatible — every new feature is additive or opt-in, and
pre-existing PDFs are byte-identical for unchanged code paths.

Still zero runtime dependencies. 71 test files / 1982 tests, all green.

Highlights

  • feat(shaping): Telugu script (te). A new pure-JS GSUB/GPOS
    mini-shaper (src/shaping/telugu-shaper.ts) brings Telugu (~95 M speakers,
    ISO 15924 Telu, U+0C00–U+0C7F) to pdfnative’s 16 existing scripts. It builds
    virama-mediated conjunct clusters, forms subjoined-consonant ligatures via the
    shared gsub-driver, and positions above/below vowel signs and modifiers via
    the shared gpos-positioner — with no reph and no pre-base reordering
    (Telugu specifics). Bundled font pdfnative/fonts/noto-telugu-data.js (Noto
    Sans Telugu, OFL-1.1). Real-font shaping of తెలుగు / నమస్తే / క్షి /
    శ్రీ / జ్ఞ produces zero .notdef and correct conjuncts. Opt-in via
    registerFont('te', () => import('pdfnative/fonts/noto-telugu-data.js')).
    (src/shaping/telugu-shaper.ts)

  • feat(shaping): Five underserved scripts — Amharic/Ethiopic (am),
    Sinhala (si), Tibetan (bo), Khmer (km), Myanmar (my).
    Five new
    pure-JS mini-shapers extend pdfnative from 17 to 22 Unicode scripts,
    following the Telugu model (shared gsub-driver + gpos-positioner,
    zero-dependency, pure functions). Ethiopic (U+1200–U+137F) is a syllabic
    abugida needing no reordering — detection + font routing only.
    Sinhala (src/shaping/sinhala-shaper.ts, U+0D80–U+0DFF) builds
    virama conjuncts, reorders the pre-base kombuva (U+0D9A-class), and
    decomposes two-part vowels. Tibetan
    (src/shaping/tibetan-shaper.ts, U+0F00–U+0FFF) performs vertical
    subjoined-consonant stacking. Khmer
    (src/shaping/khmer-shaper.ts, U+1780–U+17FF) is USE-lite — coeng
    subscripts, pre-base vowels, two-part vowel decomposition. Myanmar
    (src/shaping/myanmar-shaper.ts, U+1000–U+109F) is USE-lite — medials,
    pre-base medial-ra (U+103C) and e-vowel (U+1031), virama stacking. Khmer and
    Myanmar are pragmatic USE-lite implementations with documented limitations
    (two-part-vowel MultipleSubst is handled JS-side via shaper decomposition
    tables, not the OpenType extractor). Bundled fonts (all OFL-1.1):
    noto-ethiopic-data.js, noto-sinhala-data.js, noto-tibetan-data.js
    (Noto Serif Tibetan), noto-khmer-data.js, noto-myanmar-data.js. Opt-in via
    registerFont('am'|'si'|'bo'|'km'|'my', () => import('pdfnative/fonts/...')).

  • feat(core): Opt-in Unicode normalization (layout.normalize).
    PdfLayoutOptions.normalize?: 'NFC'|'NFD'|'NFKC'|'NFKD'|false (default
    false) applies native String.prototype.normalize to text before encoding,
    so decomposed input (e.g. combining diacritics) composes to the form the
    embedded font expects. Off by default → byte-identical output for existing
    callers. (src/core/encoding-context.ts)

  • fix(crypto): CSPRNG-only randomness. PDF encryption now throws
    if no cryptographically secure random source (crypto.getRandomValues) is
    available, instead of silently falling back to Math.random. Encryption keys
    and IVs are never derived from a non-CSPRNG source.
    (src/core/pdf-encrypt.ts)

  • feat(core): Configurable document block limit (layout.maxBlocks).
    The previously hard-coded 10 000-block safety cap in assembleDocumentParts()
    is now configurable and the default raised to 100 000 (DEFAULT_MAX_BLOCKS).
    Large reports (e.g. multi-thousand-page medical or financial documents) no
    longer hit a spurious ceiling; callers can raise or lower it per document via
    layout.maxBlocks. The over-limit error now names the active limit and how to
    change it. (src/core/pdf-document.ts, src/core/pdf-layout.ts)

  • feat(parser): validatePdfUA() — PDF/UA structural validator. A new
    read-only, zero-byte-risk developer gate (ISO 14289-1) that parses a PDF and
    checks /MarkInfo /Marked, /StructTreeRoot + /ParentTree, /Metadata,
    /Lang, and per-page /MCID uniqueness. Returns
    { valid, errors, warnings }. Complements (does not replace) veraPDF.
    (src/parser/pdf-ua-validator.ts)

  • fix(shaping, colour emoji): No more tofu from selectors/joiners.
    Emoji variation selectors (VS-15/VS-16), the ZWJ/ZWNJ, and Fitzpatrick
    skin-tone modifiers that no registered font covers are now dropped during
    run-splitting
    instead of resolving to .notdef (the  box). Joiners are
    still preserved when an Indic shaper font maps them. New isZeroWidthFormat()
    predicate. (src/shaping/multi-font.ts, src/shaping/script-registry.ts)

  • fix(core, colour emoji): Computed Form /BBox (no clipping).
    renderColorGlyph() now derives each colour-glyph Form /BBox from the
    transformed contour bounds rather than the hard-coded em box, so colour emoji
    that dip below the baseline are no longer clipped. (src/core/pdf-color-glyph.ts)

  • feat(fonts): COLRv1 colour emoji. Noto Color Emoji (OFL-1.1) is
    bundleable as a curated subset (pdfnative/fonts/noto-color-emoji-data.js,
    221 colour glyphs, 936 KB). COLR v0 solid layers and COLR v1 linear / radial
    gradients render as native PDF Form XObjects (/Shading Type 2/3 +
    /ExtGState constant-alpha), one indirect object per unique glyph,
    forward-referenced into every page /XObject. The COLR / CPAL / glyf
    parsers are self-written and zero-dependency. Opt-in:
    registerFont('emoji', () => import('pdfnative/fonts/noto-color-emoji-data.js')).
    When not registered, monochrome emoji (Noto Emoji, v1.1.0) is unchanged and
    documents are byte-identical. Sweep gradients + Porter-Duff compositing fall
    back gracefully to monochrome (documented limitation). (src/core/color-emoji.ts, src/fonts/colr-parser.ts, src/fonts/glyf-outline.ts)

  • feat(core): True constant-memory streaming. buildPDFStreamTrue()
    and buildDocumentPDFStreamTrue() assemble the PDF into its raw object/
    framing parts and yield fixed-size Uint8Array chunks while freeing each
    part as it is emitted — the fully-joined PDF binary never materialises in
    memory
    . Peak memory is bounded by the chunk size plus the single largest
    part (a content stream or embedded font subset). Byte-identical output to
    buildDocumentPDFBytes() / buildPDFBytes(). The v1.2.0 object-boundary
    variants (*StreamPageByPage) and fixed-size variants (*Stream) are
    retained. (src/core/pdf-stream-writer.ts)

  • feat(shaping): UAX #9 X4–X5 overrides. resolveBidiRuns() now
    performs full character-level direction overrides inside LRO / RLO scopes —
    every codepoint within the scope is forced to strong L (LRO) or strong R
    (RLO) before the W/N/L rules run, not merely the base paragraph direction
    (the v1.2.0 behaviour). Nested embeddings and isolates recurse correctly.
    (src/shaping/bidi.ts)

  • feat(shaping): USE-lite shaper integration. The v1.2.0 cluster
    classifier (classifyUseCategory()) is now the joiner-classification
    authority across the Devanagari, Bengali, and Tamil shapers. Orphan ZWJ /
    ZWNJ no longer reach the cmap as .notdef; ZWJ between a halant/pulli and
    the next consonant continues a conjunct (half-form, eyelash-ra, ya-phalaa)
    while ZWNJ breaks it keeping a visible virama. (src/shaping/use-lite.ts)

  • test(visual): Dual-mode pixel-diff visual regression. Two
    complementary guards over self-contained extreme-script fixtures (Tamil,
    Bengali + Devanagari, Arabic) built with the real bundled fonts:

    1. a glyph-position snapshot that extracts every show operator's font,
      size, baseline x/y, and glyph IDs into a committed JSON baseline; and
    2. a rendered-glyph pixel diff that parses the embedded FontFile2
      outlines, scan-fills the shaped glyphs at their positions into a
      grayscale bitmap, and compares against a committed PNG baseline
      (≤1% pixel tolerance) using a self-written, zero-dependency grayscale PNG
      encoder/decoder.

    A CI workflow (visual-regression.yml)
    runs both, gated on src/shaping/**, src/fonts/**, src/core/**, and
    fonts/**. (tests/visual/)

  • fix(fonts, #48):
    CP-1252 extended characters. Base-14 Helvetica text now carries a
    /ToUnicode CMap, so the Windows-1252 high range (€ ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ
    Ž ' ' " " • – — ˜ ™ š › œ ž Ÿ) is correctly extractable and searchable in any
    viewer. When a latin font is registered (Noto Sans VF), these glyphs
    additionally embed and render so the Euro sign et al. are visible, not
    viewer-tofu. Falls back to the correct WinAnsi byte (byte-stable) when no
    latin font is registered. (src/fonts/encoding.ts)

  • fix(core, tagged PDF): Per-line MCID allocation in wrapped table cells
    and multi-line table captions. Previously a single MCID was allocated per
    cell/caption block and reused on every wrapped line, so a wrapped cell emitted
    duplicate /MCID values inside one /TD / /TH / /Caption — a PDF/UA
    (ISO 14289-1 §7.10) structure violation flagged by downstream validators.
    Each wrapped line now receives a distinct, monotonically increasing MCID.
    Single-line cells, the legacy buildPDF() table path, headers, footers, and
    the TOC were already single-MCID and are byte-identical.
    (src/core/pdf-renderers.ts)

Added

  • API: validatePdfUA(bytes) => PdfUAValidationResult — read-only PDF/UA
    (ISO 14289-1) structural checker; PdfUAValidationResult exported from the
    root.

  • API: layout.maxBlocks?: number on PdfLayoutOptions and the exported
    DEFAULT_MAX_BLOCKS (100 000) constant.

  • API (shaping): shapeTeluguText, isTeluguCodepoint, containsTelugu,
    TELUGU_START, TELUGU_END, and isZeroWidthFormat exported from the root.

  • API (shaping): shapeSinhalaText, shapeTibetanText, shapeKhmerText,
    shapeMyanmarText, plus is{Ethiopic,Sinhala,Tibetan,Khmer,Myanmar}Codepoint,
    contains{Ethiopic,Sinhala,Tibetan,Khmer,Myanmar}, and the corresponding
    *_START / *_END range constants exported from the root.

  • API (core): layout.normalize?: 'NFC'|'NFD'|'NFKC'|'NFKD'|false on
    PdfLayoutOptions (default false).

  • fonts: bundled pdfnative/fonts/noto-telugu-data.{js,d.ts} (Noto Sans
    Telugu, OFL-1.1); scripts/download-fonts.ts gains the Noto Sans Telugu entry.

  • fonts: bundled pdfnative/fonts/noto-{ethiopic,sinhala,tibetan,khmer,myanmar}-data.{js,d.ts}
    (Noto Sans Ethiopic / Sinhala / Khmer / Myanmar + Noto Serif Tibetan, all
    OFL-1.1); scripts/download-fonts.ts gains the five corresponding entries.

  • samples: scripts/generators/currency-symbols.ts (base-14 €£¥¢ +
    embedded ₹₩₪₫₺₽₿ + Thai baht ฿ routed to the embedded Thai font +
    multi-currency table) verifies the #48 Euro fix end to
    end; alphabet-telugu script-coverage sample plus new Telugu document
    (doc-telugu.pdf), shaping (shaping-telugu.pdf) and multi-script
    font-subsetting coverage; the colour-emoji showcase now
    emits a third real-world document (color-emoji-real.pdf); plus five new
    script-coverage samples (alphabet-ethiopic, alphabet-sinhala,
    alphabet-tibetan, alphabet-khmer, alphabet-myanmar). Each of the five
    new scripts also gains a dedicated per-language document
    (doc-sinhala.pdf, doc-tibetan.pdf, doc-khmer.pdf, doc-myanmar.pdf,
    doc-amharic.pdf) at full parity with doc-telugu.pdf; the four shaper-backed
    scripts add text-shaping deep-dives (shaping-sinhala.pdf,
    shaping-tibetan.pdf, shaping-khmer.pdf, shaping-myanmar.pdf); and all
    five appear in the multi-script font-subsetting and 22-script multi-language
    showcases. npm run test:generate now produces 187 sample PDFs across 32
    generators.

  • API: buildPDFStreamTrue(params, layoutOptions?, streamOptions?) and
    buildDocumentPDFStreamTrue(params, layoutOptions?, streamOptions?)
    AsyncGenerator<Uint8Array>. Honour the existing StreamOptions.chunkSize
    (default 65 536 bytes). Reject TOC blocks and {pages} templates at the
    boundary (same constraints as the other streaming entry points).

  • API: the curated colour-emoji font module
    pdfnative/fonts/noto-color-emoji-data.js and its .d.ts. FontData gains
    an optional colorGlyphs field; new public types CpalColor, ColorStop,
    GradientExtend, SolidPaint, LinearGradientPaint, RadialGradientPaint,
    ColorPaint, ColorLayer, ColorGlyph exported from the root.

  • tooling: scripts/build-color-emoji-data.ts — converts the full
    NotoColorEmoji-Regular.ttf into a curated-subset data module via the COLR /
    CPAL parser and glyf subsetter (composite-aware, stable GIDs).
    scripts/download-fonts.ts gains the Noto Color Emoji entry.

  • samples: scripts/generators/color-emoji-showcase.ts (three colour-emoji
    PDFs: basic palette, mixed Latin+emoji, real-world status report) wired into
    npm run test:generate. scripts/generators/use-lite-showcase.ts
    renders the public classifyClusters() / classifyUseCategory() output for
    Indic clusters; streaming-showcase.ts gains buildPDFStreamTrue() /
    buildDocumentPDFStreamTrue() demos; bidi-embeddings-showcase.ts documents
    the X4–X5 LRO/RLO overrides.

  • docs: version numbers outside the live npm badges are now agnostic
    (single source of truth: docs/assets/versions.js + [data-pn-badge]); the
    nav logo no longer overflows the brand link (dedicated 1024 px breakpoint);
    the extreme-scripts playground gains UAX #9 embeddings and COLRv1
    colour-emoji
    presets; the medical playground is recalibrated to ~3.875
    pages/patient and adds 5 000- and 10 000-page stress options. A new
    all-scripts playground (docs/playgrounds/all-scripts.html) generates a
    single PDF containing all 22 Unicode scripts plus native COLRv1 colour emoji
    in the browser, showcasing automatic per-code-point font routing, BiDi,
    GSUB/GPOS shaping, and subsetting.

  • tests: tests/visual/ — fixtures, content/font extractor, glyf
    rasteriser, grayscale PNG codec, and the two visual-regression test files;
    baselines committed under tests/visual/baselines/.

Changed

  • package.json: added "./fonts/*" and "./package.json" to exports
    (the documented import('pdfnative/fonts/...') subpaths were previously not
    resolvable under Node's ESM exports map). Version bumped to 1.3.0. Still
    zero runtime dependencies.
  • refactor(core): buildPDF() and buildDocumentPDF() now delegate to
    new internal assembleTableParts() / assembleDocumentParts() helpers that
    return the raw string[] parts; the public builders simply .join('') the
    result. Byte-identical output; enables true streaming without a second
    assembly path.
  • test: 71 test files / 1982 tests, all green. New coverage: colour-emoji
    integration + module-shape, true-streaming byte-parity + constraints,
    Indic ZWJ/ZWNJ/eyelash/ya-phalaa edge cases, UAX #9 X4–X5 overrides,
    per-line MCID uniqueness in wrapped table cells/captions, the
    visual-regression suite, the Telugu mini-shaper, validatePdfUA, the
    configurable maxBlocks limit, and the colour-emoji selector/joiner-drop and
    computed-BBox fixes.

Downstream integration notes

  • New public APIs: buildPDFStreamTrue, buildDocumentPDFStreamTrue, the
    colour-emoji FontData.colorGlyphs field + colour paint types, the
    pdfnative/fonts/noto-color-emoji-data.js subpath, validatePdfUA (+
    PdfUAValidationResult), layout.maxBlocks / DEFAULT_MAX_BLOCKS, the Telugu
    shaper surface (shapeTeluguText, isTeluguCodepoint, containsTelugu,
    TELUGU_START/TELUGU_END), isZeroWidthFormat, and the
    pdfnative/fonts/noto-telugu-data.js subpath. No APIs were removed or
    changed in a breaking way.
  • pdfnative-mcp and pdfnative-cli reach 1.0.0 alongside this
    release; both pin pdfnative@^1.3.0. Colour emoji is opt-in in both via the
    existing font-registration surface.
  • Behaviour shifts: none for existing code paths. Colour emoji only
    activates when an emoji font with colorGlyphs is registered; the
    /ToUnicode addition for base-14 fonts is additive (improves extraction,
    does not change rendered glyphs).