Skip to content

fix(text): preserve stream order for RTL-placed design-tool PDFs#25

Merged
Mythie merged 1 commit intoLibPDF-js:mainfrom
ljagiello:fix/rtl-placed-text-extraction
Feb 12, 2026
Merged

fix(text): preserve stream order for RTL-placed design-tool PDFs#25
Mythie merged 1 commit intoLibPDF-js:mainfrom
ljagiello:fix/rtl-placed-text-extraction

Conversation

@ljagiello
Copy link
Contributor

Summary

  • Add sequenceIndex field to ExtractedChar to preserve content stream extraction order
  • Detect RTL-placed text lines (design tools like Figma/Canva place LTR characters right-to-left via TJ positioning adjustments) and use stream order instead of position-based sorting
  • Add regression test with synthetic fixture PDF that reproduces the bug

Problem

Design tools export PDFs where characters have near-zero glyph widths and all positioning is done via positive TJ adjustments (which move the pen left). Characters appear in correct reading order in the content stream, but their x-positions decrease monotonically. The line grouper unconditionally sorted characters by x-position, reversing the text.

For example, "Lorem ipsum dolor sit amet" would be extracted as "tema tis rolod muspi meroL".

Approach

Rather than removing position-based sorting entirely (which could break multi-column or out-of-order PDFs), we detect RTL-placed lines by checking if x-positions in stream order are predominantly decreasing (>= 80% of consecutive pairs). When detected, we preserve content stream order. For normal PDFs where stream order and position order agree, behavior is unchanged.

This approach is informed by how other PDF libraries handle this:

  • pdf.js: Always preserves stream order (never reorders)
  • PDFBox: Defaults to stream order, position sort is opt-in (sortByPosition)
  • MuPDF: Defaults to stream order, opt-in position sorting

Test plan

  • Regression test: src/integration/text/rtl-placed-text.test.ts verifies correct extraction from RTL-placed fixture
  • All 2852 existing tests pass with zero regressions
  • Typecheck clean
  • Lint clean

Fixes #24

@vercel
Copy link
Contributor

vercel bot commented Feb 11, 2026

@ljagiello is attempting to deploy a commit to the mythie's projects Team on Vercel.

A member of the Team first needs to authorize it.

@ljagiello ljagiello force-pushed the fix/rtl-placed-text-extraction branch 2 times, most recently from 0630cef to d4634cb Compare February 11, 2026 21:31
Design tools like Figma and Canva emit LTR text with right-to-left TJ
placement. Sorting by x-position reverses the text. Detect this pattern
via sequenceIndex and preserve content stream order instead.

- Add RTL_PLACED_THRESHOLD constant (0.8) with documentation
- Return OrderedLine { chars, rtlPlaced } from orderLineChars
- Fix gap calculation in groupIntoSpans for RTL-placed lines
- Fix createSpaceChar bbox positioning for RTL-placed lines
- Use fractional sequenceIndex (n + 0.5) for synthetic spaces
- Make sequenceIndex optional on ExtractedChar
- Guard against missing sequenceIndex (fall back to x-sort)
- Document that heuristic correctly handles genuine RTL text
- Document mixed bidi limitation (needs full bidi algorithm)
- Add 12 unit tests for RTL-placed detection edge cases
@ljagiello ljagiello force-pushed the fix/rtl-placed-text-extraction branch from d4634cb to dc35d21 Compare February 11, 2026 21:50
Copy link
Contributor

@Mythie Mythie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome! 🙌🏻

@Mythie Mythie merged commit e5f0671 into LibPDF-js:main Feb 12, 2026
1 check failed
@ljagiello ljagiello deleted the fix/rtl-placed-text-extraction branch February 12, 2026 04:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(text): extractText() reverses text from design-tool PDFs (Figma, Canva)

2 participants

Comments