New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
skip zero-width characters when rendering text to a page #372
Conversation
/cc @sebbASF |
The test file in question has \LFCR in the middle of a string. I'm not sure how an embedded LF is supposed to be displayed, but macOS Preview shows a space. |
A spec added in #370 included a PDF that visually included the following text: > aaaabbbb ... but the content stream included a zero-width LF character between aaaa and bbbb and pdf-reader text extraction looked like this: > aaaa > bbbb This filters the zero-width LF out, so text output matches the visual appearance of the PDF: > aaaabbbb
200ac72
to
d282fe9
Compare
Well, I'm confident that rendering it as two lines isn't the best option! For now I think I'll merge this to get the suite green. There is a small chance that skipping zero-width characters will skip other characters that should be displayed - particularly if there's bugs in the character width calculation code. If that happens, the other option we could explore is adding LF (and maybe other whitespace?) to the ignore logic here: pdf-reader/lib/pdf/reader/page_text_receiver.rb Lines 117 to 119 in c849c06
|
However dropping spaces causes issues with some PDFs which can be rendered without the necessary gaps between words. Unfortunately the only example I have cannot be made public. |
I've just tried opening textwraplfcr.pdf with the macOS Skim app. This also shows a space between the words. |
Adobe Acrobat Reader and Foxit also show a space (on macOS) |
A spec added in #370 included a PDF that visually included the following text:
... but the content stream included a zero-width LF character between aaaa and bbbb and pdf-reader text extraction looked like this:
This filters the zero-width LF out, so text output matches the visual appearance of the PDF: