Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip zero-width characters when rendering text to a page #372

Merged
merged 1 commit into from Oct 23, 2021

Conversation

yob
Copy link
Owner

@yob yob commented Oct 22, 2021

A spec added in #370 included a PDF that visually included the following text:

aaaabbbb

... but the content stream included a zero-width LF character between aaaa and bbbb and pdf-reader text extraction looked like this:

aaaa
bbbb

This filters the zero-width LF out, so text output matches the visual appearance of the PDF:

aaaabbbb

@yob
Copy link
Owner Author

yob commented Oct 22, 2021

/cc @sebbASF

@sebbASF
Copy link
Contributor

sebbASF commented Oct 23, 2021

The test file in question has \LFCR in the middle of a string.
The \LF is a permitted line-wrap, so removing that leaves a bare CR.
This in turn must be converted to LF according to the spec.
That's what the test was trying to check, but it should perhaps have not relied on how the page renderer would treat it.
The removal of \LF and replacement of CR by LF is tested in parser_spec.

I'm not sure how an embedded LF is supposed to be displayed, but macOS Preview shows a space.
So the test is currently wrong, but not in the way the PR currently suggests

A spec added in #370 included a PDF that visually included the following
text:

> aaaabbbb

... but the content stream included a zero-width LF character between
aaaa and bbbb and pdf-reader text extraction looked like this:

> aaaa
> bbbb

This filters the zero-width LF out, so text output matches the visual
appearance of the PDF:

> aaaabbbb
@yob yob force-pushed the skip-zero-width-characters branch from 200ac72 to d282fe9 Compare October 23, 2021 00:29
@yob
Copy link
Owner Author

yob commented Oct 23, 2021

yer, interesting. This is how evince renders the page (using libpoppler):

Screenshot from 2021-10-23 11-32-48

pdftotext (also using libpoppler) extracts it as a single line as well.

@yob
Copy link
Owner Author

yob commented Oct 23, 2021

Firefox (using pdf.js) renders like this:

Screenshot from 2021-10-23 11-34-24

Chrome renders it as a space though:

Screenshot from 2021-10-23 11-36-01

@sebbASF
Copy link
Contributor

sebbASF commented Oct 23, 2021

Screenshot 2021-10-23 at 01 35 45

This is what Preview shows

@yob
Copy link
Owner Author

yob commented Oct 23, 2021

Well, I'm confident that rendering it as two lines isn't the best option!

For now I think I'll merge this to get the suite green. There is a small chance that skipping zero-width characters will skip other characters that should be displayed - particularly if there's bugs in the character width calculation code.

If that happens, the other option we could explore is adding LF (and maybe other whitespace?) to the ignore logic here:

unless utf8_chars == SPACE
@characters << TextRun.new(newx, newy, scaled_glyph_width, @state.font_size, utf8_chars)
end

@yob yob merged commit be4dddb into main Oct 23, 2021
@yob yob deleted the skip-zero-width-characters branch October 23, 2021 00:43
@sebbASF
Copy link
Contributor

sebbASF commented Oct 23, 2021

However dropping spaces causes issues with some PDFs which can be rendered without the necessary gaps between words. Unfortunately the only example I have cannot be made public.

@sebbASF
Copy link
Contributor

sebbASF commented Oct 24, 2021

I've just tried opening textwraplfcr.pdf with the macOS Skim app.

This also shows a space between the words.

@sebbASF
Copy link
Contributor

sebbASF commented Oct 24, 2021

Adobe Acrobat Reader and Foxit also show a space (on macOS)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants