Encode PDFString containing non-Latin characters #162

PlushBeaver · 2019-08-17T20:30:52Z

ASCII representation for PDFString cannot handle code points above 127 (7-bit ASCII). Such strings have to be encoded as Unicode (UTF16+BOM). This implementation opts to represent them as hex strings for ease of encoding.

Hopding · 2019-09-03T00:03:48Z

Hello @PlushBeaver! I just had a chance to review this. My apologies for the late response.

Thank you very much for the time and effort you spent on this! I have a couple of questions/concerns:

The PDFString class is meant to represent PDF literal strings, specifically. The PDFHexString class exists to represent hex strings. So I would prefer not to have a PDFString class encode itself as a hex string.
However, the PDF spec does allow for literal strings to contain characters outside the ASCII set via octal character codes. E.g. (Foo\053Bar) represents Foo+Bar. What do you think about implementing things this way?
PDF files should not contain non-ascii characters within string literals (this would violate the spec). And checking to see if they do whenever a PDFString is created will negatively impact parsing performance (which is already a very intensive operation). So if we adopt these changes, we'll need a way to ensure they do not run when creating strings during document parsing.
Aside from checking for non-ascii characters when parsing, what use cases are you thinking of that these changes would address?

PlushBeaver · 2019-09-03T01:03:54Z

Thanks for the concerns raised, @Hopding.

I find the need for library users to choose between PDFString and PDFHexString impractical, because it exposes PDF internals and limitations (while providing flexibility if needed). Would you recommend using PDFHexString when there's any chance that the string contains characters outside ASCII? If so, this PR should not exist at all.

Otherwise, I must indeed take care of octal representation when decoding strings and consider using it when encoding. Hex seems preferable for non-English text (Cyrillic, CJK, etc), while octal may be better for seldom special characters in mostly Latin text (diacritics, punctuation, etc).
Parsing got completely off my mind (see below). Are parsing tests used as benchmarks? Note that decoding needs not be done while parsing, it can be a lazy operation on a string stored "as is" (encoded).
Primary use case is inserting Unicode metadata and outline entries. I stumbled into this with pagedjs-cli. That tool uses an outdated version of pdf-lib and it uses PDFStrings. However, you raised an interesting question of loading documents with such data.

Hopding · 2019-09-27T17:32:00Z

Hello @PlushBeaver! My apologies for the delayed response - I've been quite busy. But I haven't forgotten about this. I've been thinking about the best way to handle all of this. I finally made some decisions, and have been working on implementing things in #204 over the past few days. The changes I made in #204 are all based on your work here, but with a few differences. Please take a look at #204 and let me know if I missed anything that you solved here. If it looks good to you, then I'll merge it (closing both #162 and #204) and the changes will go out in the next pdf-lib release!

PlushBeaver · 2019-09-30T20:27:28Z

Hello, @Hopding. Specialized facilities from #204 are both handy and flexible enough not only to solve metatata encoding issues, but also to implement Unicode outline using new PDFHexString.fromText(string). This PR can indeed be closed in favor of #204. Thank you for the excellent library you provide!

Hopding · 2019-10-09T02:36:41Z

Version 1.2.0 is now published. It contains the changes from this #204. The full release notes are available here.

You can install this new version with npm:

npm install pdf-lib@1.2.0

It's also available on unpkg:

Encode PDFString containing non-Latin characters

06218d5

PlushBeaver force-pushed the unicode-string branch from b4dbbbe to 06218d5 Compare August 17, 2019 23:33

Hopding mentioned this pull request Sep 27, 2019

Add metadata methods to PDFDocument #204

Merged

Hopding closed this in #204 Oct 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode PDFString containing non-Latin characters #162

Encode PDFString containing non-Latin characters #162

PlushBeaver commented Aug 17, 2019

Hopding commented Sep 3, 2019

PlushBeaver commented Sep 3, 2019

Hopding commented Sep 27, 2019

PlushBeaver commented Sep 30, 2019

Hopding commented Oct 9, 2019

Encode PDFString containing non-Latin characters #162

Encode PDFString containing non-Latin characters #162

Conversation

PlushBeaver commented Aug 17, 2019

Hopding commented Sep 3, 2019

PlushBeaver commented Sep 3, 2019

Hopding commented Sep 27, 2019

PlushBeaver commented Sep 30, 2019

Hopding commented Oct 9, 2019