Add Encoding to Font Dictionary #28

jerp · 2018-08-28T14:54:14Z

Hopding · 2018-08-30T00:39:13Z

Hello @jerp. Thank you for submitting this PR! I've done a little bit of research on this and it looks like you're on the right track. However, I think a few more things will need to be done before this is complete.

pdf-lib currently writes text using Literal Strings. However, this means only ASCII characters can be used. This is obviously a severe restriction. What pdf-lib should do is convert strings of text into their character codes, and write them as Hex Strings. This is necessary to fully comply with the WinAnsiEncoding.

For example, this is how the text Hello World! would look as a Literal String:

(Hello World!)

And this is how it would look as a Hex String:

<48656C6C6F20576F726C6421>

Notice that each pair of numbers in the hex string is the hexadecimal representation of the corresponding character in the Literal String:

'Hello World!'
  .split('')
  .map(x => x.charCodeAt(0))
  .map(c => c.toString(16))
> [ '48', '65', '6c', '6c', '6f', '20', '57', '6f', '72', '6c', '64', '21' ]

However, JavaScript's charCodeAt method does not return WinAnsiEncoding character codes. It returns UTF-16 code points. Therefore, some conversion is necessary. This webpage shows a comparison of the two encodings. pdfkit also has a nice mapping object that could be reused in pdf-lib as well: https://github.com/foliojs/pdfkit/blob/master/lib/font/afm.coffee#L58-L85.

Please let me know if you're interested in working on these additional changes. I'm happy to help point you in the right direction and find your way around the pdf-lib codebase. Of course, I also understand if it is more than you're interested in doing.

Here are some other useful bits of pdfkit code:

Here are some of the relevant sections of the PDF specification:

7.3.4.2: Literal Strings
7.3.4.3: Hexadecimal Strings
9.6.2: Type 1 Fonts
Annex D: Character Sets and Encodings

jerp · 2018-08-30T11:07:23Z

I committed a version that uses the mapping of PDFKit. I did not check this mapping but being a user of this library for many years, I trust it better than myself.

The way i implemented it, requires to send a PDFHexString instance to drawText instead of a string. This Hex String is encoded properly. See example where I modified the following lines:

 const [helveticaFontRef, helveticaFont] = pdfDoc.embedStandardFont(HELVETIVA_FONT);
 ...
 helveticaFont.encode('Olé! - Œ')[0]

embedStandardFont now returns a tuple where the second argument is a text encoder, in this case, helveticaFont.

helveticaFont.encode should return a tuple, the first element is the encoded text. I believe, a second argument should give information about position advances (see pdfkit).

Although, I'm an experience Javascript developer, I know little about Typescript, so your input is most than welcome.

The next thing I'm looking into is Font subsetting. This is an important requirement for me (file size).

Hopding · 2018-08-30T12:49:11Z

Wow, this is great Jerome! You churned this out very quickly. I'm impressed.

I'll do a full review later this evening. It looks good overall. There are a few stylistic and linting related things that will need tweaked. I'd also like for the drawText operator to automatically encode text if possible. But I haven't fully thought through how that might work. Perhaps there are reasons this would be difficult that I haven't considered?

P.S. You may have already found this, but the easiest way to try out code changes is by running the integration tests with yarn it:node. You can control which of the tests get run by commenting out the exports of tests you don't wish to run in the index.ts file.

jerp · 2018-08-30T13:45:21Z

I didn't know yarn it:node, I like it! All tests passed.

About the above, implementing it within drawText, would require this function to know what font is being used. For now, it only knows about the font name given at page level. When subsetting fonts, this gets more important since glyphs need to be collected prior to encode the font.

Hopding · 2018-08-30T23:29:41Z

What do you think about passing the font object to the drawText operator, and encoding text "behind the scenes"?

For example, this is how it works right now:

const [helveticaFontRef, helveticaFont] = pdfDoc.embedStandardFont(HELVETIVA_FONT); 
// ...
drawText(helveticaFont.encode('Olé! - Œ')[0], {
  x: PAGE_1_WIDTH * 0.5 -  30,
  y: PAGE_1_HEIGHT - 48 - 30,
  font: HELVETIVA_FONT,
  size: 12,
}),

But it could potentially work like this:

const [helveticaFontRef, helveticaFont] = pdfDoc.embedStandardFont(HELVETIVA_FONT); 
// ...
drawText('Olé! - Œ', {
  x: PAGE_1_WIDTH * 0.5 -  30,
  y: PAGE_1_HEIGHT - 48 - 30,
  font: HELVETIVA_FONT,
  fontObject: helveticaFont,
  size: 12,
}),

My primary concern here is not technical. Clearly, it works either way. I'm just thinking about the API design. It's a balancing act between creating APIs that are too low-level and confusing for people to quickly grok and use, vs creating a super slick high-level API that requires more effort to maintain.

In this case, neither of these APIs is super elegant and easy to use. However, I think passing the font object will be easier to document, and users will be less likely to make mistakes this way. It also paves the way for supporting automatic text wrapping by the drawText operator, because it would now have access to the font object required to measure strings.

I'd like to hear your thoughts on this. And of course, I'm open to alternative solutions as well.

Hopding · 2018-08-30T23:34:16Z

Can you please run yarn lint on this as well? This will check the files to ensure they comply with the project's linting standards. It will also automatically fix any formatting changes that it can, so be sure to commit any changes you've made before running it.

P.S. You can prevent the linter from reformatting a block of code by adding a // prettier-ignore comment. But this should be used very sparingly. The linter is usually right.

Hopding

Can this comment be moved in the PDFStandardFontFactory along with the font dictionary?

Erronous Comments

src/core/pdf-document/PDFDocument.ts

Hopding · 2018-08-30T23:41:49Z

src/core/pdf-structures/factories/PDFStandardFontFactory.ts

+
+import PDFFontEncoder from 'core/pdf-structures/factories/PDFFontEncoder'
+
+const toWinAnsi= (charCode: number): number => {


Can this be converted from a switch statement to a mapping object? The switch statement feels a little heavyweight for this purpose:

const toWinAnsi = (charCode: number): number => ({ 402: 131, // ƒ 8211: 150, // – ..., 381: 142, // Ž 381: 158, // ž }[charCode] || charCode);

I was a bit puzzled by index signature message, but now, I did it like this:

const UnicodeToWinAnsiMap:{ [index:number] : number } = { 402: 131, // ƒ ..., } const toWinAnsi= (charCode: number): number => UnicodeToWinAnsiMap[charCode]

I think you'll also need to include the default value (|| charCode) if the map doesn't contain a mapping?

const toWinAnsi = (charCode: number): number => UnicodeToWinAnsiMap[charCode] || charCode

Also, while your type definition for the UnicodeToWinAnsiMap is correct, there's no need to include it for a simple mapping object. TypeScript should infer the type for you:

const UnicodeToWinAnsiMap = { 402: 131, // ƒ ..., }

Type inference is one of TypeScript's strong points and it helps avoid a lot of type definition boilerplate. It puts languages like Java or C++ to shame 😄

I did try several options, I get an error message if I omit type definition of UnicodeToWinAnsiMap

see a simplified case on TS Playground

When the flag noImplicitAny is on, this triggers an error message. This saw that this option is on in tsconfig.json

src/core/pdf-structures/factories/PDFStandardFontFactory.ts

jerp · 2018-09-01T06:56:37Z

Sure for the default charCode, unit tests detected it! About inference, got error messages, but I’ll give it another go.

…

On 31 Aug 2018, at 23:58, Andrew Dillon ***@***.***> wrote: @Hopding commented on this pull request. In src/core/pdf-structures/factories/PDFStandardFontFactory.ts: > +class PDFStandardFontFactory implements PDFFontEncoder { + static for = (fontName: IStandard14FontsUnion): PDFStandardFontFactory => + new PDFStandardFontFactory(fontName); + + fontName: IStandard14FontsUnion; + + constructor(fontName: IStandard14FontsUnion) { + validate( + fontName, + oneOf(...Standard14Fonts), + 'PDFDocument.embedStandardFont: "fontName" must be one of the Standard 14 Fonts: ' + + values(Standard14Fonts).join(', '), + ); + this.fontName = fontName + } + encodeText(text: string): PDFHexString { I'd suggest including the type for the method arrow function's argument (I think it's required, actually), but not for the .map lambda arrow functions: // Include the param type here encodeText = (text: string): PDFHexString => PDFHexString.fromString(text .split('') // but not here .map(char => char.charCodeAt(0)) .map(toWinAnsi) // or here .map(charCode => charCode.toString(16)) .join('')) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Hopding · 2018-09-01T12:53:01Z

Just tried defining that UnicodeToWinAnsiMap without a type definition myself and got an error as well. I wouldn't worry about trying to get TS to infer it. Doesn't seem like that will work in this case.

Hopding · 2018-09-03T20:26:48Z

@jerp Is there anything else you plan to do on this PR before it's merged? If not, then I'll go ahead and merge it.

jerp · 2018-09-03T20:30:58Z

If you’re statisfied with the latest change go ahead. I’m looking at fontkit, font subsetting and how to lighten it all up. Fancy a font-lib repo?

…

On Mon, 3 Sep 2018 at 22:26, Andrew Dillon ***@***.***> wrote: @jerp <https://github.com/jerp> Is there anything else you plan to do on this PR before it's merged? If not, then I'll go ahead and merge it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#28 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGSuxAuT6Uzz53EKSG4FwhIaQABLGvHKks5uXZCIgaJpZM4WPybC> .

Hopding · 2018-09-03T21:01:32Z

Yeah, this looks great! I'll go ahead and merge it. I still think I might want to change the API for the drawText operator, but that doesn't need to be done in this PR. I'll just look at doing it before I cut the next release.

Thanks again for the work you did on this! It's a real help to me, and provides a much needed feature for pdf-lib. Congrats on being the first contributor to the project other than myself! 🎉

Regarding fontkit: That dependency is a pretty heavy one (see #14 and #13). I was working on decreasing its bundle size and publishing my forked version to npm a few weeks ago. Right now, pdf-lib points directly at my fontkit fork's GitHub repo. Also, the build is rather icky and brittle, which contributes to the size problem. I've been planning to pick that back up soon, but other issues have kept me from it. The branch I was working on is here.

The reason I forked it (and the unicode-properties repo) was to make the library run in the browser without using browserify (see foliojs/unicode-properties#2). I would have preferred to just make a PR to the original repo and avoid forking, but the maintainer didn't seem to be open to the changes I made. Though perhaps this is changing (see foliojs/unicode-properties#5)?

Are you planning on making any changes in particular? Or still just exploring the codebase and what it supports?

Add Encoding to Font Dictionary

199ecf2

jerp and others added 2 commits August 30, 2018 09:00

Merge branch 'master' into master

9a06691

Encodeing Standard Font with WinAnsiEncoding

9b8e5ba

Hopding mentioned this pull request Aug 30, 2018

Problem with PDFDocumentFactory.load some usual non-encrypted PDF documents #25

Closed

Hopding previously requested changes Aug 30, 2018

View reviewed changes

Hopding reviewed Aug 31, 2018

View reviewed changes

src/core/pdf-structures/factories/PDFStandardFontFactory.ts Outdated Show resolved Hide resolved

Fork review

348cc98

Jerome Proost added 2 commits September 1, 2018 14:39

lint fixes

1b8aaaf

use of fat arrows

994fd74

cleanup unused

7058aa5

Merge branch 'master' into master

0719c6c

Hopding merged commit 1705aeb into Hopding:master Sep 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Encoding to Font Dictionary #28

Add Encoding to Font Dictionary #28

jerp commented Aug 28, 2018

Hopding commented Aug 30, 2018

jerp commented Aug 30, 2018

Hopding commented Aug 30, 2018

jerp commented Aug 30, 2018

Hopding commented Aug 30, 2018

Hopding commented Aug 30, 2018

Hopding left a comment

Hopding Aug 30, 2018

jerp Aug 31, 2018

Hopding Aug 31, 2018

jerp Sep 1, 2018

jerp commented Sep 1, 2018 via email

Hopding commented Sep 1, 2018

Hopding commented Sep 3, 2018

jerp commented Sep 3, 2018 via email

Hopding commented Sep 3, 2018


		import PDFFontEncoder from 'core/pdf-structures/factories/PDFFontEncoder'

		const toWinAnsi= (charCode: number): number => {

Add Encoding to Font Dictionary #28

Add Encoding to Font Dictionary #28

Conversation

jerp commented Aug 28, 2018

Hopding commented Aug 30, 2018

jerp commented Aug 30, 2018

Hopding commented Aug 30, 2018

jerp commented Aug 30, 2018

Hopding commented Aug 30, 2018

Hopding commented Aug 30, 2018

Hopding left a comment

Choose a reason for hiding this comment

Hopding Aug 30, 2018

Choose a reason for hiding this comment

jerp Aug 31, 2018

Choose a reason for hiding this comment

Hopding Aug 31, 2018

Choose a reason for hiding this comment

jerp Sep 1, 2018

Choose a reason for hiding this comment

jerp commented Sep 1, 2018 via email

Hopding commented Sep 1, 2018

Hopding commented Sep 3, 2018

jerp commented Sep 3, 2018 via email

Hopding commented Sep 3, 2018