Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Encoding to Font Dictionary #28

Merged
merged 8 commits into from Sep 3, 2018
Merged

Add Encoding to Font Dictionary #28

merged 8 commits into from Sep 3, 2018

Conversation

jerp
Copy link
Contributor

@jerp jerp commented Aug 28, 2018

See #17

@Hopding
Copy link
Owner

Hopding commented Aug 30, 2018

Hello @jerp. Thank you for submitting this PR! I've done a little bit of research on this and it looks like you're on the right track. However, I think a few more things will need to be done before this is complete.

pdf-lib currently writes text using Literal Strings. However, this means only ASCII characters can be used. This is obviously a severe restriction. What pdf-lib should do is convert strings of text into their character codes, and write them as Hex Strings. This is necessary to fully comply with the WinAnsiEncoding.

For example, this is how the text Hello World! would look as a Literal String:

(Hello World!)

And this is how it would look as a Hex String:

<48656C6C6F20576F726C6421>

Notice that each pair of numbers in the hex string is the hexadecimal representation of the corresponding character in the Literal String:

'Hello World!'
  .split('')
  .map(x => x.charCodeAt(0))
  .map(c => c.toString(16))
> [ '48', '65', '6c', '6c', '6f', '20', '57', '6f', '72', '6c', '64', '21' ]

However, JavaScript's charCodeAt method does not return WinAnsiEncoding character codes. It returns UTF-16 code points. Therefore, some conversion is necessary. This webpage shows a comparison of the two encodings. pdfkit also has a nice mapping object that could be reused in pdf-lib as well: https://github.com/foliojs/pdfkit/blob/master/lib/font/afm.coffee#L58-L85.


Please let me know if you're interested in working on these additional changes. I'm happy to help point you in the right direction and find your way around the pdf-lib codebase. Of course, I also understand if it is more than you're interested in doing.

Here are some other useful bits of pdfkit code:

Here are some of the relevant sections of the PDF specification:

  • 7.3.4.2: Literal Strings
  • 7.3.4.3: Hexadecimal Strings
  • 9.6.2: Type 1 Fonts
  • Annex D: Character Sets and Encodings

@jerp
Copy link
Contributor Author

jerp commented Aug 30, 2018

I committed a version that uses the mapping of PDFKit. I did not check this mapping but being a user of this library for many years, I trust it better than myself.

The way i implemented it, requires to send a PDFHexString instance to drawText instead of a string. This Hex String is encoded properly. See example where I modified the following lines:

 const [helveticaFontRef, helveticaFont] = pdfDoc.embedStandardFont(HELVETIVA_FONT);
 ...
 helveticaFont.encode('Olé! - Œ')[0]

embedStandardFont now returns a tuple where the second argument is a text encoder, in this case, helveticaFont.

helveticaFont.encode should return a tuple, the first element is the encoded text. I believe, a second argument should give information about position advances (see pdfkit).

Although, I'm an experience Javascript developer, I know little about Typescript, so your input is most than welcome.

The next thing I'm looking into is Font subsetting. This is an important requirement for me (file size).

@Hopding
Copy link
Owner

Hopding commented Aug 30, 2018

Wow, this is great Jerome! You churned this out very quickly. I'm impressed.

I'll do a full review later this evening. It looks good overall. There are a few stylistic and linting related things that will need tweaked. I'd also like for the drawText operator to automatically encode text if possible. But I haven't fully thought through how that might work. Perhaps there are reasons this would be difficult that I haven't considered?

P.S. You may have already found this, but the easiest way to try out code changes is by running the integration tests with yarn it:node. You can control which of the tests get run by commenting out the exports of tests you don't wish to run in the index.ts file.

@jerp
Copy link
Contributor Author

jerp commented Aug 30, 2018

I didn't know yarn it:node, I like it! All tests passed.

About the above, implementing it within drawText, would require this function to know what font is being used. For now, it only knows about the font name given at page level. When subsetting fonts, this gets more important since glyphs need to be collected prior to encode the font.

@Hopding
Copy link
Owner

Hopding commented Aug 30, 2018

What do you think about passing the font object to the drawText operator, and encoding text "behind the scenes"?

For example, this is how it works right now:

const [helveticaFontRef, helveticaFont] = pdfDoc.embedStandardFont(HELVETIVA_FONT); 
// ...
drawText(helveticaFont.encode('Olé! - Œ')[0], {
  x: PAGE_1_WIDTH * 0.5 -  30,
  y: PAGE_1_HEIGHT - 48 - 30,
  font: HELVETIVA_FONT,
  size: 12,
}),

But it could potentially work like this:

const [helveticaFontRef, helveticaFont] = pdfDoc.embedStandardFont(HELVETIVA_FONT); 
// ...
drawText('Olé! - Œ', {
  x: PAGE_1_WIDTH * 0.5 -  30,
  y: PAGE_1_HEIGHT - 48 - 30,
  font: HELVETIVA_FONT,
  fontObject: helveticaFont,
  size: 12,
}),

My primary concern here is not technical. Clearly, it works either way. I'm just thinking about the API design. It's a balancing act between creating APIs that are too low-level and confusing for people to quickly grok and use, vs creating a super slick high-level API that requires more effort to maintain.

In this case, neither of these APIs is super elegant and easy to use. However, I think passing the font object will be easier to document, and users will be less likely to make mistakes this way. It also paves the way for supporting automatic text wrapping by the drawText operator, because it would now have access to the font object required to measure strings.

I'd like to hear your thoughts on this. And of course, I'm open to alternative solutions as well.

@Hopding
Copy link
Owner

Hopding commented Aug 30, 2018

Can you please run yarn lint on this as well? This will check the files to ensure they comply with the project's linting standards. It will also automatically fix any formatting changes that it can, so be sure to commit any changes you've made before running it.

P.S. You can prevent the linter from reformatting a block of code by adding a // prettier-ignore comment. But this should be used very sparingly. The linter is usually right.

Copy link
Owner

@Hopding Hopding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this comment be moved in the PDFStandardFontFactory along with the font dictionary?

@Hopding Hopding dismissed their stale review August 30, 2018 23:37

Erronous Comments

src/core/pdf-document/PDFDocument.ts Outdated Show resolved Hide resolved

import PDFFontEncoder from 'core/pdf-structures/factories/PDFFontEncoder'

const toWinAnsi= (charCode: number): number => {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be converted from a switch statement to a mapping object? The switch statement feels a little heavyweight for this purpose:

const toWinAnsi = (charCode: number): number => ({
  402: 131, // ƒ
  8211: 150, // –
  ...,
  381: 142, // Ž
  381: 158, // ž
}[charCode] || charCode);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit puzzled by index signature message, but now, I did it like this:

    const UnicodeToWinAnsiMap:{ [index:number] : number } = {
        402: 131, // ƒ
        ...,
    }
    const toWinAnsi= (charCode: number): number => UnicodeToWinAnsiMap[charCode]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll also need to include the default value (|| charCode) if the map doesn't contain a mapping?

const toWinAnsi = (charCode: number): number => UnicodeToWinAnsiMap[charCode] || charCode

Also, while your type definition for the UnicodeToWinAnsiMap is correct, there's no need to include it for a simple mapping object. TypeScript should infer the type for you:

const UnicodeToWinAnsiMap = {
    402: 131, // ƒ
    ...,
}

Type inference is one of TypeScript's strong points and it helps avoid a lot of type definition boilerplate. It puts languages like Java or C++ to shame 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try several options, I get an error message if I omit type definition of UnicodeToWinAnsiMap

see a simplified case on TS Playground

When the flag noImplicitAny is on, this triggers an error message. This saw that this option is on in tsconfig.json

@jerp
Copy link
Contributor Author

jerp commented Sep 1, 2018 via email

@Hopding
Copy link
Owner

Hopding commented Sep 1, 2018

Just tried defining that UnicodeToWinAnsiMap without a type definition myself and got an error as well. I wouldn't worry about trying to get TS to infer it. Doesn't seem like that will work in this case.

@Hopding
Copy link
Owner

Hopding commented Sep 3, 2018

@jerp Is there anything else you plan to do on this PR before it's merged? If not, then I'll go ahead and merge it.

@jerp
Copy link
Contributor Author

jerp commented Sep 3, 2018 via email

@Hopding
Copy link
Owner

Hopding commented Sep 3, 2018

Yeah, this looks great! I'll go ahead and merge it. I still think I might want to change the API for the drawText operator, but that doesn't need to be done in this PR. I'll just look at doing it before I cut the next release.

Thanks again for the work you did on this! It's a real help to me, and provides a much needed feature for pdf-lib. Congrats on being the first contributor to the project other than myself! 🎉


Regarding fontkit: That dependency is a pretty heavy one (see #14 and #13). I was working on decreasing its bundle size and publishing my forked version to npm a few weeks ago. Right now, pdf-lib points directly at my fontkit fork's GitHub repo. Also, the build is rather icky and brittle, which contributes to the size problem. I've been planning to pick that back up soon, but other issues have kept me from it. The branch I was working on is here.

The reason I forked it (and the unicode-properties repo) was to make the library run in the browser without using browserify (see foliojs/unicode-properties#2). I would have preferred to just make a PR to the original repo and avoid forking, but the maintainer didn't seem to be open to the changes I made. Though perhaps this is changing (see foliojs/unicode-properties#5)?

Are you planning on making any changes in particular? Or still just exploring the codebase and what it supports?

@Hopding Hopding merged commit 1705aeb into Hopding:master Sep 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants