Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text handling oddities #193

Open
PhilterPaper opened this issue Jan 23, 2023 · 6 comments
Open

Text handling oddities #193

PhilterPaper opened this issue Jan 23, 2023 · 6 comments
Labels
documentation principally a documentation issue help wanted we could use some help from you guys

Comments

@PhilterPaper
Copy link
Owner

PhilterPaper commented Jan 23, 2023

I just ran across something odd with TrueType fonts ($pdf->ttfont(...)). It appears that word spacing ($text->wordspace(n)) is ignored for TrueType fonts. The PDF::Builder call itself seems to work OK, leaving a n Tw command in the stream. However, Adobe Acrobat Reader seems to ignore the Tw command -- I need to find some other readers to test on. The character spacing command n Tc ($text->charspace(n)) appears to work properly for TrueType fonts.

I tested with corefonts and psfonts (Type 1 fonts) and they both work properly with both word and character spacing. I wonder if the problem is that Tw is implemented to look for an ASCII space (x20) only, to adjust its size, and misses the boat on the glyph ID hex codes used with ttfonts? Certainly, the hex code for a space glyph can vary widely!

I need to find out if this is something peculiar to Adobe, or if it's widespread. Either way, the wordspace() method's limitation will have to be documented. I haven't checked yet to see if the order of commands matters.

Add: A workaround for this, assuming it isn't a bug in PDF::Builder itself, would be to output words individually, using some multiplier on the actual space width:

# close up a sentence with 40% width spaces, for a TTF font (in lieu of wordspace)
ws = text->advancewidth(' ') * 0.4;
phrase = 'The';
x = starting_x;
w = text->advancewidth(phrase);
text->text(phrase);  # outputs
x += w + ws;
phrase = 'New';
w = text->advancewidth(phrase);
text->text(phrase);
x += w + ws;
... etc. ...

There might be more elegant ways, if I think about it for a bit. And of course, it could be a loop to split up a single run of words and spaces, or even a build in a method to do this. Something like this may need to be added to all the text output methods, including column(). I'd appreciate hearing from others if they've also seen this problem, and suggestions on what to do about it. Is there a mechanism for reporting this to Adobe? The Reader might not know which glyph corresponds to a space, but it could potentially see a character with no ink (not just x20) and apply a multiplier to it if Tw is in use.

@PhilterPaper PhilterPaper added help wanted we could use some help from you guys documentation principally a documentation issue labels Jan 23, 2023
@PhilterPaper
Copy link
Owner Author

I learned something else today about fonts. While it's true that Linux etc. variants place their fonts in all sorts of locations, Windows isn't as pure as I thought it was. When you add a new font, say, by dragging and dropping a .ttf file into \Windows\Fonts, there's no guarantee that it will end up there! Its name also will often be changed. This knowledge is important for knowing the font path and file name for using a TrueType font.

To find out where your TTF or OTF file ended up, if you don't see an obvious entry in \Windows\Fonts, you need to look in \Users\XXXX\AppData\Local\Microsoft\Windows\Fonts, depending on what user you were signed on as when you installed the font. Even then, you may not be done, as the name may have been changed to something unrecognizable. You may need to look at Windows' mapping of font name to filename.

In the command shell (command line), or whatever equivalent you like to use, enter "regedit" to bring up the registry editor. For the top level, choose (click on) either HKEY_LOCAL_MACHINE (for global font settings, in \Windows\Fonts) or HKEY_CURRENT_USER (for fonts installed by whoever is currently signed on, in \Users\XXXX\AppData...). From there, both have the same path: SOFTWARE > Microsoft > Windows NT > CurrentVersion > Fonts. This should bring up a listing of all the installed fonts (full name, e.g. "Papyrus Regular") and their actual filename ("PAPYRUS.TTF"). For instance, I just installed a blackletter "Gothic" font English Towne Medium. It ended up in the \Users\Phil... directory as EnglishTowne.ttf.

You don't need to change anything in the registry, just look. You do have the capability to change things, including hiding/showing the font, if you care to get into those things.

Anyway, this should give you the information you need to get the proper path and file name for TTF fonts you install (and even those that come with Windows). Other font types don't seem to jump through these hoops. At some point, this should probably go into the ttfonts() method documentation, and perhaps a mention in FontManager.

Credit: much of this information came from https://superuser.com/questions/1658678/detect-path-of-font-on-windows

@mkl-public
Copy link

As discussed on the Adobe Support Community site, this is a matter of the encoding the PDF creator uses for the font in question:

Word spacing shall be applied to every occurrence of the single-byte character code 32 in a string when using a simple font (including Type 3) or a composite font that defines code 32 as a single-byte code. It shall not apply to occurrences of the byte value 32 in multiple-byte codes.
(ISO 32000-2:2020 section 9.3.3 Word spacing)

Thus, if you want to use the Tw instruction to manipulate the spacing between words, you have to use an encoding for your font which uses the single-byte 32 character code for the space glyph.

@PhilterPaper
Copy link
Owner Author

Regarding the Tw/wordspace issue, follow along on here. The bottom line (so far) is that there is no way when glyph IDs are used for TrueType fonts that it will ever support Tw. Plus, it will always be only for ASCII spaces (x20) and not required blanks (xA0) or other kinds of spaces.

I will have to think about adding a hack to split up a $text->text($sentence) call into individual words, and place each one with an emulated space of adjusted width. Until then, the wordspace() method needs a warning.

  1. Should this be done for all flavors of Unicode space? PDF's Tw is hard-wired to handle only ASCII space (x20), so required blanks/non-breaking spaces and various sizes of spaces could be proportionately adjusted. There could certainly be an option to apply only to x20 and xA0. Maybe xA0 should be changed to x20 anyway?
  2. Should this be done for all font types, and not just TTF? If so, non-ASCII spaces would all be handled the same way, and PDF would never see an ASCII space character (unless wordspace is set to 0). I would have to query the font type, if not.
  3. Presumably this should be built in to all text output routines (I think they all eventually come to $text->text()), including the new column(). They would have to check if the Tw value requested is non-zero, before going through all the bother.
  4. If column() supports it, I will need a new fake-HTML tag and/or CSS to change Tw (and Tc) on the fly (as well as recognizing their being set upon entry).

@PhilterPaper
Copy link
Owner Author

Thus, if you want to use the Tw instruction to manipulate the spacing between words, you have to use an encoding for your font which uses the single-byte 32 character code for the space glyph.

I don't know why you keep insisting (here and on the Adobe forum) that I am using a multibyte character encoding for the text. It's not. The original "space" is a single byte x20. For TTF support in PDF::Builder, the Reader is presented with a list of glyph IDs, which will vary by the particular font being used. A 'space' (x20) may end up 0003 in one font file and 00b7 in another. If the Reader is searching for an actual byte of x20, it ain't gonna find it. This is a limitation of the Reader implementation, in that it doesn't go looking for inkless glyphs (spaces) when presented with a glyph ID list rather than a text string (where a space is x20). My complaint is that I don't see this limitation documented, except in a very round-about way.

@mkl-public
Copy link

I don't know why you keep insisting (here and on the Adobe forum) that I am using a multibyte character encoding for the text. It's not. The original "space" is a single byte x20.

You misunderstand what the PDF specification means when it talks about multibyte character codes.

It does not talk about the character encoding you use in your application before you transform some text strings into content streams. It doesn't care what encoding you use in your application code.

What it talks about is what you eventually store in the strings (literal of hexadecimal) in the content streams. And as you use Identity-H as font encoding, you store doublebyte codes there.

With this misunderstanding cleared up, the excerpt from the specification I quoted above requires a PDF viewer to operate like Adobe Acrobat does in this regard, and it does so in a clear way.

@PhilterPaper
Copy link
Owner Author

PhilterPaper commented Jan 31, 2023

I have updated PDF::Builder to honor the Tw setting when using a TrueType font. This will hit CPAN with the 3.026 release. It splits out x20 ASCII spaces and gives them their own kerning, to adjust their width. Note that $text->textHS() and $text->advancewidthHS() (both for HarfBuzz::Shaper use) do not yet (?) honor Tw. Perhaps in the future...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation principally a documentation issue help wanted we could use some help from you guys
Projects
None yet
Development

No branches or pull requests

2 participants