Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RT 57248] Cyrillic letters #7

Closed
PhilterPaper opened this issue Jul 9, 2017 · 4 comments
Closed

[RT 57248] Cyrillic letters #7

PhilterPaper opened this issue Jul 9, 2017 · 4 comments
Labels
bug something not working to spec, or causes crash

Comments

@PhilterPaper
Copy link
Owner

Thu May 06 02:28:32 2010 kuzvesov [...] list.ru Ticket created
Subject: Cyrillic letters

  1. The following Cyrillic glyphs (names according to http://www.adobe.com/devnet/font/pdfs/5013.Cyrillic_Font_Spec.pdf)

afii10047 (uppercase 'Э')
afii10049 (uppercase 'Я')
afii10095 (lowercase 'э')

are not displayed when using TrueType fonts. I tried different encodings (CP1251, UTF8) with the same result.

  1. When using core fonts, all the cyrillics are displayed overlapping each other with CP1251 encoding, and are not displayed at all with UTF8 encoding.

Perl version v5.10.1 built for MSWin32-x86-multi-thread
Binary build 1007 [291969] provided by ActiveState
Operating system Windows Vista Home Premium, Service Pack 1 (ver. 6.0.6001)
Subject: test-utf8.pdf

  use locale;
  use POSIX;
  use PDF::Report;

  my $encoding = 'cp1251';

  POSIX::setlocale($encoding)
    or die 'cannot set locale';

  my $pdf = new PDF::Bulder(  );
  $pdf->mediabox( 'A4' );

  my $page = $pdf->page();
  my $txt = $page->text;

  my $font = $pdf->ttfont('Times.ttf', '-encode' => $encoding );
  my $fontsize = 12;
  $txt->font($font,$fontsize);
  $txt->translate(10,700);
  $txt->text("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
  $txt->translate(10,650);
  $txt->text("abcdefghijklmnopqrstuvwxyz");
  $txt->translate(10,600);
  $txt->text("àáâãäå¸æçèéêëìíîïðñòóôõö÷øùüûúýþÿ");
  $txt->translate(10,550);
  $txt->text("ÀÁÂÃÄŨÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÜÛÚÝÞß");

  my $font = $pdf->corefont('Times', '-encode' => $encoding );
  my $fontsize = 12;
  $txt->font($font,$fontsize);
  $txt->translate(10,400);
  $txt->text("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
  $txt->translate(10,350);
  $txt->text("abcdefghijklmnopqrstuvwxyz");
  $txt->translate(10,300);
  $txt->text("àáâãäå¸æçèéêëìíîïðñòóôõö÷øùüûúýþÿ");
  $txt->translate(10,250);
  $txt->text("ÀÁÂÃÄŨÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÜÛÚÝÞß");

  my $font = $pdf->corefont('Times', '-encode' => $encoding );
  my $fontsize = 12;
  $txt->font($font,$fontsize);
  $txt->translate(10,750);
  $txt->text("Using true type font:");
  $txt->translate(10,450);
  $txt->text("Using core font:");

  $pdf->saveas( 'test.pdf' );

Subject: [rt.cpan.org #57248]
Date: Mon, 15 Feb 2016 16:40:51 -0500
To: bug-PDF-API2 [...] rt.cpan.org
From: Phil M Perry

I modified the example text file to display x40 through xFF for both TrueType and Core fonts. I ran it for CP1251 (Cyrillic), CP1252 (Latin 1), CP1253 (Greek), and CP1254 (Turkish). This is Windows XP SP3, PDF::API2 2.025, Adobe Reader 11.0.08. All four character sets have some variety of MS "Smart Quotes" in the x80 - x9F range. I have not yet tried UTF-8 encoded text.

In all cases, the TTF displays perfectly, even the unassigned characters in the Smart Quotes range. The three Cyrillic characters reported missing in the original bug report are present and in the right place. All the CoreFont displays have problems with the Smart Quotes unassigned characters still displaying the empty box, but evidently having a near-zero width (so that the following character mostly overprints it).

Core Font only problems:
CP1251: All Cyrillic and possibly some other characters print correctly, but apparently have about 33% width and are overprinted by following characters.
CP1252: The unassigned characters in the Smart Quotes range get overprinted, but the rest of the Latin-1 characters look OK.
CP1253: The Greek letters behave just like the Cyrillic letters in 1251.
CP1254: The Turkish letters behave just like the Latin-1 letters in 1252.

The bottom line is that TTF looks OK from here (at least for CP125x encoding), but Core Fonts have trouble with unassigned ("box") characters and non-Latin characters, where the characters look OK, but the text location is not advanced far enough and we get overprinting. Perhaps the font data (especially character width) isn't being read correctly? Since it works for (e.g.) CP1252, it seems odd that it would fail for non-Latin sets (note that Turkish is Latin). That would imply that the font files themselves are defective or non-standard in some way.

test-cp1251.pdf
test-utf8.pdf
57248.zip

@PhilterPaper
Copy link
Owner Author

PhilterPaper commented Nov 3, 2017

The current situation is:

1: TTF does not appear to be missing any characters, including the three listed.

2a: The overlap of characters is because the width listed in PDF::Builder::Resource::Font::CoreFont::[fontname].pm's "missingwidth" value of 250, which is as little as a quarter of what is needed. Only the standard Latin-1 glyphs, and their widths, are listed. Everything else is "missing". Possibly this could be fixed by extending the [fontname].pm glyph and width tables, but that will be quite a bit of work. It's also possible that instead of using fixed .pm files, that PDF::Builder could read the local copy of the core files.

Reading the local core font files for metrics and embedding the fonts (see #80) would ensure that all glyphs are always properly rendered.

2b: Core fonts do not support UTF-8 -- only single byte encodings at this time. UTF-8 support for core and Type1 fonts would certainly be desirable, but I don't know if it's feasible to add it (see #81).

To access core font glyphs which are outside of Latin-1, consider using automap() to break up the font into multiple planes, each up to 256 characters. However (020_corefonts uses this), it still does not appear that this gives correct character widths.

@PhilterPaper
Copy link
Owner Author

Update to RT 57248:

  1. TTF does not appear to be missing any characters, including the three listed, when I tested it.
  2. The overlap of characters is because the width listed in PDF::API2::Resource::Font::CoreFont::[fontname].pm's "missingwidth" value of 250, which is as little as a quarter of what is needed. Only the standard Latin-1 glyphs, and their widths, are listed. Everything else is "missing". Possibly this could be fixed by extending the [fontname].pm glyph and width tables, but that will be quite a bit of work.
  3. Core fonts do not support UTF-8 -- only single byte encodings at this time. UTF-8 support for core and Type1 fonts would certainly be desirable, but I don't know if it's feasible to add it.

I think the best resolution of this is to switch to TTF (ttfont) rather than using core fonts.

@PhilterPaper
Copy link
Owner Author

The missing widths problem is more general than just Cyrillic — it appears that even if the encoding is supported by Perl, that only those characters appearing in Latin-1 (ISO-8859-1) will get proper widths; everything else gets "missingwidth".

If the original source font is TTF, possibly we could cheat and read widths from the font file. However, as the font file on the Reader's machine will always be used (no embedding of the font), there is no guarantee that widths will match. The same goes for extending the glyph width list to handle all supported Perl single byte encodings. It looks like the only real solution is to use TTF fonts instead of Core Fonts, if you want to use non-Latin-1 characters. Possibly allow only Latin-1/ISO-8859-1 encoding for Core Fonts? How many applications would this break? How about giving a warning for all other encodings? Give a switch to shut off this warning.

@PhilterPaper
Copy link
Owner Author

Corefonts already give the correct glyph names in the PDF file, so all that appears to be needed is to add all the missing glyph widths to the [typeface].pm file. Added tools/TTFdump.pl, which creates a "wx" (glyph widths) section in a file, and you can edit it into the [typeface].pm file (just replace the old 'wx' section with this one). I renamed the old 'wx' section (hash) to 'wxold', so it's there if you need to go back to it for some reason. All the [typeface].pm files that I could find .ttf files for have been updated (all but Bank Gothic, and various symbology typefaces), so no action is needed by users. The TTFdump tool is available in case you need it.

This appears to fix the problems described in the ticket, so I'm closing it.

@PhilterPaper PhilterPaper added bug something not working to spec, or causes crash and removed stalled things have ground to a halt on this one labels Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something not working to spec, or causes crash
Projects
None yet
Development

No branches or pull requests

1 participant