Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RT 113700] Khmer script incorrectly rendered #35

Closed
PhilterPaper opened this issue Jul 10, 2017 · 18 comments
Closed

[RT 113700] Khmer script incorrectly rendered #35

PhilterPaper opened this issue Jul 10, 2017 · 18 comments

Comments

@PhilterPaper
Copy link
Owner

Tue Apr 12 05:28:49 2016 jeromekampot [...] gmail.com - Ticket created
Subject: Khmer script incorrectly rendered
Date: Tue, 12 Apr 2016 16:28:39 +0700
To: bug-PDF-API2 [...] rt.cpan.org
From: Jerome B <jeromekampot [...] gmail.com>

It seems subscript (footer letters) are not rendered for Khmer script. PDF-API2 v2.0.27, perl 5.18.2

The code below should render 2 consonants, "under" each other but instead renders them next to each other with the "Coeng" placeholder. I tested with different fonts including KhmerOS.ttf (http://sourceforge.net/projects/khmer/files/Fonts%20-%20KhmerOS/KhmerOS%20Fonts%205.0-%20LGPL%20Licence/)

Copy/paste of the text seems to work correctly so I guess it has something to do with the way the font is generated.

my $pdf = PDF::Builder->new(-file=>"testkhmer.pdf", -encode => 'utf8');
my $page = $pdf->page;
my $font = $pdf->ttfont('../font/khmerOS.ttf');

my $text = $page->text();
$text->font($font, 20);
$text->translate(200, 700);
$text->text("\x{1780}\x{17D2}\x{1780}");
$pdf->save();
@PhilterPaper PhilterPaper added the help wanted we could use some help from you guys label Nov 3, 2017
@PhilterPaper
Copy link
Owner Author

PhilterPaper commented Nov 3, 2017

I don't know Khmer, so I'll add a "help wanted" label to this issue. I will attempt again to contact the original reporter, to see if they can try it.

The original code above (113700.pl) produces two KA's side-by-side, with a COENG under the left one. I have no idea if that's correct or not. Jerome says one consonant should be "under" the other -- does that mean "stacked vertically"?

I sort of copied a phrase ("a dog") from a Wikipedia article on the Khmer language, matching up characters as best I could. It seems to work as expected (vowel marks over or under the preceding consonant), but maybe this is something totally different from the COENG example.

Anyway, in a month or two I will close this issue, unless someone shows up who knows Khmer and how the text should appear.

113700.pl.txt
testkhmer.pdf

@jbenezech
Copy link

Thanks for looking into this Phil. I didn't expect this issue to be on anybody's list.
Khmer is written kind of horizontally, left to right, but has some tricks. I think the word Dog is a perfect example of these.

This word is composed of 3 letters:

  • Chha : consonant (\x{1786})
  • Ka : consonant (\x{1780})
  • Ae : vowel (\x{17C2})

You would read it something like Chkae
https://translate.google.com/translate_tts?ie=UTF-8&q=%E1%9E%86%E1%9F%92%E1%9E%80%E1%9F%82&tl=km&total=1&idx=0&textlen=4&tk=185153.308593&client=t

The written form obeys these 2 rules:

  • Vowel is placed before consonants
  • Each consonant can have 2 forms, the "normal" form and the "subscript" form. When two consonants form a consonant cluster, the second one takes the subscript form and is placed underneath.

In your attached example, the letter under the Chha is actually the vowel "ou" so this is not the word dog and I think is just not correct.

I have corrected the testcase. See attached. Note that I omitted the last 3 letters which are just the word for "a" (a dog).

I also attach a pdf with the correct rendering of the word. Note that I produced this pdf by copy pasting from testkhmer.pdf into LibreOffice writer then exporting as pdf.

Khmer - Dog.pdf
testkhmer.pdf
113700.pl.txt

@jbenezech
Copy link

I played around a bit and tested with Tamil script. It seems to have the same issue. So I guess this is related to the Virama sign in general and probably affects most Devanagari-related scripts.

@PhilterPaper
Copy link
Owner Author

Hi, and thanks for participating in this issue. Are you the original reporter of this problem?

The sample from Wiki was "dog a (one)", so that was my intent. The glyphs didn't seem to quite match what my Unicode book shows for Khmer, but it's close-ish.

What this all comes down to, is PDF::Builder outputting the correct sequence of bytes (and any surrounding information), and the problem is with various PDF readers messing up the presentation, or is PDF::Builder putting out incorrectly sequenced bytes (or other information) in the first place? I can hopefully do something about the latter case, but I can't do anything about the former. Are PDF readers depending on other information to tell them how to properly render this script, and it's missing or incorrect?

@PhilterPaper
Copy link
Owner Author

PhilterPaper commented Nov 4, 2017

Let me see if I understand what you sent. You revised 113700.pl to produce the correct form of "dog" (and dropped the "(one)" word). You took the PDF produced, which wasn't rendering correctly, ran it through LibreOffice and re-exported as a PDF that renders correctly? Does that imply that the byte order is correct, but there is something about the PDF file that's different? I will try to disassemble the PDF file to see how it compares to what is directly produced by PDF::Builder.

If the testkhmer.pdf you sent (revised) is correctly rendered for both words of text, does that satisfy the original complaint filed with bug 113700? If so, all I need to do is figure out what LibreOffice did to fix the PDF, and incorporate that into PDF::Builder. Or is it still incorrect in some way?

Update: I ran your revised 113700.pl to produce a new testkhmer.pdf, and it appears to render exactly the same as the testkhmer.pdf you attached above (no trip through LibreOffice). Please clarify what's in the files you sent.

@jbenezech
Copy link

jbenezech commented Nov 5, 2017

I am the original reporter indeed.

Let me try to be more clear.
Attached are 4 files:

  • khmer-fail.pdf : output of the sample perl script
  • khmer-fail.jpg : previous pdf saved as image
  • khmer-expected.pdf : pdf exported by libreoffice writer after copy/pasting the text from khmer-fail.pdf
  • khmer-expected.jpg : previous pdf saved as image

I have opened the failing pdf in several viewers/OS with the same display.

Since copy/pasting the text from the failing PDF seems to render the correct text in other editors, I would guess the byte order is correct. So there must be something different in the properties of the PDF file itself.

What I can see about the properties of the pdf font

Faling pdf:
KhmerOS
TrueType (CID)
Encoding: Identity-H
Embedded

Expected pdf:
Khmer OS
TrueType
Encoding: WinAnsi
Embedded Subset

Digging further, it seems that this might not be an issue with this library but possibly something to do with Ubuntu 14.04 as well. I ran a test in Python which produces the exact same (failing) result. Unfortunately, I don't have other OS I can test on at the moment.

I guess this might have something to do with CIDtoGIDMap but really have no idea what this should be or where it comes from.

khmer-expected.pdf
khmer-fail.pdf
khmer-fail
khmer-expected

@PhilterPaper
Copy link
Owner Author

OK, I can see that the 'failed' PDF has 3 characters, and the 'expected' (correct) version combined the first two and moved them after the third.

I fed the two PDFs into WinMerge, and unfortunately it appears that LibreOffice did massive changes to the file. It's not simply a matter of a record or two added or deleted. I will look at disassembling both files to see how they differ, but this unfortunately promises to be a lengthy process. I'm hoping that in the end it's just a minor change to the PDF::Builder output, and that most of the differences are unimportant with regards to the rendering of the text.

Thank you for helping with this, and don't hesitate to add more information if you can. I don't know if I can get to this any time soon, but I'll try.

@PhilterPaper
Copy link
Owner Author

This is going to be worse than I thought. I was able to uncompress the PDF files (deflate compressed streams) using PDFtk, but it looks like even that rearranged and renumbered a lot of stuff. I will try to find another way to uncompress everything so I can look at it. It's also possible that the uncompressed code was corrupted -- for instance, the section which maps a subset of the TTF shows only three glyphs, with a Unicode value of U+1786 for all of them (CHA). The content object (the actual text output) shows just three glyphs, but until I can decode the subsetting stream, I don't know which they are. Shouldn't there be four?

Also, the glyph order (given as four Unicode entries) is U+1786 (CHA) U+17D2 (COENG) U+1780 (KA) U+17C2 (AE). I see that this seems to be Left-to-Right on "failed" and Right-to-Left on "expected" (AE is on the right on "failed", and on the left on "expected"). Can you confirm that this is the correct ordering? Is this something to do with the syntax of the language, that glyphs are taken out of order, or is there a problem somewhere?

Is there any chance of your running both 113700.pl and LibreOffice without any compression (no "Flate" compression used)? I attach the 113700.pl file with compression turned off:
113700.pl.txt
LibreOffice messed with the text position on the page and the font size, so who knows what else it changed.

The only difference between "failed" and "expected" is that "expected" is "failed" run through LibreOffice?

@jbenezech
Copy link

Please find attached

  • testkhmer.pdf : output of the script without compression
  • pdf-from-printer.pdf : I could not find a way to disable compression on libreoffice. I printed a text file to the system pdf printer which hopefully doesn't add too much garbage to the file
  • pdf-create.py.txt : a python script that demonstrates the same problem in python
  • pdf-from-pyhton.pdf : output of the python script

Regarding ordering, AE should be the last character in the unicode sequence but the glyph should be rendered first (as shown in the "expected" pdf).

testkhmer.pdf
pdf-from-printer.pdf
pdf-from-python.pdf
pdf-create.py.txt

@PhilterPaper
Copy link
Owner Author

Curiouser and curiouser. I don't have Python installed, so I can't run your script, but it appears to be using a different PDF library, and not PDF::Builder or PDF::API2. Maybe their library was translated from PDF::API2? Anyway, like PDF::Builder, it outputs four glyphs in the same order, producing the same "fail" result. Also note that it's producing PDF version 1.3 output, which may be missing a lot of functionality,

The printer version I could uncompress, and like your "expected" version, it has three glyphs output, so someone is doing some processing to consolidate and rearrange the glyphs, something which PDF::Builder may need to do. I can see the three glyphs are (in order) U+17C2 (AE) U+1786 (CHA) and U+FFFD. That last one may be a problem with decompression, as FFFD is normally "Invalid Character". I'm not sure what happened to U+17D2 (COENG) and U+1780 (KA)... did they get preprocessed and consolidated, or something else? I thought the whole intent was for the reader to do such processing. By the way, was this printer version output from LibreOffice, or from a PDF reader?

Regarding ordering, if AE (U+17C2) is the last glyph in the sequence, under whose rules is it moved to the front? Do vowels always get moved ahead of consonants? Again, that's something that should reasonably be left to the reader, but apparently this reordering needs to be done in producing the PDF! You mentioned that this seems to be a problem with other Devanagari-based alphabets in PDF::Builder.

I'll keep chipping away at this, but for the moment I'm stuck. The first thing is to find a clean decompression (deflate) utility to uncompress the Flate-compressed streams. PDFtk is doing a lot more than just that, and I fear it may be breaking something in the process. I don't need a working PDF created, just something with human-readable uncompressed streams. Finally, I appreciate your efforts on this, and welcome any further help you can give.

@PhilterPaper
Copy link
Owner Author

PhilterPaper commented Nov 6, 2017

Attached below is a dump of the KhmerOS.ttf font:
022_truefonts.KhmerOS_.pdf
113700.pl.txt

It shows the desired KA.sub character at G+382. My suspicion is that there is logic in LibreOffice and elsewhere that knows if it sees COENG and a following character (e.g., KA), to get the glyph name of the that character ("uni1780"), add ".sub" ("uni1780.sub"), get the CID (382 decimal, 017e hex), and replace the COENG and KA code points with this CID. It may not be that simple in general (there are other subscripts that can be added to base characters). Anyway, I manually edited the testkhmer.pdf file produced by 113700.pl, and it fails to show KA.sub. Maybe it needs to be explicitly listed in the mapping, or maybe the zero width is suppressing it.

The AE vowel is shown as combining to the right, so why is it to be specified as the last Unicode point? Shouldn't it be specified first? Or is it intended to combine with the consonant given immediately before it? I tried AE as both the first glyph code, and the last. I'm confused.

@PhilterPaper
Copy link
Owner Author

I've made some progress, but it's very, very ugly code. Attached are

  • a revised PDF/Builder/Resource/CIDFont.pm which looks for a COENG (U+17D2) and a following consonant or independent vowel (U+1780 .. U+17B3) and changes the generated CID string to replace the pair of CIDs (G+nnn) with a new CID of the subscripted character.
  • a revised 113700.pl to generate a test PDF with 3 examples
  • testkhmer.pdf output of 113700.pl

CIDFont.pm.txt
113700.pl.txt
testkhmer.pdf

This is hard coded to work with a narrow range of Khmer alphabet characters. At this point, I want to see if I'm more or less on the right track, before doing a lot more work. I'm not sure if it will work for other Khmer TTF files (if they have different CID assignments), how subscripts etc. other than "*.sub" names work, and I need to write a name-to-CID function for PDF::Builder. The AE vowel needed to be manually moved to before the consonant -- what are the rules for that (it needs to be automated)? And of course, more work would have to be done for other Devanagari-family scripts.

Anyway, please temporarily replace your CIDFont.pm file with the new one, and try it out on some Khmer text. I will need to know if the rule set needs to be greatly extended to handle all the other .sub.alt, .a, .sub.a, .sub.alt.a, .au, .sub.au, .sub.alt.au, .sub.alt1, .sub.alt2, .alt1, .sub.alt3, and 4 or so special cases. That doesn't even consider whether CIDs (numbers and names) are constant across different font files.

I'm working on the assumption that LibreOffice, etc. are doing something like this substitution and rearrangement internally, rather than the PDF reader doing it. That could explain why (for "dog" example) that there are only 3 glyphs being output, rather than the 4 you would expect naively changing Unicode points to CIDs.

Thanks!

@jbenezech
Copy link

I can confirm that the subscript is now rendered correctly.
AE is still misplaced on the right side using the initial test script but as you mentioned, you did move it manually in the second line of your latest script. That second line is now renders the word "dog" correctly but AE should not have to be manually moved.

I tried a java script using iText with same failed result so this seems to be a very common issue. Although iText seem to say they support devanagri scripts in their commercial edition.
Found this link here http://palashray.com/making-itext-work-with-indic-scripts/ which seems to indicate that there should be a glyph substitution table within the font file. Could it be that PDF:Builder skips reading this table ?

@PhilterPaper
Copy link
Owner Author

PhilterPaper commented Nov 9, 2017

If this stuff is built into the font, all the better. I don't see anything in PDF::Builder that claims to be doing "glyph substitution" or anything similar, so it probably doesn't read or process this table. I will look around for documentation on this. If you happen to know a good online source for documentation (and even better, some code), I'd appreciate hearing about it.

The article about iText seems to be talking about ligatures (substituting one glyph for a stream of two or more glyphs: AB -> C), and not subscripted glyphs, although it might be applicable to that, too (not a ligature: A+B -> Ab, where + is COENG).

Can you point me to any documentation on these tables, as well as rules for moving the AE vowel? Is it always a move left by one glyph? It seems silly to put it after the glyph in the stream, but expect it to render before that glyph. It would be simple enough to look ahead for AE and any other vowels to which this applies.

If a given font does not contain glyph substitution rules, would it be reasonable to hard code fallback rules? If so, what are all the rules we would need to support? I suspect that there will be a lot of them, and not just for Khmer. If full support for glyph substitution is common, special code like I just wrote could just be left out (handle as is done today) if the font lacks the table(s).

Add: I've done some searching, and I see lots of description of external files to define ligatures and other operations you want to do with glyphs, but so far, nothing built into a font. Still looking.

@PhilterPaper
Copy link
Owner Author

PhilterPaper commented Nov 11, 2017

I'm making some progress here. I found that the GSUB rules are in the font file, but need to be explicitly read in to a huge hash array. Then I have to figure out how to interpret all these rules and lists of affected glyphs, etc. They're huge, and quite complicated. I have found the sections that handle the consonants and vowels with COENG, and I think I understand what they're doing, so I can get the same results as my code of a few days ago. Hopefully there are rules in there about moving vowels like AE.

Add: I don't see anything that looks like a simple move of AE from the right side of a consonant to the left, at least not in the GSUB (glyph substitution) data. Do you know if it might be lurking somewhere else in the font? Perhaps in a GPOS table somewhere?

If I go ahead and read the GSUB data for all TTF files, I have to see if there are any cases where we would not want to implement some of the rules (e.g., always replacing f and i by fi for Latin alphabet text).

Add: This might be something like $font->glyph_sub(1); to turn on glyph substitution when desired. You would always do this for Devanagari family languages, but optionally for others.

@PhilterPaper
Copy link
Owner Author

PhilterPaper commented Mar 22, 2019

Update to RT 113700:

This one is a major undertaking, because GSUB and GPOS processing needs to be added to TTF font processing. Indic family languages such as Khmer (also Arabic family languages) do a lot of ligatures, character substitution, and tweaking of glyph positions, and none of this is currently handled in PDF::API2 or PDF::Builder. You can't blindly implement ligatures, for example, because there are times (in English) when a ligature is inappropriate (such as across --syllable-- morpheme boundaries).

I have yet to find a clear definition of what a morpheme is, as opposed to a syllable. TeX just throws up its hands and says "I'm going to replace all ligatures that I can; tell me any I'm not supposed to."

@PhilterPaper
Copy link
Owner Author

Per discussion in https://www.catskilltech.com/forum/resolved-bugs/rt-128674-error-requested-cmap-not-installed-with-many-cjk-fonts/30/ (#98)

Indic languages (which do a lot of ligatures, character substitutions, and moving stuff around) might also have a major impact. One thing on my plate is RT 113700, which requires implementing GSUB and GPOS.

the only code that i know that implements this correctly is either "harfbuzz" or "libicu" (both C/C++).

HarfBuzz supposedly will accept a Unicode string (or CID string?) and return the CIDs in a revised order (with substitutions), with positioning information. HarfBuzz does not appear to have a Perl implementation or wrapper (in CPAN). There is Pango (in CPAN), which makes use of HarfBuzz internally, but at the moment I don't know if it will insist on rendering the adjusted text, too. I also don't know what is involved in installing HarfBuzz and Pango — if that's something that can be handled from the cpan utility interface. It's really a non-starter if cpan cannot handle the full installation of a HarfBuzz (or Pango) capability. I suppose that any HarfBuzz/Pango prereq would be optional, with a dummy stub library module so that things are no worse off than today!

I don't see a libicu/icu module on CPAN.

@PhilterPaper
Copy link
Owner Author

PDF::Builder 3.018 will be released shortly, including changes to permit the use of HarfBuzz::Shaper for complex scripts (among other things). Although my test cases for Khmer are very limited, it appears that these changes resolve this issue. Therefore I am closing this ticket.

@PhilterPaper PhilterPaper removed the help wanted we could use some help from you guys label Apr 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants