ColumnText + Arabic Reshaping causes arabic characters to no longer appear. Removing column text makes characters appear. #940

bfryer-snap · 2023-08-10T18:58:33Z

Describe the bug
The arabic reshaping is leading to characters not being rendered in the PDF when using some fonts. If I do not use the ColumnText, the characters appear.

To Reproduce

We can use a modified version of the RightToLeft.java example to show the issue:

Here it is working:

public static void main(String[] args) {
        try {
            // step 1
            Document document = new Document(PageSize.A4, 50, 50, 50, 50);
            // step 2
            PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("sample.pdf"));
            // step 3
            document.open();
            // step 4
            PdfContentByte cb = writer.getDirectContent();

// Font can be found here:
// https://fonts.google.com/noto/specimen/Noto+Sans+Arabic?sort=popularity&subset=arabic
            BaseFont bf = BaseFont.createFont("NotoSansArabic-regular.ttf", BaseFont.IDENTITY_H, true);

            ColumnText ct = new ColumnText(cb);
            ct.setSimpleColumn(100, 100, 500, 800, 24, Element.ALIGN_LEFT);
            ct.setSpaceCharRatio(PdfWriter.NO_SPACE_CHAR_RATIO);
            ct.setLeading(0, 1);
            ct.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
            ct.setAlignment(Element.ALIGN_CENTER);
            ct.addText(new Chunk(ar1, new Font(bf, 16)));
            ct.addText(new Chunk(ar2, new Font(bf, 16, Font.NORMAL, Color.red)));
            ct.go();
            ct.setAlignment(Element.ALIGN_JUSTIFIED);
            ct.addText(new Chunk(ar3, new Font(bf, 12)));
            ct.go();
            ct.setAlignment(Element.ALIGN_CENTER);
            ct.addText(new Chunk(ar4, new Font(bf, 14)));
            ct.go();

            // step 5
            document.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    /**
     * arabic text
     */
    public static String ar1 = "\u0623\u0648\u0631\u0648\u0628\u0627, \u0628\u0631\u0645\u062c\u064a\u0627\u062a "
            + "\u0627\u0644\u062d\u0627\u0633\u0648\u0628 + \u0627\u0646\u062a\u0631\u0646\u064a\u062a :\n\n";
    /**
     * arabic text
     */
    public static String ar2 = "\u062a\u0635\u0628\u062d \u0639\u0627\u0644\u0645\u064a\u0627 \u0645\u0639 "
            + "\u064a\u0648\u0646\u064a\u0643\u0648\u062f\n\n";
    /**
     * arabic text
     */
    public static String ar3 = "\u062a\u0633\u062c\u0651\u0644 \u0627\u0644\u0622\u0646 \u0644\u062d\u0636\u0648\u0631 "
            + "\u0627\u0644\u0645\u0624\u062a\u0645\u0631 \u0627\u0644\u062f\u0648\u0644\u064a "
            + "\u0627\u0644\u0639\u0627\u0634\u0631 \u0644\u064a\u0648\u0646\u064a\u0643\u0648\u062f, "
            + "\u0627\u0644\u0630\u064a \u0633\u064a\u0639\u0642\u062f \u0641\u064a 10-12 \u0622\u0630\u0627\u0631 "
            + "1997 \u0628\u0645\u062f\u064a\u0646\u0629 \u0645\u0627\u064a\u0646\u062a\u0633, "
            + "\u0623\u0644\u0645\u0627\u0646\u064a\u0627. \u0648\u0633\u064a\u062c\u0645\u0639 "
            + "\u0627\u0644\u0645\u0624\u062a\u0645\u0631 \u0628\u064a\u0646 \u062e\u0628\u0631\u0627\u0621 "
            + "\u0645\u0646  \u0643\u0627\u0641\u0629 \u0642\u0637\u0627\u0639\u0627\u062a "
            + "\u0627\u0644\u0635\u0646\u0627\u0639\u0629 \u0639\u0644\u0649 \u0627\u0644\u0634\u0628\u0643\u0629 "
            + "\u0627\u0644\u0639\u0627\u0644\u0645\u064a\u0629 \u0627\u0646\u062a\u0631\u0646\u064a\u062a "
            + "\u0648\u064a\u0648\u0646\u064a\u0643\u0648\u062f, \u062d\u064a\u062b \u0633\u062a\u062a\u0645, "
            + "\u0639\u0644\u0649 \u0627\u0644\u0635\u0639\u064a\u062f\u064a\u0646 "
            + "\u0627\u0644\u062f\u0648\u0644\u064a \u0648\u0627\u0644\u0645\u062d\u0644\u064a \u0639\u0644\u0649 "
            + "\u062d\u062f \u0633\u0648\u0627\u0621 \u0645\u0646\u0627\u0642\u0634\u0629 \u0633\u0628\u0644 "
            + "\u0627\u0633\u062a\u062e\u062f\u0627\u0645 \u064a\u0648\u0646\u0643\u0648\u062f  \u0641\u064a "
            + "\u0627\u0644\u0646\u0638\u0645 \u0627\u0644\u0642\u0627\u0626\u0645\u0629 "
            + "\u0648\u0641\u064a\u0645\u0627 \u064a\u062e\u0635 "
            + "\u0627\u0644\u062a\u0637\u0628\u064a\u0642\u0627\u062a "
            + "\u0627\u0644\u062d\u0627\u0633\u0648\u0628\u064a\u0629, \u0627\u0644\u062e\u0637\u0648\u0637, "
            + "\u062a\u0635\u0645\u064a\u0645 \u0627\u0644\u0646\u0635\u0648\u0635  "
            + "\u0648\u0627\u0644\u062d\u0648\u0633\u0628\u0629 \u0645\u062a\u0639\u062f\u062f\u0629 "
            + "\u0627\u0644\u0644\u063a\u0627\u062a.\n\n";
    /**
     * arabic text
     */
    public static String ar4 = "ع\u0646\u062f\u0645\u0627 \u064a\u0631\u064a\u062f "
            + "\u0627\u0644\u0639\u0627\u0644\u0645 \u0623\u0646 \u064a\u062a\u0643\u0644\u0651\u0645, "
            + "\u0641\u0647\u0648 \u064a\u062a\u062d\u062f\u0651\u062b \u0628\u0644\u063a\u0629 "
            + "\u064a\u0648\u0646\u064a\u0643\u0648\u062f\n\n";

Output:

If we leave everything the exact same as the above, except we change "NotoSansArabic-regular.ttf" to a different font, such as "GraphikArabic-Regular.ttf", Then we get the following output:

The problem can be seen most easily by looking at the lower left section of the main paragaph. In the NotoSansArabic font, we can see a word that looks like it expands multiple characters. In the GraphikArabic font, we can see that it is missing the right half of the word and seems to only contain the last two characters.

A specific character that seems to be rendered by NotoSansArabic and not GraphikArabic is \u0627.

I thought that GraphikArabic was missing the \u0627 character altogether, but if I use the following code, i can generate it just fine:

    public static void main(String[] args) {
        try {
            // step 1
            Document document = new Document(PageSize.A4, 50, 50, 50, 50);
            // step 2
            PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("sample.pdf"));
            // step 3
            document.open();
            // step 4
            PdfContentByte cb = writer.getDirectContent();
            BaseFont bf = BaseFont.createFont("GraphikArabic-regular.ttf", BaseFont.IDENTITY_H, true);
            Font font = new Font(bf);
            document.add(new Paragraph("\u0627   \u0627   \u0627", font));

            // step 5
            document.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

screenshot of the output being as expected

System.out.println(bf.charExists('\u0627')); also outputs true when using the GraphikArabic font. I assume that BaseFont::charExists(char) is the way to determine if the given char should show on the PDF.

I believe the issue is that characters like \u0627 are being reshaped into much different characters in a whole other unicode block and that the font does not support the reshaped characters. I believe this because when debugging, I can see that some characters such as 0x0627 become 0x0FE8E. This transformation happens here: https://github.com/LibrePDF/OpenPDF/blob/master/openpdf/src/main/java/com/lowagie/text/pdf/BidiLine.java#L197

Expected behavior

I expect ColumnText and adding elements to a document directly to have the same output OR I expect to be able to skip the "reshaping" process so that I can continue to use a font which supports the 0x0600 to 0x06FF character range.

Screenshots
screenshots added above.

System (please complete the following information):

OS: MacOS Ventura 13.4.1
Used Font: NotoSansArabic-Regular.ttf, GraphikArabic-Regular.ttf

Additional context

The text was updated successfully, but these errors were encountered:

andreasrosdal · 2024-02-14T18:07:22Z

Thank you for reporting this bug. Please submit a pull request with a solution to this problem if you can.

vk-github18 · 2024-03-14T19:37:48Z

Neither 0x0627 nor 0x0FE8E can be found in BidiLine.java

Seems to be a problem with the commercial font GraphikArabic.

asturio · 2024-03-16T19:26:43Z

May be related to #938

vk-github18 · 2024-05-01T11:11:17Z

See also https://github.com/LibrePDF/OpenPDF/wiki/Accents,-DIN-91379,-non-Latin-scripts

bfryer-snap · 2024-05-01T16:48:43Z

Hi @vk-github18, I was able to print 0x0627 with GraphikArabic. I could not print 0x0FE8E. It seems that the library was converting characters from form A -> form B based on the surrounding characters in the ArabicLigatuizer.java. We can see 0x0627 is defined here: https://github.com/LibrePDF/OpenPDF/blob/1.3-java8/openpdf/src/main/java/com/lowagie/text/pdf/ArabicLigaturizer.java#L100

we can also see that 0x0627 is included in some row in charTable: https://github.com/LibrePDF/OpenPDF/blob/1.3-java8/openpdf/src/main/java/com/lowagie/text/pdf/ArabicLigaturizer.java#L132

and we can see charTable is used in two methods, that seem to be doing some sort of transormations: https://github.com/LibrePDF/OpenPDF/blob/1.3-java8/openpdf/src/main/java/com/lowagie/text/pdf/ArabicLigaturizer.java#L212-L253

So the question is:

When a font does not support 100% of the arabic characterset but does include the "base" character set (not familiar enough with arabic charactersets to use proper terms) should the library still try to convert the character into something that the font does not support?

Or even more generally speaking:

Should the openPDF library transform the given characters to another set of characters even if the resulting characters are not supported by the font?

An ArabicLiguatizer transformation bypass flag/option would make it so its up to the client to know the limitations of their font. At the moment, the only option is to switch fonts entirely.

vk-github18 · 2024-05-01T19:06:17Z

What is the result if you use
import com.lowagie.text.pdf.LayoutProcessor; ... LayoutProcessor.enableKernLiga();

as explained in https://github.com/LibrePDF/OpenPDF/wiki/Accents,-DIN-91379,-non-Latin-scripts ?

vk-github18 · 2024-05-01T19:10:00Z

To your question, if the commercial font you used does not support Arabic properly you should open an issue at the producer of the font.
There is not much a library can do, if the font is not correct. To introduce special handling for incomplete/incorrect fonts is not the way to go. The transformations for arabic scripts are mandatory.

bfryer-snap added the bug label Aug 10, 2023

andreasrosdal added the CRITICAL Important issue label Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ColumnText + Arabic Reshaping causes arabic characters to no longer appear. Removing column text makes characters appear. #940

ColumnText + Arabic Reshaping causes arabic characters to no longer appear. Removing column text makes characters appear. #940

bfryer-snap commented Aug 10, 2023 •

edited

andreasrosdal commented Feb 14, 2024

vk-github18 commented Mar 14, 2024

asturio commented Mar 16, 2024

vk-github18 commented May 1, 2024

bfryer-snap commented May 1, 2024

vk-github18 commented May 1, 2024

vk-github18 commented May 1, 2024 •

edited

ColumnText + Arabic Reshaping causes arabic characters to no longer appear. Removing column text makes characters appear. #940

ColumnText + Arabic Reshaping causes arabic characters to no longer appear. Removing column text makes characters appear. #940

Comments

bfryer-snap commented Aug 10, 2023 • edited

andreasrosdal commented Feb 14, 2024

vk-github18 commented Mar 14, 2024

asturio commented Mar 16, 2024

vk-github18 commented May 1, 2024

bfryer-snap commented May 1, 2024

vk-github18 commented May 1, 2024

vk-github18 commented May 1, 2024 • edited

bfryer-snap commented Aug 10, 2023 •

edited

vk-github18 commented May 1, 2024 •

edited