Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ColumnText + Arabic Reshaping causes arabic characters to no longer appear. Removing column text makes characters appear. #940

Open
bfryer-snap opened this issue Aug 10, 2023 · 7 comments
Labels
bug CRITICAL Important issue

Comments

@bfryer-snap
Copy link

bfryer-snap commented Aug 10, 2023

Describe the bug
The arabic reshaping is leading to characters not being rendered in the PDF when using some fonts. If I do not use the ColumnText, the characters appear.

To Reproduce

We can use a modified version of the RightToLeft.java example to show the issue:

Here it is working:

public static void main(String[] args) {
        try {
            // step 1
            Document document = new Document(PageSize.A4, 50, 50, 50, 50);
            // step 2
            PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("sample.pdf"));
            // step 3
            document.open();
            // step 4
            PdfContentByte cb = writer.getDirectContent();

// Font can be found here:
// https://fonts.google.com/noto/specimen/Noto+Sans+Arabic?sort=popularity&subset=arabic
            BaseFont bf = BaseFont.createFont("NotoSansArabic-regular.ttf", BaseFont.IDENTITY_H, true);

            ColumnText ct = new ColumnText(cb);
            ct.setSimpleColumn(100, 100, 500, 800, 24, Element.ALIGN_LEFT);
            ct.setSpaceCharRatio(PdfWriter.NO_SPACE_CHAR_RATIO);
            ct.setLeading(0, 1);
            ct.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
            ct.setAlignment(Element.ALIGN_CENTER);
            ct.addText(new Chunk(ar1, new Font(bf, 16)));
            ct.addText(new Chunk(ar2, new Font(bf, 16, Font.NORMAL, Color.red)));
            ct.go();
            ct.setAlignment(Element.ALIGN_JUSTIFIED);
            ct.addText(new Chunk(ar3, new Font(bf, 12)));
            ct.go();
            ct.setAlignment(Element.ALIGN_CENTER);
            ct.addText(new Chunk(ar4, new Font(bf, 14)));
            ct.go();

            // step 5
            document.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    /**
     * arabic text
     */
    public static String ar1 = "\u0623\u0648\u0631\u0648\u0628\u0627, \u0628\u0631\u0645\u062c\u064a\u0627\u062a "
            + "\u0627\u0644\u062d\u0627\u0633\u0648\u0628 + \u0627\u0646\u062a\u0631\u0646\u064a\u062a :\n\n";
    /**
     * arabic text
     */
    public static String ar2 = "\u062a\u0635\u0628\u062d \u0639\u0627\u0644\u0645\u064a\u0627 \u0645\u0639 "
            + "\u064a\u0648\u0646\u064a\u0643\u0648\u062f\n\n";
    /**
     * arabic text
     */
    public static String ar3 = "\u062a\u0633\u062c\u0651\u0644 \u0627\u0644\u0622\u0646 \u0644\u062d\u0636\u0648\u0631 "
            + "\u0627\u0644\u0645\u0624\u062a\u0645\u0631 \u0627\u0644\u062f\u0648\u0644\u064a "
            + "\u0627\u0644\u0639\u0627\u0634\u0631 \u0644\u064a\u0648\u0646\u064a\u0643\u0648\u062f, "
            + "\u0627\u0644\u0630\u064a \u0633\u064a\u0639\u0642\u062f \u0641\u064a 10-12 \u0622\u0630\u0627\u0631 "
            + "1997 \u0628\u0645\u062f\u064a\u0646\u0629 \u0645\u0627\u064a\u0646\u062a\u0633, "
            + "\u0623\u0644\u0645\u0627\u0646\u064a\u0627. \u0648\u0633\u064a\u062c\u0645\u0639 "
            + "\u0627\u0644\u0645\u0624\u062a\u0645\u0631 \u0628\u064a\u0646 \u062e\u0628\u0631\u0627\u0621 "
            + "\u0645\u0646  \u0643\u0627\u0641\u0629 \u0642\u0637\u0627\u0639\u0627\u062a "
            + "\u0627\u0644\u0635\u0646\u0627\u0639\u0629 \u0639\u0644\u0649 \u0627\u0644\u0634\u0628\u0643\u0629 "
            + "\u0627\u0644\u0639\u0627\u0644\u0645\u064a\u0629 \u0627\u0646\u062a\u0631\u0646\u064a\u062a "
            + "\u0648\u064a\u0648\u0646\u064a\u0643\u0648\u062f, \u062d\u064a\u062b \u0633\u062a\u062a\u0645, "
            + "\u0639\u0644\u0649 \u0627\u0644\u0635\u0639\u064a\u062f\u064a\u0646 "
            + "\u0627\u0644\u062f\u0648\u0644\u064a \u0648\u0627\u0644\u0645\u062d\u0644\u064a \u0639\u0644\u0649 "
            + "\u062d\u062f \u0633\u0648\u0627\u0621 \u0645\u0646\u0627\u0642\u0634\u0629 \u0633\u0628\u0644 "
            + "\u0627\u0633\u062a\u062e\u062f\u0627\u0645 \u064a\u0648\u0646\u0643\u0648\u062f  \u0641\u064a "
            + "\u0627\u0644\u0646\u0638\u0645 \u0627\u0644\u0642\u0627\u0626\u0645\u0629 "
            + "\u0648\u0641\u064a\u0645\u0627 \u064a\u062e\u0635 "
            + "\u0627\u0644\u062a\u0637\u0628\u064a\u0642\u0627\u062a "
            + "\u0627\u0644\u062d\u0627\u0633\u0648\u0628\u064a\u0629, \u0627\u0644\u062e\u0637\u0648\u0637, "
            + "\u062a\u0635\u0645\u064a\u0645 \u0627\u0644\u0646\u0635\u0648\u0635  "
            + "\u0648\u0627\u0644\u062d\u0648\u0633\u0628\u0629 \u0645\u062a\u0639\u062f\u062f\u0629 "
            + "\u0627\u0644\u0644\u063a\u0627\u062a.\n\n";
    /**
     * arabic text
     */
    public static String ar4 = "ع\u0646\u062f\u0645\u0627 \u064a\u0631\u064a\u062f "
            + "\u0627\u0644\u0639\u0627\u0644\u0645 \u0623\u0646 \u064a\u062a\u0643\u0644\u0651\u0645, "
            + "\u0641\u0647\u0648 \u064a\u062a\u062d\u062f\u0651\u062b \u0628\u0644\u063a\u0629 "
            + "\u064a\u0648\u0646\u064a\u0643\u0648\u062f\n\n";

Output:
Screenshot 2023-08-10 at 11 41 24 AM

If we leave everything the exact same as the above, except we change "NotoSansArabic-regular.ttf" to a different font, such as "GraphikArabic-Regular.ttf", Then we get the following output:
Screenshot 2023-08-10 at 11 43 03 AM

The problem can be seen most easily by looking at the lower left section of the main paragaph. In the NotoSansArabic font, we can see a word that looks like it expands multiple characters. In the GraphikArabic font, we can see that it is missing the right half of the word and seems to only contain the last two characters.

A specific character that seems to be rendered by NotoSansArabic and not GraphikArabic is \u0627.

I thought that GraphikArabic was missing the \u0627 character altogether, but if I use the following code, i can generate it just fine:

    public static void main(String[] args) {
        try {
            // step 1
            Document document = new Document(PageSize.A4, 50, 50, 50, 50);
            // step 2
            PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("sample.pdf"));
            // step 3
            document.open();
            // step 4
            PdfContentByte cb = writer.getDirectContent();
            BaseFont bf = BaseFont.createFont("GraphikArabic-regular.ttf", BaseFont.IDENTITY_H, true);
            Font font = new Font(bf);
            document.add(new Paragraph("\u0627   \u0627   \u0627", font));

            // step 5
            document.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

screenshot of the output being as expected
Screenshot 2023-08-10 at 11 47 13 AM

System.out.println(bf.charExists('\u0627')); also outputs true when using the GraphikArabic font. I assume that BaseFont::charExists(char) is the way to determine if the given char should show on the PDF.

I believe the issue is that characters like \u0627 are being reshaped into much different characters in a whole other unicode block and that the font does not support the reshaped characters. I believe this because when debugging, I can see that some characters such as 0x0627 become 0x0FE8E. This transformation happens here: https://github.com/LibrePDF/OpenPDF/blob/master/openpdf/src/main/java/com/lowagie/text/pdf/BidiLine.java#L197

Expected behavior

I expect ColumnText and adding elements to a document directly to have the same output OR I expect to be able to skip the "reshaping" process so that I can continue to use a font which supports the 0x0600 to 0x06FF character range.

Screenshots
screenshots added above.

System (please complete the following information):

Additional context

@andreasrosdal
Copy link
Contributor

Thank you for reporting this bug. Please submit a pull request with a solution to this problem if you can.

@andreasrosdal andreasrosdal added the CRITICAL Important issue label Feb 14, 2024
@vk-github18
Copy link
Contributor

Neither 0x0627 nor 0x0FE8E can be found in BidiLine.java

Seems to be a problem with the commercial font GraphikArabic.

@asturio
Copy link
Member

asturio commented Mar 16, 2024

May be related to #938

@vk-github18
Copy link
Contributor

@bfryer-snap
Copy link
Author

Hi @vk-github18, I was able to print 0x0627 with GraphikArabic. I could not print 0x0FE8E. It seems that the library was converting characters from form A -> form B based on the surrounding characters in the ArabicLigatuizer.java. We can see 0x0627 is defined here: https://github.com/LibrePDF/OpenPDF/blob/1.3-java8/openpdf/src/main/java/com/lowagie/text/pdf/ArabicLigaturizer.java#L100

we can also see that 0x0627 is included in some row in charTable: https://github.com/LibrePDF/OpenPDF/blob/1.3-java8/openpdf/src/main/java/com/lowagie/text/pdf/ArabicLigaturizer.java#L132

and we can see charTable is used in two methods, that seem to be doing some sort of transormations: https://github.com/LibrePDF/OpenPDF/blob/1.3-java8/openpdf/src/main/java/com/lowagie/text/pdf/ArabicLigaturizer.java#L212-L253

So the question is:

  • When a font does not support 100% of the arabic characterset but does include the "base" character set (not familiar enough with arabic charactersets to use proper terms) should the library still try to convert the character into something that the font does not support?

Or even more generally speaking:

  • Should the openPDF library transform the given characters to another set of characters even if the resulting characters are not supported by the font?

An ArabicLiguatizer transformation bypass flag/option would make it so its up to the client to know the limitations of their font. At the moment, the only option is to switch fonts entirely.

@vk-github18
Copy link
Contributor

What is the result if you use
import com.lowagie.text.pdf.LayoutProcessor; ... LayoutProcessor.enableKernLiga();

as explained in https://github.com/LibrePDF/OpenPDF/wiki/Accents,-DIN-91379,-non-Latin-scripts ?

@vk-github18
Copy link
Contributor

vk-github18 commented May 1, 2024

To your question, if the commercial font you used does not support Arabic properly you should open an issue at the producer of the font.
There is not much a library can do, if the font is not correct. To introduce special handling for incomplete/incorrect fonts is not the way to go. The transformations for arabic scripts are mandatory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug CRITICAL Important issue
Projects
None yet
Development

No branches or pull requests

4 participants