Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page.drawText() inserts spaces when using Thai font #1010

Open
4 tasks done
robin-dunn opened this issue Oct 1, 2021 · 10 comments
Open
4 tasks done

page.drawText() inserts spaces when using Thai font #1010

robin-dunn opened this issue Oct 1, 2021 · 10 comments

Comments

@robin-dunn
Copy link

robin-dunn commented Oct 1, 2021

What were you trying to do?

I am trying to use the page.drawText() function to render text in the Thai language

Why were you trying to do this?

To build an application that creates PDF files containing text written in the Thai language

How did you attempt to do it?

The steps I followed are:

  • Download Google Noto Sans Thai font
  • Embed the font in the pdf-lib PDF document
  • Invoke the page.drawText() function passing in the text in Thai

See code example provided in reproduction steps section below.

What actually happened?

The PDF file was successfully created but it seems some large spaces have been inserted into the Thai text in the PDF.

I've copied the text from the PDF and pasted below, notice the strange block characters which have been inserted.

แห่งได้เป􏰀ดขึ􏰁นแล้วในการขยายรถไฟใต้ดินลอนดอนครั􏰁งใหญ่ครั􏰁งแรกในศตวรรษนี

Those strange characters appear visually as large blank spaces in the PDF e.g like this:

แห่งได้เป ดขึ นแล้วในการขยายรถไฟใต้ดินลอนดอนครั งใหญ่ครั งแรกในศตวรรษนี

What did you expect to happen?

I expected the Thai text to be rendered as one continuous string without any strange characters or spaces inserted:

แห่งได้เปดขึนแล้วในการขยายรถไฟใต้ดินลอนดอนครังใหญ่ครังแรกในศตวรรษนี

How can we reproduce the issue?

  • Create a Node JS project folder e.g. called 'pdf-test'
  • cd pdf-test
  • npm init -y
  • npm i pdf-lib
  • npm i @pdf-lib/fontkit
  • Download Noto Sans Thai font from https://fonts.google.com/download?family=Noto%20Sans%20Thai
  • Unzip the font and copy the TTF file from Noto_Sans_Thai/static/NotoSansThai/NotoSansThai-Regular.ttf, paste the file into the the project folder pdf-test so it can be loaded by the index.js script below
  • Create a file called index.js and paste the code from below
  • Run the index.js file using the command node index.js which will create the PDF file containing some Thai text
  • Use a PDF viewer/browser e.g. Google Chrome to view the rendered PDF
  • Notice the spacing between some of the Thai text
const fs = require('fs');
const path = require('path');
const { PDFDocument, rgb } = require('pdf-lib');
const fontkit = require('@pdf-lib/fontkit');

(async function run() {

    const pdfDoc = await PDFDocument.create()
    pdfDoc.registerFontkit(fontkit)
    
    // Font downloaded from https://fonts.google.com/download?family=Noto%20Sans%20Thai
    // See also https://fonts.google.com/noto/specimen/Noto+Sans+Thai?query=thai
    const thaiFontBytes = fs.readFileSync(path.join(__dirname, './NotoSansThai-Regular.ttf'))

    const thaiFont = await pdfDoc.embedFont(thaiFontBytes)
    const page = pdfDoc.addPage()
    const { width, height } = page.getSize()

    const fontSize = 11
    page.drawText('แห่งได้เปิดขึ้นแล้วในการขยายรถไฟใต้ดินลอนดอนครั้งใหญ่ครั้งแรกในศตวรรษนี้', {
        x: 50,
        y: height - 2 * fontSize,
        size: fontSize,
        font: thaiFont,
        color: rgb(0, 0.53, 0.71),
    })

    const pdfBytes = await pdfDoc.save()
    fs.writeFile('thai-test.pdf', pdfBytes, () => console.log('PDF file saved.'))
})()

Version

1.16.0

What environment are you running pdf-lib in?

Node

Required Reading

Additional Notes

No response

@robin-dunn robin-dunn changed the title page.drawText inserts spaces when using Thai font page.drawText() inserts spaces when using Thai font Oct 1, 2021
@hlab-pawat
Copy link

hlab-pawat commented Oct 2, 2021

I also face this problem. I guess the bug is in UnicodeLayoutEngine class in @pdf-lib/fontkit lib.

@chacal88
Copy link

for me the same with many fonts

@pfmartins
Copy link

pfmartins commented Oct 11, 2021

Hey,
I see the same issue here. When I write in document, using fonts by google api, sometimes is added an spaces " " in my text.
like this:
image

I'm looking for light 💡

@cassilup
Copy link

@tudor-sandu, is this the issue you guys are experiencing?

@akomm
Copy link

akomm commented Nov 11, 2021

same here with helvetica neue roman and helvetica neue condensed It inserts spaces, for example after the sequence of fi, but not after i or f by itself. For example Backoffice becomes Backoffi ce and fifi becomes fi fi

@MetheeS
Copy link

MetheeS commented Nov 12, 2021

(for Thai font) the issue can be resolved when we use .embedFont(fontBytes, { subset: true });
Don't know why this help.

@akomm
Copy link

akomm commented Nov 12, 2021

The effect in the first post is some bytes added to text outside of valid space for the charset. In PDF if there is no character for that byte-sequence (utf8 is multi-byte with variable length), a reader renders it as a space. While when you copy the text, the actual data with the added bytes is copied and when you paste it in a program that renders non-valid/non-printable "chars" as those "glyphs" (the squares in first post), displaying the data as hex (for example 10F0C1), instead of rendered a space.

Also all the examples and my case does not seem like the font just does not have proper glyph for a character.

I also excluded, that some non-printable bytes in the source beforehand. Its being added when rendering the pdf.

https://unicode-table.com/en/search/?q=10F0C1

https://www.unicode.org/charts/PDF/U100000.pdf
Quote:

he Supplementary Private Use Area-B block encompasses the entire range of Plane 16. The range U+100000..U+10FFFD is
entirely designated for private use. The last two code points on the plane, U+10FFFE..U+10FFFF, are designated

noncharacters. Consequently, no character code charts or names lists are provided for the majority of this block, except that

a chart and names list are provided for the last 128 code points, to show the location of the noncharacters

@ponnreay
Copy link

(for Thai font) the issue can be resolved when we use .embedFont(fontBytes, { subset: true }); Don't know why this help.

This solution is work for font Khmer also.

@AgileEduLabs
Copy link

@akomm

same here with helvetica neue roman and helvetica neue condensed It inserts spaces, for example after the sequence of fi, but not after i or f by itself. For example Backoffice becomes Backoffi ce and fifi becomes fi fi

Try the following
await pdfDoc.embedFont(YOURFONT, { features: { liga: false }, });

It definitely is a bug and in my opinion is an issue that should be fixed: #490

@c-sanchez-fd
Copy link

(for Thai font) the issue can be resolved when we use .embedFont(fontBytes, { subset: true }); Don't know why this help.

This solution also works for Calibri fonts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants