Skip to content

Commit

Permalink
Replace pdf.ttf with sharp2.ttf, keep name the same
Browse files Browse the repository at this point in the history
As discussed at length in issue #182, the existing pdf.ttf causes difficulties
for certain PDF viewers, in part because the old file had zero advance width.

With testing, sharp2.ttf seems to be the best available compromise, although
it's not perfect and causes some visual difficulties in Evince.  It does
seem to fix Kindle and OS X Preview.
  • Loading branch information
James R. Barlow committed Feb 11, 2016
1 parent b68be44 commit b30930b
Showing 1 changed file with 0 additions and 0 deletions.
Binary file modified tessdata/pdf.ttf
Binary file not shown.

6 comments on commit b30930b

@LeoFCardoso
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering Ghostscript bug and related discussion on http://bugs.ghostscript.com/show_bug.cgi?id=696116 can pdf.ttf be somehow adjusted to workaround "gs" behavior?

@jbreiden
Copy link
Contributor

@jbreiden jbreiden commented on b30930b Nov 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the font metrics in PDF.ttf for Firefox compatibility, which should have just made it to GitHub recently as part of Tesseract 4.x. so probably the first thing to do is retest when the dust settles. [EDIT: I am going to sit down and figure out the current state of affairs on Monday before I confuse myself and everybody else]

@jbreiden
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS. Don't commit compatibility changes to PDF generation without my involvement. It is very easy to break one thing while fixing another.

@jbreiden
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, Ray I've confirmed that the github pdf.ttf needs updating. Talking to Ray...

@jbreiden
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we ultimately want.

$ md5sum pdf.ttf
e436074b54ed9cc5bf4789f79059b01b pdf.ttf

@jbarlow83
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue I reported to Ghostscript is related:
http://bugs.ghostscript.com/show_bug.cgi?id=696874

OCR output produced by Tesseract will survive Ghostscript pdfwrite for versions less than 9.20. Versions <= 9.19 have a bug that can corrupt the character mapping if characters above U+00FF appear. That can easily happen for "plain English" if Tesseract misdetects a diacritic, or picks up a ligature or special character.

Please sign in to comment.