You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@mietek The garbage in these cases is related to invalid positions and dimensions of the characters in the files, which, as you noticed in issue #14, might be the outcome of some OCR process. The best way to solve this is indeed to add OCR functionality to CERMINE, which is on our long-term TODO list.
Thanks. I note that the PDFs in question haven’t been scanned, so I doubt the garbage is the result of an OCR process. The glyph data in the vector layer seems intact, while the character data in the text layer seems garbled. I conjecture these PDFs were not produced directly, say, by pdflatex, but rather, are the result of some conversion process gone wrong, badly.
@mietek Thanks for the clarification. As I am unable to solve this directly, I am closing this issue. I will leave issue #14 open to discuss everything related to OCR feature request.
For some PDFs, Cermine fails to determine the structure of the document, and outputs garbage.
Examples
Tested with 1.8-SNAPSHOT.
Input
Output
The text was updated successfully, but these errors were encountered: