Garbage output #11

mietek · 2016-01-24T03:20:47Z

For some PDFs, Cermine fails to determine the structure of the document, and outputs garbage.

Examples

Tested with 1.8-SNAPSHOT.

Input

Output

dtkaczyk · 2016-01-26T11:20:30Z

@mietek The garbage in these cases is related to invalid positions and dimensions of the characters in the files, which, as you noticed in issue #14, might be the outcome of some OCR process. The best way to solve this is indeed to add OCR functionality to CERMINE, which is on our long-term TODO list.

mietek · 2016-01-26T16:44:31Z

Thanks. I note that the PDFs in question haven’t been scanned, so I doubt the garbage is the result of an OCR process. The glyph data in the vector layer seems intact, while the character data in the text layer seems garbled. I conjecture these PDFs were not produced directly, say, by pdflatex, but rather, are the result of some conversion process gone wrong, badly.

dtkaczyk · 2016-02-15T19:57:33Z

@mietek Thanks for the clarification. As I am unable to solve this directly, I am closing this issue. I will leave issue #14 open to discuss everything related to OCR feature request.

mietek mentioned this issue Jan 24, 2016

Feature request: OCR #14

Closed

dtkaczyk closed this as completed Feb 15, 2016

dtkaczyk added the wontfix label Feb 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage output #11

Garbage output #11

mietek commented Jan 24, 2016

dtkaczyk commented Jan 26, 2016

mietek commented Jan 26, 2016

dtkaczyk commented Feb 15, 2016

Garbage output #11

Garbage output #11

Comments

mietek commented Jan 24, 2016

Examples

dtkaczyk commented Jan 26, 2016

mietek commented Jan 26, 2016

dtkaczyk commented Feb 15, 2016