Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage output #11

Closed
mietek opened this issue Jan 24, 2016 · 3 comments
Closed

Garbage output #11

mietek opened this issue Jan 24, 2016 · 3 comments
Labels

Comments

@mietek
Copy link

mietek commented Jan 24, 2016

For some PDFs, Cermine fails to determine the structure of the document, and outputs garbage.

Examples

Tested with 1.8-SNAPSHOT.

Input

Output

@dtkaczyk
Copy link

@mietek The garbage in these cases is related to invalid positions and dimensions of the characters in the files, which, as you noticed in issue #14, might be the outcome of some OCR process. The best way to solve this is indeed to add OCR functionality to CERMINE, which is on our long-term TODO list.

@mietek
Copy link
Author

mietek commented Jan 26, 2016

Thanks. I note that the PDFs in question haven’t been scanned, so I doubt the garbage is the result of an OCR process. The glyph data in the vector layer seems intact, while the character data in the text layer seems garbled. I conjecture these PDFs were not produced directly, say, by pdflatex, but rather, are the result of some conversion process gone wrong, badly.

@dtkaczyk
Copy link

@mietek Thanks for the clarification. As I am unable to solve this directly, I am closing this issue. I will leave issue #14 open to discuss everything related to OCR feature request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants