Fixing the OCR on server-side. #19

christian-oreilly · 2018-08-10T13:52:37Z

For some reasons, the behavior of ocrmypdf seem to have change. Whereas before we were expecting directly the .txt file from it, now it was generating a PDF with the ocr-ed text overlaid to it. This commit fix this issue by overwriting the original scan PDF with a pdf with text overlaid and run the usual pdftotext on this new PDF.

pafonta

The v5.0+ feature --sidecar should be used instead but it would require to upgrade ocrmypdf, which is not currently in the roadmap. This fix is therefore OK for now. See: #4 (comment).

pafonta added the bug label Aug 21, 2018

pafonta self-requested a review August 21, 2018 09:53

pafonta approved these changes Aug 21, 2018

View reviewed changes

pafonta merged commit 710d283 into master Aug 21, 2018

pafonta deleted the christian-oreilly-orc-fix branch August 21, 2018 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing the OCR on server-side. #19

Fixing the OCR on server-side. #19

christian-oreilly commented Aug 10, 2018

pafonta left a comment •

edited

Loading

Fixing the OCR on server-side. #19

Fixing the OCR on server-side. #19

Conversation

christian-oreilly commented Aug 10, 2018

pafonta left a comment • edited Loading

Choose a reason for hiding this comment

pafonta left a comment •

edited

Loading