Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: The file size increases significantly by OCR even without image recompression #1278

Open
ybeltukov opened this issue Mar 17, 2024 · 2 comments
Assignees
Labels

Comments

@ybeltukov
Copy link

ybeltukov commented Mar 17, 2024

Describe the bug

I'm trying make OCR of scanned books and preserve the small size of the input file. To do so I use --output-type pdf option. However, the size is increased by 40% even without image recompression.

Moreover, the size is increased even further after the second pass despite redo-ocr flag.

My current version is 16.1.1 installed on Arch Linux from AUR repository.
In a previous version (16.0.4 or so) I did not notice such an increase in the file size.

I observe such a problem for various files with high enough compression. Below, a part of such a book is attached as an example.

Steps to reproduce

ocrmypdf --output-type pdf --redo-ocr -v1 Watson1.pdf Watson2.pdf
ocrmypdf --output-type pdf --redo-ocr -v1 Watson2.pdf Watson3.pdf

For the given small part of the book the file sizes are:
251 KB → 349 KB → 447 KB

Files

Here is the part of one book.
Watson1.pdf
Watson2.pdf
Watson3.pdf

How did you download and install the software?

No response

OCRmyPDF version

16.1.1

Relevant log output

First pass: https://pastebin.com/UjuiU3E7
Second pass: https://pastebin.com/bWgs185r
@ybeltukov
Copy link
Author

ybeltukov commented Mar 17, 2024

It seems to me that it is related to hocr pdf renderer, which is enabled by default now. It produces a better visual quality (see e.g. #1131 ), however it increases the size of OCR layer almost twice.

With the option --pdf-renderer sandwich I obtain the following sizes for the same file:
251 KB → 310 KB → 369 KB

So the OCR layer takes 59 KB for sandwich and 98 KB for hocr

So the questions are:

  • Is it possible to optimize hocr renderer?
  • Is it possible to remove previously added OCR layer without image recompression?
    (--force-ocr is not suitable for this task)

@Jmuccigr
Copy link
Contributor

Jmuccigr commented Apr 1, 2024

I wrote a script that redoes the ocr on PDFs by deleting any original text from the file and then using ocrmypdf to generate new ocr which I then add to the original file. I use it mainly to replace the often bad ocr in jstor files. It uses Ghostscript to remove the text and relies on some other stuff that you can see in the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants