Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user to set the compression level of the output pdf #5

Open
ElectricRCAircraftGuy opened this issue Jan 1, 2020 · 3 comments
Assignees
Labels
enhancement New feature or request highest priority

Comments

@ElectricRCAircraftGuy
Copy link
Owner

ElectricRCAircraftGuy commented Jan 1, 2020

See my notes inside this document in the repo: pdf2searchablepdf - what to work on next - Gabriel.odt.

Essentially, I need to take images or an input pdf, generate high-quality TIF images for OCR, and generate a duplicate image set which is lower-quality, compressed JPEG images. Perform OCR with tesseract on the high-quality images, and output the format as hocr instead of PDF. Then, use hocr2pdf to combine the hocr output with the lower-quality JPEG images, which have been compressed to the user-specified compression level. Now, you have high-quality OCR text meta-data overlaid onto lower-quality images, in order to get a smaller (compressed) output PDF which is searchable!

Update: also take a look at this: https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-do-i-integrate-original-image-file-and-detected-text-into-pdf

Once you've done this, post to the following places to tell them about this new tool, since they don't seem to have this full feature-set/capability either:

  1. https://unix.stackexchange.com/questions/29869/converting-multiple-image-files-from-jpeg-to-pdf-format
  2. https://stackoverflow.com/questions/26775306/how-to-reduce-the-size-of-the-pdf-generated-by-tesseract
  3. https://superuser.com/questions/1077256/how-to-compress-tesseract-encoded-pdfs-while-maintaining-embedded-text-from-ocr
@ElectricRCAircraftGuy ElectricRCAircraftGuy self-assigned this Jan 1, 2020
@ElectricRCAircraftGuy ElectricRCAircraftGuy added the enhancement New feature or request label Jan 1, 2020
@ElectricRCAircraftGuy
Copy link
Owner Author

ElectricRCAircraftGuy commented Jan 24, 2020

I just discovered a couple other pdf to searchable pdf type tools, here (so I updated this wiki pg as well to be like it is): https://github.com/tesseract-ocr/tesseract/wiki/User-Projects-%E2%80%93-3rdParty#4-others-utilities-tools-command-line-interfaces-cli-etc.

Try out these tools and see if they already have this feature to specify compression level of the output pdf. If they do, consider whether continuing development on my pdf2searchablepdf tool is worth it. If they do not, let's continue with my project here.

@LeoFCardoso
Copy link

Congratulations for your project. I'm the pdf2pdfocr developer. Thank you for reference my project into your issue. :-)

With pdf2pdfocr you can use "-r" to define resolution to be used for OCR and "-f" and "-g" flags to keep low size final PDF files.

But I suggest you to keep going, as we can have more high quality open source projects!

@ElectricRCAircraftGuy
Copy link
Owner Author

ElectricRCAircraftGuy commented Mar 9, 2021

Note to self: DO THIS ASAP! I HAVE A BUNCH OF VERY LONG HOME REFINANCE DOCS TAKING UP >125 MB each when they should take up <10 MB each instead!

This is ridiculous! :)

Delete all of those big files when done fixing this issue and getting it to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request highest priority
Projects
None yet
Development

No branches or pull requests

2 participants