Allow user to set the compression level of the output pdf #5

ElectricRCAircraftGuy · 2020-01-01T02:48:38Z

See my notes inside this document in the repo: pdf2searchablepdf - what to work on next - Gabriel.odt.

Essentially, I need to take images or an input pdf, generate high-quality TIF images for OCR, and generate a duplicate image set which is lower-quality, compressed JPEG images. Perform OCR with tesseract on the high-quality images, and output the format as hocr instead of PDF. Then, use hocr2pdf to combine the hocr output with the lower-quality JPEG images, which have been compressed to the user-specified compression level. Now, you have high-quality OCR text meta-data overlaid onto lower-quality images, in order to get a smaller (compressed) output PDF which is searchable!

Update: also take a look at this: https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-do-i-integrate-original-image-file-and-detected-text-into-pdf

Once you've done this, post to the following places to tell them about this new tool, since they don't seem to have this full feature-set/capability either:

The text was updated successfully, but these errors were encountered:

ElectricRCAircraftGuy · 2020-01-24T02:15:26Z

I just discovered a couple other pdf to searchable pdf type tools, here (so I updated this wiki pg as well to be like it is): https://github.com/tesseract-ocr/tesseract/wiki/User-Projects-%E2%80%93-3rdParty#4-others-utilities-tools-command-line-interfaces-cli-etc.

Try out these tools and see if they already have this feature to specify compression level of the output pdf. If they do, consider whether continuing development on my pdf2searchablepdf tool is worth it. If they do not, let's continue with my project here.

LeoFCardoso · 2020-06-04T22:13:35Z

Congratulations for your project. I'm the pdf2pdfocr developer. Thank you for reference my project into your issue. :-)

With pdf2pdfocr you can use "-r" to define resolution to be used for OCR and "-f" and "-g" flags to keep low size final PDF files.

But I suggest you to keep going, as we can have more high quality open source projects!

ElectricRCAircraftGuy · 2021-03-09T06:40:05Z

Note to self: DO THIS ASAP! I HAVE A BUNCH OF VERY LONG HOME REFINANCE DOCS TAKING UP >125 MB each when they should take up <10 MB each instead!

This is ridiculous! :)

Delete all of those big files when done fixing this issue and getting it to work.

ElectricRCAircraftGuy self-assigned this Jan 1, 2020

ElectricRCAircraftGuy added the enhancement New feature or request label Jan 1, 2020

ElectricRCAircraftGuy added the highest priority label Mar 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow user to set the compression level of the output pdf #5

Allow user to set the compression level of the output pdf #5

ElectricRCAircraftGuy commented Jan 1, 2020 •

edited

ElectricRCAircraftGuy commented Jan 24, 2020 •

edited

LeoFCardoso commented Jun 4, 2020

ElectricRCAircraftGuy commented Mar 9, 2021 •

edited

Allow user to set the compression level of the output pdf #5

Allow user to set the compression level of the output pdf #5

Comments

ElectricRCAircraftGuy commented Jan 1, 2020 • edited

ElectricRCAircraftGuy commented Jan 24, 2020 • edited

LeoFCardoso commented Jun 4, 2020

ElectricRCAircraftGuy commented Mar 9, 2021 • edited

ElectricRCAircraftGuy commented Jan 1, 2020 •

edited

ElectricRCAircraftGuy commented Jan 24, 2020 •

edited

ElectricRCAircraftGuy commented Mar 9, 2021 •

edited