You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
See my notes inside this document in the repo: pdf2searchablepdf - what to work on next - Gabriel.odt.
Essentially, I need to take images or an input pdf, generate high-quality TIF images for OCR, and generate a duplicate image set which is lower-quality, compressed JPEG images. Perform OCR with tesseract on the high-quality images, and output the format as hocr instead of PDF. Then, use hocr2pdf to combine the hocr output with the lower-quality JPEG images, which have been compressed to the user-specified compression level. Now, you have high-quality OCR text meta-data overlaid onto lower-quality images, in order to get a smaller (compressed) output PDF which is searchable!
Once you've done this, post to the following places to tell them about this new tool, since they don't seem to have this full feature-set/capability either:
Try out these tools and see if they already have this feature to specify compression level of the output pdf. If they do, consider whether continuing development on my pdf2searchablepdf tool is worth it. If they do not, let's continue with my project here.
See my notes inside this document in the repo: pdf2searchablepdf - what to work on next - Gabriel.odt.
Essentially, I need to take images or an input pdf, generate high-quality TIF images for OCR, and generate a duplicate image set which is lower-quality, compressed JPEG images. Perform OCR with tesseract on the high-quality images, and output the format as hocr instead of PDF. Then, use
hocr2pdf
to combine the hocr output with the lower-quality JPEG images, which have been compressed to the user-specified compression level. Now, you have high-quality OCR text meta-data overlaid onto lower-quality images, in order to get a smaller (compressed) output PDF which is searchable!Update: also take a look at this: https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-do-i-integrate-original-image-file-and-detected-text-into-pdf
Once you've done this, post to the following places to tell them about this new tool, since they don't seem to have this full feature-set/capability either:
The text was updated successfully, but these errors were encountered: