[FEATURE] Image-only PDF files (as Digital Document) should get OCR #910

dmer · 2022-11-09T16:04:40Z

Overview of feature request
All PDF files ingested as media for a Digital Document model should have Extracted Text (OCR) derivatives that contain the OCR'd text from the files.

What kind of user is the feature intended for?
Collections Manager, User

What inspired the request?
Migrating PDF's created in Islandora7 (by Ghostscript) and discovering that all the Extracted Text derivatives are blank.

What existing behavior do you want changed?
PDF files as media for a Digital Document model currently have their Extracted Text media generated by copying the embedded text layer in the PDF. If the PDF is "image-only" (does not have a text layer) the Extracted Text media is created as a blank/empty text file.

Any brand new behavior do you want to add to Islandora?
Not sure how this should get implemented - maybe pages are broken out and individually OCR'd as an image file and that OCR is fed back to the original object's media. Maybe PDF's can be processed directly somehow?

Any related open or closed issues to this feature request?
Couldn't find any!

adam-vessey mentioned this issue Nov 9, 2022

OCR image-only PDFs Islandora/documentation#1583

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Image-only PDF files (as Digital Document) should get OCR #910

[FEATURE] Image-only PDF files (as Digital Document) should get OCR #910

dmer commented Nov 9, 2022

[FEATURE] Image-only PDF files (as Digital Document) should get OCR #910

[FEATURE] Image-only PDF files (as Digital Document) should get OCR #910

Comments

dmer commented Nov 9, 2022