Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Image-only PDF files (as Digital Document) should get OCR #910

Open
dmer opened this issue Nov 9, 2022 · 0 comments
Open

[FEATURE] Image-only PDF files (as Digital Document) should get OCR #910

dmer opened this issue Nov 9, 2022 · 0 comments

Comments

@dmer
Copy link

dmer commented Nov 9, 2022

Overview of feature request
All PDF files ingested as media for a Digital Document model should have Extracted Text (OCR) derivatives that contain the OCR'd text from the files.

What kind of user is the feature intended for?
Collections Manager, User

What inspired the request?
Migrating PDF's created in Islandora7 (by Ghostscript) and discovering that all the Extracted Text derivatives are blank.

What existing behavior do you want changed?
PDF files as media for a Digital Document model currently have their Extracted Text media generated by copying the embedded text layer in the PDF. If the PDF is "image-only" (does not have a text layer) the Extracted Text media is created as a blank/empty text file.

Any brand new behavior do you want to add to Islandora?
Not sure how this should get implemented - maybe pages are broken out and individually OCR'd as an image file and that OCR is fed back to the original object's media. Maybe PDF's can be processed directly somehow?

Any related open or closed issues to this feature request?
Couldn't find any!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant