Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for PDF Form Parsing #13

Closed
afolarin opened this issue Feb 9, 2017 · 1 comment
Closed

Add support for PDF Form Parsing #13

afolarin opened this issue Feb 9, 2017 · 1 comment
Assignees

Comments

@afolarin
Copy link
Member

afolarin commented Feb 9, 2017

The present options for Tika don't include parsing pdf forms to text.

It would be useful to include the Apache PDFBox https://pdfbox.apache.org/ capability to this so it is available for CogStack as we see the Clinical Record in some hospitals might have a significant number of these. A first attempt at implementation could just parse the whole form tree, this would at least provide the capability to run queries against it in Elasticsearch.

Further refinement might be made to allow limited configuration to specify which parts of the forms might be parsed, as we've seen very large documents result from PDFBox.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants