You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The present options for Tika don't include parsing pdf forms to text.
It would be useful to include the Apache PDFBox https://pdfbox.apache.org/ capability to this so it is available for CogStack as we see the Clinical Record in some hospitals might have a significant number of these. A first attempt at implementation could just parse the whole form tree, this would at least provide the capability to run queries against it in Elasticsearch.
Further refinement might be made to allow limited configuration to specify which parts of the forms might be parsed, as we've seen very large documents result from PDFBox.
The text was updated successfully, but these errors were encountered:
The present options for Tika don't include parsing pdf forms to text.
It would be useful to include the Apache PDFBox https://pdfbox.apache.org/ capability to this so it is available for CogStack as we see the Clinical Record in some hospitals might have a significant number of these. A first attempt at implementation could just parse the whole form tree, this would at least provide the capability to run queries against it in Elasticsearch.
Further refinement might be made to allow limited configuration to specify which parts of the forms might be parsed, as we've seen very large documents result from PDFBox.
The text was updated successfully, but these errors were encountered: