Add support for PDF Form Parsing #13

afolarin · 2017-02-09T17:09:20Z

The present options for Tika don't include parsing pdf forms to text.

It would be useful to include the Apache PDFBox https://pdfbox.apache.org/ capability to this so it is available for CogStack as we see the Clinical Record in some hospitals might have a significant number of these. A first attempt at implementation could just parse the whole form tree, this would at least provide the capability to run queries against it in Elasticsearch.

Further refinement might be made to allow limited configuration to specify which parts of the forms might be parsed, as we've seen very large documents result from PDFBox.

afolarin · 2017-06-13T09:56:34Z

This issue was moved to CogStack/CogStack-Pipeline#23

afolarin added the enhancement label Feb 9, 2017

afolarin assigned hkkenneth Feb 9, 2017

hkkenneth mentioned this issue Feb 10, 2017

Add support for PDF Form Parsing RichJackson/cogstack#38

Closed

hkkenneth added a commit that referenced this issue Mar 13, 2017

#13 RichJackson/cogstack#38 Initial PDF form parsing with PDF BOX

18f68c2

hkkenneth closed this as completed Apr 23, 2017

afolarin mentioned this issue Jun 13, 2017

Add support for PDF Form Parsing CogStack/CogStack-Pipeline#23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for PDF Form Parsing #13

Add support for PDF Form Parsing #13

afolarin commented Feb 9, 2017

afolarin commented Jun 13, 2017

Add support for PDF Form Parsing #13

Add support for PDF Form Parsing #13

Comments

afolarin commented Feb 9, 2017

afolarin commented Jun 13, 2017