allowing doc process toolkit to parse other forms of documents #3

geramirez · 2015-03-03T20:50:34Z

No description provided.

coveralls · 2015-03-03T21:02:38Z

Coverage increased (+0.24%) to 48.0% when pulling e938374 on excel into 5ce8438 on master.

khandelwal · 2015-03-16T15:47:00Z

textextraction/doc_process_toolkit.py

Wouldn't a simpler check be to try and extract the text from a document and see if that returns anything?

I actually run, 2 checks for text. The first is a quick scan to make sure that text even exists in the document. The second occurs after Tika extracts the document text to make sure that the text extraction worked. I figured it would save some Tika processing time.

cmc333333 · 2015-03-17T04:27:37Z

General suggestion: it looks like you've got lots of "if it's this type of document, do X, if it's that, do Y". That's really what standard interfaces are for -- consider having a PDFTextExtractor, an XLSTextExtractor, etc. so that those type checks are made implicit

khandelwal · 2015-03-17T14:14:53Z

That's a good idea @cmc333333 and would likely lead to a cleaner design of this.

geramirez · 2015-03-17T15:49:25Z

@cmc333333 @khandelwal I've added the suggested changes to PR #4

allowing doc process toolkit to parse other forms of documents

allowing doc process toolkit to parse other forms of documents

b148446

geramirez mentioned this pull request Mar 3, 2015

Update Doc Processing Toolkit work with excel/csv 18F/2015-foia-hub#571

Closed

updating readme

0bc1b2c

geramirez self-assigned this Mar 3, 2015

fixing tests

e938374

geramirez mentioned this pull request Mar 12, 2015

Krang: update import pipeline 18F/2015-foia-hub#643

Open

3 tasks

khandelwal reviewed Mar 16, 2015
View reviewed changes

khandelwal added a commit that referenced this pull request Mar 17, 2015

Merge pull request #3 from 18F/excel

4611c3a

allowing doc process toolkit to parse other forms of documents

khandelwal merged commit 4611c3a into master Mar 17, 2015

khandelwal deleted the excel branch March 17, 2015 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

allowing doc process toolkit to parse other forms of documents #3

allowing doc process toolkit to parse other forms of documents #3

Uh oh!

geramirez commented Mar 3, 2015

Uh oh!

coveralls commented Mar 3, 2015

Uh oh!

khandelwal Mar 16, 2015

Uh oh!

geramirez Mar 17, 2015

Uh oh!

cmc333333 commented Mar 17, 2015

Uh oh!

khandelwal commented Mar 17, 2015

Uh oh!

geramirez commented Mar 17, 2015

Uh oh!

Uh oh!

allowing doc process toolkit to parse other forms of documents #3

allowing doc process toolkit to parse other forms of documents #3

Uh oh!

Conversation

geramirez commented Mar 3, 2015

Uh oh!

coveralls commented Mar 3, 2015

Uh oh!

khandelwal Mar 16, 2015

Choose a reason for hiding this comment

Uh oh!

geramirez Mar 17, 2015

Choose a reason for hiding this comment

Uh oh!

cmc333333 commented Mar 17, 2015

Uh oh!

khandelwal commented Mar 17, 2015

Uh oh!

geramirez commented Mar 17, 2015

Uh oh!

Uh oh!