Skip to content
This repository was archived by the owner on Nov 7, 2018. It is now read-only.

Conversation

geramirez
Copy link
Contributor

No description provided.

@geramirez geramirez self-assigned this Mar 3, 2015
@coveralls
Copy link

Coverage Status

Coverage increased (+0.24%) to 48.0% when pulling e938374 on excel into 5ce8438 on master.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't a simpler check be to try and extract the text from a document and see if that returns anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually run, 2 checks for text. The first is a quick scan to make sure that text even exists in the document. The second occurs after Tika extracts the document text to make sure that the text extraction worked. I figured it would save some Tika processing time.

@cmc333333
Copy link

General suggestion: it looks like you've got lots of "if it's this type of document, do X, if it's that, do Y". That's really what standard interfaces are for -- consider having a PDFTextExtractor, an XLSTextExtractor, etc. so that those type checks are made implicit

@khandelwal
Copy link
Contributor

That's a good idea @cmc333333 and would likely lead to a cleaner design of this.

@geramirez
Copy link
Contributor Author

@cmc333333 @khandelwal I've added the suggested changes to PR #4

khandelwal added a commit that referenced this pull request Mar 17, 2015
allowing doc process toolkit to parse other forms of documents
@khandelwal khandelwal merged commit 4611c3a into master Mar 17, 2015
@khandelwal khandelwal deleted the excel branch March 17, 2015 19:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants