Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for TIKA-1787 : NamedEntityParser #61

Closed
wants to merge 11 commits into from
Closed

Fix for TIKA-1787 : NamedEntityParser #61

wants to merge 11 commits into from

Conversation

thammegowda
Copy link
Member

UPDATE : Wiki URL : https://wiki.apache.org/tika/TikaAndNER

Added NamedEntityParser that supports loading of different NER implementations at runtime.
The default NER implementation based on OpenNLP is supplied.

Another implementation based on StanfordCoreNLP is located here This is GNU GPL 3, So kept separate. See UPDATE 2 below

@chrismattmann This is not 100% complete, here are few TODOs :

  1. The NER implementing class name needs is to be read from tika config if possible/available. Currently relying on Java Properties. Please suggest me on how to resolve this todo
    EDIT : 2. Looking for a best way to read parsed text from non text streams within the NamedEntityParser (not sure if a parser can read output of previous parsers like html or pdf). Please suggest me on how to resolve this todo Using secondary parser to get text content

UPDATE : 1. Added Regex Based NER . Though this can recognize much more patterns than names, (I am using it for recognising weapon names and weapon types )

UPDATE 2 : Added Core NLP NER with runtime class binding this one is still using java binding instead of command invocation, because :

  • The commandline binding is not portable across environments (or need to maintain those many ports)
  • setup overead in distribted environment like hadoop.

UPDATE 3 : Chaining support :
Now we can chain many NER Implementations (OpenNLP, CoreNLP, RegEx) to the NamedEntityParser.


import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import java.util.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove star imports

@chrismattmann
Copy link
Contributor

@thammegowda great work! See my comments and please update thank you

+ Removed star imports
+ Removed dead code / commented code
+ Added License header to missing files
@thammegowda
Copy link
Member Author

@chrismattmann Thanks for the feedback. Issues Resolved!

@thammegowda thammegowda changed the title NamedEntityParser Fix for TIKA-1787 : NamedEntityParser Nov 11, 2015
file.getParentFile().mkdirs()
inStream = urlConn.getInputStream()
outStream = new FileOutputStream(file)
//IOUtils.copyLarge(inStream, outStream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thammegowda can you remove this line? commented code.

@chrismattmann
Copy link
Contributor

one more minor update @thammegowda and this is ready to go!

@chrismattmann
Copy link
Contributor

@thammegowda can you also write up a quick tutorial on http://wiki.apache.org/tika/TikaAndNER ? that shows how to install Stanford NER and run this?

@chrismattmann
Copy link
Contributor

you will need wiki karma so let me know your username and I'll grant you karma.

@thammegowda
Copy link
Member Author

@chrismattmann Sure thing. I might have missed few such comments. I will review one more time.

Please give me permission to create/edit NER wiki page, my username is "ThammeGowda".

@chrismattmann
Copy link
Contributor

@asfgit asfgit closed this in 48151b4 Nov 17, 2015
asfgit pushed a commit that referenced this pull request Nov 18, 2015
…ontributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1714931 13f79535-47bb-0310-9956-ffa450edef68
tballison pushed a commit to tballison/tika that referenced this pull request Feb 26, 2016
…ontributed by Thamme Gowda N and Yueheng He this closes apache#61 this closes apache#62

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1714835 13f79535-47bb-0310-9956-ffa450edef68
tballison pushed a commit to tballison/tika that referenced this pull request Feb 26, 2016
…ontributed by Thamme Gowda N and Yueheng He this closes apache#61 this closes apache#62

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1714931 13f79535-47bb-0310-9956-ffa450edef68
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants