Tokenizer improvements #94
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Our current RegexTokenizer is pretty dull. This new Tokenizer is now able to compute closer to state of the art and with good customization.
It includes Prefix, Infix and Suffix rules, plus allowing a list of composite word exceptions and control over how user wants to partition target words. Defaults provided are quite accuracy on Stanford NLP standards.
This branch has a few side effects throughout improvements to the RuleFactory, lazy variables evaluation and renaming of inconsistent annotator models.
Motivation and Context
Tokenizer has big impact over the rest of the library. Now we can measure against other libraries and accuracy terms shall improve.
How Has This Been Tested?
New unit tests included
Screenshots (if appropriate):
Types of changes
Checklist: