Skip to content

Conversation

saif-ellafi
Copy link
Contributor

@saif-ellafi saif-ellafi commented Jan 27, 2018

Description

Our current RegexTokenizer is pretty dull. This new Tokenizer is now able to compute closer to state of the art and with good customization.

It includes Prefix, Infix and Suffix rules, plus allowing a list of composite word exceptions and control over how user wants to partition target words. Defaults provided are quite accuracy on Stanford NLP standards.

This branch has a few side effects throughout improvements to the RuleFactory, lazy variables evaluation and renaming of inconsistent annotator models.

Motivation and Context

Tokenizer has big impact over the rest of the library. Now we can measure against other libraries and accuracy terms shall improve.

How Has This Been Tested?

New unit tests included

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@saif-ellafi saif-ellafi merged commit 2b3f340 into master Jan 27, 2018
@maziyarpanahi maziyarpanahi deleted the tokenizer-improvements branch March 29, 2021 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant