Tokenizer improvements #94

saif-ellafi · 2018-01-27T07:28:09Z

Description

Our current RegexTokenizer is pretty dull. This new Tokenizer is now able to compute closer to state of the art and with good customization.

It includes Prefix, Infix and Suffix rules, plus allowing a list of composite word exceptions and control over how user wants to partition target words. Defaults provided are quite accuracy on Stanford NLP standards.

This branch has a few side effects throughout improvements to the RuleFactory, lazy variables evaluation and renaming of inconsistent annotator models.

Motivation and Context

Tokenizer has big impact over the rest of the library. Now we can measure against other libraries and accuracy terms shall improve.

How Has This Been Tested?

New unit tests included

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

- Added warning logs into not using recursive pipeline in ner

- Removed beforeAnnotate, back to lazy

…be broadcast

saif-ellafi added 9 commits January 25, 2018 19:42

- Updated crf tests to use recursive pipeline

db89cbb

- Added warning logs into not using recursive pipeline in ner

- Tokenizer wip

4665a16

- Tokenizer testing use cases

4f9ff02

- New tokenizer wrap up

5311bec

- Fixed Annotation field name

b302ad5

- Features now truly lazy

a68f5a6

- Removed beforeAnnotate, back to lazy

- Fixed bad lazy call

a234dd6

- Fixed NullPointerException. Fallback Default Feature value may not …

562e254

…be broadcast

- language agnostic tokenizer defaults

892f782

saif-ellafi merged commit 2b3f340 into master Jan 27, 2018

maziyarpanahi deleted the tokenizer-improvements branch March 29, 2021 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer improvements #94

Tokenizer improvements #94

Uh oh!

saif-ellafi commented Jan 27, 2018 •

edited

Loading

Uh oh!

Uh oh!

Tokenizer improvements #94

Tokenizer improvements #94

Uh oh!

Conversation

saif-ellafi commented Jan 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Uh oh!

saif-ellafi commented Jan 27, 2018 •

edited

Loading