-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add toktok tokenizer #18
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice start.
Probably get @MikeInnes to take a look at this later.
Codecov Report
@@ Coverage Diff @@
## master #18 +/- ##
==========================================
+ Coverage 80.5% 80.86% +0.36%
==========================================
Files 8 9 +1
Lines 200 277 +77
==========================================
+ Hits 161 224 +63
- Misses 39 53 +14
Continue to review full report at Codecov.
|
I took a closer look at how you've added regex support. I rather than try and make regex work, best is to just introduce more fast.jl rules that are suitable. |
Which regex rules do you think would be hard to translate into the current, regex free fast.jl rules? I think the rules are all fairly simple and we can workout how to write them without regex. |
@oxinabox I was having trouble with these rules..
|
|
|
# Pad more funky punctuation. | ||
const FUNKY_PUNCT_2 = string.(Tuple("[“‘„‚«‹「『")) | ||
# Pad En dash and em dash | ||
const EN_EM_DASHES = ("–—") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const EN_EM_DASHES = ("–—") | |
const EN_EM_DASHES = ("–—",) |
Should there be an en-dash here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there should be. It wasn't in the nltk's version, so I missed it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually missing from the original too.
https://github.com/jonsafari/tok-tok/blob/fd0029cad4f3eed2ae229d835c2c77ea39931fd1/tok-tok.pl#L72
Maybe @jonsafari can comment?
I think getting the tests generated sooner rather than later is a good idea. |
Take a good look at the coverage report, and try and bring it up to near 100% on the TokTok.jl file It is very easy to have code that doesn't work on sides of if branchs that are not tested. |
src/words/TokTok.jl
Outdated
|
||
# Left/Right strip, i.e. remove heading/trailing spaces. | ||
const LSTRIP = (" ",) => "" | ||
const RSTRIP = ("\r", "\n", "\t", "\f", "\v",) => "\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These rules were declared but wasn't used in original toktok perl script. Thus, removed this.
src/words/TokTok.jl
Outdated
const LSTRIP = (" ",) => "" | ||
const RSTRIP = ("\r", "\n", "\t", "\f", "\v",) => "\n" | ||
# Merge multiple spaces. | ||
const ONE_SPACE = (" ",) => " " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was erroring when used along with rules_replaces
, thus written as a separate case in main body.
ts.input[ts.idx] = NON_BREAKING[2] | ||
end | ||
|
||
url_handler4(ts) || # these url handlers have priority over others |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found that this should be the correct priority when using this approach instead of the regex version.
end | ||
|
||
# In below functions flush!() is used when some given string needs to be a seperate token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stated the use of flush! and push! as asked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent
@@ -1,4 +1,9 @@ | |||
# WordTokenizers | |||
[](https://github.com/JuliaText/WordTokenizers.jl/releases/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some badges for WordTokenizers.jl
repeated_character_seq(ts, '.', 2) || | ||
number(ts) || | ||
spaces(ts) || # Handles ONE_SPACE rules from original toktok perl script | ||
replaces(ts, rules_replaces) || # most expensive steps, keep low priority |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found that these were the most expensive steps by profiling, thus reduced there priority to increase performance of code. Change in it's priority won't effect the tokenizer.
@oxinabox I have tried my best to increase the code coverage, but still didn't know why can't I reach 100% (Currently 80.51%). Locally using Coverage.jl, I was able to reach 100% but on codecov I can't understand why some lines are red(which shouldn't be according to me). But altogether, the master coverage has increased than before. |
ping @oxinabox |
@aquatiko I think its a good idea to update this branch from the code at JuliaText:master, even though there are no merge conflicts. Refer the 2 spaced indentation that I recently fixed. - WordTokenizers.jl/src/words/nltk_word.jl Line 52 in 16e08ea
|
Yeah, done. Thanks for the pointer |
Thanks, good work, I hope it was educational. I'm going to assume something funny is happening with Coverage website if you can get to 100% locally.
I will add the license details once i merge. This will be a breaking change. |
I haven't tested all of it but this will be the approach to modify
atoms
function infast.jl
to handle regex rules of TokTok tokenizer.Closes: #15