Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add toktok tokenizer #18

Merged
merged 12 commits into from
Mar 26, 2019
Merged

add toktok tokenizer #18

merged 12 commits into from
Mar 26, 2019

Conversation

aquatiko
Copy link
Contributor

@aquatiko aquatiko commented Feb 23, 2019

I haven't tested all of it but this will be the approach to modify atoms function in fast.jl to handle regex rules of TokTok tokenizer.

  • Modify new API to handle regex
  • Add tokenizer using new API
  • Add all regex rules
  • Tests
  • Documentation
  • Profiling

Closes: #15

Copy link
Member

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start.

Probably get @MikeInnes to take a look at this later.

@codecov-io
Copy link

codecov-io commented Feb 24, 2019

Codecov Report

Merging #18 into master will increase coverage by 0.36%.
The diff coverage is 80.51%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #18      +/-   ##
==========================================
+ Coverage    80.5%   80.86%   +0.36%     
==========================================
  Files           8        9       +1     
  Lines         200      277      +77     
==========================================
+ Hits          161      224      +63     
- Misses         39       53      +14
Impacted Files Coverage Δ
src/set_method_api.jl 100% <ø> (ø) ⬆️
src/words/fast.jl 81.25% <ø> (+1.56%) ⬆️
src/words/TokTok.jl 80.51% <80.51%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7159f20...414bc76. Read the comment docs.

@oxinabox
Copy link
Member

I took a closer look at how you've added regex support.
It is going to be very slow. As it re-converts up to the whole buffer into a String on each match.
(i'm also not sure it's logic is right)
I did some deeper digging into our PCRE options, and it is not good.
It isn't possible to operate on a Vector{Char} directly using PCRE.
(I got hopeful when i saw that one could operator on Vector{UInt8} but alas our Char type is not that. Probably for the best really there.)

I rather than try and make regex work, best is to just introduce more fast.jl rules that are suitable.

@oxinabox
Copy link
Member

Which regex rules do you think would be hard to translate into the current, regex free fast.jl rules?
Or into similar rules?

I think the rules are all fairly simple and we can workout how to write them without regex.
Post the ones you are having trouble with and I will give it a shot

@aquatiko
Copy link
Contributor Author

@oxinabox I was having trouble with these rules..

# Assertion types
 `FINAL_PERIOD_1 = r"(?<!\.)\.$", " ."`

# Continous types
    MULTI_COMMAS =r"(,{2,})"

@oxinabox
Copy link
Member

FINAL_PERIOD_1 = r"(?<!\.)\.$", " ."

This is similar (but slightly more complex) to the cast in nltk_word_tokenize

So because it is the final character we can handle it by checking the contexts of the tokenbuffer relative to the end directly

function toktok_tokenize(input)
  ts = TokenBuffer(input)
  isempty(input) && return ts.tokens

  # handle  checking  for  final period
  # Rule 1: If it is the last character and it is not preceeded by another period
  # So as to  split `foo.`, but  not `foo...` 
  stop = length(ts.input) > 1 && ts.input[end] == '.' && ts.input[end-1] != '.' 
  stop && pop!(ts.input)

  while !isdone(ts)
          # ...
          # other rules
          # ...
  end
  stop && push!(ts.tokens, ".")
  return ts.tokens
end

I know there is a second case for final period, so it might be worth breaking the handling out in to a seperate function.

@oxinabox
Copy link
Member

MULTI_COMMAS =r"(,{2,})"

So this one is similar to the number rule.

"""
    repeated_character_seq(::TokenBuffer, char, min_repeats=2)
Matches sequences of characters that are repreated at least `min_repeats` times.
Flushes them.
"""
function repeated_character_seq(ts, char, min_repeats=2)
  i = ts.idx
  while i <= length(ts.input) && (ts[i]==char)
    i += 1
  end
  seq_end_ind = i - 1  # remove last failing step

  seq_end_ind - ts.ind < min_repeats && return false  # not enough repeats.
  flush!(ts, String(ts[ts.idx : seq_end_ind]))
  ts.idx = seq_end_ind + 1 
  return true
end

I've not checked this and might have an off-by-one error,
but that is the general gist of it.

# Pad more funky punctuation.
const FUNKY_PUNCT_2 = string.(Tuple("[“‘„‚«‹「『"))
# Pad En dash and em dash
const EN_EM_DASHES = ("–—")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const EN_EM_DASHES = ("–—")
const EN_EM_DASHES = ("–—",)

Should there be an en-dash here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there should be. It wasn't in the nltk's version, so I missed it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oxinabox
Copy link
Member

oxinabox commented Mar 6, 2019

I think getting the tests generated sooner rather than later is a good idea.
Even if they fail, or are slow.

@oxinabox
Copy link
Member

Take a good look at the coverage report, and try and bring it up to near 100% on the TokTok.jl file
#18 (comment)

It is very easy to have code that doesn't work on sides of if branchs that are not tested.


# Left/Right strip, i.e. remove heading/trailing spaces.
const LSTRIP = (" ",) => ""
const RSTRIP = ("\r", "\n", "\t", "\f", "\v",) => "\n"
Copy link
Contributor Author

@aquatiko aquatiko Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These rules were declared but wasn't used in original toktok perl script. Thus, removed this.

const LSTRIP = (" ",) => ""
const RSTRIP = ("\r", "\n", "\t", "\f", "\v",) => "\n"
# Merge multiple spaces.
const ONE_SPACE = (" ",) => " "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was erroring when used along with rules_replaces, thus written as a separate case in main body.

ts.input[ts.idx] = NON_BREAKING[2]
end

url_handler4(ts) || # these url handlers have priority over others
Copy link
Contributor Author

@aquatiko aquatiko Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found that this should be the correct priority when using this approach instead of the regex version.

end

# In below functions flush!() is used when some given string needs to be a seperate token
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stated the use of flush! and push! as asked

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent

@@ -1,4 +1,9 @@
# WordTokenizers
[![GitHub release](https://img.shields.io/github/release/JuliaText/WordTokenizers.jl.svg)](https://github.com/JuliaText/WordTokenizers.jl/releases/)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some badges for WordTokenizers.jl

repeated_character_seq(ts, '.', 2) ||
number(ts) ||
spaces(ts) || # Handles ONE_SPACE rules from original toktok perl script
replaces(ts, rules_replaces) || # most expensive steps, keep low priority
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found that these were the most expensive steps by profiling, thus reduced there priority to increase performance of code. Change in it's priority won't effect the tokenizer.

@aquatiko
Copy link
Contributor Author

aquatiko commented Mar 23, 2019

@oxinabox I have tried my best to increase the code coverage, but still didn't know why can't I reach 100% (Currently 80.51%). Locally using Coverage.jl, I was able to reach 100% but on codecov I can't understand why some lines are red(which shouldn't be according to me). But altogether, the master coverage has increased than before.
As for adding license, I'm not aware of the procedure of extending a Apache 2.0 to one's one codebase.

@aquatiko
Copy link
Contributor Author

ping @oxinabox
Can this be merged now?

@Ayushk4
Copy link
Member

Ayushk4 commented Mar 26, 2019

@aquatiko I think its a good idea to update this branch from the code at JuliaText:master, even though there are no merge conflicts.

Refer the 2 spaced indentation that I recently fixed. -

@aquatiko
Copy link
Contributor Author

Yeah, done. Thanks for the pointer

@oxinabox
Copy link
Member

Thanks, good work, I hope it was educational.

I'm going to assume something funny is happening with Coverage website if you can get to 100% locally.

As for adding license, I'm not aware of the procedure of extending a Apache 2.0 to one's one codebase.

I will add the license details once i merge.

This will be a breaking change.
Probably will hold of merging til #13 is in too

@oxinabox oxinabox merged commit c7e2c32 into JuliaText:master Mar 26, 2019
@aquatiko aquatiko mentioned this pull request Mar 26, 2019
@oxinabox oxinabox mentioned this pull request Apr 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants