add toktok tokenizer #18

aquatiko · 2019-02-23T15:09:10Z

I haven't tested all of it but this will be the approach to modify atoms function in fast.jl to handle regex rules of TokTok tokenizer.

Closes: #15

oxinabox

Nice start.

Probably get @MikeInnes to take a look at this later.

src/WordTokenizers.jl

src/words/fast.jl

codecov-io · 2019-02-24T10:06:48Z

Codecov Report

Merging #18 into master will increase coverage by 0.36%.
The diff coverage is 80.51%.

@@            Coverage Diff             @@
##           master      #18      +/-   ##
==========================================
+ Coverage    80.5%   80.86%   +0.36%     
==========================================
  Files           8        9       +1     
  Lines         200      277      +77     
==========================================
+ Hits          161      224      +63     
- Misses         39       53      +14

Impacted Files	Coverage Δ
src/set_method_api.jl	`100% <ø> (ø)`	⬆️
src/words/fast.jl	`81.25% <ø> (+1.56%)`	⬆️
src/words/TokTok.jl	`80.51% <80.51%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7159f20...414bc76. Read the comment docs.

oxinabox · 2019-02-24T11:16:54Z

I took a closer look at how you've added regex support.
It is going to be very slow. As it re-converts up to the whole buffer into a String on each match.
(i'm also not sure it's logic is right)
I did some deeper digging into our PCRE options, and it is not good.
It isn't possible to operate on a Vector{Char} directly using PCRE.
(I got hopeful when i saw that one could operator on Vector{UInt8} but alas our Char type is not that. Probably for the best really there.)

I rather than try and make regex work, best is to just introduce more fast.jl rules that are suitable.

src/words/fast.jl

oxinabox · 2019-02-25T19:11:27Z

Which regex rules do you think would be hard to translate into the current, regex free fast.jl rules?
Or into similar rules?

I think the rules are all fairly simple and we can workout how to write them without regex.
Post the ones you are having trouble with and I will give it a shot

aquatiko · 2019-02-26T08:53:16Z

@oxinabox I was having trouble with these rules..

# Assertion types
 `FINAL_PERIOD_1 = r"(?<!\.)\.$", " ."`

# Continous types
    MULTI_COMMAS =r"(,{2,})"

oxinabox · 2019-02-26T09:18:20Z

`FINAL_PERIOD_1 = r"(?<!\.)\.$", " ."`

This is similar (but slightly more complex) to the cast in nltk_word_tokenize

So because it is the final character we can handle it by checking the contexts of the tokenbuffer relative to the end directly

function toktok_tokenize(input)
  ts = TokenBuffer(input)
  isempty(input) && return ts.tokens

  # handle  checking  for  final period
  # Rule 1: If it is the last character and it is not preceeded by another period
  # So as to  split `foo.`, but  not `foo...` 
  stop = length(ts.input) > 1 && ts.input[end] == '.' && ts.input[end-1] != '.' 
  stop && pop!(ts.input)

  while !isdone(ts)
          # ...
          # other rules
          # ...
  end
  stop && push!(ts.tokens, ".")
  return ts.tokens
end

I know there is a second case for final period, so it might be worth breaking the handling out in to a seperate function.

oxinabox · 2019-02-26T09:33:21Z

`MULTI_COMMAS =r"(,{2,})"`

So this one is similar to the number rule.

"""
    repeated_character_seq(::TokenBuffer, char, min_repeats=2)
Matches sequences of characters that are repreated at least `min_repeats` times.
Flushes them.
"""
function repeated_character_seq(ts, char, min_repeats=2)
  i = ts.idx
  while i <= length(ts.input) && (ts[i]==char)
    i += 1
  end
  seq_end_ind = i - 1  # remove last failing step

  seq_end_ind - ts.ind < min_repeats && return false  # not enough repeats.
  flush!(ts, String(ts[ts.idx : seq_end_ind]))
  ts.idx = seq_end_ind + 1 
  return true
end

I've not checked this and might have an off-by-one error,
but that is the general gist of it.

src/words/fast.jl

src/words/TokTok.jl

oxinabox · 2019-03-02T14:36:43Z

src/words/TokTok.jl

+# Pad more funky punctuation.
+const FUNKY_PUNCT_2 = string.(Tuple("[“‘„‚«‹「『"))          
+# Pad En dash and em dash
+const EN_EM_DASHES = ("–—")


Suggested change

const EN_EM_DASHES = ("–—")

const EN_EM_DASHES = ("–—",)

Should there be an en-dash here too?

Yeah, there should be. It wasn't in the nltk's version, so I missed it!

It is actually missing from the original too.
https://github.com/jonsafari/tok-tok/blob/fd0029cad4f3eed2ae229d835c2c77ea39931fd1/tok-tok.pl#L72

Maybe @jonsafari can comment?

src/words/TokTok.jl

oxinabox · 2019-03-06T18:53:06Z

I think getting the tests generated sooner rather than later is a good idea.
Even if they fail, or are slow.

src/words/TokTok.jl

oxinabox · 2019-03-17T12:08:43Z

Take a good look at the coverage report, and try and bring it up to near 100% on the TokTok.jl file
#18 (comment)

It is very easy to have code that doesn't work on sides of if branchs that are not tested.

aquatiko · 2019-03-20T13:37:33Z

src/words/TokTok.jl

-
-# Left/Right strip, i.e. remove heading/trailing spaces.
-const LSTRIP = (" ",) => ""           
-const RSTRIP = ("\r", "\n", "\t", "\f", "\v",) => "\n"  


These rules were declared but wasn't used in original toktok perl script. Thus, removed this.

aquatiko · 2019-03-20T13:38:47Z

src/words/TokTok.jl

-const LSTRIP = (" ",) => ""           
-const RSTRIP = ("\r", "\n", "\t", "\f", "\v",) => "\n"  
-# Merge multiple spaces.
-const ONE_SPACE = ("  ",) => " "


This was erroring when used along with rules_replaces, thus written as a separate case in main body.

aquatiko · 2019-03-20T13:40:07Z

src/words/TokTok.jl

+		ts.input[ts.idx] = NON_BREAKING[2]
+	end
+
+        url_handler4(ts) ||   # these url handlers have priority over others


Found that this should be the correct priority when using this approach instead of the regex version.

aquatiko · 2019-03-20T13:41:24Z

src/words/TokTok.jl

 end

+# In below functions flush!() is used when some given string needs to be a seperate token


Stated the use of flush! and push! as asked

aquatiko · 2019-03-23T09:29:45Z

README.md

@@ -1,4 +1,9 @@
 # WordTokenizers
+[![GitHub release](https://img.shields.io/github/release/JuliaText/WordTokenizers.jl.svg)](https://github.com/JuliaText/WordTokenizers.jl/releases/)


Added some badges for WordTokenizers.jl

aquatiko · 2019-03-23T09:32:12Z

src/words/TokTok.jl

+        repeated_character_seq(ts, '.', 2) ||        
+	number(ts) ||
+        spaces(ts) ||      # Handles ONE_SPACE rules from original toktok perl script
+	replaces(ts, rules_replaces) ||    # most expensive steps, keep low priority


Found that these were the most expensive steps by profiling, thus reduced there priority to increase performance of code. Change in it's priority won't effect the tokenizer.

aquatiko · 2019-03-23T10:38:09Z

@oxinabox I have tried my best to increase the code coverage, but still didn't know why can't I reach 100% (Currently 80.51%). Locally using Coverage.jl, I was able to reach 100% but on codecov I can't understand why some lines are red(which shouldn't be according to me). But altogether, the master coverage has increased than before.
As for adding license, I'm not aware of the procedure of extending a Apache 2.0 to one's one codebase.

aquatiko · 2019-03-26T12:21:34Z

ping @oxinabox
Can this be merged now?

Ayushk4 · 2019-03-26T12:31:40Z

@aquatiko I think its a good idea to update this branch from the code at JuliaText:master, even though there are no merge conflicts.

Refer the 2 spaced indentation that I recently fixed. -

WordTokenizers.jl/src/words/nltk_word.jl

Line 52 in 16e08ea

return ts.tokens

… into toktok

aquatiko · 2019-03-26T12:39:08Z

Yeah, done. Thanks for the pointer

oxinabox · 2019-03-26T16:52:37Z

Thanks, good work, I hope it was educational.

I'm going to assume something funny is happening with Coverage website if you can get to 100% locally.

As for adding license, I'm not aware of the procedure of extending a Apache 2.0 to one's one codebase.

I will add the license details once i merge.

This will be a breaking change.
Probably will hold of merging til #13 is in too

oxinabox reviewed Feb 23, 2019

View reviewed changes

src/WordTokenizers.jl Outdated Show resolved Hide resolved

src/words/fast.jl Outdated Show resolved Hide resolved

src/words/fast.jl Outdated Show resolved Hide resolved

src/words/fast.jl Outdated Show resolved Hide resolved

src/words/fast.jl Outdated Show resolved Hide resolved

oxinabox mentioned this pull request Feb 24, 2019

some cleanup, inclusing changing TokenBuffer to use replaces rather than splits #19

Merged

aquatiko commented Feb 25, 2019

View reviewed changes

src/words/fast.jl Outdated Show resolved Hide resolved

Add toktok_tokenize

4cf0aad

aquatiko force-pushed the toktok branch from c05a336 to 4cf0aad Compare February 28, 2019 09:06

aquatiko commented Feb 28, 2019

View reviewed changes

src/words/fast.jl Outdated Show resolved Hide resolved

fix invalid escape sequences

025de17

aquatiko force-pushed the toktok branch from 90387d6 to 025de17 Compare March 1, 2019 09:23

oxinabox reviewed Mar 1, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 1, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 1, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 1, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 1, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 1, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

performance and style edits

e95756a

oxinabox reviewed Mar 2, 2019

View reviewed changes

bug fix

e86314c

aquatiko force-pushed the toktok branch from 5b76f17 to e86314c Compare March 5, 2019 11:50

logic fix and initial tests

7209784

aquatiko commented Mar 6, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 6, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 6, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

fix url handlers and initial test

a206a7a

oxinabox reviewed Mar 17, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 17, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 17, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 17, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 17, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 17, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 17, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 17, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 17, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 17, 2019

View reviewed changes

src/words/TokTok.jl Show resolved Hide resolved

oxinabox reviewed Mar 17, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 17, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

oxinabox reviewed Mar 17, 2019

View reviewed changes

src/words/TokTok.jl Outdated Show resolved Hide resolved

more tets and logic fixes

97be07b

aquatiko commented Mar 20, 2019

View reviewed changes

aquatiko added 2 commits March 20, 2019 19:33

toktok as default tokenizer

e6fac42

badges added and increased code coverage

bff6e3a

aquatiko commented Mar 23, 2019

View reviewed changes

profiling code

16e08ea

aquatiko force-pushed the toktok branch from 78333f5 to 16e08ea Compare March 23, 2019 10:17

Merge branch 'master' of https://github.com/JuliaText/WordTokenizers.jl…

414bc76

… into toktok

oxinabox merged commit c7e2c32 into JuliaText:master Mar 26, 2019

aquatiko mentioned this pull request Mar 26, 2019

appveyor badge fix #23

Merged

oxinabox mentioned this pull request Apr 3, 2019

tokenize API #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add toktok tokenizer #18

add toktok tokenizer #18

aquatiko commented Feb 23, 2019 •

edited

Loading

oxinabox left a comment

codecov-io commented Feb 24, 2019 •

edited

Loading

oxinabox commented Feb 24, 2019

oxinabox commented Feb 25, 2019

aquatiko commented Feb 26, 2019

oxinabox commented Feb 26, 2019

oxinabox commented Feb 26, 2019

oxinabox Mar 2, 2019

aquatiko Mar 2, 2019

oxinabox Mar 2, 2019

oxinabox commented Mar 6, 2019 •

edited

Loading

oxinabox commented Mar 17, 2019

aquatiko Mar 20, 2019 •

edited

Loading

aquatiko Mar 20, 2019

aquatiko Mar 20, 2019 •

edited

Loading

aquatiko Mar 20, 2019

oxinabox Mar 20, 2019

aquatiko Mar 23, 2019

aquatiko Mar 23, 2019

aquatiko commented Mar 23, 2019 •

edited

Loading

aquatiko commented Mar 26, 2019

Ayushk4 commented Mar 26, 2019 •

edited

Loading

aquatiko commented Mar 26, 2019

oxinabox commented Mar 26, 2019

	const EN_EM_DASHES = ("–—")
	const EN_EM_DASHES = ("–—",)

		end

		# In below functions flush!() is used when some given string needs to be a seperate token

		@@ -1,4 +1,9 @@
		# WordTokenizers
		[![GitHub release](https://img.shields.io/github/release/JuliaText/WordTokenizers.jl.svg)](https://github.com/JuliaText/WordTokenizers.jl/releases/)

add toktok tokenizer #18

add toktok tokenizer #18

Conversation

aquatiko commented Feb 23, 2019 • edited Loading

oxinabox left a comment

Choose a reason for hiding this comment

codecov-io commented Feb 24, 2019 • edited Loading

Codecov Report

oxinabox commented Feb 24, 2019

oxinabox commented Feb 25, 2019

aquatiko commented Feb 26, 2019

oxinabox commented Feb 26, 2019

FINAL_PERIOD_1 = r"(?<!\.)\.$", " ."

oxinabox commented Feb 26, 2019

MULTI_COMMAS =r"(,{2,})"

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oxinabox commented Mar 6, 2019 • edited Loading

oxinabox commented Mar 17, 2019

aquatiko Mar 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aquatiko Mar 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aquatiko commented Mar 23, 2019 • edited Loading

aquatiko commented Mar 26, 2019

Ayushk4 commented Mar 26, 2019 • edited Loading

aquatiko commented Mar 26, 2019

oxinabox commented Mar 26, 2019

aquatiko commented Feb 23, 2019 •

edited

Loading

codecov-io commented Feb 24, 2019 •

edited

Loading

`FINAL_PERIOD_1 = r"(?<!\.)\.$", " ."`

`MULTI_COMMAS =r"(,{2,})"`

oxinabox commented Mar 6, 2019 •

edited

Loading

aquatiko Mar 20, 2019 •

edited

Loading

aquatiko Mar 20, 2019 •

edited

Loading

aquatiko commented Mar 23, 2019 •

edited

Loading

Ayushk4 commented Mar 26, 2019 •

edited

Loading