Add TokTok tokenizer #15

oxinabox · 2019-02-07T18:00:50Z

An earlier incomplete hack at #5
exists but was never tested.

We should port it, and use the new TokenBuffer API.
https://github.com/JuliaText/WordTokenizers.jl/blob/master/src/words/fast.jl

Summary (From #5)

Source https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl
It is Apache2

See also NLTK's implementation https://www.nltk.org/_modules/nltk/tokenize/toktok.html

When this is done I think that it should be the default tokenizer.
Because multilingual and doesn't screw up URLs

This code is untested, and not yet linked in.
I just ported the perl to sed.
Well to the PCRE extended sed that we actually use.
Which doesn't involve much.

Will want to port over NLTK's tests, which are hopefully comprehensive enough to check that I didn't mess anything up

aquatiko · 2019-02-19T05:13:49Z

@oxinabox could you link the reference to the nltk's tests that you mentioned

oxinabox · 2019-02-19T10:49:37Z

I thought they would be in https://github.com/nltk/nltk/blob/develop/nltk/test/unit/test_tokenize.py
but they are not.
May have to write our own.
Probably by using NLTK to generate reference tokenizations

aquatiko · 2019-02-19T11:00:59Z

Also, I looked in fast.jl, It would be great if there could be some regex based lookahead function. If seems right, maybe I can work on that before

oxinabox · 2019-02-19T11:32:12Z

work on it as part of the same PR?

aquatiko · 2019-02-19T11:37:08Z

No, in a separate PR

oxinabox · 2019-02-19T11:42:10Z

In general most PRs either:

add a feature that is exposed to the user
cleanup code with no change

Adding a feature to the purely internal TokenBuffer would be unusual, particularly when there are not multiple PRs waiting on such a feature.
But I am not against it. A small PR is an easy to review PR.
It will want good unit tests

aquatiko · 2019-02-19T14:28:37Z

Oh I see!! No worries I will add all of it in a single PR :)

oxinabox added help wanted Extra attention is needed good first issue Good for newcomers labels Feb 7, 2019

aquatiko mentioned this issue Feb 23, 2019

add toktok tokenizer #18

Merged

6 tasks

oxinabox closed this as completed in #18 Mar 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TokTok tokenizer #15

Add TokTok tokenizer #15

oxinabox commented Feb 7, 2019

aquatiko commented Feb 19, 2019

oxinabox commented Feb 19, 2019

aquatiko commented Feb 19, 2019 •

edited

Loading

oxinabox commented Feb 19, 2019

aquatiko commented Feb 19, 2019

oxinabox commented Feb 19, 2019

aquatiko commented Feb 19, 2019

Add TokTok tokenizer #15

Add TokTok tokenizer #15

Comments

oxinabox commented Feb 7, 2019

aquatiko commented Feb 19, 2019

oxinabox commented Feb 19, 2019

aquatiko commented Feb 19, 2019 • edited Loading

oxinabox commented Feb 19, 2019

aquatiko commented Feb 19, 2019

oxinabox commented Feb 19, 2019

aquatiko commented Feb 19, 2019

aquatiko commented Feb 19, 2019 •

edited

Loading