Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TokTok tokenizer #15

Closed
oxinabox opened this issue Feb 7, 2019 · 7 comments
Closed

Add TokTok tokenizer #15

oxinabox opened this issue Feb 7, 2019 · 7 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@oxinabox
Copy link
Member

oxinabox commented Feb 7, 2019

An earlier incomplete hack at #5
exists but was never tested.

We should port it, and use the new TokenBuffer API.
https://github.com/JuliaText/WordTokenizers.jl/blob/master/src/words/fast.jl

Summary (From #5)

Source https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl
It is Apache2

See also NLTK's implementation https://www.nltk.org/_modules/nltk/tokenize/toktok.html

When this is done I think that it should be the default tokenizer.
Because multilingual and doesn't screw up URLs

This code is untested, and not yet linked in.
I just ported the perl to sed.
Well to the PCRE extended sed that we actually use.
Which doesn't involve much.

Will want to port over NLTK's tests, which are hopefully comprehensive enough to check that I didn't mess anything up

@oxinabox oxinabox added help wanted Extra attention is needed good first issue Good for newcomers labels Feb 7, 2019
@aquatiko
Copy link
Contributor

@oxinabox could you link the reference to the nltk's tests that you mentioned

@oxinabox
Copy link
Member Author

I thought they would be in https://github.com/nltk/nltk/blob/develop/nltk/test/unit/test_tokenize.py
but they are not.
May have to write our own.
Probably by using NLTK to generate reference tokenizations

@aquatiko
Copy link
Contributor

aquatiko commented Feb 19, 2019

Also, I looked in fast.jl, It would be great if there could be some regex based lookahead function. If seems right, maybe I can work on that before

@oxinabox
Copy link
Member Author

work on it as part of the same PR?

@aquatiko
Copy link
Contributor

No, in a separate PR

@oxinabox
Copy link
Member Author

In general most PRs either:

  • add a feature that is exposed to the user
  • cleanup code with no change

Adding a feature to the purely internal TokenBuffer would be unusual, particularly when there are not multiple PRs waiting on such a feature.
But I am not against it. A small PR is an easy to review PR.
It will want good unit tests

@aquatiko
Copy link
Contributor

Oh I see!! No worries I will add all of it in a single PR :)

@aquatiko aquatiko mentioned this issue Feb 23, 2019
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants