-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TokTok tokenizer #15
Comments
@oxinabox could you link the reference to the nltk's tests that you mentioned |
I thought they would be in https://github.com/nltk/nltk/blob/develop/nltk/test/unit/test_tokenize.py |
Also, I looked in fast.jl, It would be great if there could be some regex based lookahead function. If seems right, maybe I can work on that before |
work on it as part of the same PR? |
No, in a separate PR |
In general most PRs either:
Adding a feature to the purely internal |
Oh I see!! No worries I will add all of it in a single PR :) |
An earlier incomplete hack at #5
exists but was never tested.
We should port it, and use the new TokenBuffer API.
https://github.com/JuliaText/WordTokenizers.jl/blob/master/src/words/fast.jl
Summary (From #5)
The text was updated successfully, but these errors were encountered: