Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

East Asian support #11

Closed
justinormont opened this issue Apr 18, 2019 · 1 comment
Closed

East Asian support #11

justinormont opened this issue Apr 18, 2019 · 1 comment

Comments

@justinormont
Copy link

The readme says, "Currently released model supports most of the languages except East Asian (Chinese Simplified, Traditional, Japanese, Korean, Thai)."

Is East Asian support on the roadmap?

This could be useful for ML.NET, where only a whitespace tokenizer has been open sourced. One potential issue is ML.NET has a preference for pure C# code.

The whitespace tokenization (or breaking on any specific character) is not suitable for East Asian languages: dotnet/machinelearning#325

@SergeiAlonichau
Copy link
Member

SergeiAlonichau commented Apr 18, 2019

Hi, the library is language agnostic, it supports Chinese, Japanese and all languages as long as they are covered by Unicode charset. Currently as of May 1 2020, the library supports 4 tokenization algorithms:

rule based tokenization (like NLTK) preserving or breaking Chinese characters apart
wordpiece algorithm used in various BERT models (you can provide you own list of segments)
sentencepiece Unigram LM algorithm (like the one used in XLNET)
sentencepiece BPE algorithm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants