You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The readme says, "Currently released model supports most of the languages except East Asian (Chinese Simplified, Traditional, Japanese, Korean, Thai)."
Is East Asian support on the roadmap?
This could be useful for ML.NET, where only a whitespace tokenizer has been open sourced. One potential issue is ML.NET has a preference for pure C# code.
The whitespace tokenization (or breaking on any specific character) is not suitable for East Asian languages: dotnet/machinelearning#325
The text was updated successfully, but these errors were encountered:
Hi, the library is language agnostic, it supports Chinese, Japanese and all languages as long as they are covered by Unicode charset. Currently as of May 1 2020, the library supports 4 tokenization algorithms:
rule based tokenization (like NLTK) preserving or breaking Chinese characters apart
wordpiece algorithm used in various BERT models (you can provide you own list of segments)
sentencepiece Unigram LM algorithm (like the one used in XLNET)
sentencepiece BPE algorithm
The readme says, "Currently released model supports most of the languages except East Asian (Chinese Simplified, Traditional, Japanese, Korean, Thai)."
Is East Asian support on the roadmap?
This could be useful for ML.NET, where only a whitespace tokenizer has been open sourced. One potential issue is ML.NET has a preference for pure C# code.
The whitespace tokenization (or breaking on any specific character) is not suitable for East Asian languages: dotnet/machinelearning#325
The text was updated successfully, but these errors were encountered: