East Asian support #11

justinormont · 2019-04-18T22:07:10Z

The readme says, "Currently released model supports most of the languages except East Asian (Chinese Simplified, Traditional, Japanese, Korean, Thai)."

Is East Asian support on the roadmap?

This could be useful for ML.NET, where only a whitespace tokenizer has been open sourced. One potential issue is ML.NET has a preference for pure C# code.

The whitespace tokenization (or breaking on any specific character) is not suitable for East Asian languages: dotnet/machinelearning#325

SergeiAlonichau · 2019-04-18T22:56:36Z

Hi, the library is language agnostic, it supports Chinese, Japanese and all languages as long as they are covered by Unicode charset. Currently as of May 1 2020, the library supports 4 tokenization algorithms:

rule based tokenization (like NLTK) preserving or breaking Chinese characters apart
wordpiece algorithm used in various BERT models (you can provide you own list of segments)
sentencepiece Unigram LM algorithm (like the one used in XLNET)
sentencepiece BPE algorithm

SergeiAlonichau closed this as completed May 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

East Asian support #11

East Asian support #11

justinormont commented Apr 18, 2019

SergeiAlonichau commented Apr 18, 2019 •

edited

East Asian support #11

East Asian support #11

Comments

justinormont commented Apr 18, 2019

SergeiAlonichau commented Apr 18, 2019 • edited

SergeiAlonichau commented Apr 18, 2019 •

edited