Skip to content

newmm tokenization

Wannaphong Phatthiyaphaibun edited this page Dec 14, 2020 · 3 revisions

newmm is a code name for The next maximal matching engine on PyThaiNLP. (It's not real name of word tokenizer engine.) It is a default of pythainlp.word_tokenize. Now, newmm is onecut engine.

newmm version

  • multi_cut (PyThaiNLP 1.4 - 1.5): Thai word segmentation with maximum matching. The original source code is from Korakot Chaovavanich. Now, It's mm engine in PyThaiNLP.
  • onecut (PyThaiNLP 1.6 - Now): Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster (TCC) boundaries. created by Korakot Chaovavanich