Skip to content

PyThaiNLP/mudyom

Repository files navigation

MudYom (มัดย้อม)

MudYom is a module for pre/post-processing text. It combines, aka มัด, words that should be together into one token. This process is done according to a user-defined dictionary.


Installation

Because it's still in beta, installation has to be done via

$ pip install git+https://github.com/pythainlp/mudyom.git

Usage (WIP)

$ mudyom-cli --input "..." --dictionary "..." --output "..."

Remark: Vocabs in the dictionary should be sorted from longest to shortest one.

If not, you can use the command line below to sort the dictionary:

$ cat dictionary.txt | awk '{ print length, $0 }' | sort -g -r | cut -d" " -f2  > sorted_dictionary.txt

Example

# input.txt
ฉัน|ขวัญ|หนี|ตี|ฝ่อ|ใจ|สลาย

# dictionary.txt
หลบลี้
คิดถึง
ตีฝ่อ

# output.txt
ฉัน|ขวัญ|หนี|ตีฝ่อ|ใจ|สลาย

Public Dictionaries

Name Vocaburary Size Author
Food and Restuarant menues ~400k Wongnai
Names and Acronyms ~2k Thachaparn Bunditlurdruk
Name Entity in BEST .. ..

Acknowledgements

Releases

No releases published

Packages

No packages published

Languages