Skip to content

PyThaiNLP/Han-solo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🪿 Han-solo

🪿 Han-solo: Thai syllable segmenter

This work wants to create a Thai syllable segmenter that can work in the Thai social media domain.

Dataset: Han-solo: Thai syllable segmenter

Google colab: Demo

Dataset

This work uses 2 datasets:

  1. Nutcha Dataset (Thai news domain). See more data_nutcha/
  2. Han-solo: Thai syllable segmenter dataset (Thai social media domain). See more Han-solo: Thai syllable segmenter

Model

This work uses the CRF model that uses the same feature from ssg to the training model.

You can see the training notebook from train.ipynb.

The model file: han_solo.crfsuite

F1-score

1 is split, and 0 is not split.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     61078
           1       1.00      0.99      0.99     29468

    accuracy                           1.00     90546
   macro avg       1.00      1.00      1.00     90546
weighted avg       1.00      1.00      1.00     90546

How to use?

  • See using.ipynb
  • PyThaiNLP v4.1+

License

  • CC-BY 4.0 license (for Dataset)
  • Apache License Version 2.0 (for Source code and model)

Cite as

Wannaphong Phatthiyaphaibun. (2023). Han-solo: Thai syllable segmenter (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8196608

or BibTeX entry:

@dataset{wannaphong_phatthiyaphaibun_2023_8196608,
  author       = {Wannaphong Phatthiyaphaibun},
  title        = {Han-solo: Thai syllable segmenter},
  month        = jul,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {1.0},
  doi          = {10.5281/zenodo.8196608},
  url          = {https://doi.org/10.5281/zenodo.8196608}
}