Word Tokenisation Benchmark for Thai (obsolete)

Warning: The code of this project has been migrated to PyThaiNLP (PyThaiNLP/pythainlp#248). Please check PyThaiNLP's document for further details.

This repository is a framework for benchmarking tokenisation algorithms for Thai. It has a command-line interface that allows users to conviniently execute the benchmarks as well as a module interface for later use in their development pipelines.

Metrics

Character-Level (CL)

True Positive (TP): no. of starting characters that are correctly predicted.
True Negative (TN): no. of non-starting characters that are correctly predicted.
False Positive (FP): no. of non-starting characters that are wrongly predicted as starting characters.
False Negative (FN): no. of starting characters that are wrongly predicted as non-starting characters.
Precision: TP / (TP + FP)
Recall: TP / (TP+FN)
f1: ...

Word-Level (WL)

Correctly Tokenised Words (CTW): no. of words in reference that are correctly tokenised.
Precision: CTW / no. words in reference solution
Recall: CTW / no. words in sample -**** f1: ...

Benchmark Results

Vendor	Approach	Datasets
DeepCut	CNN
PyThaiNLP-newmm	dictionary-based
Sertis-BiGRU	Bi-directional RNN

Installation (WIP)

pip ...

Usages (to be updated)

Command-line Interface

PYTHONPATH=`pwd` python scripts/thai-tokenisation-benchmark.py \
--test-file ./data/best-2010/TEST_100K_ANS.txt \
--input ./data/best-2010-syllable.txt

# Sample output
Benchmarking ./data/best-2010-deepcut.txt against ./data/best-2010/TEST_100K_ANS.txt with 2252 samples in total
============== Benchmark Result ==============
                metric       mean±std       min    max
         char_level:tp    47.82±47.22  1.000000  354.0
         char_level:tn  144.19±145.97  1.000000  887.0
         char_level:fp      1.34±2.02  0.000000   23.0
         char_level:fn      0.70±1.19  0.000000   14.0
  char_level:precision      0.96±0.08  0.250000    1.0
     char_level:recall      0.98±0.04  0.500000    1.0
         char_level:f1      0.97±0.06  0.333333    1.0
  word_level:precision      0.92±0.14  0.000000    1.0
     word_level:recall      0.93±0.12  0.000000    1.0
         word_level:f1      0.93±0.13  0.000000    1.0

Module Interface

from pythainlp.benchmarks import word_tokenisation as bwt

ref_samples = array of reference tokenised samples
tokenised_samples = array of tokenised samples, aka. from your algorithm

# dataframe contains metrics for each sample
df = bwt.benchmark(ref_samples, tokenised_samples)

Related Work

Thai Tokenisers Docker: collection of pre-built Thai tokenisers Docker containers.

Developments

# unitests
$ TEST_VERBOSE=1 PYTHONPATH=. python tests/__init__.py

Acknowledgements

This project was initiallly started by Pattarawat Chormai, while he was interning at Dr. Attapol Thamrongrattanarit's lab.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
pythainlp		pythainlp
scripts		scripts
tests		tests
.0pdd.yml		.0pdd.yml
.gitignore		.gitignore
.pdd		.pdd
.travis.yml		.travis.yml
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pythainlp

pythainlp

scripts

scripts

tests

tests

.0pdd.yml

.0pdd.yml

.gitignore

.gitignore

.pdd

.pdd

.travis.yml

.travis.yml

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

Word Tokenisation Benchmark for Thai (obsolete)

Metrics

Character-Level (CL)

Word-Level (WL)

Benchmark Results

Installation (WIP)

Usages (to be updated)

Related Work

Developments

Acknowledgements

About

Releases

Packages

Languages

PyThaiNLP/tokenization-benchmark

Folders and files

Latest commit

History

Repository files navigation

Word Tokenisation Benchmark for Thai (obsolete)

Metrics

Character-Level (CL)

Word-Level (WL)

Benchmark Results

Installation (WIP)

Usages (to be updated)

Related Work

Developments

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Languages