Tokenisation Benchmark Visualization
This tool contains a collection of tokenisation benchmarks for Thai. We aim to have all major algorithms's benchmarks included in this tool. It also has features that allows one to compare and investigate cases when each algorithm fails.
- Thai Tokenisers Docker: collection of pre-built Thai tokenisers Docker containers.
- Tokenisation Benchmark for Thai: script for evaluating tokenisation results.
|BEST Validation Set||Link||This is a validation set that I randomly selected from BEST's training set.|
|Thai National Historical Corpus (TNHC)||Link||Classical Thai literature texts. Some preprocessing steps were applied.|
|Orchid||Link||Thai Academic articles. Some preprocessing steps were applied.|
|กลอนตากลม||Link||โดย คมเพชร เชิงกลอน ภาค สายลม|
How to obtain the benchmark result?
Datasources of this tool are artifacts produced by the Tokenisation Benchmark for Thai script, i.e.
Please create an issue if you want to include your benchmark in this tool.
- NodeJS v11.4.0
This project is created by using Gatsby. One can start a development server using the command below:
$ npm run develop
For production deployment, please use
scripts/deploy.sh, preparing a production build and GitHub Page synchronisation.