Code for the paper Greed is All You Need: An Evaluation of Tokenizer Inference Methods
Python packages are listed in requirements.txt. This code does not require GPU\TPU.
The benchmark supports tokenizers which are serialized into a HuggingFace json format. In addition, we've added support for some custom inference methods (greedy longest suffix, greedy lognest token, etc.). The json files we've used for the paper will be added soon as examples.
The resources we've used for the evaluation are in the resources folder.
Resource | Reference |
---|---|
LADEC | paper |
MorphoLex | paper |
MorphyNet | paper |
DagoBert | paper |
UniMorph | paper |
UnBlend | paper |
CompoundPiece | paper |
Cognitive data | paper |
tokenization-scorer | paper |
Execute main.py
from its working directory.
arguments:
--tokenizers: a path to a txt file containing paths to tokenizers config files in JSON format. Default is tokenizers.txt in the working directory.
--compare: a boolean argument for comparing the segmentation difference between inference methods. Default is False. If enabled make sure the default segmentation is the first path in the tokenizers paths file (and that the vocabulary is shared by all tokenizers).
Example:
python main.py \
--tokenizers tokenizers.txt
@misc{uzan2024greed,
title={Greed is All You Need: An Evaluation of Tokenizer Inference Methods},
author={Omri Uzan and Craig W. Schmidt and Chris Tanner and Yuval Pinter},
year={2024},
eprint={2403.01289},
archivePrefix={arXiv},
primaryClass={cs.CL}
}