Skip to content

Tokenizer Efficiency

Leonard edited this page Apr 23, 2024 · 6 revisions

Japanese efficiency from sampling 50K items (~85M characters) from the JA subset of the CulturaX dataset:

LLM Tokenizer Vocab Size Avg Char/Token
Shisa 7B (AUGMXNT) augmxnt/shisa-base-7b-v1 120073 2.31
OpenCALM (CyberAgent) cyberagent/open-calm-7b 52000 2.17
Japanese LargeLM (LINE) line-corporation/japanese-large-lm-3.6b 51200 2.14
CALM2-7B (CyberAgent) cyberagent/calm2-7b 65000 2.00
Bilingual-GPT-NeoX-4B (Rinna) rinna/bilingual-gpt-neox-4b 65536 1.88
Japanese StableLM Alpha (Stability AI) novelai/nerdstash-tokenizer-v1 65535 1.85
Japanese-GPT-NeoX-3.6B (Rinna) rinna/japanese-gpt-neox-3.6b 32000 1.83
Japanese StableLM Beta JAVocab (Stability AI) stabilityai/japanese-stablelm-base-ja_vocab-beta-7b 49247 1.79
ELYZA 13B fast ELYZA-japanese-Llama-2-13b-fast 44581 1.77
Orion 14B (OrionStarAI) OrionStarAI/Orion-14B-Base 84608 1.71
llm-jp-13b (LLM-jp) llm-jp/llm-jp-13b-v1.0 50570 1.65
RakutenAI-7B Rakuten/RakutenAI-7B 48000 1.61
Swallow 7B (TokyoTech-LLM) tokyotech-llm/Swallow-7b-hf 43176 1.55
Japanese-Llama-2-7b-fast (ELYZA) elyza/ELYZA-japanese-Llama-2-7b-fast 45043 1.53
Qwen 14B (Qwen) Qwen/Qwen-14B 151851 1.48
XVERSE 65B (xverse) xverse/XVERSE-65B 100534 1.10
weblab-10b (Matsuo Lab) EleutherAI/gpt-neox-20b 50254 1.00
Japanese StableLM Gamma (Stability AI) mistralai/Mistral-7B-v0.1 32000 0.95
Youri 7B (Rinna) meta-llama/Llama-2-7B 32000 0.88
DeepSeek LLM 7B (DeepSeek) deepseek-ai/deepseek-llm-7b-base 102400 0.85
Yi 34B (01.ai) 01-ai/Yi-34B 64000 0.83

We also test English efficiency using a sampling of 50K items (~177M characters) from the EN subset of the CulturaX dataset as a sanity check (and to see how other tokenizers fare):

LLM Tokenizer Vocab Size Avg Char/Token
Qwen 14B (Qwen) Qwen/Qwen-14B 151851 4.47
weblab-10b (Matsuo Lab) EleutherAI/gpt-neox-20b 50254 4.45
DeepSeek LLM 7B (DeepSeek) deepseek-ai/deepseek-llm-7b-base 102400 4.33
Orion 14B (OrionStarAI) OrionStarAI/Orion-14B-Base 84608 4.25
Yi 34B (01.ai) 01-ai/Yi-34B 64000 4.19
Japanese StableLM Alpha (Stability AI) novelai/nerdstash-tokenizer-v1 65535 4.15
Shisa 7B (AUGMXNT) augmxnt/shisa-base-7b-v1 120073 4.12
CALM2-7B (CyberAgent) cyberagent/calm2-7b 65000 4.12
Japanese StableLM Beta JAVocab (Stability AI) stabilityai/japanese-stablelm-base-ja_vocab-beta-7b 49247 4.01
Japanese StableLM Gamma (Stability AI) mistralai/Mistral-7B-v0.1 32000 4.01
Swallow 7B (TokyoTech-LLM) tokyotech-llm/Swallow-7b-hf 43176 3.86
ELYZA 13B fast ELYZA-japanese-Llama-2-13b-fast 44581 3.86
Japanese-Llama-2-7b-fast (ELYZA) elyza/ELYZA-japanese-Llama-2-7b-fast 45043 3.86
Youri 7B (Rinna) meta-llama/Llama-2-7B 32000 3.86
llm-jp-13b (LLM-jp) llm-jp/llm-jp-13b-v1.0 50570 3.79
XVERSE 65B (xverse) xverse/XVERSE-65B 100534 2.96
OpenCALM (CyberAgent) cyberagent/open-calm-7b 52000 2.83
Japanese LargeLM (LINE) line-corporation/japanese-large-lm-3.6b 51200 2.49
Japanese-GPT-NeoX-3.6B (Rinna) rinna/japanese-gpt-neox-3.6b 32000 2.42
Bilingual-GPT-NeoX-4B (Rinna) rinna/bilingual-gpt-neox-4b 65536 2.42