TokenFlux++ is a fast tokenizer toolkit (C++ core + Python bindings) for:
- Training tokenizer models (
byte_bpe,bpe,wordpiece,unigram) - High-throughput encoding and dataset pre-tokenization
Latest release: 0.3.4
Releases: https://github.com/TabNahida/TokenFluxPlusPlus/releases
From PyPI:
pip install tokenfluxFrom source (local repo):
pip install .Editable source install:
pip install -e .If no prebuilt wheel is available for your platform, installation falls back to source build and requires xmake + a C++ toolchain.
import tokenflux as tf
# train
cfg = tf.TrainConfig()
cfg.trainer = tf.TrainerKind.byte_bpe
cfg.vocab_size = 16000
cfg.output_json = "tokenizer.json"
cfg.output_vocab = "vocab.json"
cfg.output_merges = "merges.txt"
tf.train(cfg, ["data/train.jsonl"])
# encode
tok = tf.Tokenizer("tokenizer.json")
ids = tok.encode("hello TokenFlux++")
print(ids[:10], len(ids))- C++ API: docs/cpp_api.md
- Python API: docs/python_api.md
python benchmarks/tokenfluxpp_vs_tiktoken_vs_tokenizer.pyInstall compare dependencies:
python -m pip install tiktoken
python -m pip install tokenizersSnapshot — encode throughput (docs/s) by thread count (higher is better):
Latest encode latency speedup:
- 4.32x vs OpenAI tiktoken
- 11.89x vs HuggingFace tokenizers
Full benchmark report:
benchmarks/BENCHMARK_RESULTS_2026-03-01.md
Train:
xmake run TokenFluxTrain \
--data-list "data/inputs.list" \
--trainer byte_bpe \
--vocab-size 16000 \
--threads 8 \
--output tokenizer.json \
--vocab vocab.jsonTokenize:
xmake run TokenFluxTokenize \
--data-list "data/inputs.list" \
--tokenizer tokenizer.json \
--out-dir data/tokens \
--add-eos \
--threads 8 \
--max-tokens-per-shard 50000000Tokenization output is a concatenated document stream in shard files.
--add-eos is enabled by default; use --no-eos to disable automatic EOS append per document.
xmakePython extension output is typically under build/.../tokenflux_cpp.pyd (Windows) or build/.../tokenflux_cpp.so (Linux/macOS).
- To pick up binding updates (for example encode threading changes), rebuild:
xmake f -y -m release --pybind=y
xmake build -y tokenflux_cpp.envdefaults are supported for CLI workflows.
