TokenFlux++

TokenFlux++ is a fast tokenizer toolkit (C++ core + Python bindings) for:

Training tokenizer models (byte_bpe, bpe, wordpiece, unigram)
High-throughput encoding and dataset pre-tokenization

Latest release: 0.3.4
Releases: https://github.com/TabNahida/TokenFluxPlusPlus/releases

Install

From PyPI:

pip install tokenflux

From source (local repo):

pip install .

Editable source install:

pip install -e .

If no prebuilt wheel is available for your platform, installation falls back to source build and requires xmake + a C++ toolchain.

Quickstart (Python)

import tokenflux as tf

# train
cfg = tf.TrainConfig()
cfg.trainer = tf.TrainerKind.byte_bpe
cfg.vocab_size = 16000
cfg.output_json = "tokenizer.json"
cfg.output_vocab = "vocab.json"
cfg.output_merges = "merges.txt"
tf.train(cfg, ["data/train.jsonl"])

# encode
tok = tf.Tokenizer("tokenizer.json")
ids = tok.encode("hello TokenFlux++")
print(ids[:10], len(ids))

API Docs

C++ API: docs/cpp_api.md
Python API: docs/python_api.md

Performance

python benchmarks/tokenfluxpp_vs_tiktoken_vs_tokenizer.py

Install compare dependencies:

python -m pip install tiktoken
python -m pip install tokenizers

Snapshot — encode throughput (docs/s) by thread count (higher is better):

Latest encode latency speedup:

4.32x vs OpenAI tiktoken
11.89x vs HuggingFace tokenizers

Full benchmark report:
benchmarks/BENCHMARK_RESULTS_2026-03-01.md

CLI

Train:

xmake run TokenFluxTrain \
  --data-list "data/inputs.list" \
  --trainer byte_bpe \
  --vocab-size 16000 \
  --threads 8 \
  --output tokenizer.json \
  --vocab vocab.json

Tokenize:

xmake run TokenFluxTokenize \
  --data-list "data/inputs.list" \
  --tokenizer tokenizer.json \
  --out-dir data/tokens \
  --add-eos \
  --threads 8 \
  --max-tokens-per-shard 50000000

Tokenization output is a concatenated document stream in shard files.
--add-eos is enabled by default; use --no-eos to disable automatic EOS append per document.

Build

xmake

Python extension output is typically under build/.../tokenflux_cpp.pyd (Windows) or build/.../tokenflux_cpp.so (Linux/macOS).

Notes

To pick up binding updates (for example encode threading changes), rebuild:

xmake f -y -m release --pybind=y
xmake build -y tokenflux_cpp

.env defaults are supported for CLI workflows.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
tokenflux		tokenflux
tokenizer		tokenizer
.clang-format		.clang-format
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py
test_tokenizer.py		test_tokenizer.py
xmake.lua		xmake.lua

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TokenFlux++

Install

Quickstart (Python)

API Docs

Performance

CLI

Build

Notes

About

Uh oh!

Releases 9

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TokenFlux++

Install

Quickstart (Python)

API Docs

Performance

CLI

Build

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages