Skip to content

TabNahida/TokenFluxPlusPlus

Repository files navigation

TokenFlux++

TokenFlux++ is a fast tokenizer toolkit (C++ core + Python bindings) for:

  • Training tokenizer models (byte_bpe, bpe, wordpiece, unigram)
  • High-throughput encoding and dataset pre-tokenization

Latest release: 0.3.4
Releases: https://github.com/TabNahida/TokenFluxPlusPlus/releases

Install

From PyPI:

pip install tokenflux

From source (local repo):

pip install .

Editable source install:

pip install -e .

If no prebuilt wheel is available for your platform, installation falls back to source build and requires xmake + a C++ toolchain.

Quickstart (Python)

import tokenflux as tf

# train
cfg = tf.TrainConfig()
cfg.trainer = tf.TrainerKind.byte_bpe
cfg.vocab_size = 16000
cfg.output_json = "tokenizer.json"
cfg.output_vocab = "vocab.json"
cfg.output_merges = "merges.txt"
tf.train(cfg, ["data/train.jsonl"])

# encode
tok = tf.Tokenizer("tokenizer.json")
ids = tok.encode("hello TokenFlux++")
print(ids[:10], len(ids))

API Docs

Performance

python benchmarks/tokenfluxpp_vs_tiktoken_vs_tokenizer.py

Install compare dependencies:

python -m pip install tiktoken
python -m pip install tokenizers

Snapshot — encode throughput (docs/s) by thread count (higher is better):

Encode Throughput by threads

Latest encode latency speedup:

  • 4.32x vs OpenAI tiktoken
  • 11.89x vs HuggingFace tokenizers

Full benchmark report:
benchmarks/BENCHMARK_RESULTS_2026-03-01.md

CLI

Train:

xmake run TokenFluxTrain \
  --data-list "data/inputs.list" \
  --trainer byte_bpe \
  --vocab-size 16000 \
  --threads 8 \
  --output tokenizer.json \
  --vocab vocab.json

Tokenize:

xmake run TokenFluxTokenize \
  --data-list "data/inputs.list" \
  --tokenizer tokenizer.json \
  --out-dir data/tokens \
  --add-eos \
  --threads 8 \
  --max-tokens-per-shard 50000000

Tokenization output is a concatenated document stream in shard files.
--add-eos is enabled by default; use --no-eos to disable automatic EOS append per document.

Build

xmake

Python extension output is typically under build/.../tokenflux_cpp.pyd (Windows) or build/.../tokenflux_cpp.so (Linux/macOS).

Notes

  • To pick up binding updates (for example encode threading changes), rebuild:
xmake f -y -m release --pybind=y
xmake build -y tokenflux_cpp
  • .env defaults are supported for CLI workflows.

About

A high-performance C++ tokenizer designed for modern LLM training

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors