Skip to content

Bump tokenizers crate from 0.14.0 to 0.22.2#7

Merged
Anush008 merged 2 commits intoAnush008:mainfrom
MayCXC:bump-tokenizers-0.22
Mar 19, 2026
Merged

Bump tokenizers crate from 0.14.0 to 0.22.2#7
Anush008 merged 2 commits intoAnush008:mainfrom
MayCXC:bump-tokenizers-0.22

Conversation

@MayCXC
Copy link
Copy Markdown
Contributor

@MayCXC MayCXC commented Mar 19, 2026

Updates the Rust tokenizers crate dependency from 0.14.0 (September 2023) to 0.22.2 (December 2025), picking up over two years of upstream improvements from huggingface/tokenizers.

Changes

Cargo.toml:

  • tokenizers 0.14.0 -> 0.22.2
  • Added ahash = { version = "0.8", features = ["serde"] } (required by newer tokenizers crate for vocab hash maps)

Source changes (required by API changes in tokenizers 0.15+):

  • with_pre_tokenizer, with_decoder, with_post_processor, with_normalizer now take Option<T> instead of T
  • BPE Vocab type changed from re-export to HashMap<String, u32>, requires AHashMap conversion
  • Metaspace decoder/pre-tokenizer: add_prefix_space: bool replaced by prepend_scheme: String + split: bool parameters
  • format!("{}", e) replaced with format!("{e}") (Rust 2021 edition style)

Verification

  • Built and tested on aarch64-unknown-linux-gnu with Rust 1.94.0
  • API surface (encode, getOffsets, getTokens, getIds, decode) verified working with nomic-embed-text-v1.5 tokenizer

Context

The source changes follow the approach proven by @inference-net/tokenizers (which bumped to 0.21.4) combined with the platform targets from v0.5.0. This PR goes further to 0.22.2.

MayCXC and others added 2 commits March 19, 2026 07:15
Updates the Rust tokenizers dependency to pick up two years of
upstream improvements. Adds ahash dependency required by newer
crate versions. Source changes adapt to API changes in the
with_* methods (now take Option<T>), BPE vocab type (AHashMap),
and metaspace prepend_scheme parameter.

Verified building on aarch64-unknown-linux-gnu with Rust 1.94.0.
Signed-off-by: Anush008 <mail@anush.sh>
Copy link
Copy Markdown
Owner

@Anush008 Anush008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to contribute @MayCXC
LGTM!

@Anush008 Anush008 merged commit 7e02f44 into Anush008:main Mar 19, 2026
26 checks passed
@Anush008
Copy link
Copy Markdown
Owner

Now available with v0.6.0.

https://github.com/Anush008/tokenizers/releases/tag/v0.6.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants