fix: Incorrect added token can cause issues when adding token as multiword token by stephantul · Pull Request #319 · MinishLab/model2vec

stephantul · 2026-04-14T13:47:47Z

In the current version of model2vec it is possible to attempt adding a multiword added token if the token is already present in the vocabulary. This happens in the following circumstances:

The tokenizer has a token that is incorrect: a token which is pre-tokenized into multiple subwords.
We attempt to add precisely that token as a multiword unit.

The logic bug is as follows: we don't check whether multiword tokens are already present in the original tokenizers vocabulary. Therefore, we try to add the same token twice. This assumption is sound, because normally a tokenizer's vocabulary can't contain tokens that are preprocessed into multiple tokens.

codecov · 2026-04-14T13:54:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines	Coverage Δ
model2vec/distill/distillation.py	`90.00% <100.00%> (-0.15%)`	⬇️
model2vec/tokenizer/tokenizer.py	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stephantul added 2 commits April 14, 2026 15:43

fix tokenizer

6ca7931

fix issue with reassignment

20d74ce

stephantul requested a review from Pringled April 14, 2026 13:51

stephantul added 2 commits April 14, 2026 16:00

update lock

3920625

fix issue with import

c9cf4b5

Pringled approved these changes Apr 14, 2026

View reviewed changes

stephantul merged commit 4f02616 into main Apr 14, 2026
9 checks passed

stephantul deleted the fix-tokenizer-againnnnn branch April 14, 2026 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Incorrect added token can cause issues when adding token as multiword token#319

fix: Incorrect added token can cause issues when adding token as multiword token#319
stephantul merged 4 commits intomainfrom
fix-tokenizer-againnnnn

stephantul commented Apr 14, 2026

Uh oh!

codecov bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stephantul commented Apr 14, 2026

Uh oh!

codecov bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Apr 14, 2026 •

edited

Loading