Skip to content

fix: Incorrect added token can cause issues when adding token as multiword token#319

Merged
stephantul merged 4 commits intomainfrom
fix-tokenizer-againnnnn
Apr 14, 2026
Merged

fix: Incorrect added token can cause issues when adding token as multiword token#319
stephantul merged 4 commits intomainfrom
fix-tokenizer-againnnnn

Conversation

@stephantul
Copy link
Copy Markdown
Contributor

In the current version of model2vec it is possible to attempt adding a multiword added token if the token is already present in the vocabulary. This happens in the following circumstances:

  1. The tokenizer has a token that is incorrect: a token which is pre-tokenized into multiple subwords.
  2. We attempt to add precisely that token as a multiword unit.

The logic bug is as follows: we don't check whether multiword tokens are already present in the original tokenizers vocabulary. Therefore, we try to add the same token twice. This assumption is sound, because normally a tokenizer's vocabulary can't contain tokens that are preprocessed into multiple tokens.

@stephantul stephantul requested a review from Pringled April 14, 2026 13:51
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
model2vec/distill/distillation.py 90.00% <100.00%> (-0.15%) ⬇️
model2vec/tokenizer/tokenizer.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@stephantul stephantul merged commit 4f02616 into main Apr 14, 2026
9 checks passed
@stephantul stephantul deleted the fix-tokenizer-againnnnn branch April 14, 2026 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants