Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word piece tokenizer never exits if a sub-word token doesn't exist #26

Open
matteocontrini opened this issue Aug 1, 2023 · 1 comment

Comments

@matteocontrini
Copy link

The following code:

var res = vocabulary.Tokenize("point™");

never returns if cannot be matched in the vocabulary.

The issue was introduced in this commit:

0f29cef#diff-82215a359c504385d48356d59d6635f3b968278cca935c73977e16cea13f4174

Specifically this line:

while (subwordLength >= 1) // was initially 2, which prevents using "character encoding"

Changing it back to 2 fixes it. Otherwise the code keeps looping by adding and removing # symbols until stopped.

Plus, I think the following two lines might cause additional issues:

var regex = new Regex(prefix);
remaining = regex.Replace(remaining, "##", 1);

The matched token shouldn't be treated as a regular expression. Those two lines can be replaced by:

remaining = "##" + remaining[prefix.Length..];

Which is also most likely much more efficient and more closely resembles Google's original tokenizer implementation.

@georg-jung
Copy link

Inspired by this library (thanks for the great work @NMZivkovic!), I built FastBertTokenizer (nuget). It shouldn't suffer from this issue. If you give it a try and stumble up on an issue, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants