Word piece tokenizer never exits if a sub-word token doesn't exist #26

matteocontrini · 2023-08-01T09:42:02Z

The following code:

var res = vocabulary.Tokenize("point™");

never returns if ™ cannot be matched in the vocabulary.

The issue was introduced in this commit:

0f29cef#diff-82215a359c504385d48356d59d6635f3b968278cca935c73977e16cea13f4174

Specifically this line:

BertTokenizers/src/Base/TokenizerBase.cs

Line 122 in 150e40a

    
           while (subwordLength >= 1) // was initially 2, which prevents using "character encoding"

Changing it back to 2 fixes it. Otherwise the code keeps looping by adding and removing # symbols until stopped.

Plus, I think the following two lines might cause additional issues:

BertTokenizers/src/Base/TokenizerBase.cs

Lines 142 to 143 in 150e40a

    
           var regex = new Regex(prefix); 
        
           remaining = regex.Replace(remaining, "##", 1);

The matched token shouldn't be treated as a regular expression. Those two lines can be replaced by:

remaining = "##" + remaining[prefix.Length..];

Which is also most likely much more efficient and more closely resembles Google's original tokenizer implementation.

The text was updated successfully, but these errors were encountered:

georg-jung · 2023-09-14T14:47:17Z

Inspired by this library (thanks for the great work @NMZivkovic!), I built FastBertTokenizer (nuget). It shouldn't suffer from this issue. If you give it a try and stumble up on an issue, please let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word piece tokenizer never exits if a sub-word token doesn't exist #26

Word piece tokenizer never exits if a sub-word token doesn't exist #26

matteocontrini commented Aug 1, 2023

georg-jung commented Sep 14, 2023

Word piece tokenizer never exits if a sub-word token doesn't exist #26

Word piece tokenizer never exits if a sub-word token doesn't exist #26

Comments

matteocontrini commented Aug 1, 2023

georg-jung commented Sep 14, 2023