BertTokenizer may not be optimal choice for converstion #10

MarkusSagen · 2022-03-12T20:37:06Z

Tensorflow supports two (or three) different types of WordPiece tokenizers.
Could be worth testing to use the FastWordPiece tokenizer, since it can build the model from a vocab directly and claims to be faster as mentioned:

But is will likely also require a bit more setup (https://www.tensorflow.org/text/guide/subwords_tokenizer#overview), as WordPiece only see to split words, but the BertTokenizer splits sentences

Goal

Compare the different tokenizers and see if they yield the same results
Compare if the new tokenizer can be saved as a Reusable SavedModel
Test if the models that previously fails now work Tokenizers do not convert tokens correctly #4

MarkusSagen added bug Something isn't working enhancement New feature or request labels Mar 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BertTokenizer may not be optimal choice for converstion #10

BertTokenizer may not be optimal choice for converstion #10

MarkusSagen commented Mar 12, 2022 •

edited

Loading

BertTokenizer may not be optimal choice for converstion #10

BertTokenizer may not be optimal choice for converstion #10

Comments

MarkusSagen commented Mar 12, 2022 • edited Loading

Goal

MarkusSagen commented Mar 12, 2022 •

edited

Loading