Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer's model_max_length is not consistent #3

Open
MarkusSagen opened this issue Mar 12, 2022 · 0 comments
Open

Tokenizer's model_max_length is not consistent #3

MarkusSagen opened this issue Mar 12, 2022 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@MarkusSagen
Copy link
Collaborator

MarkusSagen commented Mar 12, 2022

Most tokenizers define their max model length as either 510 tokens or more and is based on:

  • Model max token lenght - number of tokens needed to define a sentence (start and end)

Example

Most tokenizers follow this convention, but there are some that have nearly infinite length, with tokenizer.model_max_length=1000000000000000019884624838656

This means that when converting the tokenizer max length, in Tensorflow, most values are assumed to be ints, but with nearly infinit model length, it needs to be a tf.long or greater for the conversion not to fail


Initially, the tokenizers model_max_length was set dynamically, but is now set to 510 tokens. This should be changed to reflect the actual tokenizers.

@MarkusSagen MarkusSagen self-assigned this Mar 12, 2022
@MarkusSagen MarkusSagen added the bug Something isn't working label Mar 12, 2022
@MarkusSagen MarkusSagen changed the title [Bug] Tokenizer's model_max_length is not consistent Tokenizer's model_max_length is not consistent Mar 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant