-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer remove max length flag #152
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me! I only found a single typing issue.
However, I think that the unit tests (test_hf_tokenize
) do not suffice. Maybe it would be a good idea to explicitly check the tokenization of a multi-document example, and vary parameters like truncation
, padding
and max_length
?
Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>
I added more test cases concerning the relevant combination of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Looks much better than before. I left a few minor comments and suggestions for improvement.
Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>
Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>
Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>
By default, we do not specify the tokenizer's
max_length
anymore and settruncation
andpadding
to false now.