Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer remove max length flag #152

Merged
merged 11 commits into from
Jun 14, 2024
Merged

Conversation

le1nux
Copy link
Member

@le1nux le1nux commented Jun 11, 2024

By default, we do not specify the tokenizer's max_length anymore and set truncation and padding to false now.

@le1nux le1nux self-assigned this Jun 11, 2024
@le1nux le1nux added bug Something isn't working enhancement New feature or request labels Jun 11, 2024
@le1nux le1nux requested review from mali-git and flxst June 11, 2024 17:02
Copy link
Member

@flxst flxst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me! I only found a single typing issue.

However, I think that the unit tests (test_hf_tokenize) do not suffice. Maybe it would be a good idea to explicitly check the tokenization of a multi-document example, and vary parameters like truncation, padding and max_length?

src/modalities/tokenization/tokenizer_wrapper.py Outdated Show resolved Hide resolved
@le1nux
Copy link
Member Author

le1nux commented Jun 12, 2024

I added more test cases concerning the relevant combination of max_length, padding, truncation for the single document case.
The multi-document case it currently not supported and not needed so far, see def tokenize(self, text: str) -> List[int]:

@le1nux le1nux requested a review from flxst June 12, 2024 17:36
@le1nux le1nux changed the base branch from main to dev_experiments June 13, 2024 17:57
@le1nux le1nux changed the base branch from dev_experiments to main June 13, 2024 17:58
Copy link
Member

@flxst flxst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Looks much better than before. I left a few minor comments and suggestions for improvement.

tests/test_tokenization.py Outdated Show resolved Hide resolved
tests/test_tokenization.py Outdated Show resolved Hide resolved
tests/test_tokenization.py Outdated Show resolved Hide resolved
tests/test_tokenization.py Outdated Show resolved Hide resolved
tests/test_tokenization.py Outdated Show resolved Hide resolved
tests/test_tokenization.py Outdated Show resolved Hide resolved
tests/test_tokenization.py Show resolved Hide resolved
src/modalities/tokenization/tokenizer_wrapper.py Outdated Show resolved Hide resolved
le1nux and others added 5 commits June 14, 2024 14:13
Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>
Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>
Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>
@le1nux le1nux merged commit ed3fb62 into main Jun 14, 2024
@le1nux le1nux deleted the tokenizer_remove_max_length_flag branch June 14, 2024 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants