Tokenizer remove max length flag #152

le1nux · 2024-06-11T16:45:38Z

By default, we do not specify the tokenizer's max_length anymore and set truncation and padding to false now.

…nizer

flxst

The changes look good to me! I only found a single typing issue.

However, I think that the unit tests (test_hf_tokenize) do not suffice. Maybe it would be a good idea to explicitly check the tokenization of a multi-document example, and vary parameters like truncation, padding and max_length?

src/modalities/tokenization/tokenizer_wrapper.py

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

…e tests

le1nux · 2024-06-12T17:36:17Z

I added more test cases concerning the relevant combination of max_length, padding, truncation for the single document case.
The multi-document case it currently not supported and not needed so far, see def tokenize(self, text: str) -> List[int]:

flxst

Nice work! Looks much better than before. I left a few minor comments and suggestions for improvement.

tests/test_tokenization.py

src/modalities/tokenization/tokenizer_wrapper.py

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

ing_tokens

le1nux added 3 commits June 10, 2024 19:56

refactor: removed legacy code for dataset creation

7aee721

refactor: replaced max_length with truncation in configs

c5a9f6f

refactor: made max_length optional and truncation = False in the toke…

582f6eb

…nizer

le1nux self-assigned this Jun 11, 2024

le1nux added bug Something isn't working enhancement New feature or request labels Jun 11, 2024

le1nux requested review from mali-git and flxst June 11, 2024 17:02

flxst approved these changes Jun 12, 2024

View reviewed changes

src/modalities/tokenization/tokenizer_wrapper.py Outdated Show resolved Hide resolved

le1nux and others added 3 commits June 12, 2024 17:10

Update src/modalities/tokenization/tokenizer_wrapper.py

4bce481

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

feat: added option to add special tokens to hf tokeniser and extensiv…

f71a56a

…e tests

test: improved the hf tokenizer tests

3ba062b

le1nux requested a review from flxst June 12, 2024 17:36

le1nux changed the base branch from main to dev_experiments June 13, 2024 17:57

le1nux changed the base branch from dev_experiments to main June 13, 2024 17:58

flxst approved these changes Jun 14, 2024

View reviewed changes

le1nux and others added 5 commits June 14, 2024 14:13

chore: Update tests/test_tokenization.py

c687f1c

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

chore: Update src/modalities/tokenization/tokenizer_wrapper.py

8af49c1

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

chore: Update tests/test_tokenization.py

1cffee8

chore: Updated comment in tests/test_tokenization.py

5303be6

Co-authored-by: Felix Stollenwerk <felix.stollenwerk@ai.se>

chore: renamed expected_num_paddding_tokens -> expected_num_padd

c4f65f6

ing_tokens

le1nux merged commit ed3fb62 into main Jun 14, 2024

le1nux deleted the tokenizer_remove_max_length_flag branch June 14, 2024 12:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer remove max length flag #152

Tokenizer remove max length flag #152

le1nux commented Jun 11, 2024 •

edited

Loading

flxst left a comment

le1nux commented Jun 12, 2024

flxst left a comment

Tokenizer remove max length flag #152

Tokenizer remove max length flag #152

Conversation

le1nux commented Jun 11, 2024 • edited Loading

flxst left a comment

Choose a reason for hiding this comment

le1nux commented Jun 12, 2024

flxst left a comment

Choose a reason for hiding this comment

le1nux commented Jun 11, 2024 •

edited

Loading