proof of concept for using dataset of test cases for tokenizer tests #37994
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
To refactor comparing slow == fast, we need to 'freeze' current working behavior of tokenizers. We move the test strings to the hub (https://huggingface.co/datasets/hf-internal-testing/tokenization_test_data), with expected tokens, encoded_ids, and encoded_ids with special tokens. I got the test strings for all sentence piece based models and I ran all sp-based models on these test --> then uploaded to the dataset.
This allows us to
*Note: will convert other slow == fast cases! this is a POC for
test_rust_and_python_full_tokenizers
to get some feedback :). then we can easily move tests liketest_tokenization_python_rust_equals
and many others.More ideas/ considerations