Skip to content

proof of concept for using dataset of test cases for tokenizer tests #37994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

itazap
Copy link
Collaborator

@itazap itazap commented May 7, 2025

To refactor comparing slow == fast, we need to 'freeze' current working behavior of tokenizers. We move the test strings to the hub (https://huggingface.co/datasets/hf-internal-testing/tokenization_test_data), with expected tokens, encoded_ids, and encoded_ids with special tokens. I got the test strings for all sentence piece based models and I ran all sp-based models on these test --> then uploaded to the dataset.

This allows us to

  1. preserve testing expected behavior without relying on comparing fast == slow
  2. reduce overwriting of tests per class
  3. future tests can be added here!

*Note: will convert other slow == fast cases! this is a POC for test_rust_and_python_full_tokenizers to get some feedback :). then we can easily move tests like test_tokenization_python_rust_equals and many others.

More ideas/ considerations

  • need to check if this slow downs running these tests

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap itazap marked this pull request as ready for review May 8, 2025 14:05
@github-actions github-actions bot requested a review from ydshieh May 8, 2025 14:06
@itazap itazap force-pushed the test_fast_only_refactor branch from 19a0594 to 6cd261d Compare May 8, 2025 14:09
@ydshieh
Copy link
Collaborator

ydshieh commented May 13, 2025

Hi @itazap

To refactor comparing slow == fast, we need to 'freeze' current working behavior of tokenizers.

Could you elaborate what refactorization you plan to achieve, and explain why this approach is "needed" for it?

preserve testing expected behavior without relying on comparing fast == slow

There is still test_rust_and_python_full_tokenizers in the common file. So IIRC, we don't run it with slow tokenizer, but just fast with its expected value?

reduce overwriting of tests per class

This is very nice.

future tests can be added here!

I am not in favor of putting the input and expected values on the hub repository like this. This makes working on test (write it and maintain it like debugging and update) more frictional. It also prevents the community to contribute directly, IIRC.

I am happy if these test values being kept in each test file, and the tests in tests/test_tokenization_common.py use them from there. I think this is still doable, and still achieve what you try to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants