proof of concept for using dataset of test cases for tokenizer tests #37994

itazap · 2025-05-07T09:42:27Z

To refactor comparing slow == fast, we need to 'freeze' current working behavior of tokenizers. We move the test strings to the hub (https://huggingface.co/datasets/hf-internal-testing/tokenization_test_data), with expected tokens, encoded_ids, and encoded_ids with special tokens. I got the test strings for all sentence piece based models and I ran all sp-based models on these test --> then uploaded to the dataset.

This allows us to

preserve testing expected behavior without relying on comparing fast == slow
reduce overwriting of tests per class
future tests can be added here!

*Note: will convert other slow == fast cases! this is a POC for test_rust_and_python_full_tokenizers to get some feedback :). then we can easily move tests like test_tokenization_python_rust_equals and many others.

More ideas/ considerations

need to check if this slow downs running these tests

HuggingFaceDocBuilderDev · 2025-05-07T09:55:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ydshieh · 2025-05-13T12:36:40Z

Hi @itazap

To refactor comparing slow == fast, we need to 'freeze' current working behavior of tokenizers.

Could you elaborate what refactorization you plan to achieve, and explain why this approach is "needed" for it?

preserve testing expected behavior without relying on comparing fast == slow

There is still test_rust_and_python_full_tokenizers in the common file. So IIRC, we don't run it with slow tokenizer, but just fast with its expected value?

reduce overwriting of tests per class

This is very nice.

future tests can be added here!

I am not in favor of putting the input and expected values on the hub repository like this. This makes working on test (write it and maintain it like debugging and update) more frictional. It also prevents the community to contribute directly, IIRC.

I am happy if these test values being kept in each test file, and the tests in tests/test_tokenization_common.py use them from there. I think this is still doable, and still achieve what you try to do.

itazap marked this pull request as ready for review May 8, 2025 14:05

github-actions bot requested a review from ydshieh May 8, 2025 14:06

itazap added 3 commits May 8, 2025 16:08

poc dataset of test cases

1e14716

missed

10e37bf

ruff

6cd261d

itazap force-pushed the test_fast_only_refactor branch from 19a0594 to 6cd261d Compare May 8, 2025 14:09

itazap added 2 commits May 8, 2025 16:11

ruff

4277ca8

gpt2 ruff

abe7466

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

proof of concept for using dataset of test cases for tokenizer tests #37994

proof of concept for using dataset of test cases for tokenizer tests #37994

Uh oh!

itazap commented May 7, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 7, 2025

Uh oh!

ydshieh commented May 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

proof of concept for using dataset of test cases for tokenizer tests #37994

Are you sure you want to change the base?

proof of concept for using dataset of test cases for tokenizer tests #37994

Uh oh!

Conversation

itazap commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 7, 2025

Uh oh!

ydshieh commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

itazap commented May 7, 2025 •

edited

Loading

ydshieh commented May 13, 2025 •

edited

Loading