Fix loading tokenizers from s3 by meilame-tayebjee · Pull Request #71 · InseeFrLab/torchTextClassifiers

meilame-tayebjee · 2026-02-04T15:38:33Z

This pull request introduces improvements to the tokenizer loading and saving logic, particularly around loading from S3 and configuration-based instantiation. The changes add a unified load_from_s3 interface to the tokenizer base class, refactor the S3 loading implementation for HuggingFace and NGram tokenizers, and introduce a build_from_config method for NGram tokenizers to streamline instantiation from configuration files.

Tokenizer loading improvements:

Added an abstract load_from_s3 class method to BaseTokenizer to enforce a consistent S3 loading interface across all tokenizer subclasses.
Refactored the HuggingFaceTokenizer.load_from_s3 method to simplify error messages and ensure the correct tokenizer object is instantiated from S3.
Implemented load_from_s3 for NGramTokenizer, allowing loading from S3 using the provided filesystem and configuration, and refactored the loading logic for clarity.

Configuration-based instantiation:

Added a build_from_config class method to NGramTokenizer to streamline tokenizer creation from a configuration dictionary, and refactored existing loading methods to use this new method.
Removed redundant print statement from the NGram tokenizer's loading logic for cleaner output.

meilame-tayebjee added 2 commits February 4, 2026 15:36

fix: fix Tokenizer has no len / no attribute token_to_id at loading

44b3249

chore: force all tokenizers to have a load_from_s3 method

9bccd0c

meilame-tayebjee merged commit 593ce32 into main Feb 4, 2026
5 checks passed

meilame-tayebjee deleted the dev branch February 4, 2026 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix loading tokenizers from s3#71

Fix loading tokenizers from s3#71
meilame-tayebjee merged 2 commits intomainfrom
dev

meilame-tayebjee commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

meilame-tayebjee commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant