Skip to content

Fix loading tokenizers from s3#71

Merged
meilame-tayebjee merged 2 commits intomainfrom
dev
Feb 4, 2026
Merged

Fix loading tokenizers from s3#71
meilame-tayebjee merged 2 commits intomainfrom
dev

Conversation

@meilame-tayebjee
Copy link
Member

This pull request introduces improvements to the tokenizer loading and saving logic, particularly around loading from S3 and configuration-based instantiation. The changes add a unified load_from_s3 interface to the tokenizer base class, refactor the S3 loading implementation for HuggingFace and NGram tokenizers, and introduce a build_from_config method for NGram tokenizers to streamline instantiation from configuration files.

Tokenizer loading improvements:

  • Added an abstract load_from_s3 class method to BaseTokenizer to enforce a consistent S3 loading interface across all tokenizer subclasses.
  • Refactored the HuggingFaceTokenizer.load_from_s3 method to simplify error messages and ensure the correct tokenizer object is instantiated from S3.
  • Implemented load_from_s3 for NGramTokenizer, allowing loading from S3 using the provided filesystem and configuration, and refactored the loading logic for clarity.

Configuration-based instantiation:

  • Added a build_from_config class method to NGramTokenizer to streamline tokenizer creation from a configuration dictionary, and refactored existing loading methods to use this new method.
  • Removed redundant print statement from the NGram tokenizer's loading logic for cleaner output.

@meilame-tayebjee meilame-tayebjee merged commit 593ce32 into main Feb 4, 2026
5 checks passed
@meilame-tayebjee meilame-tayebjee deleted the dev branch February 4, 2026 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant