Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces improvements to the tokenizer loading and saving logic, particularly around loading from S3 and configuration-based instantiation. The changes add a unified
load_from_s3interface to the tokenizer base class, refactor the S3 loading implementation for HuggingFace and NGram tokenizers, and introduce abuild_from_configmethod for NGram tokenizers to streamline instantiation from configuration files.Tokenizer loading improvements:
load_from_s3class method toBaseTokenizerto enforce a consistent S3 loading interface across all tokenizer subclasses.HuggingFaceTokenizer.load_from_s3method to simplify error messages and ensure the correct tokenizer object is instantiated from S3.load_from_s3forNGramTokenizer, allowing loading from S3 using the provided filesystem and configuration, and refactored the loading logic for clarity.Configuration-based instantiation:
build_from_configclass method toNGramTokenizerto streamline tokenizer creation from a configuration dictionary, and refactored existing loading methods to use this new method.