-
Notifications
You must be signed in to change notification settings - Fork 0
USE 113 - OSNeuralSparseDocV3GTE download and load #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: Each embedding class needs a way to download the model assets (e.g. weights and related files) locally, such that it can be loaded without calls to the HuggingFace API. Some models may require work beyond just HF's `snapshot_download()` function, e.g. cloning dependency models or configurations. To test if a downloaded and configured correctly, you must then also load the model. Ideally performing a test embedding creation, but even just a load without errors is a good step. How this addresses that need: The base class is extended to include a `load()` method. Our first embedding class `OSNeuralSparseDocV3GTE` has a first pass at `downloadg()` and `load()` methods. The model we are using has some unusual dependency requirements, that most commonly relies on additional HuggingFace calls on load. To avoid this, we include some manual work to clone the model `Alibaba-NLP/new-impl` and copy required files into our local model clone. The `load()` function confirms that the model loads successfully, and without making any HuggingFace API calls. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-113
Why these changes are being introduced: We have opted to include the model weights and assets inside the Docker image. When the model is small-ish like ours, this avoids the need to save the model to S3 and download each time the CLI is invoked. How this addresses that need: The CLI command `download-model` is used within the Dockerfile itself to download the model. As noted via inline comments, the env vars `TE_MODEL_URI` and `TE_MODEL_DOWNLOAD_PATH` are also set in the Dockerfile. This has a dual purpose. First, these env vars inform the `download-model` CLI invocation within the Dockerfile. Second, they persist to the container and establish the model as default for all calls. Side effects of this change: * On Docker image build, the model will be downloaded from HuggingFace, configured, and included in the final Docker image. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-113
| def _patch_local_model_with_alibaba_new_impl(self, model_temp_path: Path) -> None: | ||
| """Patch downloaded model with required assets from Alibaba-NLP/new-impl. | ||
| Our main model, opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte, | ||
| has configurations that attempt dynamic downloading of another model for files. | ||
| This can be seen here: https://huggingface.co/opensearch-project/opensearch- | ||
| neural-sparse-encoding-doc-v3-gte/blob/main/config.json#L6-L14. | ||
| To avoid our deployed CLI application making requests to the HuggingFace API to | ||
| retrieve these required files, which is problematic during high concurrency, we | ||
| manually download these files and patch the model during our local download and | ||
| save. | ||
| This allows us to load the primary model without any HuggingFace API calls. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was surprising that we need to do this, and it's possible we may discover a more idiomatic way to do this in the future.
The long and the short is: sometimes a HuggingFace model -- like ours -- will actually reference other models in its internal configurations. When the model is first loaded, the transformers library will automatically cache those other models for future invocations. For local development and testing, this is fine, but in our production environment where we'll likely run the CLI with concurrent instances, we don't want each container hitting the HuggingFace API to download the model. Not only for performance reasons, but we'd also start to get 429 - too many requests errors (have encountered this).
This method downloads a required model in advance, and copies small configuration files from it into our model, which is all that's needed. Then our model's config.json is updated to not point to an external model anymore.
| # load local model and tokenizer | ||
| self._model = AutoModelForMaskedLM.from_pretrained( | ||
| model_path, | ||
| trust_remote_code=True, | ||
| local_files_only=True, | ||
| ) | ||
| self._tokenizer = AutoTokenizer.from_pretrained( # type: ignore[no-untyped-call] | ||
| model_path, | ||
| local_files_only=True, | ||
| ) | ||
|
|
||
| # setup special tokens | ||
| self._special_token_ids = [ | ||
| self._tokenizer.vocab[str(token)] | ||
| for token in self._tokenizer.special_tokens_map.values() | ||
| ] | ||
|
|
||
| # setup id_to_token mapping | ||
| self._id_to_token = ["" for _ in range(self._tokenizer.vocab_size)] | ||
| for token, token_id in self._tokenizer.vocab.items(): | ||
| self._id_to_token[token_id] = token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ported almost directly from the model's HuggingFace "card" (website): https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte#usage-huggingface.
It's likely that we'll return to this for tweaks and updates, so it's probably not worth a deep review now.
Big picture, we load two things:
self._model: the core of the model that creates embeddingsself._tokenizer: a part of the model that can be used to create tokens from an input text, what we then feed back to get embeddings for
Just to reiterate: this was added in this PR mostly to confirm that the download process was working correctly. Once we get into creating embeddings, I anticipate updates to this simplistic model loading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context is appreciated as well as acknowledging this is likely to be in flux
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works as expected, some clarifying questions but no blockers
| # load local model and tokenizer | ||
| self._model = AutoModelForMaskedLM.from_pretrained( | ||
| model_path, | ||
| trust_remote_code=True, | ||
| local_files_only=True, | ||
| ) | ||
| self._tokenizer = AutoTokenizer.from_pretrained( # type: ignore[no-untyped-call] | ||
| model_path, | ||
| local_files_only=True, | ||
| ) | ||
|
|
||
| # setup special tokens | ||
| self._special_token_ids = [ | ||
| self._tokenizer.vocab[str(token)] | ||
| for token in self._tokenizer.special_tokens_map.values() | ||
| ] | ||
|
|
||
| # setup id_to_token mapping | ||
| self._id_to_token = ["" for _ in range(self._tokenizer.vocab_size)] | ||
| for token, token_id in self._tokenizer.vocab.items(): | ||
| self._id_to_token[token_id] = token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context is appreciated as well as acknowledging this is likely to be in flux
|
|
||
|
|
||
| @main.command() | ||
| def test_model_load() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarifying: is this method part of the expected workflow or just for testing? Not fully clear based on the name and docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not part of a normal workflow, just for testing. It's akin to a ping command, allowing us to confirm in a deployed or Docker context that the model can load successfully.
I can update the docstring to be a bit clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Just merged without that update. I'll make sure it makes it into the next PR! Sorry about that.
| HF_HUB_DISABLE_PROGRESS_BARS=#boolean to use progress bars for HuggingFace model downloads; defaults to 'true' in deployed contexts | ||
| ``` | ||
|
|
||
| ## Configuring an Embedding Model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great overview
Purpose and background context
NOTE: the majority of line count diffs are updated dependencies. 😅
This PR adds the ability for our first embedding class
OSNeuralSparseDocV3GTEto both download and load the machine learning model. Additionally, this updates theDockerfileto perform model download and configuration, such that the model is included in the resulting Docker image.It was not originally planned to include the
load()command as well in this PR, but it feels necessary to confirm that thedownload()work is done correctly by actually loading the local, configured model snapshot. While not present in this PR, I have confirmed that with the model loaded it is possible to create embeddings further suggesting it is both download and loaded correctly.Additionally, as noted in this commit, this PR updates the
Dockerfileto use the CLI'sdownload-modelcommand to download the model during the Docker image build. This ensures that the model is baked into the Docker image, and no HuggingFace API requests are needed to actually create embeddings.The overall flow can be summarized as:
download()method that maps out how to download the machine learning model locally, and configure it accordingly.download-modelCLI command is used during Docker image build.load()method to load the model already present in the Docker container.Lastly, a new section in the README attempts to briefly outline that that the
Dockerfilealso sets a couple of env vars that are used for the build process, but also persist into the container and establish the default model.How can a reviewer manually see the effects of these changes?
At this time, we can do some testing!
Non-Docker model download + load
1- Set env vars:
2- Download the model to
/tmp/te-model:At this point, you can inspect
/tmp/te-modeland see the model assets:model.safetensorsare the model "weights" and by far the largest filevocab.txtis neat, showing the ~30k tokens that make up our model3- Test the model can be loaded:
OKemitted tostdoutDocker build + model load test
1- Build image
RUN python -m embeddings.cli --verbose download-modelis performed during build2- Invoke docker container with model testing CLI command:
TE_MODEL_URIandTE_MODEL_DOWNLOAD_PATHalready set in the container via the build.Includes new or updated dependencies?
YES: addition of
torchandtransformers, both fairly significant librariesChanges expectations for external applications?
NO
What are the relevant tickets?
Code review