USE 113 - OSNeuralSparseDocV3GTE download and load #14

ghukill · 2025-10-28T17:22:20Z

Purpose and background context

NOTE: the majority of line count diffs are updated dependencies. 😅

This PR adds the ability for our first embedding class OSNeuralSparseDocV3GTE to both download and load the machine learning model. Additionally, this updates the Dockerfile to perform model download and configuration, such that the model is included in the resulting Docker image.

It was not originally planned to include the load() command as well in this PR, but it feels necessary to confirm that the download() work is done correctly by actually loading the local, configured model snapshot. While not present in this PR, I have confirmed that with the model loaded it is possible to create embeddings further suggesting it is both download and loaded correctly.

Additionally, as noted in this commit, this PR updates the Dockerfile to use the CLI's download-model command to download the model during the Docker image build. This ensures that the model is baked into the Docker image, and no HuggingFace API requests are needed to actually create embeddings.

The overall flow can be summarized as:

Each class has a download() method that maps out how to download the machine learning model locally, and configure it accordingly.
This method and the associated download-model CLI command is used during Docker image build.
When the CLI is invoked for real, we can use the load() method to load the model already present in the Docker container.

Lastly, a new section in the README attempts to briefly outline that that the Dockerfile also sets a couple of env vars that are used for the build process, but also persist into the container and establish the default model.

How can a reviewer manually see the effects of these changes?

At this time, we can do some testing!

Non-Docker model download + load

1- Set env vars:

HF_HUB_DISABLE_PROGRESS_BARS=true
TE_MODEL_URI=opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte
TE_MODEL_DOWNLOAD_PATH=/tmp/te-model

2- Download the model to /tmp/te-model:

uv run --env-file .env embeddings --verbose download-model

At this point, you can inspect /tmp/te-model and see the model assets:

--- /tmp/te-model 
  524.1 MiB [###################################]  model.safetensors
    1.0 MiB [                                   ] /query_0_SparseStaticEmbedding
  872.0 KiB [                                   ]  idf.json
  724.0 KiB [                                   ]  query_token_weights.txt
  696.0 KiB [                                   ]  tokenizer.json
  228.0 KiB [                                   ]  vocab.txt
   88.0 KiB [                                   ] /.cache
   60.0 KiB [                                   ]  modeling.py
   12.0 KiB [                                   ]  README.md
    8.0 KiB [                                   ]  configuration.py
    4.0 KiB [                                   ]  .gitattributes
    4.0 KiB [                                   ]  tokenizer_config.json
    4.0 KiB [                                   ]  config.json
    4.0 KiB [                                   ]  router_config.json
    4.0 KiB [                                   ]  config_sentence_transformers.json
    4.0 KiB [                                   ] /document_1_SpladePooling
    4.0 KiB [                                   ]  special_tokens_map.json
    4.0 KiB [                                   ]  modules.json
    4.0 KiB [                                   ]  sentence_bert_config.json

model.safetensors are the model "weights" and by far the largest file
vocab.txt is neat, showing the ~30k tokens that make up our model

3- Test the model can be loaded:

uv run --env-file .env embeddings --verbose test-model-load

Should see some logging and OK emitted to stdout

Docker build + model load test

1- Build image

make docker-build

If you watch closely, you'll notice RUN python -m embeddings.cli --verbose download-model is performed during build

2- Invoke docker container with model testing CLI command:

docker run -it timdex-embeddings:latest test-model-load

Confirms the model can be loaded inside the Docker container.
No env vars or CLI arguments are needed due to TE_MODEL_URI and TE_MODEL_DOWNLOAD_PATH already set in the container via the build.

Includes new or updated dependencies?

YES: addition of torch and transformers, both fairly significant libraries

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-113

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: Each embedding class needs a way to download the model assets (e.g. weights and related files) locally, such that it can be loaded without calls to the HuggingFace API. Some models may require work beyond just HF's `snapshot_download()` function, e.g. cloning dependency models or configurations. To test if a downloaded and configured correctly, you must then also load the model. Ideally performing a test embedding creation, but even just a load without errors is a good step. How this addresses that need: The base class is extended to include a `load()` method. Our first embedding class `OSNeuralSparseDocV3GTE` has a first pass at `downloadg()` and `load()` methods. The model we are using has some unusual dependency requirements, that most commonly relies on additional HuggingFace calls on load. To avoid this, we include some manual work to clone the model `Alibaba-NLP/new-impl` and copy required files into our local model clone. The `load()` function confirms that the model loads successfully, and without making any HuggingFace API calls. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-113

Why these changes are being introduced: We have opted to include the model weights and assets inside the Docker image. When the model is small-ish like ours, this avoids the need to save the model to S3 and download each time the CLI is invoked. How this addresses that need: The CLI command `download-model` is used within the Dockerfile itself to download the model. As noted via inline comments, the env vars `TE_MODEL_URI` and `TE_MODEL_DOWNLOAD_PATH` are also set in the Dockerfile. This has a dual purpose. First, these env vars inform the `download-model` CLI invocation within the Dockerfile. Second, they persist to the container and establish the model as default for all calls. Side effects of this change: * On Docker image build, the model will be downloaded from HuggingFace, configured, and included in the final Docker image. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-113

ghukill · 2025-10-28T17:34:40Z

embeddings/models/os_neural_sparse_doc_v3_gte.py

+    def _patch_local_model_with_alibaba_new_impl(self, model_temp_path: Path) -> None:
+        """Patch downloaded model with required assets from Alibaba-NLP/new-impl.
+
+        Our main model, opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte,
+        has configurations that attempt dynamic downloading of another model for files.
+        This can be seen here: https://huggingface.co/opensearch-project/opensearch-
+        neural-sparse-encoding-doc-v3-gte/blob/main/config.json#L6-L14.
+
+        To avoid our deployed CLI application making requests to the HuggingFace API to
+        retrieve these required files, which is problematic during high concurrency, we
+        manually download these files and patch the model during our local download and
+        save.
+
+        This allows us to load the primary model without any HuggingFace API calls.
+        """


This was surprising that we need to do this, and it's possible we may discover a more idiomatic way to do this in the future.

The long and the short is: sometimes a HuggingFace model -- like ours -- will actually reference other models in its internal configurations. When the model is first loaded, the transformers library will automatically cache those other models for future invocations. For local development and testing, this is fine, but in our production environment where we'll likely run the CLI with concurrent instances, we don't want each container hitting the HuggingFace API to download the model. Not only for performance reasons, but we'd also start to get 429 - too many requests errors (have encountered this).

This method downloads a required model in advance, and copies small configuration files from it into our model, which is all that's needed. Then our model's config.json is updated to not point to an external model anymore.

ghukill · 2025-10-28T17:37:34Z

embeddings/models/os_neural_sparse_doc_v3_gte.py

+        # load local model and tokenizer
+        self._model = AutoModelForMaskedLM.from_pretrained(
+            model_path,
+            trust_remote_code=True,
+            local_files_only=True,
+        )
+        self._tokenizer = AutoTokenizer.from_pretrained(  # type: ignore[no-untyped-call]
+            model_path,
+            local_files_only=True,
+        )
+
+        # setup special tokens
+        self._special_token_ids = [
+            self._tokenizer.vocab[str(token)]
+            for token in self._tokenizer.special_tokens_map.values()
+        ]
+
+        # setup id_to_token mapping
+        self._id_to_token = ["" for _ in range(self._tokenizer.vocab_size)]
+        for token, token_id in self._tokenizer.vocab.items():
+            self._id_to_token[token_id] = token


This ported almost directly from the model's HuggingFace "card" (website): https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte#usage-huggingface.

It's likely that we'll return to this for tweaks and updates, so it's probably not worth a deep review now.

Big picture, we load two things:

self._model: the core of the model that creates embeddings

self._tokenizer: a part of the model that can be used to create tokens from an input text, what we then feed back to get embeddings for

Just to reiterate: this was added in this PR mostly to confirm that the download process was working correctly. Once we get into creating embeddings, I anticipate updates to this simplistic model loading.

Context is appreciated as well as acknowledging this is likely to be in flux

ehanson8

Works as expected, some clarifying questions but no blockers

ehanson8 · 2025-10-28T19:43:51Z

embeddings/models/os_neural_sparse_doc_v3_gte.py

+        # load local model and tokenizer
+        self._model = AutoModelForMaskedLM.from_pretrained(
+            model_path,
+            trust_remote_code=True,
+            local_files_only=True,
+        )
+        self._tokenizer = AutoTokenizer.from_pretrained(  # type: ignore[no-untyped-call]
+            model_path,
+            local_files_only=True,
+        )
+
+        # setup special tokens
+        self._special_token_ids = [
+            self._tokenizer.vocab[str(token)]
+            for token in self._tokenizer.special_tokens_map.values()
+        ]
+
+        # setup id_to_token mapping
+        self._id_to_token = ["" for _ in range(self._tokenizer.vocab_size)]
+        for token, token_id in self._tokenizer.vocab.items():
+            self._id_to_token[token_id] = token


Context is appreciated as well as acknowledging this is likely to be in flux

ehanson8 · 2025-10-28T19:48:47Z

embeddings/cli.py



+@main.command()
+def test_model_load() -> None:


Clarifying: is this method part of the expected workflow or just for testing? Not fully clear based on the name and docstring

This is not part of a normal workflow, just for testing. It's akin to a ping command, allowing us to confirm in a deployed or Docker context that the model can load successfully.

I can update the docstring to be a bit clearer.

Oops! Just merged without that update. I'll make sure it makes it into the next PR! Sorry about that.

ehanson8 · 2025-10-28T19:53:09Z

README.md

 HF_HUB_DISABLE_PROGRESS_BARS=#boolean to use progress bars for HuggingFace model downloads; defaults to 'true' in deployed contexts
 ```

+## Configuring an Embedding Model


Great overview

ghukill added 3 commits October 28, 2025 09:34

Testing for OSNeuralSparseDocV3GTE download and load

8899d51

ghukill commented Oct 28, 2025

View reviewed changes

ghukill marked this pull request as ready for review October 28, 2025 17:45

ghukill requested a review from a team October 28, 2025 17:46

ehanson8 self-assigned this Oct 28, 2025

ehanson8 approved these changes Oct 28, 2025

View reviewed changes

ghukill merged commit 8db533a into main Oct 28, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

USE 113 - OSNeuralSparseDocV3GTE download and load #14

USE 113 - OSNeuralSparseDocV3GTE download and load #14

Uh oh!

ghukill commented Oct 28, 2025 •

edited

Loading

Uh oh!

ghukill Oct 28, 2025

Uh oh!

ghukill Oct 28, 2025

Uh oh!

ehanson8 Oct 28, 2025

Uh oh!

ehanson8 left a comment

Uh oh!

ehanson8 Oct 28, 2025

Uh oh!

ehanson8 Oct 28, 2025

Uh oh!

ghukill Oct 28, 2025

Uh oh!

ghukill Oct 28, 2025

Uh oh!

ehanson8 Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

USE 113 - OSNeuralSparseDocV3GTE download and load #14

USE 113 - OSNeuralSparseDocV3GTE download and load #14

Uh oh!

Conversation

ghukill commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Non-Docker model download + load

Docker build + model load test

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ghukill commented Oct 28, 2025 •

edited

Loading