Skip to content

Conversation

@ghukill
Copy link
Collaborator

@ghukill ghukill commented Oct 28, 2025

Purpose and background context

NOTE: the majority of line count diffs are updated dependencies. 😅

This PR adds the ability for our first embedding class OSNeuralSparseDocV3GTE to both download and load the machine learning model. Additionally, this updates the Dockerfile to perform model download and configuration, such that the model is included in the resulting Docker image.

It was not originally planned to include the load() command as well in this PR, but it feels necessary to confirm that the download() work is done correctly by actually loading the local, configured model snapshot. While not present in this PR, I have confirmed that with the model loaded it is possible to create embeddings further suggesting it is both download and loaded correctly.

Additionally, as noted in this commit, this PR updates the Dockerfile to use the CLI's download-model command to download the model during the Docker image build. This ensures that the model is baked into the Docker image, and no HuggingFace API requests are needed to actually create embeddings.

The overall flow can be summarized as:

  1. Each class has a download() method that maps out how to download the machine learning model locally, and configure it accordingly.
  2. This method and the associated download-model CLI command is used during Docker image build.
  3. When the CLI is invoked for real, we can use the load() method to load the model already present in the Docker container.

Lastly, a new section in the README attempts to briefly outline that that the Dockerfile also sets a couple of env vars that are used for the build process, but also persist into the container and establish the default model.

How can a reviewer manually see the effects of these changes?

At this time, we can do some testing!

Non-Docker model download + load

1- Set env vars:

HF_HUB_DISABLE_PROGRESS_BARS=true
TE_MODEL_URI=opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte
TE_MODEL_DOWNLOAD_PATH=/tmp/te-model

2- Download the model to /tmp/te-model:

uv run --env-file .env embeddings --verbose download-model

At this point, you can inspect /tmp/te-model and see the model assets:

--- /tmp/te-model 
  524.1 MiB [###################################]  model.safetensors
    1.0 MiB [                                   ] /query_0_SparseStaticEmbedding
  872.0 KiB [                                   ]  idf.json
  724.0 KiB [                                   ]  query_token_weights.txt
  696.0 KiB [                                   ]  tokenizer.json
  228.0 KiB [                                   ]  vocab.txt
   88.0 KiB [                                   ] /.cache
   60.0 KiB [                                   ]  modeling.py
   12.0 KiB [                                   ]  README.md
    8.0 KiB [                                   ]  configuration.py
    4.0 KiB [                                   ]  .gitattributes
    4.0 KiB [                                   ]  tokenizer_config.json
    4.0 KiB [                                   ]  config.json
    4.0 KiB [                                   ]  router_config.json
    4.0 KiB [                                   ]  config_sentence_transformers.json
    4.0 KiB [                                   ] /document_1_SpladePooling
    4.0 KiB [                                   ]  special_tokens_map.json
    4.0 KiB [                                   ]  modules.json
    4.0 KiB [                                   ]  sentence_bert_config.json
  • model.safetensors are the model "weights" and by far the largest file
  • vocab.txt is neat, showing the ~30k tokens that make up our model

3- Test the model can be loaded:

uv run --env-file .env embeddings --verbose test-model-load
  • Should see some logging and OK emitted to stdout

Docker build + model load test

1- Build image

make docker-build
  • If you watch closely, you'll notice RUN python -m embeddings.cli --verbose download-model is performed during build

2- Invoke docker container with model testing CLI command:

docker run -it timdex-embeddings:latest test-model-load
  • Confirms the model can be loaded inside the Docker container.
  • No env vars or CLI arguments are needed due to TE_MODEL_URI and TE_MODEL_DOWNLOAD_PATH already set in the container via the build.

Includes new or updated dependencies?

YES: addition of torch and transformers, both fairly significant libraries

Changes expectations for external applications?

NO

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:

Each embedding class needs a way to download the model assets (e.g. weights and related files) locally, such
that it can be loaded without calls to the HuggingFace API.  Some models may require work beyond just HF's
`snapshot_download()` function, e.g. cloning dependency models or configurations.

To test if a downloaded and configured correctly, you must then also load the model.  Ideally performing a test
embedding creation, but even just a load without errors is a good step.

How this addresses that need:

The base class is extended to include a `load()` method.

Our first embedding class `OSNeuralSparseDocV3GTE` has a first pass at `downloadg()`
and `load()` methods.  The model we are using has some unusual dependency requirements, that
most commonly relies on additional HuggingFace calls on load.  To avoid this, we include
some manual work to clone the model `Alibaba-NLP/new-impl` and copy required files into our
local model clone.

The `load()` function confirms that the model loads successfully, and without making any
HuggingFace API calls.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-113
Why these changes are being introduced:

We have opted to include the model weights and assets inside the Docker image.  When the model is
small-ish like ours, this avoids the need to save the model to S3 and download each time the CLI is
invoked.

How this addresses that need:

The CLI command `download-model` is used within the Dockerfile itself to download the model.  As noted
via inline comments, the env vars `TE_MODEL_URI` and `TE_MODEL_DOWNLOAD_PATH` are also set in the Dockerfile.
This has a dual purpose.  First, these env vars inform the `download-model` CLI invocation within the
Dockerfile.  Second, they persist to the container and establish the model as default for all calls.

Side effects of this change:
* On Docker image build, the model will be downloaded from HuggingFace, configured, and included
in the final Docker image.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-113
Comment on lines +77 to +91
def _patch_local_model_with_alibaba_new_impl(self, model_temp_path: Path) -> None:
"""Patch downloaded model with required assets from Alibaba-NLP/new-impl.
Our main model, opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte,
has configurations that attempt dynamic downloading of another model for files.
This can be seen here: https://huggingface.co/opensearch-project/opensearch-
neural-sparse-encoding-doc-v3-gte/blob/main/config.json#L6-L14.
To avoid our deployed CLI application making requests to the HuggingFace API to
retrieve these required files, which is problematic during high concurrency, we
manually download these files and patch the model during our local download and
save.
This allows us to load the primary model without any HuggingFace API calls.
"""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was surprising that we need to do this, and it's possible we may discover a more idiomatic way to do this in the future.

The long and the short is: sometimes a HuggingFace model -- like ours -- will actually reference other models in its internal configurations. When the model is first loaded, the transformers library will automatically cache those other models for future invocations. For local development and testing, this is fine, but in our production environment where we'll likely run the CLI with concurrent instances, we don't want each container hitting the HuggingFace API to download the model. Not only for performance reasons, but we'd also start to get 429 - too many requests errors (have encountered this).

This method downloads a required model in advance, and copies small configuration files from it into our model, which is all that's needed. Then our model's config.json is updated to not point to an external model anymore.

Comment on lines +141 to +161
# load local model and tokenizer
self._model = AutoModelForMaskedLM.from_pretrained(
model_path,
trust_remote_code=True,
local_files_only=True,
)
self._tokenizer = AutoTokenizer.from_pretrained( # type: ignore[no-untyped-call]
model_path,
local_files_only=True,
)

# setup special tokens
self._special_token_ids = [
self._tokenizer.vocab[str(token)]
for token in self._tokenizer.special_tokens_map.values()
]

# setup id_to_token mapping
self._id_to_token = ["" for _ in range(self._tokenizer.vocab_size)]
for token, token_id in self._tokenizer.vocab.items():
self._id_to_token[token_id] = token
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ported almost directly from the model's HuggingFace "card" (website): https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte#usage-huggingface.

It's likely that we'll return to this for tweaks and updates, so it's probably not worth a deep review now.

Big picture, we load two things:

  • self._model: the core of the model that creates embeddings
  • self._tokenizer: a part of the model that can be used to create tokens from an input text, what we then feed back to get embeddings for

Just to reiterate: this was added in this PR mostly to confirm that the download process was working correctly. Once we get into creating embeddings, I anticipate updates to this simplistic model loading.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context is appreciated as well as acknowledging this is likely to be in flux

@ghukill ghukill marked this pull request as ready for review October 28, 2025 17:45
@ghukill ghukill requested a review from a team October 28, 2025 17:46
@ehanson8 ehanson8 self-assigned this Oct 28, 2025
Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected, some clarifying questions but no blockers

Comment on lines +141 to +161
# load local model and tokenizer
self._model = AutoModelForMaskedLM.from_pretrained(
model_path,
trust_remote_code=True,
local_files_only=True,
)
self._tokenizer = AutoTokenizer.from_pretrained( # type: ignore[no-untyped-call]
model_path,
local_files_only=True,
)

# setup special tokens
self._special_token_ids = [
self._tokenizer.vocab[str(token)]
for token in self._tokenizer.special_tokens_map.values()
]

# setup id_to_token mapping
self._id_to_token = ["" for _ in range(self._tokenizer.vocab_size)]
for token, token_id in self._tokenizer.vocab.items():
self._id_to_token[token_id] = token

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context is appreciated as well as acknowledging this is likely to be in flux



@main.command()
def test_model_load() -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying: is this method part of the expected workflow or just for testing? Not fully clear based on the name and docstring

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not part of a normal workflow, just for testing. It's akin to a ping command, allowing us to confirm in a deployed or Docker context that the model can load successfully.

I can update the docstring to be a bit clearer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Just merged without that update. I'll make sure it makes it into the next PR! Sorry about that.

HF_HUB_DISABLE_PROGRESS_BARS=#boolean to use progress bars for HuggingFace model downloads; defaults to 'true' in deployed contexts
```

## Configuring an Embedding Model

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great overview

@ghukill ghukill merged commit 8db533a into main Oct 28, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants