USE 136 - implement create embeddings for OSNeuralSparseDocV3GTE #18

ghukill · 2025-11-04T19:31:54Z

Purpose and background context

This PR updates our first embedding class OSNeuralSparseDocV3GTE to have a fully functional create_embedding() method.

Most of the business logic for this was ported directly from the HuggingFace model card, with the main addition in our code being the method _decode_sparse_vectors() which converts the sparse vector into the decoded {token:weight, ...} format which we will pass directly to OpenSearch.

Noting that a meeting has been scheduled to discuss this PR prior to any reviews, to provide a space to demo and discuss how this works (at least to some degree of detail).

This should be considered an initial, untuned implementation, but one that is producing the sparse vectors and decoded token weights for inputs that we know we'll need. There are currently a couple of tickets targeting improvements of this first implementation, and likely more will be opened:

USE-137: introduce multiprocessing / parallel processes for creating embeddings of a batch of input records or texts
USE-165: tune the model loading + inference for ECS Fargate arm64 containers

How can a reviewer manually see the effects of these changes?

1- Set Dev1 AWS credentials

2- Set env vars:

HF_HUB_DISABLE_PROGRESS_BARS=true
TE_MODEL_URI=opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte
TE_MODEL_PATH=/tmp/te-model
TIMDEX_DATASET_LOCATION=s3://timdex-extract-dev-222053980223/dataset

3- (Re)Download model:

uv run --env-file .env embeddings download-model

Manual testing in Ipython shell

4- Start ipython:

uv run --env-file .env ipython

5- Load model for creating embeddings:

import logging
import os

from embeddings.config import configure_logger
from embeddings.embedding import EmbeddingInput
from embeddings.models.registry import get_model_class

logger = logging.getLogger()
configure_logger(logger, verbose=True)

# load embedder class
model_class = get_model_class(os.environ["TE_MODEL_URI"])
model = model_class(os.environ["TE_MODEL_PATH"])
model.load()

Manual Creation of `EmbeddingInput`'s

manual_embedding_inputs = [
    EmbeddingInput(
        timdex_record_id="abc123",
        run_id="def456",
        run_record_offset=42,
        embedding_strategy="full_record",
        text="Coffee is both a plant and a bean, usually grown at high elevation.",
    )
]

manual_embeddings = list(model.create_embeddings(iter(manual_embedding_inputs)))

Look at decoded token:weight for the single embedding created:

me1 = manual_embeddings[0]
me1.embedding_token_weights
"""
Out[13]: 
{'a': 0.5694160461425781,
 'and': 0.48665890097618103,
 'is': 0.7656334638595581,
 'as': 0.21244272589683533,
 'an': 0.2317156195640564,
 'but': 0.11755359917879105,
 'or': 0.29590004682540894,
 'also': 0.3128999173641205,
 'what': 0.19876547157764435,
 'where': 0.3369053602218628,
 'between': 0.06990399211645126,
 'both': 0.7280949354171753,
 'high': 0.3671952784061432,
 'same': 0.2660926580429077,
 'come': 0.12210105359554291,
 'land': 0.16826504468917847,
 ...,
 ...,
"""

Now we can look at the top 10 tokens by weight:

sorted(me1.embedding_token_weights.items(), key=lambda item: item[1], reverse=True)[:10]
"""
Out[12]: 
[('coffee', 1.260346531867981),
 ('is', 0.7656334638595581),
 ('both', 0.7280949354171753),
 ('plant', 0.6733094453811646),
 ('caf', 0.613446056842804),
 ('elevation', 0.5893899202346802),
 ('a', 0.5694160461425781),
 ('plants', 0.5592532157897949),
 ('bean', 0.5522097945213318),
 ('beans', 0.5464577078819275)]
"""

`EmbeddingInput`'s from TIMDEX records

from timdex_dataset_api import TIMDEXDataset
from embeddings.strategies.processor import create_embedding_inputs

# init TIMDEXDataset and retrieve some records
td = TIMDEXDataset(os.environ["TIMDEX_DATASET_LOCATION"])
records = td.read_dicts_iter(source="libguides", limit=3)

# create embedding inputs from TIMDEX records with utility
timdex_embedding_inputs = list(create_embedding_inputs(records, ["full_record"]))

timdex_embeddings = list(model.create_embeddings(iter(timdex_embedding_inputs)))

CLI `create-embeddings` CLI command

Run the following:

uv run --env-file .env embeddings \
--verbose \
create-embeddings \
-d s3://timdex-extract-dev-222053980223/dataset \
--run-id e758d6c4-6ee4-4862-a00f-b9da4d3758ad \
--record-limit 10 \
--strategy full_record \
--output-jsonl output/use136.jsonl

Look at the results at output/use136.jsonl. Note that for only 10 records, a pretty sizable 1.6mb file; this is a result of the sparse vectors in JSON form. We may decide to not write these back to the TIMDEX dataset given the unknown usage at this time.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-136

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: Our first embedding class OSNeuralSparseDocV3GTE was ready for a real create_embedding() method with the rest of the moving parts mostly complete at this time. How this addresses that need: The HuggingFace model card, https://huggingface.co/ opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte, contains a section of example code for using this model with the transformers library. This logic was ported over to our class, with the model already downloaded and loaded taken care of. The biggest addition is the method _decode_sparse_vectors() which converts the numerical sparse vector into a dictionary of token:weights. This decoded token weight form is what we'll pass directly to OpenSearch. With the new functionality in place, the tests associated with this class were also updated. Fixtures were moved into the testing file, a pattern we could adopt for any future models to keep them out of the shared conftest.py. Side effects of this change: * CLI is capable of producing embeddings for our first model OSNeuralSparseDocV3GTE Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-136

ehanson8

Works as expected and looks great! One typo and a question from our earlier discussion but I'm comfortable approving regardless of the answer

embeddings/models/os_neural_sparse_doc_v3_gte.py

ehanson8 · 2025-11-06T15:35:04Z

embeddings/models/os_neural_sparse_doc_v3_gte.py

+            run_record_offset=embedding_input.run_record_offset,
+            model_uri=self.model_uri,
+            embedding_strategy=embedding_input.embedding_strategy,
+            embedding_vector=sparse_vector_list,


Just adding this here from our earlier discussion: even if the vector compresses to nothing, I think we should consider why we're passing it through if we don't have a use case for it. Obviously it's needed for generating token weights but if OpenSearch doesn't use it, I'm not sure why we're storing it on the object. I won't press beyond this comment but it feels like we're keeping an unnecessary precursor in addition to the useful output. Happy to be corrected if that's not the case!

I'd wager to say it's kind of an endless topic. Your point and hesitance are well founded and registered.

To me, it's a bit of a gamble.

It takes time and money to produce embeddings, even as we tune for performance, and the sparse vectors arguably have more information, given it's a representation of the embedding across the entire vocabulary. Theoretically, we could perform some mathematical operations on the sparse vectors we can't do on the decoded token weights. Other folks, myself included, have some interest in this. Storing them keeps that option on the table.

Additionally, there might be a good argument for only storing those sparse vectors in the future and decoding the data on the way out. This could be much cheaper to store in the long term.

Lastly, hopefully, we'll use a model in the future that produces true dense vectors, which will require storage in that form.

If any of these pan out, I think it'd be nice to have some repititions and tested schemas for storing this data. Certainly not opposed to removing it a few months down the road as we tune the pipeline, but I'd lobby for keeping it in these early days as we develop our understanding.

But, as I lead with, I don't think there is a right or wrong answer here. We may very well decide it's not useful at some point and cease to store it for this model.

That is all fine, appreciate the discussion! 🙂

jonavellecuerdo

This is looking good to me. Expecting to take a closer look with the upcoming tickets. Just one question for you.

jonavellecuerdo · 2025-11-05T18:40:27Z

embeddings/embedding.py

+    def __repr__(self) -> str:  # noqa: D105
+        return (
+            f"<EmbeddingInput - record:'{self.timdex_record_id}', "
+            f"strategy:'{self.embedding_strategy}', text length:{len(self.text)}>"


Hmm, why text length? 🤔

Keeping it brief here, but in short, just a handy little indicator of how long the text used for the embedding was! This might help when we don't see the original text, maybe uneareth instances where it's zero? or huge?

But purely a guess at helpful data for the interactive python/shell environment. Won't have any bearing otherwise.

ghukill changed the base branch from main to USE-131-embedding-input-transform-framework November 4, 2025 19:32

ghukill force-pushed the USE-136-implement-create-embeddings branch 2 times, most recently from 3d7062b to d9dba96 Compare November 4, 2025 20:40

ghukill added 2 commits November 4, 2025 16:20

Update dependencies

c563d2a

ghukill force-pushed the USE-136-implement-create-embeddings branch from d9dba96 to fc0bdea Compare November 4, 2025 21:20

ghukill changed the base branch from USE-131-embedding-input-transform-framework to main November 4, 2025 21:20

ghukill marked this pull request as ready for review November 4, 2025 21:30

ghukill requested a review from a team November 5, 2025 13:59

ghukill mentioned this pull request Nov 5, 2025

USE 137 - support JSONLines input #19

Merged

ghukill requested review from a team and removed request for a team November 5, 2025 19:45

Update commentary around model document encoding

296b870

ghukill force-pushed the USE-136-implement-create-embeddings branch from b8021a0 to 296b870 Compare November 5, 2025 20:20

jonavellecuerdo self-requested a review November 6, 2025 13:53

ehanson8 approved these changes Nov 6, 2025

View reviewed changes

Fix typo

4a0cb4d

jonavellecuerdo approved these changes Nov 6, 2025

View reviewed changes

ghukill merged commit faca712 into main Nov 6, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

USE 136 - implement create embeddings for OSNeuralSparseDocV3GTE #18

USE 136 - implement create embeddings for OSNeuralSparseDocV3GTE #18

Uh oh!

ghukill commented Nov 4, 2025 •

edited

Loading

Uh oh!

ehanson8 left a comment

Uh oh!

Uh oh!

Uh oh!

ehanson8 Nov 6, 2025

Uh oh!

ghukill Nov 6, 2025 •

edited

Loading

Uh oh!

ehanson8 Nov 6, 2025

Uh oh!

jonavellecuerdo left a comment

Uh oh!

jonavellecuerdo Nov 5, 2025

Uh oh!

ghukill Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

USE 136 - implement create embeddings for OSNeuralSparseDocV3GTE #18

USE 136 - implement create embeddings for OSNeuralSparseDocV3GTE #18

Uh oh!

Conversation

ghukill commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Manual testing in Ipython shell

Manual Creation of EmbeddingInput's

EmbeddingInput's from TIMDEX records

CLI create-embeddings CLI command

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ehanson8 Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehanson8 Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ghukill commented Nov 4, 2025 •

edited

Loading

Manual Creation of `EmbeddingInput`'s

`EmbeddingInput`'s from TIMDEX records

CLI `create-embeddings` CLI command

ghukill Nov 6, 2025 •

edited

Loading