Skip to content

Conversation

@ghukill
Copy link
Collaborator

@ghukill ghukill commented Nov 4, 2025

Purpose and background context

This PR updates our first embedding class OSNeuralSparseDocV3GTE to have a fully functional create_embedding() method.

Most of the business logic for this was ported directly from the HuggingFace model card, with the main addition in our code being the method _decode_sparse_vectors() which converts the sparse vector into the decoded {token:weight, ...} format which we will pass directly to OpenSearch.

Noting that a meeting has been scheduled to discuss this PR prior to any reviews, to provide a space to demo and discuss how this works (at least to some degree of detail).

This should be considered an initial, untuned implementation, but one that is producing the sparse vectors and decoded token weights for inputs that we know we'll need. There are currently a couple of tickets targeting improvements of this first implementation, and likely more will be opened:

  • USE-137: introduce multiprocessing / parallel processes for creating embeddings of a batch of input records or texts
  • USE-165: tune the model loading + inference for ECS Fargate arm64 containers

How can a reviewer manually see the effects of these changes?

1- Set Dev1 AWS credentials

2- Set env vars:

HF_HUB_DISABLE_PROGRESS_BARS=true
TE_MODEL_URI=opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte
TE_MODEL_PATH=/tmp/te-model
TIMDEX_DATASET_LOCATION=s3://timdex-extract-dev-222053980223/dataset

3- (Re)Download model:

uv run --env-file .env embeddings download-model

Manual testing in Ipython shell

4- Start ipython:

uv run --env-file .env ipython

5- Load model for creating embeddings:

import logging
import os

from embeddings.config import configure_logger
from embeddings.embedding import EmbeddingInput
from embeddings.models.registry import get_model_class

logger = logging.getLogger()
configure_logger(logger, verbose=True)

# load embedder class
model_class = get_model_class(os.environ["TE_MODEL_URI"])
model = model_class(os.environ["TE_MODEL_PATH"])
model.load()
Manual Creation of EmbeddingInput's
manual_embedding_inputs = [
    EmbeddingInput(
        timdex_record_id="abc123",
        run_id="def456",
        run_record_offset=42,
        embedding_strategy="full_record",
        text="Coffee is both a plant and a bean, usually grown at high elevation.",
    )
]

manual_embeddings = list(model.create_embeddings(iter(manual_embedding_inputs)))

Look at decoded token:weight for the single embedding created:

me1 = manual_embeddings[0]
me1.embedding_token_weights
"""
Out[13]: 
{'a': 0.5694160461425781,
 'and': 0.48665890097618103,
 'is': 0.7656334638595581,
 'as': 0.21244272589683533,
 'an': 0.2317156195640564,
 'but': 0.11755359917879105,
 'or': 0.29590004682540894,
 'also': 0.3128999173641205,
 'what': 0.19876547157764435,
 'where': 0.3369053602218628,
 'between': 0.06990399211645126,
 'both': 0.7280949354171753,
 'high': 0.3671952784061432,
 'same': 0.2660926580429077,
 'come': 0.12210105359554291,
 'land': 0.16826504468917847,
 ...,
 ...,
"""

Now we can look at the top 10 tokens by weight:

sorted(me1.embedding_token_weights.items(), key=lambda item: item[1], reverse=True)[:10]
"""
Out[12]: 
[('coffee', 1.260346531867981),
 ('is', 0.7656334638595581),
 ('both', 0.7280949354171753),
 ('plant', 0.6733094453811646),
 ('caf', 0.613446056842804),
 ('elevation', 0.5893899202346802),
 ('a', 0.5694160461425781),
 ('plants', 0.5592532157897949),
 ('bean', 0.5522097945213318),
 ('beans', 0.5464577078819275)]
"""
EmbeddingInput's from TIMDEX records
from timdex_dataset_api import TIMDEXDataset
from embeddings.strategies.processor import create_embedding_inputs

# init TIMDEXDataset and retrieve some records
td = TIMDEXDataset(os.environ["TIMDEX_DATASET_LOCATION"])
records = td.read_dicts_iter(source="libguides", limit=3)

# create embedding inputs from TIMDEX records with utility
timdex_embedding_inputs = list(create_embedding_inputs(records, ["full_record"]))

timdex_embeddings = list(model.create_embeddings(iter(timdex_embedding_inputs)))

CLI create-embeddings CLI command

Run the following:

uv run --env-file .env embeddings \
--verbose \
create-embeddings \
-d s3://timdex-extract-dev-222053980223/dataset \
--run-id e758d6c4-6ee4-4862-a00f-b9da4d3758ad \
--record-limit 10 \
--strategy full_record \
--output-jsonl output/use136.jsonl

Look at the results at output/use136.jsonl. Note that for only 10 records, a pretty sizable 1.6mb file; this is a result of the sparse vectors in JSON form. We may decide to not write these back to the TIMDEX dataset given the unknown usage at this time.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

@ghukill ghukill changed the base branch from main to USE-131-embedding-input-transform-framework November 4, 2025 19:32
@ghukill ghukill force-pushed the USE-136-implement-create-embeddings branch 2 times, most recently from 3d7062b to d9dba96 Compare November 4, 2025 20:40
Why these changes are being introduced:

Our first embedding class OSNeuralSparseDocV3GTE was ready for a real
create_embedding() method with the rest of the moving parts mostly complete at this time.

How this addresses that need:

The HuggingFace model card, https://huggingface.co/
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte, contains a section of
example code for using this model with the transformers library. This logic was ported
over to our class, with the model already downloaded and loaded taken care of.

The biggest addition is the method _decode_sparse_vectors() which converts the numerical
sparse vector into a dictionary of token:weights.  This decoded token weight form is what
we'll pass directly to OpenSearch.

With the new functionality in place, the tests associated with this class were also
updated.  Fixtures were moved into the testing file, a pattern we could adopt for any
future models to keep them out of the shared conftest.py.

Side effects of this change:
* CLI is capable of producing embeddings for our first model OSNeuralSparseDocV3GTE

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-136
@ghukill ghukill force-pushed the USE-136-implement-create-embeddings branch from d9dba96 to fc0bdea Compare November 4, 2025 21:20
@ghukill ghukill changed the base branch from USE-131-embedding-input-transform-framework to main November 4, 2025 21:20
@ghukill ghukill marked this pull request as ready for review November 4, 2025 21:30
@ghukill ghukill requested a review from a team November 5, 2025 13:59
@ghukill ghukill requested review from a team and removed request for a team November 5, 2025 19:45
@ghukill ghukill force-pushed the USE-136-implement-create-embeddings branch from b8021a0 to 296b870 Compare November 5, 2025 20:20
@jonavellecuerdo jonavellecuerdo self-requested a review November 6, 2025 13:53
Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected and looks great! One typo and a question from our earlier discussion but I'm comfortable approving regardless of the answer

run_record_offset=embedding_input.run_record_offset,
model_uri=self.model_uri,
embedding_strategy=embedding_input.embedding_strategy,
embedding_vector=sparse_vector_list,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding this here from our earlier discussion: even if the vector compresses to nothing, I think we should consider why we're passing it through if we don't have a use case for it. Obviously it's needed for generating token weights but if OpenSearch doesn't use it, I'm not sure why we're storing it on the object. I won't press beyond this comment but it feels like we're keeping an unnecessary precursor in addition to the useful output. Happy to be corrected if that's not the case!

Copy link
Collaborator Author

@ghukill ghukill Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd wager to say it's kind of an endless topic. Your point and hesitance are well founded and registered.

To me, it's a bit of a gamble.

It takes time and money to produce embeddings, even as we tune for performance, and the sparse vectors arguably have more information, given it's a representation of the embedding across the entire vocabulary. Theoretically, we could perform some mathematical operations on the sparse vectors we can't do on the decoded token weights. Other folks, myself included, have some interest in this. Storing them keeps that option on the table.

Additionally, there might be a good argument for only storing those sparse vectors in the future and decoding the data on the way out. This could be much cheaper to store in the long term.

Lastly, hopefully, we'll use a model in the future that produces true dense vectors, which will require storage in that form.

If any of these pan out, I think it'd be nice to have some repititions and tested schemas for storing this data. Certainly not opposed to removing it a few months down the road as we tune the pipeline, but I'd lobby for keeping it in these early days as we develop our understanding.

But, as I lead with, I don't think there is a right or wrong answer here. We may very well decide it's not useful at some point and cease to store it for this model.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is all fine, appreciate the discussion! 🙂

Copy link

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me. Expecting to take a closer look with the upcoming tickets. Just one question for you.

def __repr__(self) -> str: # noqa: D105
return (
f"<EmbeddingInput - record:'{self.timdex_record_id}', "
f"strategy:'{self.embedding_strategy}', text length:{len(self.text)}>"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why text length? 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping it brief here, but in short, just a handy little indicator of how long the text used for the embedding was! This might help when we don't see the original text, maybe uneareth instances where it's zero? or huge?

But purely a guess at helpful data for the interactive python/shell environment. Won't have any bearing otherwise.

@ghukill ghukill merged commit faca712 into main Nov 6, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants