Skip to content

Commit 3810f03

Browse files
committed
Do not save sparse vectors for OSNeuralSparseDocV3GTE
Why these changes are being introduced: Our initial pass with the embedding class OSNeuralSparseDocV3GTE was to save both the sparse vector and the decoded token:weights. Each sparse vector was the length of the model vocabulary, about 30k, with mostly zeros. While technically this could be used for analysis beyond just the decoded token:weights given to OpenSearch, the data transfer and storage overhead exceeds any known use cases at the moment. How this addresses that need: The OSNeuralSparseDocV3GTE embedding model is updated to not include the sparse vector for the Embedding.embedding_vector property on output. This can easily be turned on later, with an inline code comment showing how to toggle that back on. Side effects of this change: * No sparse vectors are stored for now, storage is decreased. Relevant ticket(s): * None
1 parent 3dd6529 commit 3810f03

File tree

3 files changed

+10
-7
lines changed

3 files changed

+10
-7
lines changed

embeddings/embedding.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,8 @@ class Embedding:
4848
run_record_offset: int
4949
model_uri: str
5050
embedding_strategy: str
51-
embedding_vector: list[float]
52-
embedding_token_weights: dict
51+
embedding_vector: list[float] | None
52+
embedding_token_weights: dict | None
5353

5454
timestamp: datetime.datetime = field(
5555
default_factory=lambda: datetime.datetime.now(datetime.UTC)

embeddings/models/os_neural_sparse_doc_v3_gte.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -247,8 +247,11 @@ def _get_embedding_from_sparse_vector(
247247
decoded_token_weights = cast("list[tuple[str, float]]", decoded_token_weights)
248248
embedding_token_weights = dict(decoded_token_weights)
249249

250-
# prepare sparse vector for JSON serialization
251-
embedding_vector = sparse_vector.to_dense().tolist()
250+
# # prepare sparse vector for JSON serialization
251+
# NOTE: at this time we are NOT including the sparse vector for output. This
252+
# block can be uncommented in the future to include it when wanted.
253+
# embedding_vector = sparse_vector.to_dense().tolist() # noqa: ERA001
254+
embedding_vector = None
252255

253256
return Embedding(
254257
timdex_record_id=embedding_input.timdex_record_id,

tests/test_os_neural_sparse_doc_v3_gte.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@ def test_create_embedding_returns_embedding_object(tmp_path):
217217
assert embedding.run_record_offset == 42
218218
assert embedding.model_uri == model.model_uri
219219
assert embedding.embedding_strategy == "title_only"
220-
assert embedding.embedding_vector == pytest.approx([0.1, 0.2])
220+
assert embedding.embedding_vector is None
221221
assert embedding.embedding_token_weights == {"sum": pytest.approx(0.3)}
222222

223223

@@ -257,6 +257,6 @@ def test_create_embeddings_consumes_iterator_and_returns_embeddings(
257257

258258
assert len(embeddings) == 2
259259
assert embeddings[0].timdex_record_id == "id-1"
260-
assert embeddings[0].embedding_vector == pytest.approx([0.1, 0.2])
260+
assert embeddings[0].embedding_vector is None
261261
assert embeddings[1].timdex_record_id == "id-2"
262-
assert embeddings[1].embedding_vector == pytest.approx([0.3, 0.4])
262+
assert embeddings[1].embedding_vector is None

0 commit comments

Comments
 (0)