Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Nov 19, 2025

Purpose and background context

When CLI command create-embeddings is called, the default behavior is to write the embeddings back to the dataset (which was already used to read the records). Now that the TDA library includes a method to write embeddings associated with TIMDEX records in the TIMDEX dataset, the create-embeddings CLI command must be updated to use the method.

How can a reviewer manually see the effects of these changes?

Review the added unit test.

💡 Note: While I saw that the existing tests tests/test_cli.py module primarily focusing on parameter checking, I felt it was important to confirm that records are written to the TIMDEX dataset

Includes new or updated dependencies?

YES - Update to timdex-dataset-api==3.6.1

Changes expectations for external applications?

NO

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

@jonavellecuerdo jonavellecuerdo force-pushed the USE-138-write-embeddings-to-timdex-dataset branch from 5e3849a to 21c8ccd Compare November 19, 2025 16:22
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review November 19, 2025 16:25
@jonavellecuerdo jonavellecuerdo force-pushed the USE-138-write-embeddings-to-timdex-dataset branch from 21c8ccd to 5751bca Compare November 19, 2025 16:30
Copy link
Collaborator

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start! Left a couple of questions / requests in the cli.py for a first review pass.

Comment on lines 276 to 291
embeddings_iter = iter(
[
DatasetEmbedding(
timdex_record_id=embedding.timdex_record_id,
run_id=embedding.run_id,
run_record_offset=embedding.run_record_offset,
embedding_model=embedding.model_uri,
embedding_strategy=embedding.embedding_strategy,
embedding_vector=embedding.embedding_vector,
embedding_object=json.dumps(
embedding.embedding_token_weights
).encode(),
)
for embedding in embeddings
]
)
Copy link
Collaborator

@ghukill ghukill Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might need a small rework to preserve true iteration. It's likely not an issue memory-wise, at least at first, but it feels worth attempting to maintain throughout so it scales well.

When iter([...]) is called, while it results in an iterator, the list comprehension does fully consume the iterator that create_embeddings() returned.

This might require a standalone function that:

  • accepts an input iterator of EmbeddingInput's
  • loops through them and yields instances of DatasetEmbedding

You could pass the output of that function, itself an iterator, to TE.write() I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See changes in here!

jonavellecuerdo added a commit that referenced this pull request Nov 20, 2025
@jonavellecuerdo
Copy link
Contributor Author

The last two commits are to fix up the uv.lock and pyproject.toml to avoid setting rev (revision) to the main branch!

Copy link
Collaborator

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Embedding iterator update looks great.

Left a comment re: TIMDEXDataset initialization, think it could still be problematic. I continue to be surprised that ruff doesn't surface this...

I'd propsoe just simple for now. Undoubtedly, this app will get more touches as we go and we could take some optimization passes at it.

Copy link
Collaborator

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! Great foundation to build from.

Why these changes are being introduced:
* Now that the TDA library includes a method
to write embeddings associated with TIMDEX records
in the TIMDEX dataset, the 'create-embeddings'
CLI command must be updated to use the method.

How this addresses that need:
* Create an iter of TDA DatasetEmbedding objects from
model Embedding objects
* Use TDA TIMDEXEmbeddings.write method

Side effects of this change:
*

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-138
@jonavellecuerdo jonavellecuerdo force-pushed the USE-138-write-embeddings-to-timdex-dataset branch from f30cd8b to d46682a Compare November 20, 2025 16:53
@jonavellecuerdo jonavellecuerdo merged commit 57387fd into main Nov 20, 2025
4 checks passed
@jonavellecuerdo jonavellecuerdo deleted the USE-138-write-embeddings-to-timdex-dataset branch November 20, 2025 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants