-
Notifications
You must be signed in to change notification settings - Fork 0
Write embeddings to timdex dataset using TDA library #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write embeddings to timdex dataset using TDA library #25
Conversation
5e3849a to
21c8ccd
Compare
21c8ccd to
5751bca
Compare
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start! Left a couple of questions / requests in the cli.py for a first review pass.
embeddings/cli.py
Outdated
| embeddings_iter = iter( | ||
| [ | ||
| DatasetEmbedding( | ||
| timdex_record_id=embedding.timdex_record_id, | ||
| run_id=embedding.run_id, | ||
| run_record_offset=embedding.run_record_offset, | ||
| embedding_model=embedding.model_uri, | ||
| embedding_strategy=embedding.embedding_strategy, | ||
| embedding_vector=embedding.embedding_vector, | ||
| embedding_object=json.dumps( | ||
| embedding.embedding_token_weights | ||
| ).encode(), | ||
| ) | ||
| for embedding in embeddings | ||
| ] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might need a small rework to preserve true iteration. It's likely not an issue memory-wise, at least at first, but it feels worth attempting to maintain throughout so it scales well.
When iter([...]) is called, while it results in an iterator, the list comprehension does fully consume the iterator that create_embeddings() returned.
This might require a standalone function that:
- accepts an input iterator of
EmbeddingInput's - loops through them and yields instances of
DatasetEmbedding
You could pass the output of that function, itself an iterator, to TE.write() I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See changes in here!
|
The last two commits are to fix up the |
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Embedding iterator update looks great.
Left a comment re: TIMDEXDataset initialization, think it could still be problematic. I continue to be surprised that ruff doesn't surface this...
I'd propsoe just simple for now. Undoubtedly, this app will get more touches as we go and we could take some optimization passes at it.
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved! Great foundation to build from.
Why these changes are being introduced: * Now that the TDA library includes a method to write embeddings associated with TIMDEX records in the TIMDEX dataset, the 'create-embeddings' CLI command must be updated to use the method. How this addresses that need: * Create an iter of TDA DatasetEmbedding objects from model Embedding objects * Use TDA TIMDEXEmbeddings.write method Side effects of this change: * Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-138
f30cd8b to
d46682a
Compare
Purpose and background context
When CLI command create-embeddings is called, the default behavior is to write the embeddings back to the dataset (which was already used to read the records). Now that the TDA library includes a method to write embeddings associated with TIMDEX records in the TIMDEX dataset, the
create-embeddingsCLI command must be updated to use the method.How can a reviewer manually see the effects of these changes?
Review the added unit test.
💡 Note: While I saw that the existing tests
tests/test_cli.pymodule primarily focusing on parameter checking, I felt it was important to confirm that records are written to the TIMDEX datasetIncludes new or updated dependencies?
YES - Update to
timdex-dataset-api==3.6.1Changes expectations for external applications?
NO
What are the relevant tickets?
Code review