-
Notifications
You must be signed in to change notification settings - Fork 0
Add support to read embeddings from the TIMDEX dataset #178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 19945064186Details
💛 - Coveralls |
dc6f63f to
813b950
Compare
|
Did some cleanup as well: Update PR template and add |
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I dig into the actual DuckDB views, particularly around "current" embeddings -- globally and within a run -- I noticed a few logging and docstrings changes I think we might need. Opting to share this as a quick request for changes while continuing my review.
Overall, looking great though!
| logger.debug( | ||
| f"DuckDB data context created, {round(time.perf_counter()-start_time,2)}s" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar story here to a comment below (this was added later), where I think this logging message could be improved.
Perhaps we should say something like,
DEBUG:DuckDB context created for TIMDEXEmbeddings
Then we could also update TIMDEXDataset and TIMDEXDatasetMetadata as well. It'd be clear when each class method finished, what DuckDB context had completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion!
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INFO:timdex_dataset_api.dataset:Dataset successfully loaded: '/tmp/use143-dataset/data/records', 0.01s
DEBUG:timdex_dataset_api.metadata:Attaching to static database file: /tmp/use143-dataset/metadata/metadata.duckdb
DEBUG:timdex_dataset_api.metadata:creating view metadata.append_deltas
DEBUG:timdex_dataset_api.metadata:4 append deltas found
DEBUG:timdex_dataset_api.metadata:creating view metadata.records
DEBUG:timdex_dataset_api.metadata:creating view metadata.current_records
DEBUG:timdex_dataset_api.metadata:DuckDB context created for TIMDEXDatasetMetadata, 0.05s
DEBUG:timdex_dataset_api.utils:SQLAlchemy reflection elapsed: 0.136s
DEBUG:timdex_dataset_api.dataset:DuckDB context created for TIMDEXDataset, 0.0s
DEBUG:timdex_dataset_api.embeddings:creating view data.embeddings
DEBUG:timdex_dataset_api.embeddings:creating view data.current_embeddings
DEBUG:timdex_dataset_api.embeddings:creating view data.current_run_embeddings
DEBUG:timdex_dataset_api.embeddings:DuckDB context created for TIMDEXEmbeddings, 0.01s
DEBUG:timdex_dataset_api.utils:SQLAlchemy reflection elapsed: 0.088s
😍 Thanks for cleaning up the logging, which was a problem that pre-dated this PR. Looks great, and easy to follow what's happening at a DEBUG level.
As mentioned in an off-PR discussion, I think there could be some room for some additional unit tests that look at the edge cases of "current" embeddings vs "current run" embeddings, but it feels like it might be better served as follow-up, targeted work.
Because, this PR takes on a really big and important piece of work of adding a new data source -- embeddings -- to our informal TIMDEX data lake! Thanks for all the discussions along the way and the addition here. This unblocks downstream work like TIM reading embeddings, and gives us lots of new things to test and iterate on.
|
Ack! I should have mentioned / requested sooner, but can we bump it a semantic minor version Jonavelle? I say we bump this TDA library to Maybe we save major versions for how it operates, broken backwards compatibility, etc. |
7e87f89 to
ba3b1a5
Compare
Why these changes are being introduced: The TDA library will be used to read embeddings associated with TIMDEX records in the TIMDEX dataset. How this addresses that need: * Set up DuckDB connection for embeddings query and retrieval * Add read methods that mirror follow the same pattern implemented in the TIMDEXDataset class * Attach TIMDEXEmbeddings to TIMDEXDataset * Add unit tests Side effects of this change: * None Relevant ticket(s): https://mitlibraries.atlassian.net/browse/USE-143
ba3b1a5 to
2ca7b07
Compare
|
Thanks for all the discussion! Learned a lot through this work. 🚀 |
Purpose and background context
This PR introduces the ability to read embeddings from TIMDEX dataset using TDA. In support of this, a series of read methods --following the same pattern of the read methods defined for
TIMDEXDataset(e.g., yielding/returningpyarrow.RecordBatches,dict,pd.DataFrame)--have been implemented for theEmbeddingsclass.Note: In contrast to the
TIMDEXDataset, which uses a metadata layer to pre-filter records to identify parquet filenames and record offsets (indices) to improve performance, theEmbeddingsclass examines all the embedding parquet files when filtering records. It is our intention to revisit the structure of these modules and ensure read methods continue to perform efficiently as the dataset (and embeddings) expand.How can a reviewer manually see the effects of these changes?
Includes new or updated dependencies?
YES - dependencies were updated to resolve the vulnerability:
Changes expectations for external applications?
NO
What are the relevant tickets?
https://mitlibraries.atlassian.net/browse/USE-143
Code review
Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.