-
Notifications
You must be signed in to change notification settings - Fork 0
USE 142 - embeddings source and write #175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 19471397498Details
💛 - Coveralls |
Why these changes are being introduced: We will begin storing embeddings associated with TIMDEX records in the TIMDEX dataset, where read and write functionality is managed by this library. We will need to ability to write and read embeddings, but will start with establishing the new data source structure and write methods. Read methods are to follow, and may require some additional linkages with the `TIMDEXDataset` class and read methods. How this addresses that need: This commit adds a new data source for the dataset in the form of the new `TIMDEXEmbeddings` class. This class will encapsulate write and read methods for embeddings, with a composite key of (timdex_record_id, run_id, run_record_offset) tethering the embeddings to specific TIMDEX record versions in the dataset. At this time only a write() method exists, writing embeddings to a `/embeddings` folder in the dataset. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-142
| pa.field("embedding_model", pa.string()), | ||
| pa.field("embedding_strategy", pa.string()), | ||
| pa.field("embedding_vector", pa.list_(pa.float32())), | ||
| pa.field("embedding_object", pa.binary()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonavellecuerdo - this deviates from the proposed schema. I realized during this work, and on timdex-embeddings, that "object" is a bit more flexible for any non-vector form we may want to store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible we would want to store multiple non-vector forms?
| ( | ||
| pa.field("timdex_record_id", pa.string()), | ||
| pa.field("run_id", pa.string()), | ||
| pa.field("run_record_offset", pa.int32()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source is removed from the proposed schema. This doubles down on (timdex_record_id, run_id, run_record_offset) being sufficient to tether to record data. Ideally, we don't want to duplicate much of anything available there, here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart!
|
Thanks for the version bump reminder @jonavellecuerdo! |
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! The changes were straightforward and easy to follow, which speaks to great architectural decisions made previously with TIMDEXDataset and TIMDEXMetadata. ✨
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, one non-blocking question
| ( | ||
| pa.field("timdex_record_id", pa.string()), | ||
| pa.field("run_id", pa.string()), | ||
| pa.field("run_record_offset", pa.int32()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart!
| pa.field("embedding_model", pa.string()), | ||
| pa.field("embedding_strategy", pa.string()), | ||
| pa.field("embedding_vector", pa.list_(pa.float32())), | ||
| pa.field("embedding_object", pa.binary()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible we would want to store multiple non-vector forms?
Purpose and background context
This PR introduces the ability to write embeddings to a TIMDEX dataset using TDA. In support of this, this introduces a new
timdex_dataset_api/embeddings.pyfile that will encapsulate most of this functionality. Inside that file, a newTIMDEXEmbeddingsclass.At this time, this class is fairly minimal. It requires an instance of
TIMDEXDataseton init, saved to self, whichi provides functionality and configurations from the "root" dataset class. As we progress into read methods this may be revisited, but works well for now.The only meaningful function at this time is
.write(), which likeTIMDEXDataset.write(), accepts an iterator of items to write. For this new method, it's expecting an iterator ofDatasetEmbeddingwhich is akin to theDatasetRecordclass. Big picture, the mechanics are designed to be largely identical:DatasetRecordorDatasetEmbeddingand prepare an iterator for writing..write()method with that iterator and let TDA handle the rest.As noted, as we move into read methods there may be some fairly substantial changes. We may want a tighter coupling with
TIMDEXDataset, e.g. something likeTIMDEXDataset.embeddingswhich is a composed instance of this newTIMDEXEmbeddingsclass. We may even perform reading with a first step of record reading to get(timdex_record_id, run_id, run_record_offset)data, then re-use that for querying the embeddings themselves. TBD. Mentioning this now only to contextualize this PR which establishes the ability to write embeddings, but some structure is subject to change.How can a reviewer manually see the effects of these changes?
1- Download the following JSONLines file with embeddings created by the new timdex-embeddings CLI application:
s3://timdex-extract-dev-222053980223/dataset/data/sandbox/dspace-100-embeddings.jsonl. Save tooutput/folder in this project (git ignored).2- Open ipython shell:
3- Load
TIMDEXDatasetandTIMDEXEmbeddingsinstances:4- Prepare embeddings for writing:
5- Write embeddings to dataset:
6- With the
TIMDEXDatasetduckdb connection, confirm rows are written to parquet file:The final results at
/tmp/tda-use142looks similar to this, showing a successful write of embeddings as parquet file(s) to a new/data/embeddingslocation:Includes new or updated dependencies?
YES: notably,
duckdbis back on a stable versionChanges expectations for external applications?
NO
What are the relevant tickets?