Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Dec 3, 2025

Purpose and background context

This PR introduces the ability to read embeddings from TIMDEX dataset using TDA. In support of this, a series of read methods --following the same pattern of the read methods defined for TIMDEXDataset (e.g., yielding/returning pyarrow.RecordBatches, dict, pd.DataFrame)--have been implemented for the Embeddings class.

Note: In contrast to the TIMDEXDataset, which uses a metadata layer to pre-filter records to identify parquet filenames and record offsets (indices) to improve performance, the Embeddings class examines all the embedding parquet files when filtering records. It is our intention to revisit the structure of these modules and ensure read methods continue to perform efficiently as the dataset (and embeddings) expand.

How can a reviewer manually see the effects of these changes?

  1. Review unit tests [High-level review]
  2. Follow the instructions in Explore the TIMDEX Dataset Locally
    te.conn.query("select * from data.current_run_embeddings") # retrieves full set of records (n=30)
    te.conn.query("select * from data.current_embeddings") # retrieves subset of records (n=20)

Includes new or updated dependencies?

YES - dependencies were updated to resolve the vulnerability:

Found 1 known vulnerability in 1 package
Name     Version ID                  Fix Versions
-------- ------- ------------------- ------------
werkzeug 3.1.3   GHSA-hgf8-39gv-g3f2 3.1.4

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-143

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

@jonavellecuerdo jonavellecuerdo changed the title Add functionality to read embeddings Add support to read embeddings Dec 3, 2025
@coveralls
Copy link

coveralls commented Dec 3, 2025

Pull Request Test Coverage Report for Build 19945064186

Details

  • 83 of 90 (92.22%) changed or added relevant lines in 5 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 93.503%

Changes Missing Coverage Covered Lines Changed/Added Lines %
timdex_dataset_api/embeddings.py 72 79 91.14%
Totals Coverage Status
Change from base Build 19505892476: -0.3%
Covered Lines: 662
Relevant Lines: 708

💛 - Coveralls

@jonavellecuerdo jonavellecuerdo changed the title Add support to read embeddings Add support to read embeddings from the TIMDEX dataset Dec 3, 2025
@jonavellecuerdo jonavellecuerdo force-pushed the USE-143-embeddings-read branch from dc6f63f to 813b950 Compare December 4, 2025 13:50
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review December 4, 2025 14:57
@jonavellecuerdo jonavellecuerdo requested review from a team and ghukill December 4, 2025 14:57
@jonavellecuerdo
Copy link
Contributor Author

Did some cleanup as well: Update PR template and add CODEOWNERS file.
@ghukill If we want to automatically assign mitlibraries/dataeng as the reviewer(s): Settings > Branches > main (Edit) > Check 'Require review from Code Owners' 🤔

@ghukill ghukill self-assigned this Dec 4, 2025
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I dig into the actual DuckDB views, particularly around "current" embeddings -- globally and within a run -- I noticed a few logging and docstrings changes I think we might need. Opting to share this as a quick request for changes while continuing my review.

Overall, looking great though!

Comment on lines 170 to 173
logger.debug(
f"DuckDB data context created, {round(time.perf_counter()-start_time,2)}s"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar story here to a comment below (this was added later), where I think this logging message could be improved.

Perhaps we should say something like,

DEBUG:DuckDB context created for TIMDEXEmbeddings

Then we could also update TIMDEXDataset and TIMDEXDatasetMetadata as well. It'd be clear when each class method finished, what DuckDB context had completed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion!

jonavellecuerdo added a commit that referenced this pull request Dec 4, 2025
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INFO:timdex_dataset_api.dataset:Dataset successfully loaded: '/tmp/use143-dataset/data/records', 0.01s
DEBUG:timdex_dataset_api.metadata:Attaching to static database file: /tmp/use143-dataset/metadata/metadata.duckdb
DEBUG:timdex_dataset_api.metadata:creating view metadata.append_deltas
DEBUG:timdex_dataset_api.metadata:4 append deltas found
DEBUG:timdex_dataset_api.metadata:creating view metadata.records
DEBUG:timdex_dataset_api.metadata:creating view metadata.current_records
DEBUG:timdex_dataset_api.metadata:DuckDB context created for TIMDEXDatasetMetadata, 0.05s
DEBUG:timdex_dataset_api.utils:SQLAlchemy reflection elapsed: 0.136s
DEBUG:timdex_dataset_api.dataset:DuckDB context created for TIMDEXDataset, 0.0s
DEBUG:timdex_dataset_api.embeddings:creating view data.embeddings
DEBUG:timdex_dataset_api.embeddings:creating view data.current_embeddings
DEBUG:timdex_dataset_api.embeddings:creating view data.current_run_embeddings
DEBUG:timdex_dataset_api.embeddings:DuckDB context created for TIMDEXEmbeddings, 0.01s
DEBUG:timdex_dataset_api.utils:SQLAlchemy reflection elapsed: 0.088s

😍 Thanks for cleaning up the logging, which was a problem that pre-dated this PR. Looks great, and easy to follow what's happening at a DEBUG level.

As mentioned in an off-PR discussion, I think there could be some room for some additional unit tests that look at the edge cases of "current" embeddings vs "current run" embeddings, but it feels like it might be better served as follow-up, targeted work.

Because, this PR takes on a really big and important piece of work of adding a new data source -- embeddings -- to our informal TIMDEX data lake! Thanks for all the discussions along the way and the addition here. This unblocks downstream work like TIM reading embeddings, and gives us lots of new things to test and iterate on.

@ghukill
Copy link
Contributor

ghukill commented Dec 4, 2025

Ack! I should have mentioned / requested sooner, but can we bump it a semantic minor version Jonavelle? I say we bump this TDA library to 3.7.0 given this is an entirely new data source. Given we don't have any hard and fast rules on semantic versioning for this library, feels open to debate, but I think it deserves a minor version bump at the least.

Maybe we save major versions for how it operates, broken backwards compatibility, etc.

@jonavellecuerdo jonavellecuerdo force-pushed the USE-143-embeddings-read branch from 7e87f89 to ba3b1a5 Compare December 4, 2025 21:48
Why these changes are being introduced:
The TDA library will be used to read embeddings associated
with TIMDEX records in the TIMDEX dataset.

How this addresses that need:
* Set up DuckDB connection for embeddings query and retrieval
* Add read methods that mirror follow the same pattern implemented
in the TIMDEXDataset class
* Attach TIMDEXEmbeddings to TIMDEXDataset
* Add unit tests

Side effects of this change:
* None

Relevant ticket(s):
https://mitlibraries.atlassian.net/browse/USE-143
@jonavellecuerdo jonavellecuerdo force-pushed the USE-143-embeddings-read branch from ba3b1a5 to 2ca7b07 Compare December 4, 2025 21:49
@jonavellecuerdo
Copy link
Contributor Author

Thanks for all the discussion! Learned a lot through this work. 🚀

@jonavellecuerdo jonavellecuerdo merged commit 2ff19da into main Dec 4, 2025
2 checks passed
@jonavellecuerdo jonavellecuerdo deleted the USE-143-embeddings-read branch December 4, 2025 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants