Add support to read embeddings from the TIMDEX dataset #178

jonavellecuerdo · 2025-12-03T20:11:46Z

Purpose and background context

This PR introduces the ability to read embeddings from TIMDEX dataset using TDA. In support of this, a series of read methods --following the same pattern of the read methods defined for TIMDEXDataset (e.g., yielding/returning pyarrow.RecordBatches, dict, pd.DataFrame)--have been implemented for the Embeddings class.

Note: In contrast to the TIMDEXDataset, which uses a metadata layer to pre-filter records to identify parquet filenames and record offsets (indices) to improve performance, the Embeddings class examines all the embedding parquet files when filtering records. It is our intention to revisit the structure of these modules and ensure read methods continue to perform efficiently as the dataset (and embeddings) expand.

How can a reviewer manually see the effects of these changes?

Review unit tests [High-level review]

Follow the instructions in Explore the TIMDEX Dataset Locally

Create the TIMDEX Dataset Locally
Useful Code Snippets | Write and read sample embeddings to TIMDEX Dataset
Run additional queries:

te.conn.query("select * from data.current_run_embeddings") # retrieves full set of records (n=30)
te.conn.query("select * from data.current_embeddings") # retrieves subset of records (n=20)

Includes new or updated dependencies?

YES - dependencies were updated to resolve the vulnerability:

Found 1 known vulnerability in 1 package
Name     Version ID                  Fix Versions
-------- ------- ------------------- ------------
werkzeug 3.1.3   GHSA-hgf8-39gv-g3f2 3.1.4

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-143

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

coveralls · 2025-12-03T20:14:06Z

Pull Request Test Coverage Report for Build 19945064186

Details

83 of 90 (92.22%) changed or added relevant lines in 5 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.3%) to 93.503%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
timdex_dataset_api/embeddings.py	72	79	91.14%

Totals
Change from base Build 19505892476:	-0.3%
Covered Lines:	662
Relevant Lines:	708

💛 - Coveralls

jonavellecuerdo · 2025-12-04T15:05:47Z

Did some cleanup as well: Update PR template and add CODEOWNERS file.
@ghukill If we want to automatically assign mitlibraries/dataeng as the reviewer(s): Settings > Branches > main (Edit) > Check 'Require review from Code Owners' 🤔

ghukill

While I dig into the actual DuckDB views, particularly around "current" embeddings -- globally and within a run -- I noticed a few logging and docstrings changes I think we might need. Opting to share this as a quick request for changes while continuing my review.

Overall, looking great though!

timdex_dataset_api/embeddings.py

ghukill · 2025-12-04T15:51:16Z

timdex_dataset_api/embeddings.py

+        logger.debug(
+            f"DuckDB data context created, {round(time.perf_counter()-start_time,2)}s"
+        )


Similar story here to a comment below (this was added later), where I think this logging message could be improved.

Perhaps we should say something like,

DEBUG:DuckDB context created for TIMDEXEmbeddings

Then we could also update TIMDEXDataset and TIMDEXDatasetMetadata as well. It'd be clear when each class method finished, what DuckDB context had completed.

Thanks for the suggestion!

timdex_dataset_api/embeddings.py

ghukill

INFO:timdex_dataset_api.dataset:Dataset successfully loaded: '/tmp/use143-dataset/data/records', 0.01s
DEBUG:timdex_dataset_api.metadata:Attaching to static database file: /tmp/use143-dataset/metadata/metadata.duckdb
DEBUG:timdex_dataset_api.metadata:creating view metadata.append_deltas
DEBUG:timdex_dataset_api.metadata:4 append deltas found
DEBUG:timdex_dataset_api.metadata:creating view metadata.records
DEBUG:timdex_dataset_api.metadata:creating view metadata.current_records
DEBUG:timdex_dataset_api.metadata:DuckDB context created for TIMDEXDatasetMetadata, 0.05s
DEBUG:timdex_dataset_api.utils:SQLAlchemy reflection elapsed: 0.136s
DEBUG:timdex_dataset_api.dataset:DuckDB context created for TIMDEXDataset, 0.0s
DEBUG:timdex_dataset_api.embeddings:creating view data.embeddings
DEBUG:timdex_dataset_api.embeddings:creating view data.current_embeddings
DEBUG:timdex_dataset_api.embeddings:creating view data.current_run_embeddings
DEBUG:timdex_dataset_api.embeddings:DuckDB context created for TIMDEXEmbeddings, 0.01s
DEBUG:timdex_dataset_api.utils:SQLAlchemy reflection elapsed: 0.088s

😍 Thanks for cleaning up the logging, which was a problem that pre-dated this PR. Looks great, and easy to follow what's happening at a DEBUG level.

As mentioned in an off-PR discussion, I think there could be some room for some additional unit tests that look at the edge cases of "current" embeddings vs "current run" embeddings, but it feels like it might be better served as follow-up, targeted work.

Because, this PR takes on a really big and important piece of work of adding a new data source -- embeddings -- to our informal TIMDEX data lake! Thanks for all the discussions along the way and the addition here. This unblocks downstream work like TIM reading embeddings, and gives us lots of new things to test and iterate on.

ghukill · 2025-12-04T20:51:15Z

Ack! I should have mentioned / requested sooner, but can we bump it a semantic minor version Jonavelle? I say we bump this TDA library to 3.7.0 given this is an entirely new data source. Given we don't have any hard and fast rules on semantic versioning for this library, feels open to debate, but I think it deserves a minor version bump at the least.

Maybe we save major versions for how it operates, broken backwards compatibility, etc.

Why these changes are being introduced: The TDA library will be used to read embeddings associated with TIMDEX records in the TIMDEX dataset. How this addresses that need: * Set up DuckDB connection for embeddings query and retrieval * Add read methods that mirror follow the same pattern implemented in the TIMDEXDataset class * Attach TIMDEXEmbeddings to TIMDEXDataset * Add unit tests Side effects of this change: * None Relevant ticket(s): https://mitlibraries.atlassian.net/browse/USE-143

jonavellecuerdo · 2025-12-04T21:50:24Z

Thanks for all the discussion! Learned a lot through this work. 🚀

jonavellecuerdo changed the title ~~Add functionality to read embeddings~~ Add support to read embeddings Dec 3, 2025

jonavellecuerdo changed the title ~~Add support to read embeddings~~ Add support to read embeddings from the TIMDEX dataset Dec 3, 2025

jonavellecuerdo force-pushed the USE-143-embeddings-read branch from dc6f63f to 813b950 Compare December 4, 2025 13:50

jonavellecuerdo marked this pull request as ready for review December 4, 2025 14:57

jonavellecuerdo requested review from a team and ghukill December 4, 2025 14:57

ghukill self-assigned this Dec 4, 2025

ghukill requested changes Dec 4, 2025

View reviewed changes

jonavellecuerdo added a commit that referenced this pull request Dec 4, 2025

Address comments in PR #178

b84343c

jonavellecuerdo requested a review from ghukill December 4, 2025 18:59

ghukill approved these changes Dec 4, 2025

View reviewed changes

jonavellecuerdo force-pushed the USE-143-embeddings-read branch from 7e87f89 to ba3b1a5 Compare December 4, 2025 21:48

jonavellecuerdo added 5 commits December 4, 2025 16:49

Update dependencies

bd7677b

Update PR template

6d1f45a

Add CODEOWNERS file

fa846bc

Update version number to 3.7.0

2ca7b07

jonavellecuerdo force-pushed the USE-143-embeddings-read branch from ba3b1a5 to 2ca7b07 Compare December 4, 2025 21:49

jonavellecuerdo merged commit 2ff19da into main Dec 4, 2025
2 checks passed

jonavellecuerdo deleted the USE-143-embeddings-read branch December 4, 2025 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support to read embeddings from the TIMDEX dataset #178

Add support to read embeddings from the TIMDEX dataset #178

Uh oh!

jonavellecuerdo commented Dec 3, 2025 •

edited

Loading

Uh oh!

coveralls commented Dec 3, 2025 •

edited

Loading

Uh oh!

jonavellecuerdo commented Dec 4, 2025

Uh oh!

ghukill left a comment

Uh oh!

Uh oh!

ghukill Dec 4, 2025

Uh oh!

jonavellecuerdo Dec 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ghukill left a comment

Uh oh!

ghukill commented Dec 4, 2025

Uh oh!

jonavellecuerdo commented Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add support to read embeddings from the TIMDEX dataset #178

Add support to read embeddings from the TIMDEX dataset #178

Uh oh!

Conversation

jonavellecuerdo commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

coveralls commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 19945064186

Details

💛 - Coveralls

Uh oh!

jonavellecuerdo commented Dec 4, 2025

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ghukill Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

ghukill commented Dec 4, 2025

Uh oh!

jonavellecuerdo commented Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jonavellecuerdo commented Dec 3, 2025 •

edited

Loading

coveralls commented Dec 3, 2025 •

edited

Loading