USE 142 - embeddings source and write #175

ghukill · 2025-11-17T16:47:39Z

Purpose and background context

This PR introduces the ability to write embeddings to a TIMDEX dataset using TDA. In support of this, this introduces a new timdex_dataset_api/embeddings.py file that will encapsulate most of this functionality. Inside that file, a new TIMDEXEmbeddings class.

At this time, this class is fairly minimal. It requires an instance of TIMDEXDataset on init, saved to self, whichi provides functionality and configurations from the "root" dataset class. As we progress into read methods this may be revisited, but works well for now.

The only meaningful function at this time is .write(), which like TIMDEXDataset.write(), accepts an iterator of items to write. For this new method, it's expecting an iterator of DatasetEmbedding which is akin to the DatasetRecord class. Big picture, the mechanics are designed to be largely identical:

Init a class that represents that data source in the dataset (this being only our second, but more will come, e.g. fulltext).
Users of the library will import dataclasses like DatasetRecord or DatasetEmbedding and prepare an iterator for writing.
Call the appropriate .write() method with that iterator and let TDA handle the rest.

As noted, as we move into read methods there may be some fairly substantial changes. We may want a tighter coupling with TIMDEXDataset, e.g. something like TIMDEXDataset.embeddings which is a composed instance of this new TIMDEXEmbeddings class. We may even perform reading with a first step of record reading to get (timdex_record_id, run_id, run_record_offset) data, then re-use that for querying the embeddings themselves. TBD. Mentioning this now only to contextualize this PR which establishes the ability to write embeddings, but some structure is subject to change.

How can a reviewer manually see the effects of these changes?

1- Download the following JSONLines file with embeddings created by the new timdex-embeddings CLI application: s3://timdex-extract-dev-222053980223/dataset/data/sandbox/dspace-100-embeddings.jsonl. Save to output/ folder in this project (git ignored).

2- Open ipython shell:

pipenv run ipython

3- Load TIMDEXDataset and TIMDEXEmbeddings instances:

from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.embeddings import TIMDEXEmbeddings
from timdex_dataset_api.config import configure_dev_logger

configure_dev_logger()

td = TIMDEXDataset("/tmp/tda-use142")  # expected and okay this does not exist yet
te = TIMDEXEmbeddings(td)

4- Prepare embeddings for writing:

import json

from timdex_dataset_api.embeddings import TIMDEXEmbeddings, DatasetEmbedding

embeddings = []
with open("output/dspace-100-embeddings.jsonl") as f:
    for record in f.readlines():
        record = json.loads(record)
        embeddings.append(
            DatasetEmbedding(
                timdex_record_id=record["timdex_record_id"],
                run_id=record["run_id"],
                run_record_offset=record["run_record_offset"],
                embedding_model=record["model_uri"],
                embedding_strategy=record["embedding_strategy"],
                timestamp=record["timestamp"],
                embedding_vector=record.get("embedding_vector"),
                embedding_object=json.dumps(  # note the renamed "embedding_object" field here
                    record.get("embedding_token_weights")
                ).encode(),
            )
        )

this simulates what timdex-embeddings CLI will do as it prepares to write, skipping the embeddings ever existing in a file as an interim step

5- Write embeddings to dataset:

files_written = te.write(iter(embeddings))

6- With the TIMDEXDataset duckdb connection, confirm rows are written to parquet file:

td.conn.query(f"""select * from '{files_written[0].path}';""")
"""
┌──────────────────────┬──────────────────────┬───────────────────┬──────────────────────┬──────────────────────┬────────────────────┬──────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────┬───────┬───────┐
│   timdex_record_id   │        run_id        │ run_record_offset │      timestamp       │   embedding_model    │ embedding_strategy │ embedding_vector │                                                 embedding_object                                                 │  day  │ month │ year  │
│       varchar        │       varchar        │       int32       │ timestamp with tim…  │       varchar        │      varchar       │     float[]      │                                                       blob                                                       │ int64 │ int64 │ int64 │
├──────────────────────┼──────────────────────┼───────────────────┼──────────────────────┼──────────────────────┼────────────────────┼──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┼───────┼───────┤
│ dspace:1721.1-2712   │ 60b76094-5412-4f4b…  │            100000 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22sticky\x22: 0.8922109603881836, \x22local\x22: 0.7325811982154846, \x22ds\x22: 0.720859169960022, \x22mit…  │    17 │    11 │  2025 │
│ dspace:1721.1-106575 │ 60b76094-5412-4f4b…  │            100001 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22distance\x22: 0.8653345704078674, \x22lb\x22: 0.7897465825080872, \x22loop\x22: 0.7712590098381042, \x22t…  │    17 │    11 │  2025 │
│ dspace:1721.1-130098 │ 60b76094-5412-4f4b…  │            100002 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22stroke\x22: 1.0930368900299072, \x22strokes\x22: 0.7778793573379517, \x22nl\x22: 0.676250696182251, \x22g…  │    17 │    11 │  2025 │
│ dspace:1721.1-116230 │ 60b76094-5412-4f4b…  │            100003 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22f\x22: 0.9276513457298279, \x226\x22: 0.7267107367515564, \x22u\x22: 0.7204970121383667, \x22charge\x22: …  │    17 │    11 │  2025 │
│ dspace:1721.1-15162  │ 60b76094-5412-4f4b…  │            100004 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22time\x22: 0.7713623642921448, \x22acceleration\x22: 0.7129517197608948, \x22ga\x22: 0.7012690305709839, \…  │    17 │    11 │  2025 │
│ dspace:1721.1-63397  │ 60b76094-5412-4f4b…  │            100005 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22international\x22: 0.763080894947052, \x22ds\x22: 0.7044110298156738, \x22mit\x22: 0.6805474758148193, \x…  │    17 │    11 │  2025 │
│ dspace:1721.1-114893 │ 60b76094-5412-4f4b…  │            100006 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22brain\x22: 0.8636446595191956, \x22cancer\x22: 0.8300579190254211, \x22##ic\x22: 0.8218355774879456, \x22…  │    17 │    11 │  2025 │
│ dspace:1721.1-78915  │ 60b76094-5412-4f4b…  │            100007 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22radiation\x22: 0.8831801414489746, \x22interaction\x22: 0.6725037693977356, \x22matter\x22: 0.66096931695…  │    17 │    11 │  2025 │
│ dspace:1721.1-157848 │ 60b76094-5412-4f4b…  │            100008 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22voice\x22: 0.7316490411758423, \x22anonymous\x22: 0.7303279042243958, \x22##ony\x22: 0.706677258014679, \…  │    17 │    11 │  2025 │
│ dspace:1721.1-38845  │ 60b76094-5412-4f4b…  │            100009 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22##rm\x22: 0.6709635853767395, \x22ds\x22: 0.661644458770752, \x22massachusetts\x22: 0.6469870805740356, \…  │    17 │    11 │  2025 │
├──────────────────────┴──────────────────────┴───────────────────┴──────────────────────┴──────────────────────┴────────────────────┴──────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────┴───────┴───────┤
│ 10 rows                                                                                                                                                                                                                                                                               11 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
"""

The final results at /tmp/tda-use142 looks similar to this, showing a successful write of embeddings as parquet file(s) to a new /data/embeddings location:

tda-use142
├── data
│   ├── embeddings
│   │   └── year=2025
│   │       └── month=11
│   │           └── day=17
│   │               └── 51ead940-71a8-4157-8cfd-268bab7e41fb-0.parquet
│   └── records
└── metadata
    └── append_deltas

Includes new or updated dependencies?

YES: notably, duckdb is back on a stable version

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-142

coveralls · 2025-11-17T19:12:44Z

Pull Request Test Coverage Report for Build 19471397498

Details

64 of 64 (100.0%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.7%) to 93.78%

Totals
Change from base Build 18653028932:	0.7%
Covered Lines:	588
Relevant Lines:	627

💛 - Coveralls

Why these changes are being introduced: We will begin storing embeddings associated with TIMDEX records in the TIMDEX dataset, where read and write functionality is managed by this library. We will need to ability to write and read embeddings, but will start with establishing the new data source structure and write methods. Read methods are to follow, and may require some additional linkages with the `TIMDEXDataset` class and read methods. How this addresses that need: This commit adds a new data source for the dataset in the form of the new `TIMDEXEmbeddings` class. This class will encapsulate write and read methods for embeddings, with a composite key of (timdex_record_id, run_id, run_record_offset) tethering the embeddings to specific TIMDEX record versions in the dataset. At this time only a write() method exists, writing embeddings to a `/embeddings` folder in the dataset. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-142

ghukill · 2025-11-17T19:58:46Z

timdex_dataset_api/embeddings.py

+        pa.field("embedding_model", pa.string()),
+        pa.field("embedding_strategy", pa.string()),
+        pa.field("embedding_vector", pa.list_(pa.float32())),
+        pa.field("embedding_object", pa.binary()),


@jonavellecuerdo - this deviates from the proposed schema. I realized during this work, and on timdex-embeddings, that "object" is a bit more flexible for any non-vector form we may want to store.

Is it possible we would want to store multiple non-vector forms?

ghukill · 2025-11-17T19:59:57Z

timdex_dataset_api/embeddings.py

+    (
+        pa.field("timdex_record_id", pa.string()),
+        pa.field("run_id", pa.string()),
+        pa.field("run_record_offset", pa.int32()),


source is removed from the proposed schema. This doubles down on (timdex_record_id, run_id, run_record_offset) being sufficient to tether to record data. Ideally, we don't want to duplicate much of anything available there, here.

ghukill · 2025-11-18T15:24:34Z

Thanks for the version bump reminder @jonavellecuerdo!

jonavellecuerdo

Looks good to me! The changes were straightforward and easy to follow, which speaks to great architectural decisions made previously with TIMDEXDataset and TIMDEXMetadata. ✨

ehanson8

Looks good, one non-blocking question

ehanson8 · 2025-11-18T15:35:50Z

timdex_dataset_api/embeddings.py

+    (
+        pa.field("timdex_record_id", pa.string()),
+        pa.field("run_id", pa.string()),
+        pa.field("run_record_offset", pa.int32()),


ehanson8 · 2025-11-18T15:40:48Z

timdex_dataset_api/embeddings.py

+        pa.field("embedding_model", pa.string()),
+        pa.field("embedding_strategy", pa.string()),
+        pa.field("embedding_vector", pa.list_(pa.float32())),
+        pa.field("embedding_object", pa.binary()),


Is it possible we would want to store multiple non-vector forms?

ghukill added 2 commits November 10, 2025 16:45

Update dependencies and comment pinned libs

998fc2e

Update dependencies and unpin duckdb

425b664

ghukill marked this pull request as ready for review November 17, 2025 19:55

ghukill requested review from a team and jonavellecuerdo November 17, 2025 19:55

ghukill commented Nov 17, 2025

View reviewed changes

ehanson8 self-assigned this Nov 17, 2025

Bump version to 3.6

c438bce

jonavellecuerdo approved these changes Nov 18, 2025

View reviewed changes

ehanson8 approved these changes Nov 18, 2025

View reviewed changes

ghukill merged commit db73ba0 into main Nov 18, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

USE 142 - embeddings source and write #175

USE 142 - embeddings source and write #175

ghukill commented Nov 17, 2025 •

edited

Loading

Uh oh!

coveralls commented Nov 17, 2025 •

edited

Loading

Uh oh!

ghukill Nov 17, 2025

Uh oh!

ehanson8 Nov 18, 2025

Uh oh!

ghukill Nov 17, 2025

Uh oh!

ehanson8 Nov 18, 2025

Uh oh!

ghukill commented Nov 18, 2025

Uh oh!

jonavellecuerdo left a comment

Uh oh!

ehanson8 left a comment

Uh oh!

ehanson8 Nov 18, 2025

Uh oh!

ehanson8 Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

USE 142 - embeddings source and write #175

USE 142 - embeddings source and write #175

Conversation

ghukill commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Uh oh!

coveralls commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 19471397498

Details

💛 - Coveralls

Uh oh!

ghukill Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

ehanson8 Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

ehanson8 Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill commented Nov 18, 2025

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

ehanson8 Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

ehanson8 Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ghukill commented Nov 17, 2025 •

edited

Loading

coveralls commented Nov 17, 2025 •

edited

Loading