Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Nov 17, 2025

Purpose and background context

This PR introduces the ability to write embeddings to a TIMDEX dataset using TDA. In support of this, this introduces a new timdex_dataset_api/embeddings.py file that will encapsulate most of this functionality. Inside that file, a new TIMDEXEmbeddings class.

At this time, this class is fairly minimal. It requires an instance of TIMDEXDataset on init, saved to self, whichi provides functionality and configurations from the "root" dataset class. As we progress into read methods this may be revisited, but works well for now.

The only meaningful function at this time is .write(), which like TIMDEXDataset.write(), accepts an iterator of items to write. For this new method, it's expecting an iterator of DatasetEmbedding which is akin to the DatasetRecord class. Big picture, the mechanics are designed to be largely identical:

  1. Init a class that represents that data source in the dataset (this being only our second, but more will come, e.g. fulltext).
  2. Users of the library will import dataclasses like DatasetRecord or DatasetEmbedding and prepare an iterator for writing.
  3. Call the appropriate .write() method with that iterator and let TDA handle the rest.

As noted, as we move into read methods there may be some fairly substantial changes. We may want a tighter coupling with TIMDEXDataset, e.g. something like TIMDEXDataset.embeddings which is a composed instance of this new TIMDEXEmbeddings class. We may even perform reading with a first step of record reading to get (timdex_record_id, run_id, run_record_offset) data, then re-use that for querying the embeddings themselves. TBD. Mentioning this now only to contextualize this PR which establishes the ability to write embeddings, but some structure is subject to change.

How can a reviewer manually see the effects of these changes?

1- Download the following JSONLines file with embeddings created by the new timdex-embeddings CLI application: s3://timdex-extract-dev-222053980223/dataset/data/sandbox/dspace-100-embeddings.jsonl. Save to output/ folder in this project (git ignored).

2- Open ipython shell:

pipenv run ipython

3- Load TIMDEXDataset and TIMDEXEmbeddings instances:

from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.embeddings import TIMDEXEmbeddings
from timdex_dataset_api.config import configure_dev_logger

configure_dev_logger()

td = TIMDEXDataset("/tmp/tda-use142")  # expected and okay this does not exist yet
te = TIMDEXEmbeddings(td)

4- Prepare embeddings for writing:

import json

from timdex_dataset_api.embeddings import TIMDEXEmbeddings, DatasetEmbedding

embeddings = []
with open("output/dspace-100-embeddings.jsonl") as f:
    for record in f.readlines():
        record = json.loads(record)
        embeddings.append(
            DatasetEmbedding(
                timdex_record_id=record["timdex_record_id"],
                run_id=record["run_id"],
                run_record_offset=record["run_record_offset"],
                embedding_model=record["model_uri"],
                embedding_strategy=record["embedding_strategy"],
                timestamp=record["timestamp"],
                embedding_vector=record.get("embedding_vector"),
                embedding_object=json.dumps(  # note the renamed "embedding_object" field here
                    record.get("embedding_token_weights")
                ).encode(),
            )
        )
  • this simulates what timdex-embeddings CLI will do as it prepares to write, skipping the embeddings ever existing in a file as an interim step

5- Write embeddings to dataset:

files_written = te.write(iter(embeddings))

6- With the TIMDEXDataset duckdb connection, confirm rows are written to parquet file:

td.conn.query(f"""select * from '{files_written[0].path}';""")
"""
┌──────────────────────┬──────────────────────┬───────────────────┬──────────────────────┬──────────────────────┬────────────────────┬──────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────┬───────┬───────┐
│   timdex_record_id   │        run_id        │ run_record_offset │      timestamp       │   embedding_model    │ embedding_strategy │ embedding_vector │                                                 embedding_object                                                 │  day  │ month │ year  │
│       varchar        │       varchar        │       int32       │ timestamp with tim…  │       varchar        │      varchar       │     float[]      │                                                       blob                                                       │ int64 │ int64 │ int64 │
├──────────────────────┼──────────────────────┼───────────────────┼──────────────────────┼──────────────────────┼────────────────────┼──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┼───────┼───────┤
│ dspace:1721.1-2712   │ 60b76094-5412-4f4b…  │            100000 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22sticky\x22: 0.8922109603881836, \x22local\x22: 0.7325811982154846, \x22ds\x22: 0.720859169960022, \x22mit…  │    17 │    11 │  2025 │
│ dspace:1721.1-106575 │ 60b76094-5412-4f4b…  │            100001 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22distance\x22: 0.8653345704078674, \x22lb\x22: 0.7897465825080872, \x22loop\x22: 0.7712590098381042, \x22t…  │    17 │    11 │  2025 │
│ dspace:1721.1-130098 │ 60b76094-5412-4f4b…  │            100002 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22stroke\x22: 1.0930368900299072, \x22strokes\x22: 0.7778793573379517, \x22nl\x22: 0.676250696182251, \x22g…  │    17 │    11 │  2025 │
│ dspace:1721.1-116230 │ 60b76094-5412-4f4b…  │            100003 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22f\x22: 0.9276513457298279, \x226\x22: 0.7267107367515564, \x22u\x22: 0.7204970121383667, \x22charge\x22: …  │    17 │    11 │  2025 │
│ dspace:1721.1-15162  │ 60b76094-5412-4f4b…  │            100004 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22time\x22: 0.7713623642921448, \x22acceleration\x22: 0.7129517197608948, \x22ga\x22: 0.7012690305709839, \…  │    17 │    11 │  2025 │
│ dspace:1721.1-63397  │ 60b76094-5412-4f4b…  │            100005 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22international\x22: 0.763080894947052, \x22ds\x22: 0.7044110298156738, \x22mit\x22: 0.6805474758148193, \x…  │    17 │    11 │  2025 │
│ dspace:1721.1-114893 │ 60b76094-5412-4f4b…  │            100006 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22brain\x22: 0.8636446595191956, \x22cancer\x22: 0.8300579190254211, \x22##ic\x22: 0.8218355774879456, \x22…  │    17 │    11 │  2025 │
│ dspace:1721.1-78915  │ 60b76094-5412-4f4b…  │            100007 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22radiation\x22: 0.8831801414489746, \x22interaction\x22: 0.6725037693977356, \x22matter\x22: 0.66096931695…  │    17 │    11 │  2025 │
│ dspace:1721.1-157848 │ 60b76094-5412-4f4b…  │            100008 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22voice\x22: 0.7316490411758423, \x22anonymous\x22: 0.7303279042243958, \x22##ony\x22: 0.706677258014679, \…  │    17 │    11 │  2025 │
│ dspace:1721.1-38845  │ 60b76094-5412-4f4b…  │            100009 │ 2025-11-17 13:51:4…  │ opensearch-project…  │ full_record        │ NULL             │ {\x22##rm\x22: 0.6709635853767395, \x22ds\x22: 0.661644458770752, \x22massachusetts\x22: 0.6469870805740356, \…  │    17 │    11 │  2025 │
├──────────────────────┴──────────────────────┴───────────────────┴──────────────────────┴──────────────────────┴────────────────────┴──────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────┴───────┴───────┤
│ 10 rows                                                                                                                                                                                                                                                                               11 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
"""

The final results at /tmp/tda-use142 looks similar to this, showing a successful write of embeddings as parquet file(s) to a new /data/embeddings location:

tda-use142
├── data
│   ├── embeddings
│   │   └── year=2025
│   │       └── month=11
│   │           └── day=17
│   │               └── 51ead940-71a8-4157-8cfd-268bab7e41fb-0.parquet
│   └── records
└── metadata
    └── append_deltas

Includes new or updated dependencies?

YES: notably, duckdb is back on a stable version

Changes expectations for external applications?

NO

What are the relevant tickets?

@coveralls
Copy link

coveralls commented Nov 17, 2025

Pull Request Test Coverage Report for Build 19471397498

Details

  • 64 of 64 (100.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.7%) to 93.78%

Totals Coverage Status
Change from base Build 18653028932: 0.7%
Covered Lines: 588
Relevant Lines: 627

💛 - Coveralls

Why these changes are being introduced:

We will begin storing embeddings associated with TIMDEX records in
the TIMDEX dataset, where read and write functionality is managed by
this library.

We will need to ability to write and read embeddings,
but will start with establishing the new data source structure and
write methods.  Read methods are to follow, and may require some
additional linkages with the `TIMDEXDataset` class and read methods.

How this addresses that need:

This commit adds a new data source for the dataset in the form of
the new `TIMDEXEmbeddings` class.  This class will encapsulate
write and read methods for embeddings, with a composite key of
(timdex_record_id, run_id, run_record_offset) tethering the embeddings
to specific TIMDEX record versions in the dataset.

At this time only a write() method exists, writing embeddings to
a `/embeddings` folder in the dataset.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-142
@ghukill ghukill marked this pull request as ready for review November 17, 2025 19:55
@ghukill ghukill requested review from a team and jonavellecuerdo November 17, 2025 19:55
pa.field("embedding_model", pa.string()),
pa.field("embedding_strategy", pa.string()),
pa.field("embedding_vector", pa.list_(pa.float32())),
pa.field("embedding_object", pa.binary()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonavellecuerdo - this deviates from the proposed schema. I realized during this work, and on timdex-embeddings, that "object" is a bit more flexible for any non-vector form we may want to store.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible we would want to store multiple non-vector forms?

(
pa.field("timdex_record_id", pa.string()),
pa.field("run_id", pa.string()),
pa.field("run_record_offset", pa.int32()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source is removed from the proposed schema. This doubles down on (timdex_record_id, run_id, run_record_offset) being sufficient to tether to record data. Ideally, we don't want to duplicate much of anything available there, here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart!

@ehanson8 ehanson8 self-assigned this Nov 17, 2025
@ghukill
Copy link
Contributor Author

ghukill commented Nov 18, 2025

Thanks for the version bump reminder @jonavellecuerdo!

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! The changes were straightforward and easy to follow, which speaks to great architectural decisions made previously with TIMDEXDataset and TIMDEXMetadata. ✨

Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one non-blocking question

(
pa.field("timdex_record_id", pa.string()),
pa.field("run_id", pa.string()),
pa.field("run_record_offset", pa.int32()),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart!

pa.field("embedding_model", pa.string()),
pa.field("embedding_strategy", pa.string()),
pa.field("embedding_vector", pa.list_(pa.float32())),
pa.field("embedding_object", pa.binary()),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible we would want to store multiple non-vector forms?

@ghukill ghukill merged commit db73ba0 into main Nov 18, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants