USE 131 - Framework for embedding input strategies #17

ghukill · 2025-11-03T20:16:53Z

Purpose and background context

This PR establishes a framework for "strategies" used to extract and transform data from TIMDEX records into single strings used to create an embedding from.

As noted in the commit message, while we only have a single strategy at the moment of using the full record, we want to ensure we can support other strategies in the future and potentially multiple strategies for a record, in a single run (resulting in the cartesian product of records * strategies requested).

This establishes a new embeddings/strategies module, with a base class BaseStrategy, and our first implementation class FullRecordStrategy.

Overall, it's quite simple: a bit of boilerplate to get down to the strategy's extract_text() method which is opinionated for how text is extracted to create an embedding from.

How can a reviewer manually see the effects of these changes?

1- Set Dev1 AWS credentials

2- Set env vars:

TIMDEX_DATASET_LOCATION=s3://timdex-extract-dev-222053980223/dataset

3- Start an ipython shell:

uv run --env-file .env ipython

4- Prepare an iterator of TIMDEX records:

import os
from timdex_dataset_api import TIMDEXDataset

# init TIMDEXDataset and get an iterator or records
td = TIMDEXDataset(os.environ["TIMDEX_DATASET_LOCATION"])
records = td.read_dicts_iter(source="libguides", limit=3)

5- Create EmbeddingInput instances:

from embeddings.strategies.processor import create_embedding_inputs

# create EmbeddingInputs
embedding_inputs = list(create_embedding_inputs(records, ["full_record"]))

The following is an example EmbeddingInput returned:

EmbeddingInput(timdex_record_id='libguides:guides-175846', run_id='370abbb9-fcfc-4356-bff9-7b000dff109e', run_record_offset=0, embedding_strategy='full_record', text='<full record JSON dump here...>')

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: A core requirement of this application is the ability to take a TIMDEX JSON record and "transform" all or parts of it into a single string for which an embedding can be created. We are calling these "embedding strategies" in the context of this app. While our first strategy will likely be a very simple, full record approach, we want to support multiple strategies in the application, and even multiple strategies for a single record in a single invocation. How this addresses that need: * A new 'strategies' module is created * A base 'BaseStrategy' class, with a required 'extract_text()' method for implementations * Our first strategy represented in class 'FullRecordStrategy', which JSON dumps the entire TIMDEX JSON record. * A registry of strategies, similar to our models, that allow CLI level validation. Side effects of this change: * None really, but further solidifies that this application is contains the opinionation about how text is prepared for the embedding process. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-131 * https://mitlibraries.atlassian.net/browse/USE-132

README.md

ehanson8

ipython testing worked as expected, a few minor comments and questions but nothing blocking. Great work, this is clean and tight!

README.md

ehanson8 · 2025-11-04T16:03:57Z

embeddings/strategies/base.py

+        super().__init_subclass__(**kwargs)
+
+        # require class level STRATEGY_NAME to be set
+        if not hasattr(cls, "STRATEGY_NAME"):


Is this check necessary since the attribute exists in the base class even if it's not defined?

Yes, I think it is. Without this, you could leave out the class level STRATEGY_NAME attribute on a real strategy class.

Another pattern could be something like this:

class BaseStrategy(ABC): @absractmethod @classmethod def strategy_name(self) -> str: # return strategy name...

Which would allow getting the strategy name for an uninstantiated class, which is important. We do something similar in other projects I think.

But this pattern, with a little logic in the base class, enforces that child classes define it.

embeddings/strategies/base.py

embeddings/strategies/registry.py

jonavellecuerdo

Hmm, I'm curious if you had an idea of other methods that could be added to a child class of BaseStrategy? 🤔 When I think of this class, I understand it as a "collection of methods that could be applied to TIMDEX records", and while hard to explain, I was surprised by the number of required variables for the __init__ method:

self.timdex_record_id = timdex_record_id
self.run_id = run_id
self.run_record_offset = run_record_offset
self.transformed_record = transformed_record

It didn't feel clear to me as to why all these fields were required to instantiate a "strategy", if that makes sense. I do understand why it is important information for the resulting EmbeddingInput returned by the BaseStrategy.to_embedding_input() method, which is why I find it fits more as parameters for that method specifically, instead of the __init__ method. However, perhaps there is something I'm not thinking of re: other methods that you expect to define for BaseStrategy classes that will heavily rely on these fields. 🤔

embeddings/strategies/base.py

ghukill · 2025-11-04T20:14:58Z

Hmm, I'm curious if you had an idea of other methods that could be added to a child class of BaseStrategy? 🤔 When I think of this class, I understand it as a "collection of methods that could be applied to TIMDEX records", and while hard to explain, I was surprised by the number of required variables for the __init__ method:
self.timdex_record_id = timdex_record_id
self.run_id = run_id
self.run_record_offset = run_record_offset
self.transformed_record = transformed_record
It didn't feel clear to me as to why all these fields were required to instantiate a "strategy", if that makes sense. I do understand why it is important information for the resulting EmbeddingInput returned by the BaseStrategy.to_embedding_input() method, which is why I find it fits more as parameters for that method specifically, instead of the __init__ method. However, perhaps there is something I'm not thinking of re: other methods that you expect to define for BaseStrategy classes that will heavily rely on these fields. 🤔

It's an excellent question, and one I don't have an excellent answer for. I'm glad you caught and surfaced this, as this was something I had intended to return to but did not. I think what we have here is a pretty classic major reshuffling during work on this, and this is an awkward leftover that needs attention.

One thing is known: we want strategies to emit fully formed EmbeddingInput's that can be passed directly to an embedding class's create_embedding() method. What I think is wrong with the current implementation -- and I think you touch on this -- is instantiating the transformer strategy class in a per record fashion, hence those record level details.

I'm going to take another pass at this with this discussion in mind, and will re-tag everyone for a review. Thanks for raising this @jonavellecuerdo!

Why these changes are being introduced: Formerly, a transformer strategy class was instantiated in a per-record fashion, where things like the timdex_record_id and other record-level values were passed. This ultimately felt awkward, when we could just as easily instantiate it once in a more generic fashion, then build EmbeddingInput instances with the *result* of the strategy extracting text from the TIMDEX JSON record. How this addresses that need: All record-level details are removed as arguments for initializing a transformer strategy. Instead, the helper function create_embedding_inputs() is responsible for passing the TIMDEX JSON record to the transformer strategies, and then building an EmbeddingInput object before yielding. This keeps the init of those strategies much simpler, and preventing properties in the class they don't really need. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-131 * https://mitlibraries.atlassian.net/browse/USE-132

ghukill · 2025-11-04T20:38:21Z

@jonavellecuerdo, @ehanson8 - I've added a new commit per the discussion above: 421ba71.

Thanks again @jonavellecuerdo, that rough edge completely slipped my mind and/or I couldn't see it anymore when working through this. This further simplifies the transformer strategies as just that: accept a TIMDEX JSON record, be opinionated about how to return a string representation based on the strategy, and then get out of the way.

Because we are utilizing iterators most everywhere in the CLI, there was a need for the embeddings.strategies.processor.create_embedding_inputs() function to yield EmbeddingInput's. I think this change makes that function "do" a bit more, and that also feels correct. It actually is creating EmbeddingInput's objects now, by combining the record details with the text-to-embed from the transformer strategy.

ehanson8

It got even cleaner and tighter! Great comment @jonavellecuerdo and great code in response @ghukill !

jonavellecuerdo

Changes look great and code flows better! Great work. :D

ghukill requested a review from a team November 3, 2025 20:17

ghukill added 2 commits November 3, 2025 15:22

Strategies unit tests

70d8394

ghukill force-pushed the USE-131-embedding-input-transform-framework branch from 7fa20b0 to 70d8394 Compare November 3, 2025 20:22

ghukill commented Nov 3, 2025

View reviewed changes

README.md Show resolved Hide resolved

ghukill marked this pull request as ready for review November 3, 2025 20:23

jonavellecuerdo self-requested a review November 4, 2025 14:25

ehanson8 approved these changes Nov 4, 2025

View reviewed changes

Streamline exception messages

145cd81

ghukill mentioned this pull request Nov 4, 2025

USE 136 - implement create embeddings for OSNeuralSparseDocV3GTE #18

Merged

jonavellecuerdo reviewed Nov 4, 2025

View reviewed changes

embeddings/strategies/base.py Show resolved Hide resolved

ghukill requested review from ehanson8 and jonavellecuerdo November 4, 2025 20:38

ehanson8 approved these changes Nov 4, 2025

View reviewed changes

jonavellecuerdo approved these changes Nov 4, 2025

View reviewed changes

ghukill merged commit a725a58 into main Nov 4, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

USE 131 - Framework for embedding input strategies #17

USE 131 - Framework for embedding input strategies #17

Uh oh!

ghukill commented Nov 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

ehanson8 left a comment

Uh oh!

Uh oh!

ehanson8 Nov 4, 2025

Uh oh!

ghukill Nov 4, 2025

Uh oh!

Uh oh!

Uh oh!

jonavellecuerdo left a comment

Uh oh!

Uh oh!

ghukill commented Nov 4, 2025

Uh oh!

ghukill commented Nov 4, 2025

Uh oh!

ehanson8 left a comment

Uh oh!

jonavellecuerdo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

USE 131 - Framework for embedding input strategies #17

USE 131 - Framework for embedding input strategies #17

Uh oh!

Conversation

ghukill commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ehanson8 Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ghukill commented Nov 4, 2025

Uh oh!

ghukill commented Nov 4, 2025

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ghukill commented Nov 3, 2025 •

edited

Loading