Skip to content

Conversation

@ghukill
Copy link
Collaborator

@ghukill ghukill commented Nov 3, 2025

Purpose and background context

This PR establishes a framework for "strategies" used to extract and transform data from TIMDEX records into single strings used to create an embedding from.

As noted in the commit message, while we only have a single strategy at the moment of using the full record, we want to ensure we can support other strategies in the future and potentially multiple strategies for a record, in a single run (resulting in the cartesian product of records * strategies requested).

This establishes a new embeddings/strategies module, with a base class BaseStrategy, and our first implementation class FullRecordStrategy.

Overall, it's quite simple: a bit of boilerplate to get down to the strategy's extract_text() method which is opinionated for how text is extracted to create an embedding from.

How can a reviewer manually see the effects of these changes?

1- Set Dev1 AWS credentials

2- Set env vars:

TIMDEX_DATASET_LOCATION=s3://timdex-extract-dev-222053980223/dataset

3- Start an ipython shell:

uv run --env-file .env ipython

4- Prepare an iterator of TIMDEX records:

import os
from timdex_dataset_api import TIMDEXDataset

# init TIMDEXDataset and get an iterator or records
td = TIMDEXDataset(os.environ["TIMDEX_DATASET_LOCATION"])
records = td.read_dicts_iter(source="libguides", limit=3)

5- Create EmbeddingInput instances:

from embeddings.strategies.processor import create_embedding_inputs

# create EmbeddingInputs
embedding_inputs = list(create_embedding_inputs(records, ["full_record"]))

The following is an example EmbeddingInput returned:

EmbeddingInput(timdex_record_id='libguides:guides-175846', run_id='370abbb9-fcfc-4356-bff9-7b000dff109e', run_record_offset=0, embedding_strategy='full_record', text='<full record JSON dump here...>')

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

@ghukill ghukill requested a review from a team November 3, 2025 20:17
Why these changes are being introduced:

A core requirement of this application is the ability to take a TIMDEX
JSON record and "transform" all or parts of it into a single string for
which an embedding can be created.  We are calling these "embedding
strategies" in the context of this app.

While our first strategy will likely be a very simple, full record approach,
we want to support multiple strategies in the application, and even
multiple strategies for a single record in a single invocation.

How this addresses that need:
* A new 'strategies' module is created
* A base 'BaseStrategy' class, with a required 'extract_text()' method
for implementations
* Our first strategy represented in class 'FullRecordStrategy', which
JSON dumps the entire TIMDEX JSON record.
* A registry of strategies, similar to our models, that allow CLI
level validation.

Side effects of this change:
* None really, but further solidifies that this application is contains
the opinionation about how text is prepared for the embedding process.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-131
* https://mitlibraries.atlassian.net/browse/USE-132
@ghukill ghukill force-pushed the USE-131-embedding-input-transform-framework branch from 7fa20b0 to 70d8394 Compare November 3, 2025 20:22
@ghukill ghukill marked this pull request as ready for review November 3, 2025 20:23
@jonavellecuerdo jonavellecuerdo self-requested a review November 4, 2025 14:25
Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ipython testing worked as expected, a few minor comments and questions but nothing blocking. Great work, this is clean and tight!

super().__init_subclass__(**kwargs)

# require class level STRATEGY_NAME to be set
if not hasattr(cls, "STRATEGY_NAME"):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check necessary since the attribute exists in the base class even if it's not defined?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it is. Without this, you could leave out the class level STRATEGY_NAME attribute on a real strategy class.

Another pattern could be something like this:

class BaseStrategy(ABC):
  @absractmethod
  @classmethod
  def strategy_name(self) -> str:
    # return strategy name...

Which would allow getting the strategy name for an uninstantiated class, which is important. We do something similar in other projects I think.

But this pattern, with a little logic in the base class, enforces that child classes define it.

Copy link

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm curious if you had an idea of other methods that could be added to a child class of BaseStrategy? 🤔 When I think of this class, I understand it as a "collection of methods that could be applied to TIMDEX records", and while hard to explain, I was surprised by the number of required variables for the __init__ method:

self.timdex_record_id = timdex_record_id
self.run_id = run_id
self.run_record_offset = run_record_offset
self.transformed_record = transformed_record

It didn't feel clear to me as to why all these fields were required to instantiate a "strategy", if that makes sense. I do understand why it is important information for the resulting EmbeddingInput returned by the BaseStrategy.to_embedding_input() method, which is why I find it fits more as parameters for that method specifically, instead of the __init__ method. However, perhaps there is something I'm not thinking of re: other methods that you expect to define for BaseStrategy classes that will heavily rely on these fields. 🤔

@ghukill
Copy link
Collaborator Author

ghukill commented Nov 4, 2025

Hmm, I'm curious if you had an idea of other methods that could be added to a child class of BaseStrategy? 🤔 When I think of this class, I understand it as a "collection of methods that could be applied to TIMDEX records", and while hard to explain, I was surprised by the number of required variables for the __init__ method:

self.timdex_record_id = timdex_record_id
self.run_id = run_id
self.run_record_offset = run_record_offset
self.transformed_record = transformed_record

It didn't feel clear to me as to why all these fields were required to instantiate a "strategy", if that makes sense. I do understand why it is important information for the resulting EmbeddingInput returned by the BaseStrategy.to_embedding_input() method, which is why I find it fits more as parameters for that method specifically, instead of the __init__ method. However, perhaps there is something I'm not thinking of re: other methods that you expect to define for BaseStrategy classes that will heavily rely on these fields. 🤔

It's an excellent question, and one I don't have an excellent answer for. I'm glad you caught and surfaced this, as this was something I had intended to return to but did not. I think what we have here is a pretty classic major reshuffling during work on this, and this is an awkward leftover that needs attention.

One thing is known: we want strategies to emit fully formed EmbeddingInput's that can be passed directly to an embedding class's create_embedding() method. What I think is wrong with the current implementation -- and I think you touch on this -- is instantiating the transformer strategy class in a per record fashion, hence those record level details.

I'm going to take another pass at this with this discussion in mind, and will re-tag everyone for a review. Thanks for raising this @jonavellecuerdo!

Why these changes are being introduced:

Formerly, a transformer strategy class was instantiated in a per-record fashion,
where things like the timdex_record_id and other record-level values were passed.
This ultimately felt awkward, when we could just as easily instantiate it once in
a more generic fashion, then build EmbeddingInput instances with the *result* of
the strategy extracting text from the TIMDEX JSON record.

How this addresses that need:

All record-level details are removed as arguments for initializing a transformer
strategy.  Instead, the helper function create_embedding_inputs() is responsible
for passing the TIMDEX JSON record to the transformer strategies, and then
building an EmbeddingInput object before yielding.  This keeps the init of those
strategies much simpler, and preventing properties in the class they don't really
need.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-131
* https://mitlibraries.atlassian.net/browse/USE-132
@ghukill
Copy link
Collaborator Author

ghukill commented Nov 4, 2025

@jonavellecuerdo, @ehanson8 - I've added a new commit per the discussion above: 421ba71.

Thanks again @jonavellecuerdo, that rough edge completely slipped my mind and/or I couldn't see it anymore when working through this. This further simplifies the transformer strategies as just that: accept a TIMDEX JSON record, be opinionated about how to return a string representation based on the strategy, and then get out of the way.

Because we are utilizing iterators most everywhere in the CLI, there was a need for the embeddings.strategies.processor.create_embedding_inputs() function to yield EmbeddingInput's. I think this change makes that function "do" a bit more, and that also feels correct. It actually is creating EmbeddingInput's objects now, by combining the record details with the text-to-embed from the transformer strategy.

Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It got even cleaner and tighter! Great comment @jonavellecuerdo and great code in response @ghukill !

Copy link

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great and code flows better! Great work. :D

@ghukill ghukill merged commit a725a58 into main Nov 4, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants