Skip to content

Conversation

@ghukill
Copy link
Collaborator

@ghukill ghukill commented Oct 30, 2025

Purpose and background context

This PR stubs the following:

  • CLI create-embeddings command
    • main entrypoint for the CLI application
    • creates embeddings and writes them somewhere
  • embedding class base method create_embedding()
    • this is what each model must define, specific to the model's business logic for actually creating embeddings
    • expecting a single input text, gives a single embedding back
  • embedding class base method create_embeddings()
    • driver of creating multiple embeddings
    • called by the CLI
    • may be the location of parallelism in the future

Please note in the CLI create-embeddings command there is a fair amount of #WIP: ... and # DEBUG ... comments. The goal here is to sketch what will happen in this CLI without completing those pieces yet. The two primary pieces missing at this time:

  1. the transformation of TIMDEX records into embeddable texts via one or more "strategies"
  2. actually creating the embedding, i.e. from our first and only embedding class OSNeuralSparseDocV3GTE

There was a bit of churn in the unit tests as well, refactoring to try and utilize the mocked embedding class a bit more.

How can a reviewer manually see the effects of these changes?

1- Set Dev1 AWS TimdexManagers credentials

2- Set env vars:

HF_HUB_DISABLE_PROGRESS_BARS=true
TE_MODEL_URI=opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte
TE_MODEL_PATH=/tmp/te-model
TDA_LOG_LEVEL=DEBUG

3- Ensure model is downloaded and tested:

uv run --env-file .env embeddings --verbose download-model
uv run --env-file .env embeddings --verbose test-model-load

4- Run the create-embeddings CLI command, relying on stubbed, debug functionality in the CLI for parts that are not yet implemented:

uv run --env-file .env embeddings --verbose \
create-embeddings \
-d s3://timdex-extract-dev-222053980223/dataset \
--run-id 16a97964-4c61-45f0-87fa-7363d1af01c2 \
--strategy full-record \
--output-jsonl=output/use-112-test.jsonl

The --run-id is a real run in dev, with 350 records. As a result, we should see 350 lines in the output file output/use-112.test.jsonl. This is simulating creating embeddings for the records from the --run-id. As we get into a real implementation, the number of records would be multipled by the number of --strategy's passed as well (though most likely just one for the foreseeable future).

What does this demonstrate?

  • the CLI command create-embeddings is using TDA to read records
    • currently only scopable to a run_id, much like TIM is for indexing into OpenSearch
    • we could expand the possible ways of scoping records to include in the future
  • an chain of iterators is generated that results in output
  • iterator of records via TDA
  • iterator of records when we transform those records into embeddable text (stubbed)
  • iterator of embeddings from the ML model (stubbed)

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

How this addresses that need:
* CLI command create-embeddings created
  * args and some functionality in place
  * WIP comments and DEBUG code temporarily added to demonstrate how it will work
* class RecordText added to encapsulate text that is ready for an embedding
  * this will support future functionality of pre-embedding "strategies" applied
    to records
* class Embedding created to encapsulate the embedding result
  * this captures the TIMDEX record the embedding was assocaited with,
    and the model + strategy used to prepare the text

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-112
Why these changes are being introduced:

Previously, when --verbose was set for the CLI all loggers inherited this.
In other applications, we have used a 'WARNING_ONLY_LOGGERS' env var that
would limit them to WARNING level.  This worked, but was perhaps not ideal.
Without that env var, it's a bit of whack-a-mole to figure out which loggers
to quiet.

How this addresses that need:

Instead of defaulting all loggers to DEBUG in verbose mode, we target only
libraries we expect this application to log in DEBUG.  By default, all other
logger families will still have WARNING.

This may be a pattern we want to explore in other repositories. Potentially
even further inverting the pattern and supporting a 'DEBUG_LOGGERS' env var
list that would explicitly toggle on DEBUG logging for those libraries.
That would allow troubleshooting in deployed environments just by setting
an env var.

This is NOT applied in this commit, but noting for future consideration.

Side effects of this change:
* Both 'embeddings' and 'timdex_dataset_api' are logged as DEBUG
in verbose mode, but only those libraries.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-112
@ghukill ghukill force-pushed the USE-112-scaffold-creating-embeddings branch from c42bd5e to 0024b8f Compare October 30, 2025 19:36
run_record_offset: int
model_uri: str
embedding_strategy: str
embedding: dict | list[float]
Copy link
Collaborator Author

@ghukill ghukill Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This single embedding field is probably one of the trickier things to solve right now.

The schema proposed in this comment from USE-114 shows there will be two embedding fields:

  • embedding_token_weights
  • embedding_vector

It's possible, and perhaps desirable, that our current embedding class OSNeuralSparseDocV3GTE produces both when creating embeddings. As such, we may find that this Embedding class needs to have slots for both, more closely reflecting the TIMDEX dataset schema where this data will go. This might look like the following, decoupling the dict and list[float] types:

embedding_token_weights: dict
embedding_vector: list[float]

Even typing this now, it seems like a pretty decent option. That said, I'd propose leaving this for now and solving for this in the following tickets that focus on the creation of embeddings and the writing of the embeddings to the dataset:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opting to make this change now. We should engineer for our first, known use case that will want to save both of these on output. We can decide during write if we include both, but most likely will.

@ghukill ghukill requested a review from a team October 30, 2025 19:43
@ghukill ghukill marked this pull request as ready for review October 31, 2025 13:34
Comment on lines +272 to +275
dumps=lambda obj: json.dumps(
obj,
default=str,
),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was new to me: when using jsonlines.open() to get a writer, you can define a custom dumps= serializer. We needed the option default=str to coerce datetime objects into strings on serialization.

Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and the test worked as expected, a few questions

Comment on lines +269 to +278
with jsonlines.open(
output_jsonl,
mode="w",
dumps=lambda obj: json.dumps(
obj,
default=str,
),
) as writer:
for embedding in embeddings:
writer.write(embedding.to_dict())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: this block could be a method to improve readability

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the thinking of encapsulating this somehow, but I'd like to wait on that until the other pieces are more established.

For example, I'm unsure if it makes sense for the embedding classes to perform writing, I'm thinking not. Therefore, it's basically the CLI that does the writing. So there is nowhere for a method per se, but we could have some utility functions? But if we go the utility function route, I'm unsure if a free floating function at the bottom of the file or the hopping around to another file is better than these couple of steps here.

Duly noted, but opting to wait for now.

Comment on lines -14 to -15
for handler in logging.root.handlers:
handler.addFilter(logging.Filter("embeddings"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed?

Copy link
Collaborator Author

@ghukill ghukill Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a great question, one that I felt deserved an entire commit 😅: 0024b8f.

Not sharing the commit to be glib and happy to elaborate more. In short, any applications that install our timdex_dataset_api library, it's helpful to get logs from TDA as well. Unfortunately, some of our other conventions for setting up logging make that difficult. I would argue they are over aggressively "only this app shall log". But I'd also argue went too hard the other direction with, "every library shall log, unless directed otherwise via WARNING_ONLY_LOGGERS".

To me, and noted in the commit, this could be a happy medium:

  • in the application, explicitly configure whic libraries you want --verbose to bump to DEBUG
  • all other libraries keep their default WARNING
  • not implemented, but opens the door for a DEBUG_LOGGERS env var that could toggle other libraries to DEBUG logging

TL/DR: moves to an opt-in pattern for debug logging, while putting TDA on the same footing as the application it's part of

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, since this was in the first commit, I didn't associate it with those changes!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all - kind of sloppy on my part how this happened. Snuck the removal in an earlier PR, then this update is basically building on that.


@dataclass
class RecordText:
"""Input record for creating an embedding for.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phrasing is a little awkward, maybe Input record used to generate embedding or Input record from which an embedding is generated but those aren't much better 🙃

Copy link
Collaborator Author

@ghukill ghukill Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I even changed the name of the class a few times. I see your other comments above somewhat confusing docstrings.

Thanks for raising this up, this is the time to get the mental model and wording right.

I'll take another pass at class names and docstrings for this important class.

Comment on lines +12 to +13
embedding_strategy: strategy used to create text for embedding
text: text to embed, created from the TIMDEX record via the embedding_strategy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docstrings and names could be a little clearer, is there a more descriptive name than text?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that the class names and docstrings may need some touches, I do feel like text is a succinct and accurate property for this class. The class and docstring should clearly communicate that an instance of this class is:

  1. came from a specific TIMDEX record
  2. we prepared a string of text to create the embedding from via strategy XYZ
  3. the actual string of text we'll send to the embedding model is found at .text

I'm unsure if we benefit from something like .text_to_embed, as I think .text on this class should kind of imply that. It's like a Meal class with a .desert property, where the relationship feels implied and wouldn't expect .desert_to_eat.

Comment on lines 30 to 31
embedding_strategy: strategy used to create text for embedding
embedding: model embedding created from text

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note here for making the docstrings a littler clearer, names are fine



@dataclass
class RecordText:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of multiple commits to get it right, opting to try and hash this out in a dedicated comment thread. I feel like some of your comments below @ehanson8, which I agree with, could be addressed by renaming this class and attributes.

After a bit of thinking on it, what about EmbeddingInput?

  • the presence of (timdex_record_id, run_id, run_record_offset) implies this is associated with a specific TIMDEX record
  • still not 1000% happy with embedding_strategy, but I think it communicates pretty broadly that a) this was a strategy for preparing this "embedding input" object, and b) it's the strategy of the embedding itself to represent those things
  • lastly, hopefully text is a bit clearer now as the "text" that will be used as the "embedding input"?

Thoughts @ehanson8?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful! That seems like a better framing for the app as well. It doesn't care that it's a "Record", just an input for an embedding

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehanson8 - agreed. Thanks for the comments and a bit of friction on the first pass of names.

In theory, we could use this CLI for creating embeddings for anything. Obviously it's wired to read TIMDEX records as input currently, but deeper in the data model this EmbeddingInput class would work equally well for anything.

I'll work on a commit with some renamings and docstring updates.

Why these changes are being introduced:

Code review suggested that 'RecordText' was a confusing name
for the object that we prepare to then create an embedding from.

How this addresses that need:

Renamign to 'EmbeddingInput' makes it crystal clear that we are
preparing an object that will be used to create an embedding.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-112
Why these changes are being introduced:

Formerly, our 'Embedding' class only had an 'embedding' property for the output.
However, for our first model in the pipeline, opensearch-project/
opensearch-neural-sparse-encoding-doc-v3-gte, it produces two representations of the
embedding that are useful to store: a sparse vector and decoded token weights.

How this addresses that need:

Updates the 'Embedding' class to explicitly store both representations of the
embedding.

We may decide that we don't store both, or some futures models may not produce
decoded token weights of any kind, but this matches our first proposed model
and pipeline.  Better to be explicit and opinionated in these early days, then
adjust later if needed.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-112
@ghukill ghukill requested a review from ehanson8 November 3, 2025 14:24
@ghukill
Copy link
Collaborator Author

ghukill commented Nov 3, 2025

Re-requesting a review @ehanson8. Thanks again for the feedback.

Changes include:

  • rename RecordText to EmbeddingInput
  • explicit properties on Embedding: embedding_token_weights and embedding_vector that maps to our proposed schema in USE-114

Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Comment on lines +7 to +12
class EmbeddingInput:
"""Encapsulates the inputs for an embedding.
When creating an embedding, we need to note what TIMDEX record the embedding is
associated with and what strategy was used to prepare the embedding input text from
the record itself.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the approach in the commit message!

@ghukill ghukill merged commit bd9de2b into main Nov 3, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants