Stub CLI command methods to create embeddings #16

ghukill · 2025-10-30T15:55:15Z

Purpose and background context

This PR stubs the following:

CLI create-embeddings command
- main entrypoint for the CLI application
- creates embeddings and writes them somewhere
embedding class base method create_embedding()
- this is what each model must define, specific to the model's business logic for actually creating embeddings
- expecting a single input text, gives a single embedding back
embedding class base method create_embeddings()
- driver of creating multiple embeddings
- called by the CLI
- may be the location of parallelism in the future

Please note in the CLI create-embeddings command there is a fair amount of #WIP: ... and # DEBUG ... comments. The goal here is to sketch what will happen in this CLI without completing those pieces yet. The two primary pieces missing at this time:

the transformation of TIMDEX records into embeddable texts via one or more "strategies"
actually creating the embedding, i.e. from our first and only embedding class OSNeuralSparseDocV3GTE

There was a bit of churn in the unit tests as well, refactoring to try and utilize the mocked embedding class a bit more.

How can a reviewer manually see the effects of these changes?

1- Set Dev1 AWS TimdexManagers credentials

2- Set env vars:

HF_HUB_DISABLE_PROGRESS_BARS=true
TE_MODEL_URI=opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte
TE_MODEL_PATH=/tmp/te-model
TDA_LOG_LEVEL=DEBUG

3- Ensure model is downloaded and tested:

uv run --env-file .env embeddings --verbose download-model
uv run --env-file .env embeddings --verbose test-model-load

4- Run the create-embeddings CLI command, relying on stubbed, debug functionality in the CLI for parts that are not yet implemented:

uv run --env-file .env embeddings --verbose \
create-embeddings \
-d s3://timdex-extract-dev-222053980223/dataset \
--run-id 16a97964-4c61-45f0-87fa-7363d1af01c2 \
--strategy full-record \
--output-jsonl=output/use-112-test.jsonl

The --run-id is a real run in dev, with 350 records. As a result, we should see 350 lines in the output file output/use-112.test.jsonl. This is simulating creating embeddings for the records from the --run-id. As we get into a real implementation, the number of records would be multipled by the number of --strategy's passed as well (though most likely just one for the foreseeable future).

What does this demonstrate?

the CLI command create-embeddings is using TDA to read records
- currently only scopable to a run_id, much like TIM is for indexing into OpenSearch
- we could expand the possible ways of scoping records to include in the future
an chain of iterators is generated that results in output
iterator of records via TDA
iterator of records when we transform those records into embeddable text (stubbed)
iterator of embeddings from the ML model (stubbed)

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-112

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

How this addresses that need: * CLI command create-embeddings created * args and some functionality in place * WIP comments and DEBUG code temporarily added to demonstrate how it will work * class RecordText added to encapsulate text that is ready for an embedding * this will support future functionality of pre-embedding "strategies" applied to records * class Embedding created to encapsulate the embedding result * this captures the TIMDEX record the embedding was assocaited with, and the model + strategy used to prepare the text Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-112

Why these changes are being introduced: Previously, when --verbose was set for the CLI all loggers inherited this. In other applications, we have used a 'WARNING_ONLY_LOGGERS' env var that would limit them to WARNING level. This worked, but was perhaps not ideal. Without that env var, it's a bit of whack-a-mole to figure out which loggers to quiet. How this addresses that need: Instead of defaulting all loggers to DEBUG in verbose mode, we target only libraries we expect this application to log in DEBUG. By default, all other logger families will still have WARNING. This may be a pattern we want to explore in other repositories. Potentially even further inverting the pattern and supporting a 'DEBUG_LOGGERS' env var list that would explicitly toggle on DEBUG logging for those libraries. That would allow troubleshooting in deployed environments just by setting an env var. This is NOT applied in this commit, but noting for future consideration. Side effects of this change: * Both 'embeddings' and 'timdex_dataset_api' are logged as DEBUG in verbose mode, but only those libraries. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-112

ghukill · 2025-10-30T19:41:47Z

embeddings/embedding.py

+    run_record_offset: int
+    model_uri: str
+    embedding_strategy: str
+    embedding: dict | list[float]


This single embedding field is probably one of the trickier things to solve right now.

The schema proposed in this comment from USE-114 shows there will be two embedding fields:

embedding_token_weights

embedding_vector

It's possible, and perhaps desirable, that our current embedding class OSNeuralSparseDocV3GTE produces both when creating embeddings. As such, we may find that this Embedding class needs to have slots for both, more closely reflecting the TIMDEX dataset schema where this data will go. This might look like the following, decoupling the dict and list[float] types:

embedding_token_weights: dict embedding_vector: list[float]

Even typing this now, it seems like a pretty decent option. That said, I'd propose leaving this for now and solving for this in the following tickets that focus on the creation of embeddings and the writing of the embeddings to the dataset:

https://mitlibraries.atlassian.net/browse/USE-136

https://mitlibraries.atlassian.net/browse/USE-138

Opting to make this change now. We should engineer for our first, known use case that will want to save both of these on output. We can decide during write if we include both, but most likely will.

ghukill · 2025-10-31T18:35:17Z

embeddings/cli.py

+            dumps=lambda obj: json.dumps(
+                obj,
+                default=str,
+            ),


This was new to me: when using jsonlines.open() to get a writer, you can define a custom dumps= serializer. We needed the option default=str to coerce datetime objects into strings on serialization.

ehanson8

Looks good and the test worked as expected, a few questions

embeddings/cli.py

ehanson8 · 2025-10-31T18:47:40Z

embeddings/cli.py

+        with jsonlines.open(
+            output_jsonl,
+            mode="w",
+            dumps=lambda obj: json.dumps(
+                obj,
+                default=str,
+            ),
+        ) as writer:
+            for embedding in embeddings:
+                writer.write(embedding.to_dict())


Optional: this block could be a method to improve readability

I like the thinking of encapsulating this somehow, but I'd like to wait on that until the other pieces are more established.

For example, I'm unsure if it makes sense for the embedding classes to perform writing, I'm thinking not. Therefore, it's basically the CLI that does the writing. So there is nowhere for a method per se, but we could have some utility functions? But if we go the utility function route, I'm unsure if a free floating function at the bottom of the file or the hopping around to another file is better than these couple of steps here.

Duly noted, but opting to wait for now.

embeddings/cli.py

ehanson8 · 2025-10-31T18:49:02Z

embeddings/config.py

-        for handler in logging.root.handlers:
-            handler.addFilter(logging.Filter("embeddings"))


Why was this removed?

It's a great question, one that I felt deserved an entire commit 😅: 0024b8f.

Not sharing the commit to be glib and happy to elaborate more. In short, any applications that install our timdex_dataset_api library, it's helpful to get logs from TDA as well. Unfortunately, some of our other conventions for setting up logging make that difficult. I would argue they are over aggressively "only this app shall log". But I'd also argue went too hard the other direction with, "every library shall log, unless directed otherwise via WARNING_ONLY_LOGGERS".

To me, and noted in the commit, this could be a happy medium:

in the application, explicitly configure whic libraries you want --verbose to bump to DEBUG

all other libraries keep their default WARNING

not implemented, but opens the door for a DEBUG_LOGGERS env var that could toggle other libraries to DEBUG logging

TL/DR: moves to an opt-in pattern for debug logging, while putting TDA on the same footing as the application it's part of

My bad, since this was in the first commit, I didn't associate it with those changes!

Not at all - kind of sloppy on my part how this happened. Snuck the removal in an earlier PR, then this update is basically building on that.

ehanson8 · 2025-10-31T18:53:26Z

embeddings/embedding.py

+
+@dataclass
+class RecordText:
+    """Input record for creating an embedding for.


Phrasing is a little awkward, maybe Input record used to generate embedding or Input record from which an embedding is generated but those aren't much better 🙃

Agreed. I even changed the name of the class a few times. I see your other comments above somewhat confusing docstrings.

Thanks for raising this up, this is the time to get the mental model and wording right.

I'll take another pass at class names and docstrings for this important class.

ehanson8 · 2025-10-31T18:55:41Z

embeddings/embedding.py

+        embedding_strategy: strategy used to create text for embedding
+        text: text to embed, created from the TIMDEX record via the embedding_strategy


These docstrings and names could be a little clearer, is there a more descriptive name than text?

While I agree that the class names and docstrings may need some touches, I do feel like text is a succinct and accurate property for this class. The class and docstring should clearly communicate that an instance of this class is:

came from a specific TIMDEX record

we prepared a string of text to create the embedding from via strategy XYZ

the actual string of text we'll send to the embedding model is found at .text

I'm unsure if we benefit from something like .text_to_embed, as I think .text on this class should kind of imply that. It's like a Meal class with a .desert property, where the relationship feels implied and wouldn't expect .desert_to_eat.

ehanson8 · 2025-10-31T18:57:06Z

embeddings/embedding.py

+        embedding_strategy: strategy used to create text for embedding
+        embedding: model embedding created from text


Same note here for making the docstrings a littler clearer, names are fine

embeddings/models/base.py

ghukill · 2025-10-31T20:54:23Z

embeddings/embedding.py

+
+
+@dataclass
+class RecordText:


Instead of multiple commits to get it right, opting to try and hash this out in a dedicated comment thread. I feel like some of your comments below @ehanson8, which I agree with, could be addressed by renaming this class and attributes.

After a bit of thinking on it, what about EmbeddingInput?

the presence of (timdex_record_id, run_id, run_record_offset) implies this is associated with a specific TIMDEX record

still not 1000% happy with embedding_strategy, but I think it communicates pretty broadly that a) this was a strategy for preparing this "embedding input" object, and b) it's the strategy of the embedding itself to represent those things

lastly, hopefully text is a bit clearer now as the "text" that will be used as the "embedding input"?

Thoughts @ehanson8?

Beautiful! That seems like a better framing for the app as well. It doesn't care that it's a "Record", just an input for an embedding

@ehanson8 - agreed. Thanks for the comments and a bit of friction on the first pass of names.

In theory, we could use this CLI for creating embeddings for anything. Obviously it's wired to read TIMDEX records as input currently, but deeper in the data model this EmbeddingInput class would work equally well for anything.

I'll work on a commit with some renamings and docstring updates.

Why these changes are being introduced: Code review suggested that 'RecordText' was a confusing name for the object that we prepare to then create an embedding from. How this addresses that need: Renamign to 'EmbeddingInput' makes it crystal clear that we are preparing an object that will be used to create an embedding. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-112

Why these changes are being introduced: Formerly, our 'Embedding' class only had an 'embedding' property for the output. However, for our first model in the pipeline, opensearch-project/ opensearch-neural-sparse-encoding-doc-v3-gte, it produces two representations of the embedding that are useful to store: a sparse vector and decoded token weights. How this addresses that need: Updates the 'Embedding' class to explicitly store both representations of the embedding. We may decide that we don't store both, or some futures models may not produce decoded token weights of any kind, but this matches our first proposed model and pipeline. Better to be explicit and opinionated in these early days, then adjust later if needed. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-112

ghukill · 2025-11-03T14:25:52Z

Re-requesting a review @ehanson8. Thanks again for the feedback.

Changes include:

rename RecordText to EmbeddingInput
explicit properties on Embedding: embedding_token_weights and embedding_vector that maps to our proposed schema in USE-114

ehanson8

Looks good to me!

ehanson8 · 2025-11-03T14:45:27Z

embeddings/embedding.py

+class EmbeddingInput:
+    """Encapsulates the inputs for an embedding.
+
+    When creating an embedding, we need to note what TIMDEX record the embedding is
+    associated with and what strategy was used to prepare the embedding input text from
+    the record itself.


Much better!

ehanson8 · 2025-11-03T14:50:19Z

embeddings/cli.py

Agree with the approach in the commit message!

ghukill added 2 commits October 30, 2025 11:50

ghukill force-pushed the USE-112-scaffold-creating-embeddings branch from c42bd5e to 0024b8f Compare October 30, 2025 19:36

ghukill commented Oct 30, 2025

View reviewed changes

ghukill requested a review from a team October 30, 2025 19:43

ghukill marked this pull request as ready for review October 31, 2025 13:34

ghukill commented Oct 31, 2025

View reviewed changes

ehanson8 reviewed Oct 31, 2025

View reviewed changes

ghukill commented Oct 31, 2025

View reviewed changes

ghukill added 3 commits November 3, 2025 09:09

Update README

de351a1

ghukill requested a review from ehanson8 November 3, 2025 14:24

ehanson8 approved these changes Nov 3, 2025

View reviewed changes

ghukill merged commit bd9de2b into main Nov 3, 2025
2 checks passed

		for handler in logging.root.handlers:
		handler.addFilter(logging.Filter("embeddings"))

		embedding_strategy: strategy used to create text for embedding
		text: text to embed, created from the TIMDEX record via the embedding_strategy

		embedding_strategy: strategy used to create text for embedding
		embedding: model embedding created from text

Stub CLI command methods to create embeddings #16

Stub CLI command methods to create embeddings #16

Uh oh!

Conversation

ghukill commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

What does this demonstrate?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

ghukill Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill commented Nov 3, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ghukill commented Oct 30, 2025 •

edited

Loading

ghukill Oct 30, 2025 •

edited

Loading

ghukill Oct 31, 2025 •

edited

Loading

ghukill Oct 31, 2025 •

edited

Loading

ghukill commented Nov 3, 2025 •

edited by atlassian bot

Loading