-
Notifications
You must be signed in to change notification settings - Fork 0
Stub CLI command methods to create embeddings #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
How this addresses that need:
* CLI command create-embeddings created
* args and some functionality in place
* WIP comments and DEBUG code temporarily added to demonstrate how it will work
* class RecordText added to encapsulate text that is ready for an embedding
* this will support future functionality of pre-embedding "strategies" applied
to records
* class Embedding created to encapsulate the embedding result
* this captures the TIMDEX record the embedding was assocaited with,
and the model + strategy used to prepare the text
Side effects of this change:
* None
Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-112
Why these changes are being introduced: Previously, when --verbose was set for the CLI all loggers inherited this. In other applications, we have used a 'WARNING_ONLY_LOGGERS' env var that would limit them to WARNING level. This worked, but was perhaps not ideal. Without that env var, it's a bit of whack-a-mole to figure out which loggers to quiet. How this addresses that need: Instead of defaulting all loggers to DEBUG in verbose mode, we target only libraries we expect this application to log in DEBUG. By default, all other logger families will still have WARNING. This may be a pattern we want to explore in other repositories. Potentially even further inverting the pattern and supporting a 'DEBUG_LOGGERS' env var list that would explicitly toggle on DEBUG logging for those libraries. That would allow troubleshooting in deployed environments just by setting an env var. This is NOT applied in this commit, but noting for future consideration. Side effects of this change: * Both 'embeddings' and 'timdex_dataset_api' are logged as DEBUG in verbose mode, but only those libraries. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-112
c42bd5e to
0024b8f
Compare
embeddings/embedding.py
Outdated
| run_record_offset: int | ||
| model_uri: str | ||
| embedding_strategy: str | ||
| embedding: dict | list[float] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This single embedding field is probably one of the trickier things to solve right now.
The schema proposed in this comment from USE-114 shows there will be two embedding fields:
embedding_token_weightsembedding_vector
It's possible, and perhaps desirable, that our current embedding class OSNeuralSparseDocV3GTE produces both when creating embeddings. As such, we may find that this Embedding class needs to have slots for both, more closely reflecting the TIMDEX dataset schema where this data will go. This might look like the following, decoupling the dict and list[float] types:
embedding_token_weights: dict
embedding_vector: list[float]Even typing this now, it seems like a pretty decent option. That said, I'd propose leaving this for now and solving for this in the following tickets that focus on the creation of embeddings and the writing of the embeddings to the dataset:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opting to make this change now. We should engineer for our first, known use case that will want to save both of these on output. We can decide during write if we include both, but most likely will.
| dumps=lambda obj: json.dumps( | ||
| obj, | ||
| default=str, | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was new to me: when using jsonlines.open() to get a writer, you can define a custom dumps= serializer. We needed the option default=str to coerce datetime objects into strings on serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and the test worked as expected, a few questions
| with jsonlines.open( | ||
| output_jsonl, | ||
| mode="w", | ||
| dumps=lambda obj: json.dumps( | ||
| obj, | ||
| default=str, | ||
| ), | ||
| ) as writer: | ||
| for embedding in embeddings: | ||
| writer.write(embedding.to_dict()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: this block could be a method to improve readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the thinking of encapsulating this somehow, but I'd like to wait on that until the other pieces are more established.
For example, I'm unsure if it makes sense for the embedding classes to perform writing, I'm thinking not. Therefore, it's basically the CLI that does the writing. So there is nowhere for a method per se, but we could have some utility functions? But if we go the utility function route, I'm unsure if a free floating function at the bottom of the file or the hopping around to another file is better than these couple of steps here.
Duly noted, but opting to wait for now.
| for handler in logging.root.handlers: | ||
| handler.addFilter(logging.Filter("embeddings")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a great question, one that I felt deserved an entire commit 😅: 0024b8f.
Not sharing the commit to be glib and happy to elaborate more. In short, any applications that install our timdex_dataset_api library, it's helpful to get logs from TDA as well. Unfortunately, some of our other conventions for setting up logging make that difficult. I would argue they are over aggressively "only this app shall log". But I'd also argue went too hard the other direction with, "every library shall log, unless directed otherwise via WARNING_ONLY_LOGGERS".
To me, and noted in the commit, this could be a happy medium:
- in the application, explicitly configure whic libraries you want
--verboseto bump toDEBUG - all other libraries keep their default
WARNING - not implemented, but opens the door for a
DEBUG_LOGGERSenv var that could toggle other libraries toDEBUGlogging
TL/DR: moves to an opt-in pattern for debug logging, while putting TDA on the same footing as the application it's part of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, since this was in the first commit, I didn't associate it with those changes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all - kind of sloppy on my part how this happened. Snuck the removal in an earlier PR, then this update is basically building on that.
embeddings/embedding.py
Outdated
|
|
||
| @dataclass | ||
| class RecordText: | ||
| """Input record for creating an embedding for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phrasing is a little awkward, maybe Input record used to generate embedding or Input record from which an embedding is generated but those aren't much better 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I even changed the name of the class a few times. I see your other comments above somewhat confusing docstrings.
Thanks for raising this up, this is the time to get the mental model and wording right.
I'll take another pass at class names and docstrings for this important class.
| embedding_strategy: strategy used to create text for embedding | ||
| text: text to embed, created from the TIMDEX record via the embedding_strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docstrings and names could be a little clearer, is there a more descriptive name than text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree that the class names and docstrings may need some touches, I do feel like text is a succinct and accurate property for this class. The class and docstring should clearly communicate that an instance of this class is:
- came from a specific TIMDEX record
- we prepared a string of text to create the embedding from via strategy
XYZ - the actual string of text we'll send to the embedding model is found at
.text
I'm unsure if we benefit from something like .text_to_embed, as I think .text on this class should kind of imply that. It's like a Meal class with a .desert property, where the relationship feels implied and wouldn't expect .desert_to_eat.
embeddings/embedding.py
Outdated
| embedding_strategy: strategy used to create text for embedding | ||
| embedding: model embedding created from text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note here for making the docstrings a littler clearer, names are fine
embeddings/embedding.py
Outdated
|
|
||
|
|
||
| @dataclass | ||
| class RecordText: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of multiple commits to get it right, opting to try and hash this out in a dedicated comment thread. I feel like some of your comments below @ehanson8, which I agree with, could be addressed by renaming this class and attributes.
After a bit of thinking on it, what about EmbeddingInput?
- the presence of
(timdex_record_id, run_id, run_record_offset)implies this is associated with a specific TIMDEX record - still not 1000% happy with
embedding_strategy, but I think it communicates pretty broadly that a) this was a strategy for preparing this "embedding input" object, and b) it's the strategy of the embedding itself to represent those things - lastly, hopefully
textis a bit clearer now as the "text" that will be used as the "embedding input"?
Thoughts @ehanson8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful! That seems like a better framing for the app as well. It doesn't care that it's a "Record", just an input for an embedding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehanson8 - agreed. Thanks for the comments and a bit of friction on the first pass of names.
In theory, we could use this CLI for creating embeddings for anything. Obviously it's wired to read TIMDEX records as input currently, but deeper in the data model this EmbeddingInput class would work equally well for anything.
I'll work on a commit with some renamings and docstring updates.
Why these changes are being introduced: Code review suggested that 'RecordText' was a confusing name for the object that we prepare to then create an embedding from. How this addresses that need: Renamign to 'EmbeddingInput' makes it crystal clear that we are preparing an object that will be used to create an embedding. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-112
Why these changes are being introduced: Formerly, our 'Embedding' class only had an 'embedding' property for the output. However, for our first model in the pipeline, opensearch-project/ opensearch-neural-sparse-encoding-doc-v3-gte, it produces two representations of the embedding that are useful to store: a sparse vector and decoded token weights. How this addresses that need: Updates the 'Embedding' class to explicitly store both representations of the embedding. We may decide that we don't store both, or some futures models may not produce decoded token weights of any kind, but this matches our first proposed model and pipeline. Better to be explicit and opinionated in these early days, then adjust later if needed. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-112
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
| class EmbeddingInput: | ||
| """Encapsulates the inputs for an embedding. | ||
| When creating an embedding, we need to note what TIMDEX record the embedding is | ||
| associated with and what strategy was used to prepare the embedding input text from | ||
| the record itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the approach in the commit message!
Purpose and background context
This PR stubs the following:
create-embeddingscommandcreate_embedding()create_embeddings()Please note in the CLI
create-embeddingscommand there is a fair amount of#WIP: ...and# DEBUG ...comments. The goal here is to sketch what will happen in this CLI without completing those pieces yet. The two primary pieces missing at this time:OSNeuralSparseDocV3GTEThere was a bit of churn in the unit tests as well, refactoring to try and utilize the mocked embedding class a bit more.
How can a reviewer manually see the effects of these changes?
1- Set Dev1 AWS
TimdexManagerscredentials2- Set env vars:
3- Ensure model is downloaded and tested:
4- Run the
create-embeddingsCLI command, relying on stubbed, debug functionality in the CLI for parts that are not yet implemented:The
--run-idis a real run in dev, with 350 records. As a result, we should see 350 lines in the output fileoutput/use-112.test.jsonl. This is simulating creating embeddings for the records from the--run-id. As we get into a real implementation, the number of records would be multipled by the number of--strategy's passed as well (though most likely just one for the foreseeable future).What does this demonstrate?
create-embeddingsis using TDA to read recordsrun_id, much like TIM is for indexing into OpenSearchIncludes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Code review