-
Notifications
You must be signed in to change notification settings - Fork 0
USE 136 - implement create embeddings for OSNeuralSparseDocV3GTE #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3d7062b to
d9dba96
Compare
Why these changes are being introduced: Our first embedding class OSNeuralSparseDocV3GTE was ready for a real create_embedding() method with the rest of the moving parts mostly complete at this time. How this addresses that need: The HuggingFace model card, https://huggingface.co/ opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte, contains a section of example code for using this model with the transformers library. This logic was ported over to our class, with the model already downloaded and loaded taken care of. The biggest addition is the method _decode_sparse_vectors() which converts the numerical sparse vector into a dictionary of token:weights. This decoded token weight form is what we'll pass directly to OpenSearch. With the new functionality in place, the tests associated with this class were also updated. Fixtures were moved into the testing file, a pattern we could adopt for any future models to keep them out of the shared conftest.py. Side effects of this change: * CLI is capable of producing embeddings for our first model OSNeuralSparseDocV3GTE Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-136
d9dba96 to
fc0bdea
Compare
b8021a0 to
296b870
Compare
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works as expected and looks great! One typo and a question from our earlier discussion but I'm comfortable approving regardless of the answer
| run_record_offset=embedding_input.run_record_offset, | ||
| model_uri=self.model_uri, | ||
| embedding_strategy=embedding_input.embedding_strategy, | ||
| embedding_vector=sparse_vector_list, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adding this here from our earlier discussion: even if the vector compresses to nothing, I think we should consider why we're passing it through if we don't have a use case for it. Obviously it's needed for generating token weights but if OpenSearch doesn't use it, I'm not sure why we're storing it on the object. I won't press beyond this comment but it feels like we're keeping an unnecessary precursor in addition to the useful output. Happy to be corrected if that's not the case!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd wager to say it's kind of an endless topic. Your point and hesitance are well founded and registered.
To me, it's a bit of a gamble.
It takes time and money to produce embeddings, even as we tune for performance, and the sparse vectors arguably have more information, given it's a representation of the embedding across the entire vocabulary. Theoretically, we could perform some mathematical operations on the sparse vectors we can't do on the decoded token weights. Other folks, myself included, have some interest in this. Storing them keeps that option on the table.
Additionally, there might be a good argument for only storing those sparse vectors in the future and decoding the data on the way out. This could be much cheaper to store in the long term.
Lastly, hopefully, we'll use a model in the future that produces true dense vectors, which will require storage in that form.
If any of these pan out, I think it'd be nice to have some repititions and tested schemas for storing this data. Certainly not opposed to removing it a few months down the road as we tune the pipeline, but I'd lobby for keeping it in these early days as we develop our understanding.
But, as I lead with, I don't think there is a right or wrong answer here. We may very well decide it's not useful at some point and cease to store it for this model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is all fine, appreciate the discussion! 🙂
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me. Expecting to take a closer look with the upcoming tickets. Just one question for you.
| def __repr__(self) -> str: # noqa: D105 | ||
| return ( | ||
| f"<EmbeddingInput - record:'{self.timdex_record_id}', " | ||
| f"strategy:'{self.embedding_strategy}', text length:{len(self.text)}>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why text length? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping it brief here, but in short, just a handy little indicator of how long the text used for the embedding was! This might help when we don't see the original text, maybe uneareth instances where it's zero? or huge?
But purely a guess at helpful data for the interactive python/shell environment. Won't have any bearing otherwise.
Purpose and background context
This PR updates our first embedding class
OSNeuralSparseDocV3GTEto have a fully functionalcreate_embedding()method.Most of the business logic for this was ported directly from the HuggingFace model card, with the main addition in our code being the method
_decode_sparse_vectors()which converts the sparse vector into the decoded{token:weight, ...}format which we will pass directly to OpenSearch.Noting that a meeting has been scheduled to discuss this PR prior to any reviews, to provide a space to demo and discuss how this works (at least to some degree of detail).
This should be considered an initial, untuned implementation, but one that is producing the sparse vectors and decoded token weights for inputs that we know we'll need. There are currently a couple of tickets targeting improvements of this first implementation, and likely more will be opened:
arm64containersHow can a reviewer manually see the effects of these changes?
1- Set Dev1 AWS credentials
2- Set env vars:
3- (Re)Download model:
Manual testing in Ipython shell
4- Start ipython:
5- Load model for creating embeddings:
Manual Creation of
EmbeddingInput'sLook at decoded
token:weightfor the single embedding created:Now we can look at the top 10 tokens by weight:
EmbeddingInput's from TIMDEX recordsCLI
create-embeddingsCLI commandRun the following:
Look at the results at
output/use136.jsonl. Note that for only 10 records, a pretty sizable 1.6mb file; this is a result of the sparse vectors in JSON form. We may decide to not write these back to the TIMDEX dataset given the unknown usage at this time.Includes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Code review