-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embedding pr #369
Embedding pr #369
Conversation
The title says it is the "Embedding pr", but I see quite a bit code for the chunking as well. Can we remove that to keep the review manageable? The goal for this PR should be to only implement the embedding model base class plus one concrete class for the embedding model that we are currently using. The concrete implementation should simply hardcode the chunking as we are currently doing it for the source storages. Everything else, e.g. factoring out the chunking logic, new chunking models, new embedding models, etc., should only be implemented in follow-up PRs. |
beced90
to
f179b69
Compare
You are correct, I forgot to push a rebase from yesterday that dropped the chunking commits. It should be fixed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Tengal-Teemo for moving forward with this. I did a first pass mostly about moving all the pieces into the right place. I think we are making good progress here. When everything is resolved here, I'll do another pass.
One thing that I don't really understand yet is #354 (comment). But we'll come to that later.
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
aeb91d5
to
e664a1b
Compare
bfce5e7
to
2735298
Compare
@pmeier I believe I've either implemented or commented on all your changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tengal-Teemo I did another pass. Still mostly moving stuff around. I'll come to the meaty part after that. All open comments new and old still need your attention.
…structor from Embedding
…es by re-importing source_storages
@pmeier The error message from changing the chromadb retrieve function is as follows:
Apologies for my lack of experience with pydantic. |
Additionally @pmeier , I will finish up the rest of the comments tomorrow and attempt to solve why dropping commit ed4f91a causes the following error:
I'm not 100% sure, but I believe it's caused by the introduction of the |
Re #369 (comment): This is on me. In contrast to what I said earlier, we do actually check for the name: ragna/ragna/core/_components.py Lines 52 to 70 in a5e4795
Leave it as is. This is something I wanted to refactor anyway. I'll fix it in a follow-up PR. |
Re #369 (comment): Fixed in 1d5d4a8. Note that this also includes a few automatic formatting changes. They have to be applied before merge anyway, so I didn't mess with the pre-commit hooks for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Toby, this going pretty well. I took the liberty and pushed a few cleanup commits about some minor stuff that would just slow us down. Mostly moving the parts into their final position as well as linting. Now we can finally get to the interesting part.
) | ||
] | ||
|
||
def embed_text( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at the PR in more detail, I again have my doubts about this function. As discussed before, this only exists for our benefit of embedding the prompt. However it comes at the cost of
- We need to mess with the type annotations quite a bit, because the typing system doesn't really support output types based on the input, e.g.
str -> str
andlist[str] -> list[str]
. - Users have to implement two methods instead of one.
- We need to integrate two very similar methods into the Ragna protocol.
I think the costs outweigh the benefit here.
Right now embed_documents
takes Document
s. But this will change with the chunking work to take Chunk
s. If we move the chunking into Chat.prepare
for now to emulate the future case for the embedding model, couldn't we create a private _embed_text
method stuffs the text into a dummy Chunk
, calls the user defined embed_documents
and returns the embedding? That way we should be able to eliminate all the downsides above while keeping the benefit.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, after reading a good analysis you are correct. I will implement your solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main thing swaying it for me is that the user (ragna developer) has to implement two different methods. I think this warrants the roundabout method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like to point out that pretty much no matter what you do, you'll still encounter the issue of passing through a batch vs passing through a single object. We really want to be able to batch chunks because it is simply more efficient to batch-encode these things. I will show you the difference after finishing up these comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can take a look at what I've done in commit 8c9fb84, I don't particularly like it but it does work.
chunk: Chunk | ||
|
||
|
||
class EmbeddingModel(Component, ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open for opinions here. Chroma calls this EmbeddingFunction
should we align to that? Are there other references that we should look at?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EmbeddingFunction makes more sense if you're "calling" it on some text/chunks. The user treats this as a model, so I think it should be named as such. Although, now that we have embed_chunks as the only function of it, we could change it to be EmbeddingFunction as chromaDB does and make it callable. The issue with that is that something that simple shouldn't require a chunk input, it should just be text. I've made the change in a new branch here so you can have a look
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
@pmeier Sorry I've been away for a while, I believe I've caught up with your comments and made the suggested changes. Let me know when you get time to do another passthrough. |
@Tengal-Teemo I haven't forgotten about this. Just been rather busy the past days / weeks. |
@pmeier Any chance of reviving this? I'm happy to continue with it. |
I'm really sorry this takes so long. I didn't expect to get so swamped the past few weeks. TBH, this PR is already in a fairly good state. Instead of going through yet another round of review, let me just finish it later today and merge. From there on, you can continue with the chunking PR if you like. |
@pmeier Ok no problem, I'm happy with that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've cleaned everything up to a state where we can move on. However, since there is much work needed on the API / UI side to make this and the chunking refactor work, I've decided we are not going to merge into main
just yet. I've started a new branch chunking-embedding-dev
that we'll use for now. We can do PRs against that branch and when we are happy with the implementation for Python API / REST API / web UI we can then merge into main
.
Thanks for the hard work and patience with me Toby! Looking forward to the chunking PR.
@pmeier
This is the first of the two ongoing pull requests that will represent the embedding and chunking models. All suggested changes have been implemented.