-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
embedding model component? #191
Comments
So on another project, we were looking into a more mature RAG pipeline that has been in development for a while and the approach that was implemented is actually even more fine grained. I don't have the chart in front of me but document to embedding to storage was a multi-step pipeline with configurable components including a summarization and metadata extraction step to annotate the chunks. To be fair, they were also looking at supporting extracting structure from powerpoint and tables etc. Overall, I think it make sense to go this direction. Would it enale the embed once and use multiple LLM's workflow or is that orthogonal? |
It is orthogonal. The point is to make the embedding model a "first class citizen". Right now, everything is hardcoded to ragna/ragna/source_storages/_vector_database.py Lines 74 to 78 in 8d14f87
We could potentially add a flag for it on the retrieve and store method so users could set it an runtime through the additional chat parameters: Lines 62 to 70 in 8d14f87
But that has the downside that all embedding models have to be specified by us. Or, if the user wants to have a different embedding model, they would need to create their own source storage (subclassing will do the heavy lifting, but still). By adding a |
tl;dr Independent configurability for both chunking and embedding behaviour, along with support for hyrbid-search (semantic + lexical) would hopefully bring 80% of the performance with 20% of the work. Details:
This has been my experience to date. I suspect it may be necessary to get the level of quality that most users will expect. A useful graphic from OpenAI on their RAG experience was recently posted on Twitter/X (see the image in the tweet if it doesn't appear through the link). The numbers can be taken with a grain of salt, but from a qualitative perspective, configurability around embeddings and chunkings seems very important. My personal preference (and I also think that this is reasonably in-line with other RAG tools currently available) is to have at least 3 largely-independent pieces: document loading, document embedding and document querying. I think this is quite do-able with the current configuration, and not too far off what @pmeier is suggesting. document loading - This includes both the data connectors and the data chunking strategies. I would recommend the chunking live here (rather than on the embedding) as users will likely want to vary both independently ie is it the chunking or the embedding that is degrading my performance. It also means a 'document store' can be populated, of documents that have been loaded and chunked but not embedded. document embedding - @pmeier suggestions seem sensible to me. An additional point to keep in mind is that I suspect a lot of embedding models will be hosted elsewhere ie this will transform the current in-process, compute-bound approach into a network call that is more like I/O work. A queue is not really needed for this I think - perhaps this can be leveraged in some way. document querying - hybrid-search that combines lexical and semantic search is a feature that has long been used in RecSys. Supported storages like Other
|
In #176, @pmeier thought I might like to chime in here based on my use-case. I think it would be great to have a separate |
Makes sense to me. One thing that I don't want to support right now is to have a separate "chunk storage". Meaning, we only store documents and chunking will always be calculated on the fly. This ties into the "chunk / embed" once use case mentioned in #191 (comment) and #176 (comment) for which we don't have a good solution yet.
See #204 for a write-up of potentially removing the task queue from the Python API. For the REST API / UI though it will stay.
Not sure I understand. Isn't
IMO no. If uses want that, they can write a
I don't understand the use case. Why would you want to do that and not just start another chat?
Yeah, the UI unfortunately makes a lot of hardcoded assumptions right now. This is something that we urgently need to address. However, to make this truly dynamic, we need to annotate the parameters on the components methods to have some extra metadata. I have some ideas for that, but no proper design yet.
They will not be ripped out, but just move to a different component. Will answer on the discussion in detail. |
Yes, poorly worded on my part. I was pointing out that I considered this to be the relevant abstraction, I should also have mentioned that the abstraction already exists! Like you said, it can be configured for further use-cases in the future if this is a path that is considered useful for
It's probably a bit contrived, but I was imagining a case where an org has multiple embedding models for distinct sets of data. For example, they have an embedding model for emails and another for slack messages (as both will be written in different styles, lengths etc). If I wanted to explore my emails and slack messages at the same time, then ideally I would be able to have both embedding models in the same chat. On reflection though, it does seem like this is not really a high-priority at the current stage of development. Perhaps best to pretend I never mentioned it. ;) |
Just my $0.02 here, but I think we should be careful about ever-expanding method signatures. I would love to see an abstraction like follows (with appropriate UI support): class HybridSourceStorage(SourceStorage):
def __init__(self, config: Config, *args):
# stuff
hybrid = HybridSourceStorage(
config,
SemanticSourceStorage(…),
LexicalSourceStorage(…)
) |
Although currently marked experimental, Meaning, if we would go with your abstraction, this would be quite a bit harder, since we now have two different source storages that need to hit the same table. |
My gut tells me you should keep the methods on SourceStorage as clean as possible. That's going to allow you to do composing (i.e., HybridSourceStorage) and other cool stuff so much easier. The implementation details are left as an exercise for the reader 😄 You could do something as follows, though: class LanceDB(SourceStorage):
def __init__(self, config, values: list[Value]):
value_dict = {value.name: value.value for value in values}
self.chunk_size = value_dict["chunk_size"]
self.chunk_overlap = value_dict["chunk_overlap"]
self.num_tokens = value_dict["num_tokens"]
self.query_type = value_dict["query_type"]
def store(self, documents: list[Document]) -> None:
# Implementation for storing documents
...
def retrieve(self, documents: list[Document], prompt: str) -> list[Source]:
# Implementation for retrieving sources
...
@staticmethod
def get_params():
return [
Param("chunk_size", int, default=1024, values=range(512, 4096), description="Size of data chunks"),
Param("chunk_overlap", int, default=100, values=range(50, 500), description="Overlap size between chunks"),
Param("num_tokens", int, default=500, values=range(100, 1000), description="Number of tokens to process"),
Param("query_type", str, default="semantic", values=["semantic", "lexical"], description="Type of query: semantic or lexical")
] Then, to create a hybrid: hybrid_storage = HybridSourceStorage(
config,
Chroma(config, [Value("param1", "value1"), ...]), # Chroma for semantic search
LanceDB(config, [Value("query_type", "lexical"), ...]) # LanceDB for lexical search
) Now the question is what if you want to do: hybrid_storage = HybridSourceStorage(
config,
LanceDB(config, [Value("query_type", "semantic"), ...]), # LanceDB for semantic search
LanceDB(config, [Value("query_type", "lexical"), ...]) # LanceDB for lexical search
) That's a good question, and I haven't studied LanceDB to know what it does under the covers. But my opinion is that the code should be structured such that this works (even if the implementation requires redundant tables for now). |
@pmeier I would be interested in working on this as it really, really needs to be done. I propose a component called an |
This system would be expandable for different chunking methods as well, should you want to edit them. |
Agreed.
Let's start with just the design for the Python API. If we have that down, we can deal with the rest. This also hinges on #313.
I think we should not have a default value for the Assuming we now have an ragna/ragna/core/_components.py Line 103 in 8b0b0f3
When the chat is created and we find Based on the same information, we could also communicate whether or not the chat requires an embedding model when using the REST API / web UI. With this proposal, the What do you think?
Indeed. See also #297. With my proposal above, we could add another option for the |
@pmeier I've read through your proposal. I think I understand. I've been working on this problem today, and what I have ended up doing (which I think is inline with your vision) is completely separating the I'll continue working on this until I'm happy with it, and then I'll submit a pull request. Let me know if I've misunderstood what your vision for this feature is. |
Yes, that sounds correct. If you are unsure about anything, feel free to post a draft PR so we can have an early look.
If that is the case, there is no real way of separating the That being said, why do we need to know this upfront? In our builtin source storages we have two different cases:
Do you have a different use case in mind? |
I've changed lanced to use the method you describe, where schema is created when you store a list of |
Currently, the
SourceStorage
is responsible for extracting the text from the document, embedding it and finally storing it. In this issue, I want to discuss whether it makes sense for us to factor out the extraction and embedding logic into a separateEmbeddingModel
component on the same level asSourceStorage
andAssistant
. Here is what would change:EmbeddingModel
class would have an.embed()
method that takes a list of documents as inputs and returns a list ofEmbedding
s.Embedding
is a dataclass that holds two things:list[float]
) of the embedded textSourceStorage.store()
method would no longer take a list of documents, but rather a list of embeddings as input.chunk_size
andchunk_overlap
would move from theSourceStorage
to theEmbeddingModel
. That means that for all currently implemented source storages, there is no configuration anymore. This is not a bad thing, just pointing it out.We have not implemented it that way from the start to allow source storages where the user has no control over the extraction and embedding. For example AWS Kendra needs to take the full document and does the extraction and embedding in the background. Meaning, it will not not fit the schema above.
That being said, AWS Kendra will under current policy not land in Ragna, since it requires additional infrastructure. And for custom solution it is always possible to fall back to a custom
class NoEmbedding(Embedding)
that doesn't do any embedding, but just stores the document so the source storage can do its thing.So the question here is: do we want to have a separate
EmbeddingModel
component? It would make it easier for users to try new things, but makes it potentially harder on source storages that don't operate like a "regular" vector database.The text was updated successfully, but these errors were encountered: