Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding pr #369

Merged
merged 35 commits into from
Apr 30, 2024
Merged

Conversation

Tengal-Teemo
Copy link
Contributor

@pmeier
This is the first of the two ongoing pull requests that will represent the embedding and chunking models. All suggested changes have been implemented.

@pmeier
Copy link
Member

pmeier commented Mar 19, 2024

The title says it is the "Embedding pr", but I see quite a bit code for the chunking as well. Can we remove that to keep the review manageable?

The goal for this PR should be to only implement the embedding model base class plus one concrete class for the embedding model that we are currently using. The concrete implementation should simply hardcode the chunking as we are currently doing it for the source storages.

Everything else, e.g. factoring out the chunking logic, new chunking models, new embedding models, etc., should only be implemented in follow-up PRs.

@Tengal-Teemo
Copy link
Contributor Author

You are correct, I forgot to push a rebase from yesterday that dropped the chunking commits. It should be fixed now.

Copy link
Member

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Tengal-Teemo for moving forward with this. I did a first pass mostly about moving all the pieces into the right place. I think we are making good progress here. When everything is resolved here, I'll do another pass.

One thing that I don't really understand yet is #354 (comment). But we'll come to that later.

ragna/core/_components.py Show resolved Hide resolved
ragna/core/_components.py Outdated Show resolved Hide resolved
ragna/core/_components.py Outdated Show resolved Hide resolved
ragna/core/_components.py Outdated Show resolved Hide resolved
ragna/core/_components.py Outdated Show resolved Hide resolved
ragna/core/_rag.py Outdated Show resolved Hide resolved
ragna/core/_rag.py Outdated Show resolved Hide resolved
ragna/core/_rag.py Outdated Show resolved Hide resolved
ragna/core/_rag.py Outdated Show resolved Hide resolved
ragna/core/_rag.py Outdated Show resolved Hide resolved
Tengal-Teemo and others added 2 commits March 22, 2024 16:22
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
@Tengal-Teemo
Copy link
Contributor Author

@pmeier I believe I've either implemented or commented on all your changes.

ragna/core/_components.py Outdated Show resolved Hide resolved
ragna/core/_components.py Outdated Show resolved Hide resolved
ragna/core/_rag.py Outdated Show resolved Hide resolved
ragna/core/_rag.py Outdated Show resolved Hide resolved
ragna/core/_rag.py Outdated Show resolved Hide resolved
ragna/source_storages/_chroma.py Show resolved Hide resolved
ragna/embedding_models/_embedding.py Outdated Show resolved Hide resolved
ragna/embedding_models/_embedding.py Outdated Show resolved Hide resolved
ragna/embedding_models/_embedding.py Outdated Show resolved Hide resolved
ragna/embedding_models/_embedding.py Outdated Show resolved Hide resolved
Copy link
Member

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tengal-Teemo I did another pass. Still mostly moving stuff around. I'll come to the meaty part after that. All open comments new and old still need your attention.

ragna/embedding_models/_embedding.py Outdated Show resolved Hide resolved
ragna/source_storages/_vector_database.py Outdated Show resolved Hide resolved
@Tengal-Teemo
Copy link
Contributor Author

@pmeier The error message from changing the chromadb retrieve function is as follows:

Traceback (most recent call last):
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/###/Work/ragna-dev-embedding/ui_entrypoint.py", line 28, in main
    async with rag.chat(
  File "/home/###/Work/ragna-dev-embedding/ragna/core/_rag.py", line 100, in chat
    return Chat(
  File "/home/###/Work/ragna-dev-embedding/ragna/core/_rag.py", line 178, in __init__
    self._unpacked_params = self._unpack_chat_params(params)
  File "/home/###/Work/ragna-dev-embedding/ragna/core/_rag.py", line 293, in _unpack_chat_params
    chat_params = ChatModel.model_validate(params, strict=True).model_dump(
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/site-packages/pydantic/main.py", line 503, in model_validate
    return cls.__pydantic_validator__.validate_python(
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/site-packages/pydantic/_internal/_mock_val_ser.py", line 41, in __getattr__
    val_ser = self._attempt_rebuild()
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/site-packages/pydantic/_internal/_mock_val_ser.py", line 73, in attempt_rebuild_validator
    if cls.model_rebuild(raise_errors=False, _parent_namespace_depth=5) is not False:
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/site-packages/pydantic/main.py", line 457, in model_rebuild
    _model_construction.unpack_lenient_weakvaluedict(cls.__pydantic_parent_namespace__) or {}
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/site-packages/pydantic/_internal/_model_construction.py", line 215, in __getattr__
    raise AttributeError(item)
AttributeError: __pydantic_parent_namespace__
python-BaseException

Apologies for my lack of experience with pydantic.

@Tengal-Teemo
Copy link
Contributor Author

Additionally @pmeier , I will finish up the rest of the comments tomorrow and attempt to solve why dropping commit ed4f91a causes the following error:

Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/home/###/Work/ragna-dev-embedding/ragna/__init__.py", line 15, in <module>
    from . import assistants, core, deploy, source_storages
  File "/home/###/Work/ragna-dev-embedding/ragna/source_storages/__init__.py", line 7, in <module>
    from ._chroma import Chroma
  File "/home/###/Work/ragna-dev-embedding/ragna/source_storages/_chroma.py", line 13, in <module>
    class Chroma(VectorDatabaseSourceStorage):
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/abc.py", line 106, in __new__
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
  File "/home/###/Work/ragna-dev-embedding/ragna/core/_components.py", line 233, in __init_subclass__
    valid_input_types = get_args(get_type_hints(cls)["__ragna_input_type__"])
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/typing.py", line 1459, in get_type_hints
    value = _eval_type(value, base_globals, localns)
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/typing.py", line 292, in _eval_type
    return t._evaluate(globalns, localns, recursive_guard)
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/typing.py", line 554, in _evaluate
    eval(self.__forward_code__, globalns, localns),
  File "<string>", line 1, in <module>
NameError: name 'Union' is not defined
python-BaseException

I'm not 100% sure, but I believe it's caused by the introduction of the __ragna_input_type__ variable somehow.

@pmeier
Copy link
Member

pmeier commented Mar 26, 2024

Re #369 (comment): This is on me. In contrast to what I said earlier, we do actually check for the name:

method = getattr(cls, method_name)
concrete_params = inspect.signature(method).parameters
protocol_params = inspect.signature(
getattr(protocol_cls, method_name)
).parameters
extra_param_names = concrete_params.keys() - protocol_params.keys()
models[(cls, method_name)] = pydantic.create_model( # type: ignore[call-overload]
f"{cls.__name__}.{method_name}",
**{
(param := concrete_params[param_name]).name: (
param.annotation,
param.default
if param.default is not inspect.Parameter.empty
else ...,
)
for param_name in extra_param_names
},
)

Leave it as is. This is something I wanted to refactor anyway. I'll fix it in a follow-up PR.

@pmeier
Copy link
Member

pmeier commented Mar 26, 2024

Re #369 (comment): Fixed in 1d5d4a8. Note that this also includes a few automatic formatting changes. They have to be applied before merge anyway, so I didn't mess with the pre-commit hooks for this.

Copy link
Member

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Toby, this going pretty well. I took the liberty and pushed a few cleanup commits about some minor stuff that would just slow us down. Mostly moving the parts into their final position as well as linting. Now we can finally get to the interesting part.

ragna/source_storages/_lancedb.py Outdated Show resolved Hide resolved
ragna/source_storages/_lancedb.py Outdated Show resolved Hide resolved
ragna/source_storages/_lancedb.py Outdated Show resolved Hide resolved
ragna/embedding_models/_all_minilm_l6_v2.py Outdated Show resolved Hide resolved
)
]

def embed_text(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the PR in more detail, I again have my doubts about this function. As discussed before, this only exists for our benefit of embedding the prompt. However it comes at the cost of

  1. We need to mess with the type annotations quite a bit, because the typing system doesn't really support output types based on the input, e.g. str -> str and list[str] -> list[str].
  2. Users have to implement two methods instead of one.
  3. We need to integrate two very similar methods into the Ragna protocol.

I think the costs outweigh the benefit here.

Right now embed_documents takes Documents. But this will change with the chunking work to take Chunks. If we move the chunking into Chat.prepare for now to emulate the future case for the embedding model, couldn't we create a private _embed_text method stuffs the text into a dummy Chunk, calls the user defined embed_documents and returns the embedding? That way we should be able to eliminate all the downsides above while keeping the benefit.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, after reading a good analysis you are correct. I will implement your solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing swaying it for me is that the user (ragna developer) has to implement two different methods. I think this warrants the roundabout method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like to point out that pretty much no matter what you do, you'll still encounter the issue of passing through a batch vs passing through a single object. We really want to be able to batch chunks because it is simply more efficient to batch-encode these things. I will show you the difference after finishing up these comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can take a look at what I've done in commit 8c9fb84, I don't particularly like it but it does work.

ragna/core/_components.py Outdated Show resolved Hide resolved
chunk: Chunk


class EmbeddingModel(Component, ABC):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open for opinions here. Chroma calls this EmbeddingFunction should we align to that? Are there other references that we should look at?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EmbeddingFunction makes more sense if you're "calling" it on some text/chunks. The user treats this as a model, so I think it should be named as such. Although, now that we have embed_chunks as the only function of it, we could change it to be EmbeddingFunction as chromaDB does and make it callable. The issue with that is that something that simple shouldn't require a chunk input, it should just be text. I've made the change in a new branch here so you can have a look

@Tengal-Teemo
Copy link
Contributor Author

@pmeier Sorry I've been away for a while, I believe I've caught up with your comments and made the suggested changes. Let me know when you get time to do another passthrough.

@pmeier
Copy link
Member

pmeier commented Apr 15, 2024

@Tengal-Teemo I haven't forgotten about this. Just been rather busy the past days / weeks.

@Tengal-Teemo
Copy link
Contributor Author

@pmeier Any chance of reviving this? I'm happy to continue with it.

@pmeier
Copy link
Member

pmeier commented Apr 29, 2024

I'm really sorry this takes so long. I didn't expect to get so swamped the past few weeks. TBH, this PR is already in a fairly good state. Instead of going through yet another round of review, let me just finish it later today and merge. From there on, you can continue with the chunking PR if you like.

@Tengal-Teemo
Copy link
Contributor Author

@pmeier Ok no problem, I'm happy with that.

@pmeier pmeier marked this pull request as ready for review April 30, 2024 07:26
@pmeier pmeier changed the base branch from main to chunking-embedding-dev April 30, 2024 07:26
Copy link
Member

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've cleaned everything up to a state where we can move on. However, since there is much work needed on the API / UI side to make this and the chunking refactor work, I've decided we are not going to merge into main just yet. I've started a new branch chunking-embedding-dev that we'll use for now. We can do PRs against that branch and when we are happy with the implementation for Python API / REST API / web UI we can then merge into main.

Thanks for the hard work and patience with me Toby! Looking forward to the chunking PR.

@pmeier pmeier merged commit 9f0051d into Quansight:chunking-embedding-dev Apr 30, 2024
12 checks passed
@pmeier pmeier linked an issue May 7, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

embedding model component?
2 participants