Embedding pr #369

Tengal-Teemo · 2024-03-19T03:12:24Z

@pmeier
This is the first of the two ongoing pull requests that will represent the embedding and chunking models. All suggested changes have been implemented.

pmeier · 2024-03-19T08:16:59Z

The title says it is the "Embedding pr", but I see quite a bit code for the chunking as well. Can we remove that to keep the review manageable?

The goal for this PR should be to only implement the embedding model base class plus one concrete class for the embedding model that we are currently using. The concrete implementation should simply hardcode the chunking as we are currently doing it for the source storages.

Everything else, e.g. factoring out the chunking logic, new chunking models, new embedding models, etc., should only be implemented in follow-up PRs.

Tengal-Teemo · 2024-03-19T20:11:07Z

You are correct, I forgot to push a rebase from yesterday that dropped the chunking commits. It should be fixed now.

pmeier

Thanks @Tengal-Teemo for moving forward with this. I did a first pass mostly about moving all the pieces into the right place. I think we are making good progress here. When everything is resolved here, I'll do another pass.

One thing that I don't really understand yet is #354 (comment). But we'll come to that later.

ragna/core/_components.py

ragna/core/_rag.py

Co-authored-by: Philip Meier <github.pmeier@posteo.de>

Tengal-Teemo · 2024-03-25T00:45:15Z

@pmeier I believe I've either implemented or commented on all your changes.

ragna/core/_components.py

ragna/core/_rag.py

ragna/source_storages/_chroma.py

ragna/embedding_models/_embedding.py

pmeier

@Tengal-Teemo I did another pass. Still mostly moving stuff around. I'll come to the meaty part after that. All open comments new and old still need your attention.

ragna/embedding_models/_embedding.py

ragna/source_storages/_vector_database.py

…structor from Embedding

…es by re-importing source_storages

Tengal-Teemo · 2024-03-26T06:42:20Z

@pmeier The error message from changing the chromadb retrieve function is as follows:

Traceback (most recent call last):
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/###/Work/ragna-dev-embedding/ui_entrypoint.py", line 28, in main
    async with rag.chat(
  File "/home/###/Work/ragna-dev-embedding/ragna/core/_rag.py", line 100, in chat
    return Chat(
  File "/home/###/Work/ragna-dev-embedding/ragna/core/_rag.py", line 178, in __init__
    self._unpacked_params = self._unpack_chat_params(params)
  File "/home/###/Work/ragna-dev-embedding/ragna/core/_rag.py", line 293, in _unpack_chat_params
    chat_params = ChatModel.model_validate(params, strict=True).model_dump(
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/site-packages/pydantic/main.py", line 503, in model_validate
    return cls.__pydantic_validator__.validate_python(
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/site-packages/pydantic/_internal/_mock_val_ser.py", line 41, in __getattr__
    val_ser = self._attempt_rebuild()
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/site-packages/pydantic/_internal/_mock_val_ser.py", line 73, in attempt_rebuild_validator
    if cls.model_rebuild(raise_errors=False, _parent_namespace_depth=5) is not False:
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/site-packages/pydantic/main.py", line 457, in model_rebuild
    _model_construction.unpack_lenient_weakvaluedict(cls.__pydantic_parent_namespace__) or {}
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/site-packages/pydantic/_internal/_model_construction.py", line 215, in __getattr__
    raise AttributeError(item)
AttributeError: __pydantic_parent_namespace__
python-BaseException

Apologies for my lack of experience with pydantic.

Tengal-Teemo · 2024-03-26T06:51:12Z

Additionally @pmeier , I will finish up the rest of the comments tomorrow and attempt to solve why dropping commit ed4f91a causes the following error:

Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/home/###/Work/ragna-dev-embedding/ragna/__init__.py", line 15, in <module>
    from . import assistants, core, deploy, source_storages
  File "/home/###/Work/ragna-dev-embedding/ragna/source_storages/__init__.py", line 7, in <module>
    from ._chroma import Chroma
  File "/home/###/Work/ragna-dev-embedding/ragna/source_storages/_chroma.py", line 13, in <module>
    class Chroma(VectorDatabaseSourceStorage):
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/abc.py", line 106, in __new__
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
  File "/home/###/Work/ragna-dev-embedding/ragna/core/_components.py", line 233, in __init_subclass__
    valid_input_types = get_args(get_type_hints(cls)["__ragna_input_type__"])
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/typing.py", line 1459, in get_type_hints
    value = _eval_type(value, base_globals, localns)
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/typing.py", line 292, in _eval_type
    return t._evaluate(globalns, localns, recursive_guard)
  File "/home/###/miniconda3/envs/ragna-dev/lib/python3.9/typing.py", line 554, in _evaluate
    eval(self.__forward_code__, globalns, localns),
  File "<string>", line 1, in <module>
NameError: name 'Union' is not defined
python-BaseException

I'm not 100% sure, but I believe it's caused by the introduction of the __ragna_input_type__ variable somehow.

pmeier · 2024-03-26T08:03:44Z

Re #369 (comment): This is on me. In contrast to what I said earlier, we do actually check for the name:

ragna/ragna/core/_components.py

Lines 52 to 70 in a5e4795

    
           method = getattr(cls, method_name) 
        
           concrete_params = inspect.signature(method).parameters 
        
           protocol_params = inspect.signature( 
        
               getattr(protocol_cls, method_name) 
        
           ).parameters 
        
           extra_param_names = concrete_params.keys() - protocol_params.keys() 
        
           models[(cls, method_name)] = pydantic.create_model(  # type: ignore[call-overload] 
        
               f"{cls.__name__}.{method_name}", 
        
               **{ 
        
                   (param := concrete_params[param_name]).name: ( 
        
                       param.annotation, 
        
                       param.default 
        
                       if param.default is not inspect.Parameter.empty 
        
                       else ..., 
        
                   ) 
        
                   for param_name in extra_param_names 
        
               }, 
        
           )

Leave it as is. This is something I wanted to refactor anyway. I'll fix it in a follow-up PR.

pmeier · 2024-03-26T08:51:05Z

Re #369 (comment): Fixed in 1d5d4a8. Note that this also includes a few automatic formatting changes. They have to be applied before merge anyway, so I didn't mess with the pre-commit hooks for this.

pmeier

Thanks Toby, this going pretty well. I took the liberty and pushed a few cleanup commits about some minor stuff that would just slow us down. Mostly moving the parts into their final position as well as linting. Now we can finally get to the interesting part.

ragna/source_storages/_lancedb.py

ragna/embedding_models/_all_minilm_l6_v2.py

pmeier · 2024-03-27T10:20:10Z

ragna/embedding_models/_all_minilm_l6_v2.py

+            )
+        ]
+
+    def embed_text(


After looking at the PR in more detail, I again have my doubts about this function. As discussed before, this only exists for our benefit of embedding the prompt. However it comes at the cost of

We need to mess with the type annotations quite a bit, because the typing system doesn't really support output types based on the input, e.g. str -> str and list[str] -> list[str].

Users have to implement two methods instead of one.

We need to integrate two very similar methods into the Ragna protocol.

I think the costs outweigh the benefit here.

Right now embed_documents takes Documents. But this will change with the chunking work to take Chunks. If we move the chunking into Chat.prepare for now to emulate the future case for the embedding model, couldn't we create a private _embed_text method stuffs the text into a dummy Chunk, calls the user defined embed_documents and returns the embedding? That way we should be able to eliminate all the downsides above while keeping the benefit.

Thoughts?

I think, after reading a good analysis you are correct. I will implement your solution.

The main thing swaying it for me is that the user (ragna developer) has to implement two different methods. I think this warrants the roundabout method.

I'd also like to point out that pretty much no matter what you do, you'll still encounter the issue of passing through a batch vs passing through a single object. We really want to be able to batch chunks because it is simply more efficient to batch-encode these things. I will show you the difference after finishing up these comments.

You can take a look at what I've done in commit 8c9fb84, I don't particularly like it but it does work.

ragna/core/_components.py

pmeier · 2024-03-27T10:22:08Z

ragna/core/_components.py

+    chunk: Chunk
+
+
+class EmbeddingModel(Component, ABC):


Open for opinions here. Chroma calls this EmbeddingFunction should we align to that? Are there other references that we should look at?

EmbeddingFunction makes more sense if you're "calling" it on some text/chunks. The user treats this as a model, so I think it should be named as such. Although, now that we have embed_chunks as the only function of it, we could change it to be EmbeddingFunction as chromaDB does and make it callable. The issue with that is that something that simple shouldn't require a chunk input, it should just be text. I've made the change in a new branch here so you can have a look

Co-authored-by: Philip Meier <github.pmeier@posteo.de>

Tengal-Teemo · 2024-04-04T03:44:50Z

@pmeier Sorry I've been away for a while, I believe I've caught up with your comments and made the suggested changes. Let me know when you get time to do another passthrough.

pmeier · 2024-04-15T19:09:35Z

@Tengal-Teemo I haven't forgotten about this. Just been rather busy the past days / weeks.

Tengal-Teemo · 2024-04-28T20:56:17Z

@pmeier Any chance of reviving this? I'm happy to continue with it.

pmeier · 2024-04-29T14:25:52Z

I'm really sorry this takes so long. I didn't expect to get so swamped the past few weeks. TBH, this PR is already in a fairly good state. Instead of going through yet another round of review, let me just finish it later today and merge. From there on, you can continue with the chunking PR if you like.

Tengal-Teemo · 2024-04-29T20:34:20Z

@pmeier Ok no problem, I'm happy with that.

pmeier

I've cleaned everything up to a state where we can move on. However, since there is much work needed on the API / UI side to make this and the chunking refactor work, I've decided we are not going to merge into main just yet. I've started a new branch chunking-embedding-dev that we'll use for now. We can do PRs against that branch and when we are happy with the implementation for Python API / REST API / web UI we can then merge into main.

Thanks for the hard work and patience with me Toby! Looking forward to the chunking PR.

Tengal-Teemo and others added 3 commits March 19, 2024 12:22

implemented rudimentary mebedding model

dccf346

add automatic input type discovery for source storages

85c4928

implemented suggested pmier changes

f179b69

Tengal-Teemo force-pushed the embedding-pr branch from beced90 to f179b69 Compare March 19, 2024 20:10

Merge branch 'Quansight:main' into embedding-pr

8690136

pmeier requested changes Mar 21, 2024

View reviewed changes

Tengal-Teemo and others added 2 commits March 22, 2024 16:22

Renamed GenericEmbeddingModel to EmbeddingModel

130353f

Co-authored-by: Philip Meier <github.pmeier@posteo.de>

Changed embedding model to be optional

0684345

Co-authored-by: Philip Meier <github.pmeier@posteo.de>

Tengal-Teemo force-pushed the embedding-pr branch from aeb91d5 to e664a1b Compare March 24, 2024 22:13

Tengal-Teemo added 3 commits March 25, 2024 13:44

added pmier suggested changes

c97a2d4

created embedding models namespace, Embedding is dataclass

5ad61c3

removed SentenceTransformers dependancy for ChromaDB embedding functions

2735298

Tengal-Teemo force-pushed the embedding-pr branch from bfce5e7 to 2735298 Compare March 25, 2024 00:44

pmeier reviewed Mar 25, 2024

View reviewed changes

ragna/embedding_models/_embedding.py Outdated Show resolved Hide resolved

ragna/source_storages/_vector_database.py Outdated Show resolved Hide resolved

Tengal-Teemo added 3 commits March 26, 2024 09:14

removed commented link to github discussion

c336a90

removed redundant import, removed private VectorDatabase, removed con…

684bba6

…structor from Embedding

fixed bug where nothing would work unless you re-import source_storag…

ed4f91a

…es by re-importing source_storages

pmeier added 2 commits March 26, 2024 09:05

fix embedding model loading

e371075

fix __init_subclass__

1d5d4a8

pmeier and others added 3 commits March 26, 2024 09:53

fix import

d3c2c32

renamed mini_lm, moved requirements from VDBS

acccbe5

removed _EMBEDDING_DIMENSIONS from EmbeddingModel baseclass

1dddc32

Tengal-Teemo and others added 6 commits March 27, 2024 13:11

moved _take_sources and _page_numbers_to_str to VectorDatabase

ddd5af2

embed text is generic and takes string or list of strings

0343ce1

simplified embed_documents return

d7704e6

made MiniLML6v2().model private

e7129c9

cleanup

c267bcf

more cleanup

1094037

pmeier reviewed Mar 27, 2024

View reviewed changes

Tengal-Teemo and others added 6 commits March 28, 2024 13:03

removed instance-specific tokens and embeddings

86c8f09

added embedding dimensions local var

153d732

refactor: EmbeddingModel only implements embed_chunks

8c9fb84

renamed float list in Embeding to values

6d799ac

Co-authored-by: Philip Meier <github.pmeier@posteo.de>

fixup! renamed float list in Embeding to values

cfa0949

fixup! fixup! renamed float list in Embeding to values

3ea8cca

pmeier added 3 commits April 30, 2024 08:52

Merge branch 'main' into embedding-pr

7aaaf85

format

8d0eb59

cleanup

e236cb6

pmeier marked this pull request as ready for review April 30, 2024 07:26

pmeier changed the base branch from main to chunking-embedding-dev April 30, 2024 07:26

pmeier approved these changes Apr 30, 2024

View reviewed changes

pmeier added 2 commits April 30, 2024 09:32

add missing module

3ea27e9

fix requirements

e8e5c40

pmeier merged commit 9f0051d into Quansight:chunking-embedding-dev Apr 30, 2024
12 checks passed

pmeier linked an issue May 7, 2024 that may be closed by this pull request

embedding model component? #191

Open

pmeier added the dev: components label May 10, 2024

pmeier mentioned this pull request Jun 28, 2024

refactor protocol model extraction to only check extra parameters #436

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding pr #369

Embedding pr #369

Tengal-Teemo commented Mar 19, 2024

pmeier commented Mar 19, 2024 •

edited

Loading

Tengal-Teemo commented Mar 19, 2024

pmeier left a comment

Tengal-Teemo commented Mar 25, 2024

pmeier left a comment

Tengal-Teemo commented Mar 26, 2024

Tengal-Teemo commented Mar 26, 2024

pmeier commented Mar 26, 2024

pmeier commented Mar 26, 2024

pmeier left a comment

pmeier Mar 27, 2024

Tengal-Teemo Mar 27, 2024

Tengal-Teemo Mar 27, 2024

Tengal-Teemo Mar 27, 2024

Tengal-Teemo Mar 28, 2024

pmeier Mar 27, 2024

Tengal-Teemo Mar 28, 2024

Tengal-Teemo commented Apr 4, 2024

pmeier commented Apr 15, 2024

Tengal-Teemo commented Apr 28, 2024

pmeier commented Apr 29, 2024

Tengal-Teemo commented Apr 29, 2024

pmeier left a comment

Embedding pr #369

Embedding pr #369

Conversation

Tengal-Teemo commented Mar 19, 2024

pmeier commented Mar 19, 2024 • edited Loading

Tengal-Teemo commented Mar 19, 2024

pmeier left a comment

Choose a reason for hiding this comment

Tengal-Teemo commented Mar 25, 2024

pmeier left a comment

Choose a reason for hiding this comment

Tengal-Teemo commented Mar 26, 2024

Tengal-Teemo commented Mar 26, 2024

pmeier commented Mar 26, 2024

pmeier commented Mar 26, 2024

pmeier left a comment

Choose a reason for hiding this comment

pmeier Mar 27, 2024

Choose a reason for hiding this comment

Tengal-Teemo Mar 27, 2024

Choose a reason for hiding this comment

Tengal-Teemo Mar 27, 2024

Choose a reason for hiding this comment

Tengal-Teemo Mar 27, 2024

Choose a reason for hiding this comment

Tengal-Teemo Mar 28, 2024

Choose a reason for hiding this comment

pmeier Mar 27, 2024

Choose a reason for hiding this comment

Tengal-Teemo Mar 28, 2024

Choose a reason for hiding this comment

Tengal-Teemo commented Apr 4, 2024

pmeier commented Apr 15, 2024

Tengal-Teemo commented Apr 28, 2024

pmeier commented Apr 29, 2024

Tengal-Teemo commented Apr 29, 2024

pmeier left a comment

Choose a reason for hiding this comment

pmeier commented Mar 19, 2024 •

edited

Loading