feat: Qdrant support #730

Anush008 · 2024-11-28T09:32:35Z

Description

This PR adds support for Qdrant - https://qdrant.tech to be used an external database for vector search.

Qdrant can be run with :

docker run -p 6333:6333 qdrant/qdrant

A dashboard will be accessible at http://localhost:6333/dashboard.

Testing

I've Q&A tested QdrantVectorStore implementation externally.

Signed-off-by: Anush008 <anushshetty90@gmail.com>

paperqa/llms.py

pyproject.toml

Signed-off-by: Anush008 <anushshetty90@gmail.com>

.github/workflows/tests.yml

tests/test_paperqa.py

Signed-off-by: Anush008 <anushshetty90@gmail.com>

pyproject.toml

Signed-off-by: Anush008 <anushshetty90@gmail.com>

Anush008 · 2024-12-05T16:02:09Z

Hey @jamesbraza. Could you please approve the CI?

Signed-off-by: Anush008 <anushshetty90@gmail.com>

Anush008 · 2024-12-06T18:17:52Z

Weird that the mailman pre-commit doesn't complain locally. I've tried to do a patch.

Anush008 · 2024-12-06T18:49:20Z

Alright. That's through.
I guess the OpenAI errors are unrelated.

jamesbraza

Few more comments, looking good so far

paperqa/llms.py

Signed-off-by: Anush008 <anushshetty90@gmail.com>

Anush008 · 2024-12-07T19:29:23Z

I believe these are the same OpenAI failures.

jamesbraza

Hi @Anush008 it looks great. Can you merge or rebase atop main for #752, and confirm if QdrantVectorStore needs any changes?

Anush008 · 2024-12-10T13:22:12Z

confirm if QdrantVectorStore needs any changes?

Should be fine.

jamesbraza

Nice work @Anush008 , thanks for this

ThomasRochefortB · 2024-12-20T16:53:36Z

@Anush008 I managed to successfully create a docs object and push it to Qdrant using the following:

from paperqa import QdrantVectorStore, Docs
from qdrant_client import QdrantClient
import nest_asyncio

nest_asyncio.apply()

client = QdrantClient(url="localhost", port=6333)
vectorstore = QdrantVectorStore(client=client,
                                collection_name="test-collection")

docs = Docs(texts_index=vectorstore)
docs.add("testpaper.pdf")
docs.texts_index.add_texts_and_embeddings(docs.texts)

My question now is:

Is there a clever way to rebuild the Docs() object from the QdrantVectorStore directly? There seems to be everything we need persisted in the Qdrant collection.
This could be a useful add to the README.md to document this.

Anush008 · 2024-12-20T16:57:01Z

Is there a clever way to rebuild the Docs() object from the QdrantVectorStore directly? There seems to be everything we need persisted in the Qdrant collection.

I think no, as of yet. We can add a something like from_existing(...) for this purpose.

ThomasRochefortB · 2024-12-20T18:26:21Z

For now I am using this which seems to work:

from paperqa import QdrantVectorStore, Docs, Text, Doc
from qdrant_client import QdrantClient
import nest_asyncio
import asyncio

nest_asyncio.apply()

async def recreate_docs_from_qdrant(client: QdrantClient, collection_name: str) -> Docs:
    # Initialize empty Docs with the existing vector store
    vectorstore = QdrantVectorStore(
        client=client,
        collection_name=collection_name
    )
    
    docs = Docs(texts_index=vectorstore)
    
    # Get all points from the collection
    points = client.scroll(
        collection_name=collection_name,
        with_payload=True,
        with_vectors=True,
        limit=100  # adjust based on your needs
    )[0]
    
    # Reconstruct the texts and docs
    for point in points:
        payload = point.payload
        doc = payload['doc']
        
        if doc['dockey'] not in docs.docs:
            docs.docs[doc['dockey']] = Doc(
                docname=doc['docname'],
                citation=doc['citation'],
                dockey=doc['dockey']
            )
            docs.docnames.add(doc['docname'])
        
        # Reconstruct Text object
        text = Text(
            text=payload['text'],
            name=payload['name'],
            doc=docs.docs[doc['dockey']],
            embedding=point.vector
        )
        docs.texts.append(text)
    
    return docs

# Usage:
client = QdrantClient(url="localhost", port=6333)
docs = asyncio.run(recreate_docs_from_qdrant(client, "test-collection"))

I think it's clunky to reload the entire vectorstore into RAM however. I wonder if we could just use the Qdrant store as the Docs() object itself.

ThomasRochefortB · 2024-12-21T00:09:34Z

@Anush008 I have created a Docs.load_docs_from_qdrant function to reconstruct the Docs() object from Qdrant in the following #776 . Would love your opinion!

Anush008 added 2 commits November 28, 2024 14:56

feat: Qdrant support

1084f70

Signed-off-by: Anush008 <anushshetty90@gmail.com>

ci: Configure tests

87cdc86

Signed-off-by: Anush008 <anushshetty90@gmail.com>

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Nov 28, 2024

jamesbraza reviewed Dec 2, 2024

View reviewed changes

paperqa/llms.py Outdated Show resolved Hide resolved

paperqa/llms.py Outdated Show resolved Hide resolved

pyproject.toml Show resolved Hide resolved

Anush008 added 2 commits December 3, 2024 23:45

Merge remote-tracking branch 'origin/HEAD' into qdrant

e347a0b

Signed-off-by: Anush008 <anushshetty90@gmail.com>

chore: Review updates

d15f3f3

Signed-off-by: Anush008 <anushshetty90@gmail.com>

jamesbraza reviewed Dec 3, 2024

View reviewed changes

.github/workflows/tests.yml Outdated Show resolved Hide resolved

tests/test_paperqa.py Show resolved Hide resolved

chore: Missed stashed commit

fa1bc94

Signed-off-by: Anush008 <anushshetty90@gmail.com>

jamesbraza reviewed Dec 3, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

chore: Fix tests

cfa6cfc

Signed-off-by: Anush008 <anushshetty90@gmail.com>

Anush008 force-pushed the qdrant branch from febc0bb to cfa6cfc Compare December 3, 2024 21:55

fix: unique _ids identifier

9836e09

Signed-off-by: Anush008 <anushshetty90@gmail.com>

Anush008 force-pushed the qdrant branch 2 times, most recently from dadc7d4 to 9836e09 Compare December 5, 2024 07:59

Anush008 added 2 commits December 5, 2024 13:30

Merge remote-tracking branch 'origin/main' into qdrant

654ff1b

Signed-off-by: Anush008 <anushshetty90@gmail.com>

chore: lockfile

fdd769b

Signed-off-by: Anush008 <anushshetty90@gmail.com>

chore: That one refurb

f87fea6

Signed-off-by: Anush008 <anushshetty90@gmail.com>

Anush008 force-pushed the qdrant branch from 8d575c0 to f87fea6 Compare December 6, 2024 06:36

fix: try .mailmap

9143b5c

Signed-off-by: Anush008 <anushshetty90@gmail.com>

jamesbraza reviewed Dec 6, 2024

View reviewed changes

paperqa/llms.py Outdated Show resolved Hide resolved

paperqa/llms.py Outdated Show resolved Hide resolved

refactor: _point_ids

8b59afd

Signed-off-by: Anush008 <anushshetty90@gmail.com>

jamesbraza reviewed Dec 10, 2024

View reviewed changes

Merge branch 'main' into qdrant

20b0831

jamesbraza approved these changes Dec 11, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 11, 2024

jamesbraza merged commit 0f5c494 into Future-House:main Dec 11, 2024
3 of 5 checks passed

Anush008 deleted the qdrant branch December 11, 2024 01:22

feat: Qdrant support #730

feat: Qdrant support #730

Uh oh!

Conversation

Anush008 commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Anush008 commented Dec 5, 2024

Uh oh!

Anush008 commented Dec 6, 2024

Uh oh!

Anush008 commented Dec 6, 2024

Uh oh!

jamesbraza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Anush008 commented Dec 7, 2024

Uh oh!

jamesbraza left a comment

Choose a reason for hiding this comment

Uh oh!

Anush008 commented Dec 10, 2024

Uh oh!

jamesbraza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ThomasRochefortB commented Dec 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Anush008 commented Dec 20, 2024

Uh oh!

ThomasRochefortB commented Dec 20, 2024

Uh oh!

ThomasRochefortB commented Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Anush008 commented Nov 28, 2024 •

edited

Loading

ThomasRochefortB commented Dec 20, 2024 •

edited

Loading

ThomasRochefortB commented Dec 21, 2024 •

edited

Loading