Skip to content

Conversation

Anush008
Copy link
Contributor

@Anush008 Anush008 commented Nov 28, 2024

Description

This PR adds support for Qdrant - https://qdrant.tech to be used an external database for vector search.

Qdrant can be run with :

docker run -p 6333:6333 qdrant/qdrant

A dashboard will be accessible at http://localhost:6333/dashboard.

Testing

I've Q&A tested QdrantVectorStore implementation externally.

Signed-off-by: Anush008 <anushshetty90@gmail.com>
Signed-off-by: Anush008 <anushshetty90@gmail.com>
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Nov 28, 2024
Signed-off-by: Anush008 <anushshetty90@gmail.com>
Signed-off-by: Anush008 <anushshetty90@gmail.com>
Signed-off-by: Anush008 <anushshetty90@gmail.com>
Signed-off-by: Anush008 <anushshetty90@gmail.com>
Signed-off-by: Anush008 <anushshetty90@gmail.com>
@Anush008 Anush008 force-pushed the qdrant branch 2 times, most recently from dadc7d4 to 9836e09 Compare December 5, 2024 07:59
Signed-off-by: Anush008 <anushshetty90@gmail.com>
Signed-off-by: Anush008 <anushshetty90@gmail.com>
@Anush008
Copy link
Contributor Author

Anush008 commented Dec 5, 2024

Hey @jamesbraza. Could you please approve the CI?

Signed-off-by: Anush008 <anushshetty90@gmail.com>
Signed-off-by: Anush008 <anushshetty90@gmail.com>
@Anush008
Copy link
Contributor Author

Anush008 commented Dec 6, 2024

Weird that the mailman pre-commit doesn't complain locally. I've tried to do a patch.

@Anush008
Copy link
Contributor Author

Anush008 commented Dec 6, 2024

Alright. That's through.
I guess the OpenAI errors are unrelated.

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments, looking good so far

Signed-off-by: Anush008 <anushshetty90@gmail.com>
@Anush008
Copy link
Contributor Author

Anush008 commented Dec 7, 2024

I believe these are the same OpenAI failures.

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Anush008 it looks great. Can you merge or rebase atop main for #752, and confirm if QdrantVectorStore needs any changes?

@Anush008
Copy link
Contributor Author

confirm if QdrantVectorStore needs any changes?

Should be fine.

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @Anush008 , thanks for this

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 11, 2024
@jamesbraza jamesbraza merged commit 0f5c494 into Future-House:main Dec 11, 2024
3 of 5 checks passed
@Anush008 Anush008 deleted the qdrant branch December 11, 2024 01:22
@ThomasRochefortB
Copy link
Contributor

ThomasRochefortB commented Dec 20, 2024

@Anush008 I managed to successfully create a docs object and push it to Qdrant using the following:

from paperqa import QdrantVectorStore, Docs
from qdrant_client import QdrantClient
import nest_asyncio

nest_asyncio.apply()

client = QdrantClient(url="localhost", port=6333)
vectorstore = QdrantVectorStore(client=client,
                                collection_name="test-collection")

docs = Docs(texts_index=vectorstore)
docs.add("testpaper.pdf")
docs.texts_index.add_texts_and_embeddings(docs.texts)

My question now is:

  • Is there a clever way to rebuild the Docs() object from the QdrantVectorStore directly? There seems to be everything we need persisted in the Qdrant collection.
  • This could be a useful add to the README.md to document this.

@Anush008
Copy link
Contributor Author

Is there a clever way to rebuild the Docs() object from the QdrantVectorStore directly? There seems to be everything we need persisted in the Qdrant collection.

I think no, as of yet. We can add a something like from_existing(...) for this purpose.

@ThomasRochefortB
Copy link
Contributor

For now I am using this which seems to work:

from paperqa import QdrantVectorStore, Docs, Text, Doc
from qdrant_client import QdrantClient
import nest_asyncio
import asyncio

nest_asyncio.apply()

async def recreate_docs_from_qdrant(client: QdrantClient, collection_name: str) -> Docs:
    # Initialize empty Docs with the existing vector store
    vectorstore = QdrantVectorStore(
        client=client,
        collection_name=collection_name
    )
    
    docs = Docs(texts_index=vectorstore)
    
    # Get all points from the collection
    points = client.scroll(
        collection_name=collection_name,
        with_payload=True,
        with_vectors=True,
        limit=100  # adjust based on your needs
    )[0]
    
    # Reconstruct the texts and docs
    for point in points:
        payload = point.payload
        doc = payload['doc']
        
        if doc['dockey'] not in docs.docs:
            docs.docs[doc['dockey']] = Doc(
                docname=doc['docname'],
                citation=doc['citation'],
                dockey=doc['dockey']
            )
            docs.docnames.add(doc['docname'])
        
        # Reconstruct Text object
        text = Text(
            text=payload['text'],
            name=payload['name'],
            doc=docs.docs[doc['dockey']],
            embedding=point.vector
        )
        docs.texts.append(text)
    
    return docs

# Usage:
client = QdrantClient(url="localhost", port=6333)
docs = asyncio.run(recreate_docs_from_qdrant(client, "test-collection"))

I think it's clunky to reload the entire vectorstore into RAM however. I wonder if we could just use the Qdrant store as the Docs() object itself.

@ThomasRochefortB
Copy link
Contributor

ThomasRochefortB commented Dec 21, 2024

@Anush008 I have created a Docs.load_docs_from_qdrant function to reconstruct the Docs() object from Qdrant in the following #776 . Would love your opinion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants