# Using Qdrant as a vector database for OpenAI embeddings

This notebook guides you step by step on using **`Qdrant`** as a vector database for OpenAI embeddings. [Qdrant](https://qdrant.tech) is a high-performant vector search database written in Rust. It offers RESTful and gRPC APIs to manage your embeddings. There is an official Python [qdrant-client](https://github.com/qdrant/qdrant_client) that eases the integration with your apps.

This notebook presents an end-to-end process of:
1. Using precomputed embeddings created by OpenAI API.
2. Storing the embeddings in a local instance of Qdrant.
3. Converting raw text query to an embedding with OpenAI API.
4. Using Qdrant to perform the nearest neighbour search in the created collection.

### What is Qdrant

[Qdrant](https://qdrant.tech) is an Open Source vector database that allows storing neural embeddings along with the metadata, a.k.a [payload](https://qdrant.tech/documentation/payload/). Payloads are not only available for keeping some additional attributes of a particular point, but might be also used for filtering. [Qdrant](https://qdrant.tech) offers a unique filtering mechanism which is built-in into the vector search phase, what makes it really efficient.

### Deployment options

[Qdrant](https://qdrant.tech) might be launched in various ways, depending on the target load on the application it might be hosted:

- Locally or on premise, with Docker containers
- On Kubernetes cluster, with the [Helm chart](https://github.com/qdrant/qdrant-helm)
- Using [Qdrant Cloud](https://cloud.qdrant.io/)

### Integration

[Qdrant](https://qdrant.tech) provides both RESTful and gRPC APIs which makes integration easy, no matter the programming language you use. However, there are some official clients for the most popular languages available, and if you use Python then the [Python Qdrant client library](https://github.com/qdrant/qdrant_client) might be the best choice.

## Prerequisites

For the purposes of this exercise we need to prepare a couple of things:

1. Qdrant server instance. In our case a local Docker container.
2. The [qdrant-client](https://github.com/qdrant/qdrant_client) library to interact with the vector database.

### Start Qdrant server

We're going to use a local Qdrant instance running in a Docker container. The easiest way to launch it is to use the attached [docker-compose.yaml] file and run the following command:

In [1]:
! docker-compose up -d

! sudo docker run -d -p 6333:6333 -p 6334:6334 --name qdrant -v /userdata/temp:/share qdrant/qdrant

qdrant_qdrant_1 is up-to-date


We might validate if the server was launched successfully by running a simple curl command:

In [29]:
! curl http://localhost:6333

! export http_proxy=http://127.0.0.1:8888

curl: (7) Failed to connect to localhost port 6333: Connection refused


### Install requirements

This notebook obviously requires the `openai` and `qdrant-client` packages, but there are also some other additional libraries we will use. The following command installs them all:


In [1]:
! pip install qdrant-client pandas

Collecting qdrant-client
  Downloading qdrant_client-1.1.6-py3-none-any.whl (126 kB)
[K     |████████████████████████████████| 126 kB 645 kB/s eta 0:00:01
Collecting grpcio-tools>=1.41.0
  Downloading grpcio_tools-1.54.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[K     |████████████████████████████████| 2.4 MB 4.4 MB/s eta 0:00:01
[?25hCollecting pydantic<2.0,>=1.8
  Downloading pydantic-1.10.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
[K     |████████████████████████████████| 3.2 MB 1.5 MB/s eta 0:00:011
Collecting httpx[http2]>=0.14.0
  Downloading httpx-0.24.0-py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 1.7 MB/s  eta 0:00:01
[?25hCollecting grpcio>=1.41.0
  Downloading grpcio-1.54.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB)
[K     |████████████████████████████████| 5.1 MB 15.7 MB/s eta 0:00:01
Collecting urllib3<2.0.0,>=1.26.14
  Downloading urllib3-1.26.15-py2.py3-none-any.wh

## Connect to Qdrant

Connecting to a running instance of Qdrant server is easy with the official Python library:

In [1]:
import qdrant_client

# import os
# os.environ["http_proxy"] = "http://192.168.0.106:8888"
# os.environ["https_proxy"] = "http://127.0.0.1:8888"


client = qdrant_client.QdrantClient(
    host="localhost",
    #  port=6333
    prefer_grpc=True,
)

We can test the connection by running any available method:

In [2]:
client.get_collections()

CollectionsResponse(collections=[])

## Load data

In this section we are going to load the data prepared previous to this session, so you don't have to recompute the embeddings of Wikipedia articles with your own credits.

And we can finally load it from the provided CSV file:

In [3]:
import pandas as pd

from ast import literal_eval

article_df = pd.read_csv('../../data/vector_database_wikipedia_articles_embedded.csv')
# Read vectors from strings back into a list
article_df["title_vector"] = article_df.title_vector.apply(literal_eval)
article_df["content_vector"] = article_df.content_vector.apply(literal_eval)
article_df.head()

Unnamed: 0,id,url,title,text,title_vector,content_vector,vector_id
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,"[0.001009464613161981, -0.020700545981526375, ...","[-0.011253940872848034, -0.013491976074874401,...",0
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,"[0.0009286514250561595, 0.000820168002974242, ...","[0.0003609954728744924, 0.007262262050062418, ...",1
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,"[0.003393713850528002, 0.0061537534929811954, ...","[-0.004959689453244209, 0.015772193670272827, ...",2
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,"[0.0153952119871974, -0.013759135268628597, 0....","[0.024894846603274345, -0.022186409682035446, ...",3
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...,"[0.02224554680287838, -0.02044147066771984, -0...","[0.021524671465158463, 0.018522677943110466, -...",4


## Index data

Qdrant stores data in __collections__ where each object is described by at least one vector and may contain an additional metadata called __payload__. Our collection will be called **Articles** and each object will be described by both **title** and **content** vectors. Qdrant does not require you to set up any kind of schema beforehand, so you can freely put points to the collection with a simple setup only.

We will start with creating a collection, and then we will fill it with our precomputed embeddings.

In [4]:
from qdrant_client.http import models as rest

vector_size = len(article_df["content_vector"][0])

client.recreate_collection(
    collection_name="Articles",
    vectors_config={
        "title": rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
        "content": rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
    }
)

True

In [11]:
import numpy as np
from qdrant_client.http import models as rest
from qdrant_client.models import PointStruct

count = 40000
dim = 1536
vectors = np.random.rand(count, dim).astype(np.float32)
vectors2 = np.random.rand(count, dim).astype(np.float32)

idBase = 25000

client.upsert(
    collection_name="Articles",
    points=[
        rest.PointStruct(
            id = i + idBase,
            vector={
                "title": vector.tolist(),
                "content": vectors2[i].tolist(),
            },
            payload = {
                "id": i + idBase,
                "text" : "text {0}".format(i + idBase),
                "title": "title {0}".format(i + idBase),
                "url": "http://www.baidu.com/?{0}".format(i + idBase),
                "vector_id": i + idBase
            }
        )
        for i, vector in enumerate(vectors)
    ],
)

SyntaxError: invalid syntax (4270993373.py, line 29)

In [5]:
client.upsert(
    collection_name="Articles",
    points=[
        rest.PointStruct(
            id=k,
            vector={
                "title": v["title_vector"],
                "content": v["content_vector"],
            },
            payload=v.to_dict(),
        )
        for k, v in article_df.iterrows()
    ],
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [6]:
# Check the collection size to make sure all the points have been stored
client.count(collection_name="Articles")

CountResult(count=25000)

## Search data

Once the data is put into Qdrant we will start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search. Since the precomputed embeddings were created with `text-embedding-ada-002` OpenAI model we also have to use it during search.


In [6]:
# import openai
import numpy as np


def query_qdrant(query, collection_name, vector_name="title", top_k=20):
    # Creates embedding vector from user query
    # embedded_query = openai.Embedding.create(
    #     input=query,
    #     model="text-embedding-ada-002",
    # )["data"][0]["embedding"]

    # 生成随机浮点数数组
    arr = np.random.rand(1536).astype(np.float32).tolist()


    query_results = client.search(
        collection_name=collection_name,
        query_vector=(
            vector_name,  arr #embedded_query
        ),
        limit=top_k,
    )

    return query_results

In [17]:
# import os
# os.environ["http_proxy"] = ""
# "http://127.0.0.1:1231"
# os.environ["https_proxy"] = ""
# "http://127.0.0.1:1231"


query_results = query_qdrant("modern art in Europe", "Articles")
for i, article in enumerate(query_results):
    print(f"{i + 1}. {article.payload['title']} (Score: {round(article.score, 3)})")

1. title 18862 (Score: 0.78)
2. title 28128 (Score: 0.78)
3. title 7307 (Score: 0.779)
4. title 32357 (Score: 0.778)
5. title 35346 (Score: 0.777)
6. title 22915 (Score: 0.776)
7. title 10949 (Score: 0.775)
8. title 16399 (Score: 0.775)
9. title 20487 (Score: 0.775)
10. title 10643 (Score: 0.775)
11. title 18689 (Score: 0.775)
12. title 36154 (Score: 0.774)
13. title 36977 (Score: 0.774)
14. title 24123 (Score: 0.774)
15. title 44680 (Score: 0.774)
16. title 42819 (Score: 0.774)
17. title 3273 (Score: 0.774)
18. title 34529 (Score: 0.774)
19. title 36683 (Score: 0.774)
20. title 17964 (Score: 0.773)


In [41]:

# import os
# os.environ["http_proxy"] = "http://192.168.0.106:8888"

# This time we'll query using content vector
query_results = query_qdrant("Famous battles in Scottish history", "Articles", "content")
for i, article in enumerate(query_results):
    print(f"{i + 1}. {article.payload['title']} (Score: {round(article.score, 3)})")

ValidationError: 1 validation error for NamedVector
vector
  value is not a valid list (type=type_error.list)