# [Langchain with Qdrant](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/qdrant.html)

## 0. [Qdrant](https://qdrant.tech/documentation/quick-start/) 简介

`Qdrant` 是 `Rust`语言 实现的 `向量数据库`，支持 云端 和 分布式存储；

有很多存储方式：

+ 内存数据库：纯内存，程序关掉就丢失；
+ 磁盘数据库：用 `Sqlite` 实现，部署方便，适用于小规模数据集；
+ 标准 C/S 存储服务
    - 私有部署 服务器：用 `Docker` 搭建
    - 官方 [Qdrant 云](https://cloud.qdrant.io/)，需要登录，创建 `API_KEY`

## 1. 环境

+ 运行环境：Windows 11
+ OpenAI的API Key，配置在 您电脑的 环境变量 `OPENAI_API_KEY`

`Docker` 搭建 `Qdrant` 服务器 步骤：

+ 下载镜像 docker pull qdrant/qdrant
+ 运行容器：docker run -p 6333:6333 -p 6334:6334 -e QDRANT__SERVICE__GRPC_PORT="6334" qdrant/qdrant
+ 浏览器 测试 http://127.0.0.1:6333

说明：

+ 运行`Docker`容器 docker run -p 外部端口:内部端口 -e 环境变量=值 镜像
+ 6333 是 `RESTFul` HTTP 接口
+ 6334 是 `gprc` 接口 （二进制）

In [None]:
# 安装 / 升级 Python 库
!pip3 install --upgrade tiktoken openai langchain qdrant-client



## 2. 初始化 Qdrant

#### 2.1. 加载 Python Modules

In [12]:
# Qdrant python 客户端
from qdrant_client import QdrantClient
from qdrant_client.http import models as rest

# langchain 的 Qdrant 封装
from langchain.vectorstores import Qdrant

# langchain 的 Embedding 封装
from langchain.embeddings.openai import OpenAIEmbeddings

# langchain 的 文档加载器
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader


#### 2.2. 创建 Qdrant Python 客户端 对象

[Qdrant Python 客户端 文档](https://github.com/qdrant/qdrant-client)


In [13]:
print(f"+++++++++++++++++++ Begin: 创建 Qdrant 客户端")

# 数据库：内存版
# qdrant_client = QdrantClient(location=":memory:")

# 数据库：磁盘版，sqlite
# path = "qdrant_data_1"
# qdrant_client = QdrantClient(path=path, prefer_grpc=True)

# 数据库：服务器版本
qdrant_client = QdrantClient(host="localhost", port=6333, grpc_port=6334, prefer_grpc=True)

print(f"+++++++++++++++++++ End: 创建 Qdrant 客户端")

+++++++++++++++++++ Begin: 创建 Qdrant 客户端
+++++++++++++++++++ End: 创建 Qdrant 客户端


#### 2.3. 创建 Qdrant `Collection`

一个 `Collection` 类似 传统数据库的 一张表；

每个`Collection`内部的向量维度必须一样，不同的 `Collection` 可以不一样；

**注：** 如果 报 502/503 Bad Gateway 异常，要检查 开启 `Docker`容器；同时 **不要** 开启全局代理。

In [15]:
print(f"+++++++++++++++++++ Begin: 创建 Qdrant Collection")

collection_name = 'MyCollection'

# OpenAI的 嵌入向量 维度 是 1536
vector_size = 1536

# 判断向量相近程度的度量：余弦相似度，点乘，欧式距离
distance = rest.Distance['COSINE']  # 注：这里用余弦相似度，越接近0，相似度越高

# 删除老的 Collection（如果有的话）再用给定参数 创建新的 Collection

qdrant_client.recreate_collection(
    collection_name=collection_name,

    vectors_config=rest.VectorParams(
        size=vector_size,   # OpenAI的 嵌入向量 维度
        distance=distance,
    ),
)

print(f"+++++++++++++++++++ End: 创建 Qdrant Collection")

+++++++++++++++++++ Begin: 创建 Qdrant Collection
+++++++++++++++++++ End: 创建 Qdrant Collection


#### 2.4. 创建 `Qdrant Langchain` 封装层

如果有多个 `Collection`，每个 `Collection`一个 Langchain 封装

In [16]:
print(f"+++++++++++++++++++ Begin: 创建 Langchain Qdrant")

# 注：这里要和上面的 vector_size 一致
embedding = OpenAIEmbeddings(client="davinci")

qdrant = Qdrant(
    client=qdrant_client,
    collection_name=collection_name,
    embeddings=embedding,
)

print(f"+++++++++++++++++++ End: 创建 Langchain Qdrant")

+++++++++++++++++++ Begin: 创建 Langchain Qdrant
+++++++++++++++++++ End: 创建 Langchain Qdrant


#### 2.5. 分割文本

加载 并 分割 文本为 Document

这里 选择 state_of_the_union.txt 做实验

In [17]:
print("+++++++++++++++++++ Begin: 分割文档")

loader = TextLoader('./state_of_the_union.txt', encoding="utf-8")

documents = loader.load()

text_splitter = CharacterTextSplitter("\n", chunk_size=256, chunk_overlap=0)

docs = text_splitter.split_documents(documents)

print("+++++++++++++++++++ End: 分割文档")

Created a chunk of size 304, which is longer than the specified 256
Created a chunk of size 332, which is longer than the specified 256
Created a chunk of size 281, which is longer than the specified 256
Created a chunk of size 325, which is longer than the specified 256


+++++++++++++++++++ Begin: 分割文档
+++++++++++++++++++ End: 分割文档


In [18]:
print(f"len(docs) = {len(docs)}")

print(f"doc 0: text size = {len(docs[0].page_content)}, meta data = {docs[0].metadata}")

print(f"doc 0: text = {docs[0].page_content}")

len(docs) = 193
doc 0: text size = 239, meta data = {'source': './state_of_the_union.txt'}
doc 0: text = Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  
Last year COVID-19 kept us apart. This year we are finally together again.


#### 2.6. 上传 Text 到 `Qdrant`

注：Langchain `Qdrant` 只会将 Text 转成嵌入向量；

In [19]:
print(f"+++++++++++++++++++ Begin: 上传 文本 到 Qdrant")

batch_size = 64
succ_ids = qdrant.add_documents(docs, batch_size=batch_size)

print(f"+++++++++++++++++++ End: Upload Document To Qdrant, succ_ids's len = {len(succ_ids)}")


+++++++++++++++++++ Begin: 上传 文本 到 Qdrant
+++++++++++++++++++ End: Upload Document To Qdrant, succ_ids's len = 193


## 3. 搜索


#### 3.1. 相似性搜索

使用 Qdrant 向量存储的最简单场景是执行相似性搜索。

在幕后，我们的查询将使用 进行编码，`embedding_function`并用于在 Qdrant 集合中查找类似的文档。

In [11]:
query = "What did the president say about Ketanji Brown Jackson"

found_docs = qdrant.similarity_search(query)

for i, doc in enumerate(found_docs):
    print(f"{i + 1}.", doc.page_content, "\n")

1. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. 

2. Vice President Harris and I ran for office with a new economic vision for America. 
Invest in America. Educate Americans. Grow the workforce. Build the economy from the bottom up  
and the middle out, not from the top down. 

3. As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. 

4. Powered by people I’ve met like JoJo Burgess, from generations of union steelworkers from Pittsburgh, who’s here with us tonight. 
As Ohio Senator Sherrod Brown says, “It’s time to bury the label “Rust Belt.” 
It’s time. 



#### 3.2. 带`Score`的相似性搜索

希望获得相关性分数以了解特定结果的好坏程度；

返回的距离分数是余弦距离。因此，分数越低越好。

In [12]:
query = "What did the president say about Ketanji Brown Jackson"

s_found_docs = qdrant.similarity_search_with_score(query)

In [13]:
for i, info in enumerate(s_found_docs):
    doc, score = info
    # 对 余弦距离，分数 越低越好
    print(f"{i + 1}. score = {score}, ", doc.page_content, "\n")

1. score = 0.8249364495277405,  And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. 

2. score = 0.7833905220031738,  Vice President Harris and I ran for office with a new economic vision for America. 
Invest in America. Educate Americans. Grow the workforce. Build the economy from the bottom up  
and the middle out, not from the top down. 

3. score = 0.7775577306747437,  As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. 

4. score = 0.7758106589317322,  Powered by people I’ve met like JoJo Burgess, from generations of union steelworkers from Pittsburgh, who’s here with us tonight. 
As Ohio Senator Sherrod Brown says, “It’s time to bury the label “Rust Belt.” 
It’s time. 



## 4. 作为 Langchain Retriever

Qdrant 是一个 LangChain Retriever，使用余弦相似度。

In [16]:
retriever = qdrant.as_retriever()

retriever

VectorStoreRetriever(vectorstore=<langchain.vectorstores.qdrant.Qdrant object at 0x000001A40DB20F90>, search_type='similarity', search_kwargs={})

查询组装：

In [18]:
query = "What did the president say about Ketanji Brown Jackson"

retriever.get_relevant_documents(query)[0]

Document(page_content='And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': './state_of_the_union.txt'})

# 5. [过滤器](https://qdrant.tech/documentation/concepts/filtering/)

Qdrant 具有广泛的过滤系统和丰富的类型支持。也可以使用 Langchain 中的过滤器，通过将附加参数传递给similarity_search_with_score和similarity_search方法。

用 Langchain 的 Qdrant 封装层：

+ **注：** 过滤 元数据，key = "metadata.想要查的键"
+ **注：** 过滤 内容数据，key = "page_content", 用 FullText 全文查找字符串

In [None]:
query = "What did the president say about Ketanji Brown Jackson"

filter_docs = qdrant.similarity_search_with_score(
    query, 
    filter=rest.Filter(
        must=[
            rest.FieldCondition(
                key="metadata.source",
                match=rest.MatchValue(value="./state_of_the_union.txt"),
            ),
        ]
    )
)

print(f"====================== filter_docs: {filter_docs}")

# 6. Langchain 封装的 Qdrant

## 6.1. payload 组成

下面的 键 是 Langchain 的 Qdrant 参数默认值，可以通过构造方法改

``` python
payload = {
    "page_content": "", # 文本内容，字符串,
    "metadata": { # 文档元数据，默认有 source
        "source": "./aaaa.txt", 
    }
}
```

## 6.2. 构造

``` python
qdrant = Qdrant(
    client,
    collection_name,
    embeddings: Optional[Embeddings] = None,
)
```

## 6.3. 添加文本

``` python
qdrant.add_texts(
    texts: Iterable[str],
    metadatas: Optional[List[dict]] = None,
    ids: Optional[Sequence[str]] = None,
    batch_size: int = 64,
    **kwargs: Any,
)

payloads = self._build_payloads(
    batch_texts,
    batch_metadatas,
)

self.client.upsert(
    collection_name=self.collection_name,
    points=rest.Batch.construct(
        ids=batch_ids,
        vectors=self._embed_texts(batch_texts),
        payloads=payloads,
    ),
)
```

## 6.4. 构造 payload

``` python
Qdrant._build_payloads(
    texts: Iterable[str],
    metadatas: Optional[List[dict]],
) -> List[dict]:
    payloads = []
    
    for i, text in enumerate(texts):
        payloads.append({
            "page_content": text,
            "metadata": metadata[i],
        })

    return payloads
```