# 使用LangChain构建语义搜索引擎
## 文件和文件加载器
LangChain实现了一个Document抽象类，旨在表示一组文本数据和相关的元数据，它有三个属性：
1. page_content: string，文档的文本内容
2. metadata:  字典，文档的元数据
3. id: string，文档的id（可选）

In [None]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

LangChain提供了文件加载器，用于从各种来源加载文件。

比如将一个PDF加载为一连串的 Document 对象。 

In [4]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "./example_data/概率分水岭.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

19


PyPDFLoader导入PDF时，将一页存入一个Document对象中。

In [None]:
print(f"{docs[0].page_content[:200]}\n")#打印第一页的前200个字符
print(docs[0].metadata)

Probabilistic Watershed:
Sampling all spanning forests
for seeded segmentation and semi-supervised learning
Enrique Fita Sanmartín, Sebastian Damrich, Fred A. Hamprecht
HCI/IWR at Heidelberg Universit

{'producer': 'pdfTeX-1.40.17', 'creator': 'LaTeX with hyperref package', 'creationdate': '2019-11-11T02:03:16+00:00', 'author': '', 'keywords': '', 'moddate': '2019-11-11T02:03:16+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': './example_data/概率分水岭.pdf', 'total_pages': 19, 'page': 0, 'page_label': '1'}


但显然，这样过于粗糙，颗粒度太大了，必须按照需求对文本进行分割，LangChain 提供了 TextSplitter 类，用于对文本进行分割。

设置add_start_index=True，可以在元数据中添加每个分片的起始索引。

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
# 将文件分为1000个字符的块，并设置重叠为200个字符，以确保上下文连续性
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

print(len(all_splits))

79


## 嵌入
向量搜索时存储和搜索非结构化数据（如非结构化文本）的常用方法。其思想是存储与文本相关的数字向量。给定一个查询，我们可以将其作为相同维度的向量嵌入，并使用向量相似度度量（如余弦相似度）来识别相关文本。

LangChain支持多个提供商的嵌入模型，这里只记录阿里云的Embedding方法。


In [6]:
from langchain_community.embeddings import DashScopeEmbeddings
import os
# 关于虚拟环境中环境变量的配置，要使用conda env config vars set DASHSCOPE_API_KEY="你的API密钥"！！！！！
my_api_key = os.environ.get("DASHSCOPE_API_KEY")
embeddings = DashScopeEmbeddings(
    model="text-embedding-v1", dashscope_api_key=my_api_key
)
text = "This is a test document."
query_result = embeddings.embed_query(text)
print(query_result)
doc_results = embeddings.embed_documents(["foo"])
print(doc_results)

[0.3107011616230011, 0.9425862431526184, -0.9483913779258728, 2.639361619949341, -2.1982150077819824, -0.12639787793159485, 1.0644768476486206, -1.463064432144165, 0.4155646562576294, 0.9022487998008728, -0.9056667685508728, 0.2552490234375, 1.5458712577819824, -0.2592027485370636, 1.7698636054992676, 0.1977505087852478, -0.6691555380821228, -3.3732097148895264, 0.8468695878982544, 0.5063815712928772, 1.02392578125, 1.0923393964767456, 0.9745347499847412, 0.0087415911257267, -0.3915913999080658, -0.1623435616493225, -0.9985283613204956, -0.5642293095588684, -1.1784464120864868, 0.10965368151664734, -0.8466660976409912, -0.1585591584444046, -0.2599538266658783, 2.209174156188965, 2.269822835922241, -0.2519938051700592, 0.0421498604118824, -0.6886189579963684, 1.639203429222107, 0.1796434223651886, 2.9693570137023926, 1.4052726030349731, 4.72607421875, 0.1977708637714386, 0.1952701210975647, -0.9378458857536316, 0.7993571162223816, 0.16235817968845367, -2.1394450664520264, -1.26909720897

## 向量存储
完成了嵌入后自然要存储嵌入向量，避免重复嵌入导致额外的时间和成本。

先看看使用Chrome的向量存储

In [7]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

实例化向量存储后，就可以索引decuments了

In [8]:
ids = vector_store.add_documents(documents=all_splits)

存储过后便是查询，VectorStore包含了多种查询的方法：
- 同步或异步
- 通过字符串或向量
- 是否包含相似度分数
- 通过相似度和最大区间相关度（以平衡查询的相似性和检索结果的多样性）
这些方法返回一个包含Documents的列表。

最简单的示例：通过相似度和字符串：

In [10]:
results = vector_store.similarity_search(
    "What is Probabilistic Watershed?"
)

print(results[0])

page_content='(a) Image with seeds
 (b) Watershed
 (c) Probabilistic Watershed
 (d) Uncertainty
Figure 2: The Probabilistic Watershed proﬁts from using all spanning forests instead of only the
minimum cost one. (2a) Crop of a CREMI image [19] with marked seeds. (2b) and (2c) show results
of Watershed and multiple seed Probabilistic Watershed (end of section 3) applied to edge-weights
from [12]. (2d) shows the entropy of the label probabilities of the Probabilistic Watershed (white high,
black low). The Watershed errs in an area where the Probabilistic Watershed expresses uncertainty
but is correct.
[18]. The Random Walker [28, 52, 53, 5] calculates the probability that a random walker starting at a
query node reaches a certain seed before the other ones. Both algorithms are related in [16] by a limit
consideration termed Power Watershed algorithm. In this work, we establish a different link between
the Watershed and the Random Walker. The Watershed’s and Random Walker’s recent combinat

异步：

In [13]:
results = vector_store.similarity_search(
    "What is Probabilistic Watershed?"
)

print(results[0])

page_content='(a) Image with seeds
 (b) Watershed
 (c) Probabilistic Watershed
 (d) Uncertainty
Figure 2: The Probabilistic Watershed proﬁts from using all spanning forests instead of only the
minimum cost one. (2a) Crop of a CREMI image [19] with marked seeds. (2b) and (2c) show results
of Watershed and multiple seed Probabilistic Watershed (end of section 3) applied to edge-weights
from [12]. (2d) shows the entropy of the label probabilities of the Probabilistic Watershed (white high,
black low). The Watershed errs in an area where the Probabilistic Watershed expresses uncertainty
but is correct.
[18]. The Random Walker [28, 52, 53, 5] calculates the probability that a random walker starting at a
query node reaches a certain seed before the other ones. Both algorithms are related in [16] by a limit
consideration termed Power Watershed algorithm. In this work, we establish a different link between
the Watershed and the Random Walker. The Watershed’s and Random Walker’s recent combinat

返回带分数的结果：

In [14]:
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What is Probabilistic Watershed?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 4363.193359375

page_content='(a) Image with seeds
 (b) Watershed
 (c) Probabilistic Watershed
 (d) Uncertainty
Figure 2: The Probabilistic Watershed proﬁts from using all spanning forests instead of only the
minimum cost one. (2a) Crop of a CREMI image [19] with marked seeds. (2b) and (2c) show results
of Watershed and multiple seed Probabilistic Watershed (end of section 3) applied to edge-weights
from [12]. (2d) shows the entropy of the label probabilities of the Probabilistic Watershed (white high,
black low). The Watershed errs in an area where the Probabilistic Watershed expresses uncertainty
but is correct.
[18]. The Random Walker [28, 52, 53, 5] calculates the probability that a random walker starting at a
query node reaches a certain seed before the other ones. Both algorithms are related in [16] by a limit
consideration termed Power Watershed algorithm. In this work, we establish a different link between
the Watershed and the Random Walker. The Watershed’s and Random W

通过嵌入向量进行查询

In [15]:
embedding = embeddings.embed_query("What is Probabilistic Watershed?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

page_content='(a) Image with seeds
 (b) Watershed
 (c) Probabilistic Watershed
 (d) Uncertainty
Figure 2: The Probabilistic Watershed proﬁts from using all spanning forests instead of only the
minimum cost one. (2a) Crop of a CREMI image [19] with marked seeds. (2b) and (2c) show results
of Watershed and multiple seed Probabilistic Watershed (end of section 3) applied to edge-weights
from [12]. (2d) shows the entropy of the label probabilities of the Probabilistic Watershed (white high,
black low). The Watershed errs in an area where the Probabilistic Watershed expresses uncertainty
but is correct.
[18]. The Random Walker [28, 52, 53, 5] calculates the probability that a random walker starting at a
query node reaches a certain seed before the other ones. Both algorithms are related in [16] by a limit
consideration termed Power Watershed algorithm. In this work, we establish a different link between
the Watershed and the Random Walker. The Watershed’s and Random Walker’s recent combinat

## Retrievers
VectorStore不是@[Runnable]的子类，而LangChain @[Retrievers]是，所以实现了基本的同步和异步方法。虽然我们可以从向量存储构造检索器，但Retrievers也可以与非向量存储数据源（如外部api）进行对接。

我们可以自己创建一个简单的版本，而不需要子类化retriver。如果选择想要使用的方法来检索文档，就可以轻松地创建一个可运行对象。下面我们将围绕similarity_search方法构建一个：

In [16]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)


retriever.batch(
    [
        "What is Probabilistic Watershed?",
        "What is the difference between Probabilistic Watershed and Watershed?",
    ],
)

[[Document(id='f7914e7c-b10f-4d17-ad9d-539532bd1807', metadata={'total_pages': 19, 'moddate': '2019-11-11T02:03:16+00:00', 'author': '', 'page': 2, 'subject': '', 'source': './example_data/概率分水岭.pdf', 'creationdate': '2019-11-11T02:03:16+00:00', 'creator': 'LaTeX with hyperref package', 'trapped': '/False', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2', 'title': '', 'start_index': 0, 'producer': 'pdfTeX-1.40.17', 'page_label': '3', 'keywords': ''}, page_content='(a) Image with seeds\n (b) Watershed\n (c) Probabilistic Watershed\n (d) Uncertainty\nFigure 2: The Probabilistic Watershed proﬁts from using all spanning forests instead of only the\nminimum cost one. (2a) Crop of a CREMI image [19] with marked seeds. (2b) and (2c) show results\nof Watershed and multiple seed Probabilistic Watershed (end of section 3) applied to edge-weights\nfrom [12]. (2d) shows the entropy of the label probabilities of the Probabilistic Watershed 

Vectorstores实现了一个as_retriever()方法，这个方法返回一个Retriever对象，这个对象可以作为Retriever对象来使用，它名为VectorStoreRetriever。这些检索器包括特定的search_type和search_kwargsattributes，它们标识要调用底层向量存储的哪些方法，以及如何对它们进行参数化。

In [17]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    [
        "What is Probabilistic Watershed?",
        "What is the difference between Probabilistic Watershed and Watershed?",
    ],
)

[[Document(id='f7914e7c-b10f-4d17-ad9d-539532bd1807', metadata={'moddate': '2019-11-11T02:03:16+00:00', 'producer': 'pdfTeX-1.40.17', 'start_index': 0, 'trapped': '/False', 'keywords': '', 'creationdate': '2019-11-11T02:03:16+00:00', 'title': '', 'subject': '', 'total_pages': 19, 'creator': 'LaTeX with hyperref package', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2', 'source': './example_data/概率分水岭.pdf', 'page': 2, 'page_label': '3', 'author': ''}, page_content='(a) Image with seeds\n (b) Watershed\n (c) Probabilistic Watershed\n (d) Uncertainty\nFigure 2: The Probabilistic Watershed proﬁts from using all spanning forests instead of only the\nminimum cost one. (2a) Crop of a CREMI image [19] with marked seeds. (2b) and (2c) show results\nof Watershed and multiple seed Probabilistic Watershed (end of section 3) applied to edge-weights\nfrom [12]. (2d) shows the entropy of the label probabilities of the Probabilistic Watershed 

VectorStoreRetriever支持“相似性”（默认），“mmr”（最大边际相关性，如上所述）和“similarity_score_threshold”的搜索类型。我们可以使用后者根据相似度分数对检索器输出的文档设置阈值。

检索器也支持更复杂的应用，如RAG