## **多向量检索器**
每个文档存储多个向量通常是有益的。

LangChain有一个基础MultiVectorRetriever,这使得查询此类设置变得容易。

多向量检索器复杂性在于如何为每个文档创建多个向量,本笔记本涵盖了创建这些向量和使用MuiltiVectorRetriever的一些常见方法。

为每个文档创建多个向量的方法包括：
- 较小的块：将文档分割成较小的块,然后嵌入这些块(这是ParentDocumeentRetriever)。
- 摘要：为每个文档创建摘要,将其与文档一起嵌入(或代替文档)。
- 假设性问题：创建每个文档都适合回答的假设性问题,将这些问题与文档一起嵌入(或代替文档)。

**从URL地址加载文档**

In [1]:
from langchain_community.document_loaders import UnstructuredURLLoader

# 添加自定义请求头
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8"
}

# 使用UnstructuredURLLoader加载URL
loaders = [UnstructuredURLLoader(
    urls=["https://www.gutenberg.org/cache/epub/75833/pg75833-images.html"],
    headers=headers
),UnstructuredURLLoader(
    urls=["https://www.gutenberg.org/cache/epub/75832/pg75832-images.html"],
    headers=headers
)]
# 加载文档
docs = [loader.load()[0] for loader in loaders]

print(f"加载的文档数量: {len(docs)}")

加载的文档数量: 2


**加载文本嵌入模型，创建向量数据库**

In [2]:
import os
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import DashScopeEmbeddings
from langchain.schema import Document

# 初始化文本嵌入模型
embeddings = DashScopeEmbeddings(
    model="text-embedding-v2",
    dashscope_api_key=os.getenv("API_KEY")
)

# 初始化向量数据库
vectorstore = FAISS.from_documents(
    documents=[Document(page_content="",metadata = {})],
    embedding=embeddings
)

**存储文档**

第一步：初始化持久化存储库和多向量检索器

In [3]:
from langchain.storage import LocalFileStore
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 获取当前文件的绝对路径
store_path = os.path.abspath("./langchain_store")

print(f"持久化存储路径: {store_path}")

# 初始化LocalFileStore
docstore = LocalFileStore(store_path)

# 初始化检索器
retriever = MultiVectorRetriever(
    vectorstore= vectorstore,
    byte_store=docstore,
    id_key = "doc_id",
)

持久化存储路径: d:\AI_Project\LangChainTutorials\LangChain\langchain_store


第二歩：给每个父文档一个uid

In [12]:
import uuid

doc_ids = [f"full_doc{i}" for i in range(len(docs))]

print(f"文档ID: {doc_ids}")


文档ID: ['full_doc0', 'full_doc1']


第三步：将每个父文档分成较小的子块，存储到向量数据库中

In [13]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    # 使用\n\n作为分隔符
    separator="\n\n",
    # 每个块的大小为100个字符
    chunk_size=100, 
    # 块之间的重叠为10个字符
    chunk_overlap=10,
    # 使用字符长度作为长度函数
    length_function=len,
    # 不使用正则表达式
    is_separator_regex=False
)

# 将文档分成较小的块
docs_chunks = text_splitter.split_documents(docs)
   
# 将文档块存储到向量数据库中
retriever.vectorstore.add_documents(docs_chunks)

# 将文档ID和文档一起存储到持久化存储库中
retriever.docstore.mset(list(zip(doc_ids, docs)))

# 单独搜索小块
query_results = retriever.vectorstore.similarity_search("第一课讲了什么")

print(f"查询结果: {query_results[0]}")



Created a chunk of size 442, which is longer than the specified 100
Created a chunk of size 225, which is longer than the specified 100
Created a chunk of size 137, which is longer than the specified 100
Created a chunk of size 2159, which is longer than the specified 100
Created a chunk of size 378, which is longer than the specified 100
Created a chunk of size 563, which is longer than the specified 100
Created a chunk of size 592, which is longer than the specified 100
Created a chunk of size 733, which is longer than the specified 100
Created a chunk of size 876, which is longer than the specified 100
Created a chunk of size 1023, which is longer than the specified 100
Created a chunk of size 531, which is longer than the specified 100
Created a chunk of size 707, which is longer than the specified 100
Created a chunk of size 594, which is longer than the specified 100
Created a chunk of size 213, which is longer than the specified 100
Created a chunk of size 581, which is longer t

查询结果: page_content='FIRST LESSONS

IN THE

PRINCIPLES OF COOKING

PART I. INTRODUCTORY.' metadata={'source': 'https://www.gutenberg.org/cache/epub/75832/pg75832-images.html'}


第四歩：将父文档进行总结摘要,并把摘要持久化存储到向量数据库中

In [16]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# 初始化模型
llm = ChatOpenAI(
    model="qwen-turbo",
    api_key=os.getenv("API_KEY"),
    base_url=os.getenv("API_BASEURL"),
)

# 初始化提示模板
prompt = ChatPromptTemplate.from_template(
    "请总结以下文档：\n\n{doc}"
)

# 初始化输出解析器
parser = StrOutputParser()

# 初始化链
chain = (
    {"doc": lambda x: x.page_content}
    | prompt
    | llm
    | parser
)

# 总结文档
summary = chain.batch(docs,{"max_concurrency":5})

print(f"总结文档: {summary[0]}")

summary_docs = [Document(page_content=summary[i],metadata={"source":docs[i].metadata["source"]}) for i in range(len(docs))]

# 将总结文档存储到向量数据库中
retriever.vectorstore.add_documents(summary_docs)

summary_ids = [f"summary_doc{i}" for i in range(len(docs))]

# 将文档ID和总结文档一起存储到持久化存储库中
retriever.docstore.mset(list(zip(summary_ids, summary_docs)))

# 单独搜索总结文档
query_results = retriever.vectorstore.similarity_search("Summary of Fairy Tales from South Africa")

print(f"查询结果: {query_results[0]}")

总结文档: page_content='**Summary of "Fairy Tales from South Africa"**

"Fairy Tales from South Africa" is a collection of traditional tales compiled from original native sources by Sarah F. Bourhill and Beatrice L. Drake. The book contains a series of narratives, each rooted in the culture and traditions of the Kafir people, featuring themes of magic, transformation, and moral lessons. Below is a summary of the key stories and elements:

1. **Introduction**: The book highlights that these tales are passed down orally and are as timeless as classic Western fairy tales. The Kafir people are cautious about sharing them, especially with outsiders, believing that disrespect or laughter could lead to misfortune.

2. **Setuli; or, the King of the Birds**: A deaf and dumb man gains the ability to speak and understand after meeting a Fairy. Using magical birds, Setuli gathers an army and conquers a kingdom, eventually marrying a beautiful princess.

3. **The Story of the King's Son and the Magic S

第五步：针对特定文档生成假设的问题列表，将问题和答案存储到向量数据库中

In [37]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser

# 初始化模型
llm = ChatOpenAI(
    model="qwen-turbo",
    api_key=os.getenv("API_KEY"),
    base_url=os.getenv("API_BASEURL"),
)

# 提示模板
template = """
    '''
    {doc}
    '''
    
    根据上面的文档,生成3个相关问题和回答。响应以json列表的结构返回,返回的结构参考如下

    '''
    [
        {{"question":"问题1","answer":"回答1"}},
        {{"question":"问题2","answer":"回答2"}},
        {{"question":"问题3","answer":"回答3"}}
    ]
    '''
"""
# 初始化提示模板
prompt = ChatPromptTemplate.from_template(template)

# 初始化输出解析器
parser = JsonOutputParser()

# 初始化链
chain = (
    {"doc": lambda x: x.page_content}
    | prompt
    | llm
    | parser
)

# 问题文档
questions = chain.batch(docs_chunks[0:2],{"max_concurrency":5})

print(f"问题: {questions[0]}")

questions_docs = [Document(page_content=f"问题：{questions[i][0]['question']}\n答案：{questions[i][0]['answer']}",metadata={"source":docs_chunks[i].metadata["source"]}) for i in range(len(docs_chunks[0:2]))]

# 将问题文档存储到向量数据库中
retriever.vectorstore.add_documents(questions_docs)

# 单独搜索问题文档
query_results = retriever.vectorstore.similarity_search("这本书的标题是什么？")

print(f"查询结果: {query_results[0]}")

问题: [{'question': '这本书的标题是什么？', 'answer': '这本书的标题是《Fairy tales from South Africa》。'}, {'question': '这本书的来源是什么？', 'answer': '这本书来源于Project Gutenberg，是一个电子书资源库。'}, {'question': '这本书的主题是什么？', 'answer': '这本书的主题是南非的童话故事。'}]
向量数据库: 16
查询结果: page_content='问题：这本书的标题是什么？
答案：这本书的标题是《Fairy tales from South Africa》。' metadata={'source': 'https://www.gutenberg.org/cache/epub/75833/pg75833-images.html'}


## **多向量检索器的实际应用案例** 🌟
医疗文献检索系统

向量类型：
1. 段落向量（500字符分块）
2. 医学实体向量（提取疾病、药物等实体）
3. 问题向量（自动生成"该药物的副作用有哪些？"类问题）

检索策略：
- 实体向量权重0.5 + 段落向量0.3 + 问题向量0.2
- 效果：相比单向量检索，召回率提升37%