# 简介
Ensemble Retriever（混合检索工具）是一种结合了多个检索器的检索工具，通过汇总它们的get_relevant_documents()方法的结果，并基于逆序秩合并（Reciprocal Rank Fusion）算法重新排序这些结果。通过利用不同算法的优势，Ensemble Retriever能够实现比任何单一算法更好的性能。

# 算法组合模式
最常见的模式是将稀疏检索器（如BM25）与密集检索器（如嵌入式相似度）结合起来，因为它们的优势互补，这也被称为“混合搜索”。稀疏检索器擅长基于关键字找到相关文档，而密集检索器擅长基于语义相似性找到相关文档。

In [1]:
%pip install --upgrade --quiet rank_bm25

In [2]:
%pip install langchain langchain_community langchain_openai faiss-cpu

Collecting langchain
  Downloading langchain-0.2.1-py3-none-any.whl (973 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.5/973.5 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_community
  Downloading langchain_community-0.2.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_openai
  Downloading langchain_openai-0.1.8-py3-none-any.whl (38 kB)
Collecting langchain-core<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_core-0.2.3-py3-none-any.whl (310 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.2/310.2 kB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.0-py3-none-any.whl (23 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.67-py3-none-any.wh

In [4]:
# 设置OpenAI KEY环境变量
import os
import getpass
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

OpenAI API Key:··········


In [7]:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# 文档列表
doc_list_1 = [
    "I like apples",
    "I like oranges",
    "Apples and oranges are fruits",
]

# 初始化BM25检索器
bm25_retriever = BM25Retriever.from_texts(
    doc_list_1, metadatas=[{"source": 1}] * len(doc_list_1)
)
bm25_retriever.k = 2

doc_list_2 = [
    "You like apples",
    "You like oranges",
]

embedding = OpenAIEmbeddings()
faiss_vectorstore = FAISS.from_texts(
    doc_list_2, embedding, metadatas=[{"source": 2}] * len(doc_list_2)
)

# 初始化FAISS检索器
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

# 初始化Ensemble Retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)

# 使用Ensemble Retriever
docs = ensemble_retriever.invoke("apples")
print(docs)

[Document(page_content='I like apples', metadata={'source': 1}), Document(page_content='You like apples', metadata={'source': 2}), Document(page_content='Apples and oranges are fruits', metadata={'source': 1}), Document(page_content='You like oranges', metadata={'source': 2})]


# 运行时配置
我们还可以在运行时配置检索器。为了做到这点，我们需要将字段标记为可配置的：

In [8]:
from langchain_core.runnables import ConfigurableField

# 配置FAISS检索器
faiss_retriever = faiss_vectorstore.as_retriever(
    search_kwargs={"k": 2}
).configurable_fields(
    search_kwargs=ConfigurableField(
        id="search_kwargs_faiss",
        name="Search Kwargs",
        description="The search kwargs to use",
    )
)

ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)

# 运行时修改配置
config = {"configurable": {"search_kwargs_faiss": {"k": 1}}}
docs = ensemble_retriever.invoke("apples", config=config)
print(docs)

[Document(page_content='I like apples', metadata={'source': 1}), Document(page_content='You like apples', metadata={'source': 2}), Document(page_content='Apples and oranges are fruits', metadata={'source': 1})]
