# 如何创建自定义检索器
## 概述
许多LLM应用涉及使用[检索器](/docs/concepts/retrievers/)从外部数据源获取信息。
检索器负责根据用户的`query`检索相关[文档](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html)列表。
检索到的文档通常会被格式化为提示词，输入到大型语言模型（LLM）中，使LLM能够利用这些信息生成合适的响应（例如，基于知识库回答用户问题）。
## 接口
要创建自己的检索器，您需要继承 `BaseRetriever` 类并实现以下方法：
| 方法                         | 描述                                      | 必填/选填 |
|--------------------------------|-----------------------------------------------|----------------|
| Method                         | Description                                      | Required/Optional |
|--------------------------------|-----------------------------------------------|----------------||--------------------------------|--------------------------------------------------|-------------------|| `_get_relevant_documents`      | 获取与查询相关的文档。               | 必填          || `_aget_relevant_documents`     | 实现以提供异步原生支持。       | 可选          |

`_get_relevant_documents` 内部的逻辑可以包含对数据库的任意调用，或通过 requests 库发起网络请求。
:::提示通过继承 `BaseRetriever`，你的检索器会自动成为 LangChain 的 [Runnable](/docs/concepts/runnables)，并立即获得标准 `Runnable` 的功能！:::

:::信息你可以使用 `RunnableLambda` 或 `RunnableGenerator` 来实现一个检索器。
将检索器实现为 `BaseRetriever` 而非 `RunnableLambda`（自定义[可运行函数](/docs/how_to/functions)）的主要优势在于，`BaseRetriever` 是一个标准化已知的LangChain实体，因此一些监控工具可能会为检索器实现专门的行为。另一个区别`BaseRetriever` 在某些 API 中的行为会与 `RunnableLambda` 略有不同；例如，`start` 事件在 `astream_events` API 中，将使用 `on_retriever_start` 而非 `on_chain_start`。:::

## 示例
让我们实现一个简易检索器，它能够返回所有文本内容包含用户查询文本的文档。

In [26]:
from typing import List

from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever


class ToyRetriever(BaseRetriever):
    """A toy retriever that contains the top k documents that contain the user query.

    This retriever only implements the sync method _get_relevant_documents.

    If the retriever were to involve file access or network access, it could benefit
    from a native async implementation of `_aget_relevant_documents`.

    As usual, with Runnables, there's a default async implementation that's provided
    that delegates to the sync implementation running on another thread.
    """

    documents: List[Document]
    """List of documents to retrieve from."""
    k: int
    """Number of top results to return"""

    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Sync implementations for retriever."""
        matching_documents = []
        for document in documents:
            if len(matching_documents) > self.k:
                return matching_documents

            if query.lower() in document.page_content.lower():
                matching_documents.append(document)
        return matching_documents

    # Optional: Provide a more efficient native implementation by overriding
    # _aget_relevant_documents
    # async def _aget_relevant_documents(
    #     self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun
    # ) -> List[Document]:
    #     """Asynchronously get documents relevant to a query.

    #     Args:
    #         query: String to find relevant documents for
    #         run_manager: The callbacks handler to use

    #     Returns:
    #         List of relevant documents
    #     """

## 测试一下 🧪

In [21]:
documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"type": "dog", "trait": "loyalty"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"type": "cat", "trait": "independence"},
    ),
    Document(
        page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
        metadata={"type": "fish", "trait": "low maintenance"},
    ),
    Document(
        page_content="Parrots are intelligent birds capable of mimicking human speech.",
        metadata={"type": "bird", "trait": "intelligence"},
    ),
    Document(
        page_content="Rabbits are social animals that need plenty of space to hop around.",
        metadata={"type": "rabbit", "trait": "social"},
    ),
]
retriever = ToyRetriever(documents=documents, k=3)

In [22]:
retriever.invoke("that")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]

这是一个**可运行**的，因此它将受益于标准的Runnable接口！🤩

In [23]:
await retriever.ainvoke("that")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]

In [24]:
retriever.batch(["dog", "cat"])

[[Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'type': 'dog', 'trait': 'loyalty'})],
 [Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'})]]

In [25]:
async for event in retriever.astream_events("bar", version="v1"):
    print(event)

{'event': 'on_retriever_start', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'name': 'ToyRetriever', 'tags': [], 'metadata': {}, 'data': {'input': 'bar'}}
{'event': 'on_retriever_stream', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'name': 'ToyRetriever', 'data': {'chunk': []}}
{'event': 'on_retriever_end', 'name': 'ToyRetriever', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'data': {'output': []}}


## 贡献指南
我们欢迎有趣的检索器贡献！
以下是一份清单，用于确保您的贡献能够被纳入LangChain：
文档：
* 检索器包含所有初始化参数的文档字符串，这些内容将在[API参考文档](https://python.langchain.com/api_reference/langchain/index.html)中呈现。* 该模型的类文档字符串（doc-string）中包含指向检索器所用相关API的链接（例如，若检索器从维基百科获取数据，则建议链接至维基百科API！）
测试：
* [ ] 添加单元或集成测试，以验证 `invoke` 和 `ainvoke` 的功能。
优化：
如果检索器正在连接外部数据源（例如API或文件），那么几乎可以肯定它会受益于异步原生优化！ 
* [ ] 提供 `_aget_relevant_documents` 的原生异步实现（供 `ainvoke` 使用）