# Custom Retriever

## Overview

Many LLM applications involve retrieving information from external data sources using a `Retriever`. 

The responsibility of a retriever is to return a list of the most relevant `Documents` that correspond to a given user `query`.

Downstream `Document` objects are often formatted into prompts that are fed into an LLM, allowing the LLM to use the information in the `Document` to generate a desired response (e.g., answering a user question).

## Interface

To create your own retriever, you need to extend the `BaseRetriever` class and implement the following methods:

| Method                         | Description                                      | Required/Optional |
|--------------------------------|--------------------------------------------------|-------------------|
| `_get_relevant_documents`      | Get documents relevant to a query.               | Required          |
| `_aget_relevant_documents`     | Implement to provide async native support.       | Optional          |


The logic inside of `_get_relevant_documents` can involve arbitrary calls to a database or to the web using requests.

By inherting from `BaseRetriever`, your retriever will automatically become a `LangChain` Runnable and will gain some of the standard `Runnable` functionality out of the box.

:::{.callout-tip}

If you wish, you can use a `RunnableLambda` or `RunnableGenerator` to implement a retriever as well.

The main benefit of using a `BaseRetriever` is that:
1) It's a well known entity in LangChain, so some integrations may implement specialized behavior for retrievers.
2) It'll appear as a retriever in monitoring applications (e.g., LangSmith)
3) It'll appear explicitly as a retriever in some APIs; e.g., it'll yield `on_retriever_[x]` events rather than `on_chain_[x]` events with `astream_events`.
:::

## Example

Let's implement a toy retriever that returns documents that contain the user query.

In [1]:
from typing import List

from langchain_core.retrievers import BaseRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document


class ToyRetriever(BaseRetriever):
    documents: List[Document]
    """List of documents to retrieve from."""
    k: int
    """Number of top results to return"""
    
    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Sync implementations for retriever."""
        matching_documents = []
        for document in documents:
            if len(matching_documents) > self.k:
                return matching_documents

            if query.lower() in document.page_content.lower():
                matching_documents.append(document)
        return matching_documents


    # Optional: Provide a more efficient native implementation by overriding 
    # _aget_relevant_documents
    # async def _aget_relevant_documents(
    #     self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun
    # ) -> List[Document]:
    #     """Asynchronously get documents relevant to a query.
        
    #     Args:
    #         query: String to find relevant documents for
    #         run_manager: The callbacks handler to use
            
    #     Returns:
    #         List of relevant documents
    #     """


# Creating 10 sample Document objects about pets
documents = [
    Document(page_content="Dogs are great companions, known for their loyalty and friendliness.", metadata={"type": "dog", "trait": "loyalty"}),
    Document(page_content="Cats are independent pets that often enjoy their own space.", metadata={"type": "cat", "trait": "independence"}),
    Document(page_content="Goldfish are popular pets for beginners, requiring relatively simple care.", metadata={"type": "fish", "trait": "low maintenance"}),
    Document(page_content="Parrots are intelligent birds capable of mimicking human speech.", metadata={"type": "bird", "trait": "intelligence"}),
    Document(page_content="Rabbits are social animals that need plenty of space to hop around.", metadata={"type": "rabbit", "trait": "social"}),
    Document(page_content="Hamsters are nocturnal animals, making them active during the night.", metadata={"type": "hamster", "trait": "nocturnal"}),
    Document(page_content="Turtles are known for their long lifespan, making them a long-term commitment as pets.", metadata={"type": "turtle", "trait": "longevity"}),
    Document(page_content="Snakes can be fascinating pets, but they require specific care and handling.", metadata={"type": "snake", "trait": "unique care"}),
    Document(page_content="Guinea pigs are friendly and easy to handle, making them great pets for children.", metadata={"type": "guinea pig", "trait": "friendly"}),
    Document(page_content="Horses require a lot of space and care, but they can form strong bonds with their owners.", metadata={"type": "horse", "trait": "strong bonding"}),
]

## Test it 🧪

In [2]:
retriever = ToyRetriever(documents=documents, k=3)
retriever.invoke("that")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]

But it's a runnable so it'll benefit from the standard Runnable Interface! 🤩

In [3]:
await retriever.ainvoke('that')

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]

In [4]:
retriever.batch(['dog', 'cat'])

[[Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'type': 'dog', 'trait': 'loyalty'})],
 [Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'})]]

In [5]:
async for event in retriever.astream_events('bar', version='v1'):
    print(event)

{'event': 'on_retriever_start', 'run_id': '5acd4b11-45d9-4984-806d-7c35b1774108', 'name': 'ToyRetriever', 'tags': [], 'metadata': {}, 'data': {'input': 'bar'}}
{'event': 'on_retriever_stream', 'run_id': '5acd4b11-45d9-4984-806d-7c35b1774108', 'tags': [], 'metadata': {}, 'name': 'ToyRetriever', 'data': {'chunk': []}}
{'event': 'on_retriever_end', 'name': 'ToyRetriever', 'run_id': '5acd4b11-45d9-4984-806d-7c35b1774108', 'tags': [], 'metadata': {}, 'data': {'output': []}}


  warn_beta(


## Contributing

We appreciate contributions for retrievers!

Here's a checklist to help make sure your contribution gets added to LangChain:

Documentation:

* The retriever contains doc-strings for all initialization arguments, as these will be surfaced in the [API Reference](https://api.python.langchain.com/en/stable/langchain_api_reference.html).
* The class doc-string for the model contains a link to any relevant APIs used for the retriever (e.g., if the retriever is retrieving from wikipedia, it'll be good to link to the wikipedia API!)

Tests:

* [ ] Add unit or integration tests to verify that `invoke` and `ainvoke` work.

Optimizations:

If the retriever is connecting to external data sources (e.g., an API or a file), it'll almost certainly benefit from an async native optimization!
 
* [ ] Provide a native async implementation of `_aget_relevant_documents` (used by `ainvoke`)