# Parent Document Retriever
When splitting documents for retrieval, there are often conflicting desires:

1. You may want to have small documents, so that their embeddings can most accurately reflect their meaning. If too long, then the embeddings can lose meaning.
2. You want to have long enough documents that the context of each chunk is retained.

The `ParentDocumentRetriever` strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents.

Note that “parent document” refers to the document that a small chunk originated from. This can either be the whole raw document OR a larger chunk.

In [1]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import SentenceTransformerEmbeddings


In [2]:
from IPython.display import display, Markdown
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

True

In [23]:
from langchain_community.document_loaders import PyPDFLoader

loader1 = PyPDFLoader("../data/Raptor Contract.docx.pdf")
loader2 = PyPDFLoader("../data/Robinson Advisory.docx.pdf")

loaders = [loader1, loader2]
docs =[]

for loader in loaders:
    docs.extend(loader.load())

In [None]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader("../data/Raptor Contract.docx.pdf")
docs = loader.load()

# Retrieving full documents
In this mode, we want to retrieve the full documents. Therefore, we only specify a child splitter.

In [3]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedder = SentenceTransformerEmbeddings()

  from .autonotebook import tqdm as notebook_tqdm


In [25]:
# This text splitter is used to create the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=SentenceTransformerEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

In [26]:
retriever.add_documents(docs, ids=None)

This should yield two keys, because we added two documents. However, this is not the case because we use PyPDFLoader rather DirectoryLoader.

In [31]:
list(store.yield_keys())

['6e8fba82-d38c-41f6-a5e8-f692b47ee18b',
 'faba2abf-d22b-461e-be34-12928f08c89a',
 '2cab842e-ae93-40ca-b497-c8f1779ef275',
 'd133bfd1-9f0f-4935-8e24-e810971ed9c3',
 '0f86b0df-fbc4-4653-8a3d-31dae19cc149',
 'b3400ef9-ed74-4906-a153-80bc0ee742f0',
 'da081d35-7dc1-4b18-94ce-7a49643eda48',
 'f7fce3c1-9ce6-4e92-847f-c354e335c410',
 'df44fd81-1e3f-4661-a4a8-febfdbf09568',
 'ddfa4e18-3a03-4d20-9ce5-2ec0ea3d7c88',
 'c86cb567-a71f-4a12-a3c2-0353357995b3',
 '02a7e119-c9bb-4794-8371-4a0eaeb778a5',
 'b301373e-240d-4651-81d4-0261850c1068',
 'f4fc54fb-43ec-46a0-9778-f333b71635f0',
 'acad0c8d-1a44-4504-a274-7ed990b229b6',
 '37dbef44-92fc-484b-95c9-a01d45122137',
 'fde058e6-2318-49fa-bc4f-01a276d22c30',
 'f1766d16-3dac-40de-8b6a-8362cd21a726',
 '220790fa-54ee-4d19-9d72-9973a22d646b',
 '07d2fe20-7a74-41ac-8627-715cecc9a699',
 'f32ab620-47c7-418b-94dd-261e787b2404',
 '7fbbcadd-513e-4204-a5c5-53d0ec34d957',
 '3b0d5565-d7d5-4a5c-8960-a83978a8783b',
 '2b9a4f1b-731c-4516-afbb-d813f4fb3292',
 '90195431-85b2-

In [32]:
sub_docs = vectorstore.similarity_search("how much is the escrow amount")

In [34]:
sub_docs

[Document(page_content='Escrow \nAmount;\n(iv)\nrepay ,\nor\ncause\nto\nbe\nrepaid,\nin\naccordance\nwith\nthe\npayof f\nletters\ndelivered\nby\nthe \nSellers’\nRepresentative\npursuant\nto\nSection\n2.05(b)(iv)\nand\nthe\nAllocation\nStatement,\non\nbehalf \nof\nthe\nCompany ,\nthe\namounts\nas\nset\nforth\nin\nsuch\npayof f\nletters;\n(v)\npay,\nor\ncause\nto\nbe\npaid\nin\naccordance\nwith\nthe\nAllocation\nStatement,\non\nbehalf\nof \nthe\nCompany ,\nthe\nSeller\nTransaction', metadata={'doc_id': 'b1fa2a89-8245-40fa-8840-38bb732f39c8', 'page': 21, 'source': '../data/Raptor Contract.docx.pdf'}),
 Document(page_content='Escrow \nAmount;\n(iv)\nrepay ,\nor\ncause\nto\nbe\nrepaid,\nin\naccordance\nwith\nthe\npayof f\nletters\ndelivered\nby\nthe \nSellers’\nRepresentative\npursuant\nto\nSection\n2.05(b)(iv)\nand\nthe\nAllocation\nStatement,\non\nbehalf \nof\nthe\nCompany ,\nthe\namounts\nas\nset\nforth\nin\nsuch\npayof f\nletters;\n(v)\npay,\nor\ncause\nto\nbe\npaid\nin\naccordance\nwit

In [36]:
display(sub_docs[0].page_content)

'Escrow \nAmount;\n(iv)\nrepay ,\nor\ncause\nto\nbe\nrepaid,\nin\naccordance\nwith\nthe\npayof f\nletters\ndelivered\nby\nthe \nSellers’\nRepresentative\npursuant\nto\nSection\n2.05(b)(iv)\nand\nthe\nAllocation\nStatement,\non\nbehalf \nof\nthe\nCompany ,\nthe\namounts\nas\nset\nforth\nin\nsuch\npayof f\nletters;\n(v)\npay,\nor\ncause\nto\nbe\npaid\nin\naccordance\nwith\nthe\nAllocation\nStatement,\non\nbehalf\nof \nthe\nCompany ,\nthe\nSeller\nTransaction'

Let’s now retrieve from the overall retriever. This should return large documents - since it returns the documents where the smaller chunks are located.

In [37]:
retrieved_docs = retriever.get_relevant_documents("how much is the escrow amount?")

In [38]:
len(retrieved_docs[0].page_content)

3046

In [44]:
# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="split_parents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()

In [42]:
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [None]:
retriever.add_documents(docs)