# Step-by-step implementation
The following are the steps to implement the parent document retrieval (PDR):
  1. Prepare the data
  
    i) Import necessary modules
  
    ii) Set up the OpenAI API key
  
    iii) Define the text embedding function
  
    vi) Load text data

  2. Retrieve full documents
  
    i) Full document splitting
  
    ii) Vector store and storage setup
  
    iii) Parent document retriever
  
    iv) Adding documents
  
    v) Similarity search and retrieval
  3. Retrieve larger chunks

    i) Parent document retriever
  
    ii) Similarity search and retrieval
  4. Integrate with `RetrievalQA`

## Prepare the data

### i) Import necessary modules

In [0]:
from langchain.schema import Document
from langchain.vectorstores import Chroma
from langchain.retrievers import ParentDocumentRetriever
from langchain.chains import RetrievalQA
from langchain_openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
import os

### ii) Set up the OpenAI API key

In [0]:
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"] = ""  # Add your OpenAI API key
if OPENAI_API_KEY == "":
    raise ValueError("Please set the OPENAI_API_KEY environment variable")

### iii) Define the text embedding function

In [0]:
embeddings = OpenAIEmbeddings()

### vi) Load text data

In [0]:
loaders = [
    TextLoader('blog.langchain.dev_announcing-langsmith_.txt'),
    TextLoader('blog.langchain.dev_automating-web-research_.txt'),
]

docs = []
for l in loaders:
    docs.extend(l.load())

## 2. Retrieve full documents

### i) Full document splitting

In [0]:
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

### ii) Vector store and storage setup

In [0]:
vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=OpenAIEmbeddings()
)

store = InMemoryStore()

### iii) Parent document retriever

In [0]:
full_doc_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter
)

### iv) Adding documents

In [0]:
full_doc_retriever.add_documents(docs)

print(list(store.yield_keys()))  # List document IDs in the store

### v) Similarity search and retrieval

In [0]:
sub_docs = vectorstore.similarity_search("What is LangSmith?", k=2)
print(len(sub_docs))

print(sub_docs[0].page_content)

retrieved_docs = full_doc_retriever.invoke("What is LangSmith?")

print(len(retrieved_docs[0].page_content))
print(retrieved_docs[0].page_content)

## 3. Retrieve larger chunks

### i) Parent document retriever

In [0]:
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

vectorstore = Chroma(
    collection_name="split_parents",
    embedding_function=OpenAIEmbeddings()
)

store = InMemoryStore()

big_chunks_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter
)

# Adding documents
big_chunks_retriever.add_documents(docs)
print(len(list(store.yield_keys())))  # List document IDs in the store

### ii) Similarity search and retrieval

In [0]:
sub_docs = vectorstore.similarity_search("What is LangSmith?", k=2)
print(len(sub_docs))

print(sub_docs[0].page_content)

retrieved_docs = big_chunks_retriever.invoke("What is LangSmith?")
print(len(retrieved_docs))

print(len(retrieved_docs[0].page_content))
print(retrieved_docs[0].page_content)

## 4. Integrate with `RetrievalQA`

In [0]:
qa = RetrievalQA.from_chain_type(llm=OpenAI(),
                                chain_type="stuff",
                                retriever=big_chunks_retriever)

query = "What is LangSmith?"

response = qa.invoke(query)
print(response)