# 父文件檢索器（Parent Document Retriever）

## 概覽

本教學聚焦於 `ParentDocumentRetriever` 的實作，一種旨在平衡「文件檢索」與「文本分塊」需求的工具。

當我們將文件拆分用於搜尋時，會面臨兩種相互競爭的需求：

1. **小分塊（Small Chunks）**：有助於透過嵌入向量（embeddings）準確表示語意  
2. **上下文保留（Context Preservation）**：有助於保持文件的連貫性與完整理解

> 📌 運作原理

`ParentDocumentRetriever` 的運作方式如下：

1. 將文件拆分成便於檢索的小分塊  
2. 為每個小分塊保留與「原始父文件」的連結（使用 ID）  
3. 支援透過 `TextLoader` 對多份文件進行載入與處理  

> ✅ 優點

1. **高效檢索**：快速找出與查詢相關的內容  
2. **上下文感知**：可根據需求取用更完整的文件段落  
3. **結構彈性**：可使用整份原始文件或較大的區塊作為父文件

---

### 目錄

- [概覽](#概覽)
- [環境設置](#環境設置)
- [完整文件檢索](#完整文件檢索)
- [調整較大的分塊尺寸](#調整較大的分塊尺寸)

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial langchain_chroma

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_community",
        "langchain_openai",
        "chromadb",
    ],
    verbose=False,
    upgrade=False,
)

from langchain.storage import InMemoryStore
from langchain_community.document_loaders import TextLoader
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Parent-Document-Retriever",
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as ```OPENAI_API_KEY``` in a ```.env``` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [4]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

First, let's load the documents that we'll use as data.

In [5]:
loaders = [
    # load file (It could be multiple files)
    TextLoader("./data/appendix-keywords.txt"),
]
# If your os is window, execute the following line 
# loader = TextLoader("./data/appendix-keywords.txt", encoding="utf-8")

docs = []
for loader in loaders:
    # Load the document using the loader and add it to the docs list.
    docs.extend(loader.load())


In [6]:
docs

[Document(metadata={'source': './data/appendix-keywords.txt'}, page_content='Semantic Search\n\nDefinition: Semantic search refers to a search method that understands the meaning behind user queries, going beyond simple keyword matching to return relevant results.  \nExample: When a user searches for "solar system planets," the search returns information about related planets like Jupiter and Mars.  \nRelated Keywords: Natural Language Processing, Search Algorithms, Data Mining  \n\n\nEmbedding\n\nDefinition: Embedding is the process of converting textual data, such as words or sentences, into low-dimensional continuous vectors. This enables computers to understand and process text.  \nExample: The word "apple" can be represented as a vector like [0.65, -0.23, 0.17].  \nRelated Keywords: Natural Language Processing, Vectorization, Deep Learning  \n\n\nToken\n\nDefinition: A token refers to smaller units of text obtained by breaking it into parts, such as words, sentences, or phrases.  

## 完整文件檢索（Full Document Retrieval）

在此模式中，我們的目標是對**完整文件**進行搜尋。因此，我們只需指定 `child_splitter`（子分割器）。

稍後，我們也會加入 `parent_splitter`（父分割器）來比較檢索結果的差異。

In [7]:
# Define Child Splitter with chunk size
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Create a Chroma DB collection -- in memory version
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)

store = InMemoryStore()

# Create Retriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

Documents are added using the ```retriever.add_documents(docs, ids=None)``` function:
* If ```ids``` is ```None```, they will be automatically generated.
* Setting ```add_to_docstore=False``` prevents duplicate document additions. However, ```ids``` values are required to check for duplicates.

In [8]:
# Add documents to the retriever. 'docs' is a list of documents, and 'ids' is a list of unique document identifiers.
retriever.add_documents(docs, ids=None, add_to_docstore=True)

This code should return two keys because we added two documents.

- Convert the keys returned by the ```store``` object's ```yield_keys()``` method into a list.

In [9]:
# Return all keys from the store as a list.
list(store.yield_keys())

['739c2480-1aac-4090-ae21-ceebc416d099']

讓我們嘗試呼叫向量資料庫（vector store）的搜尋功能。

由於我們儲存的是**較小的區塊（chunks）**，因此在搜尋結果中應該會看到這些小區塊被返回。

你可以使用向量資料庫物件的 `similarity_search` 方法來執行相似度搜尋。

In [10]:
# Perform similarity search
sub_docs = vectorstore.similarity_search("Word2Vec")

# Print the page_content property of the first element in the sub_docs list.
print(sub_docs[0].page_content)

Word2Vec


Now let's search through the entire retriever. In this process, since it **returns the documents** containing the small chunks, relatively larger documents will be returned.

Use the ```invoke()``` method of the ```retriever``` object to retrieve documents related to the query.

In [11]:
# Retrieve and fetch documents
retrieved_docs = retriever.invoke("Word2Vec")

In [12]:
# Print the length of the page content of the retrieved document
print(
    f"Document length: {len(retrieved_docs[0].page_content)}",
    end="\n\n=====================\n\n",
)

# Print a portion of the document
print(retrieved_docs[0].page_content[2000:2500])

Document length: 10044


 old.  
Related Keywords: Database, Query, Data Management  


CSV

Definition: CSV (Comma-Separated Values) is a file format used to store data, where each value is separated by a comma. It is often used for saving and exchanging tabular data.  
Example: A CSV file with headers "Name, Age, Job" could contain data like "John, 30, Developer."  
Related Keywords: Data Format, File Handling, Data Exchange  


JSON

Definition: JSON (JavaScript Object Notation) is a lightweight data interchange form


## 調整較大的區塊大小（Adjusting Larger Chunk Sizes）

如同先前的結果，**整份文件可能過大，不適合直接用於搜尋**。

在這種情況下，我們實際上希望的作法是：

1. 先將原始文件切分成**較大的區塊（Parent Chunks）**。
2. 再將這些較大的區塊細分為**更小的區塊（Child Chunks）**。

接著，我們會：
- 對**小區塊進行向量索引（index）**
- 在檢索時返回**較大的父區塊**（而不是整份文件）

### 實作方式：

使用 `RecursiveCharacterTextSplitter` 建立父與子的切分器：

- 父文件（Parent Documents）的 `chunk_size` 設為 `1000`
- 子文件（Child Documents）的 `chunk_size` 設為 `200`，使其比父文件小

In [13]:
# Text splitter used to generate parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)

# Text splitter used to generate child documents
# Should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Vector store to be used for indexing child chunks
vectorstore = Chroma(
    collection_name="split_parents", embedding_function=OpenAIEmbeddings()
)
# Storage layer for parent documents
store = InMemoryStore()

這是初始化 ```ParentDocumentRetriever``` 的程式碼說明：

- `vectorstore` 參數：指定用於儲存文件向量的向量資料庫。
- `docstore` 參數：指定用於儲存文件實體內容的文件儲存庫。
- `child_splitter` 參數：指定用來切分子文件的分割器。
- `parent_splitter` 參數：指定用來切分父文件的分割器。

```ParentDocumentRetriever``` 能夠處理具有階層結構的文件，會分別切分與儲存父文件與子文件。

這樣的設計讓我們在檢索時，能夠有效結合父文件的上下文與子文件的精確內容，提升搜尋品質與語境理解能力。

In [14]:
retriever = ParentDocumentRetriever(
    # Specify the vector store
    vectorstore=vectorstore,
    # Specify the document store
    docstore=store,
    # Specify the child document splitter
    child_splitter=child_splitter,
    # Specify the parent document splitter
    parent_splitter=parent_splitter,
)

Add docs to the ```retriever``` object. This adds new documents to the set of documents that ```retriever``` can search through.

In [15]:
# Add documents to the retriever
retriever.add_documents(docs)

Now you can see there are many more documents. These are the larger chunks.

In [16]:
# Generate keys from the store, convert to list, and return the length
len(list(store.yield_keys()))

12

In [17]:
# Perform similarity search
sub_docs = vectorstore.similarity_search("Word2Vec")
# Print the page_content property of the first element in the sub_docs list
print(sub_docs[0].page_content)

Word2Vec


Now let's use the ```invoke()``` method of the ```retriever``` object to search for documents.

In [18]:
# Retrieve and fetch documents
retrieved_docs = retriever.invoke("Word2Vec")

# Return the length of the page content of the first retrieved document
print(retrieved_docs[0].page_content)

Crawling

Definition: Crawling is the automated process of visiting web pages to collect data. It is commonly used for search engine optimization and data analysis.  
Example: Google’s search engine crawls websites to collect and index content.  
Related Keywords: Data Collection, Web Scraping, Search Engine  


Word2Vec

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces, representing semantic relationships between words.  
Example: In a Word2Vec model, "king" and "queen" are represented as vectors close to each other in the vector space.  
Related Keywords: Natural Language Processing, Embedding, Semantic Similarity  


LLM (Large Language Model)
