# Document Splitting

將整份文件分割成較小的區塊（chunks）。以便更有效率地存入向量資料庫並供語言模型檢索。

範例:
```
The Toyota Camry has a head-snapping 80 HP and an eight-speed automatic transmission...
```
Chunk 1:
on this model. The Toyota Camry has a head-snapping

Chunk 2:
80 HP and an eight-speed automatic transmission that will

為什麼要分割？

- LLM 有 token 限制（例如 Ollama 的 LLaMA2 ≈ 4096 tokens）
- 檢索時可以只抓出相關段落，而非整份檔案
- 保證回答的準確度與效能

下圖為一個文本切區塊示意圖

![](https://r.anikit.app/i/nQ9XLQbo14)


- https://python.langchain.com/docs/concepts/text_splitters/#approaches

1. CharacterTextSplitter()
    - 根據「字元長度」進行分割
    - 常見於基本 chunk 設定，例如每段 500 字元
2. MarkdownHeaderTextSplitter()
    - 根據 Markdown 中的標題層級（如 #, ##）進行分割
    - 適合處理 .md 技術文件或章節結構化資料
3. TokenTextSplitter()
    - 依據 token 數量 分割
    - token 可依據模型（如 OpenAI 或 HuggingFace tokenizer）來計算
    - 適合控制 prompt 長度（例如不超過 512 token）
4. SentenceTransformersTokenTextSplitter()
    - 也是基於 token 分割
    - 搭配 Sentence Transformers 的 tokenizer 使用
    - 更適合語意嵌入式 pipeline（與 sentence-transformers 配合）
5. RecursiveCharacterTextSplitter() ✅ 最常用
    - 最推薦的預設選擇
    - 會嘗試以不同分隔符（如段落、句號、空白）進行分割
    - 無法分成功時再退回用更細分的方式（遞迴式 fallback）

✅ 適用於通用文本、自動平衡語意結構與 chunk 長度

6. Language()（⚠️ 較冷門）
    - 適用於 code file 分割（例如 C++, Python, Ruby, Markdown）
    - 可搭配語言語法做智能分割


7. NLTKTextSplitter()
    - 使用 NLTK 的語句切割器
    - 適合英文文件，每段為完整句子
    - 需安裝 nltk 套件與語料庫

8. SpacyTextSplitter()
    - 使用 spaCy 的語句斷句功能
    - 適合做語意較強的分段
    - 需安裝 spaCy 模型（如 en_core_web_sm）

In [1]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [3]:
chunk_size =26
chunk_overlap = 4

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
"""
CharacterTextSplitter 會根據 separator 來進行切割，預設是 `\n\n` 
"""
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [5]:
text1 = 'abcdefghijklmnopqrstuvwxyz'
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [7]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
r_splitter.split_text(text2)
# ['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg'] 第二個元素 wxyz 是 chunk_overlap

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

In [8]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [9]:
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [10]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [11]:
c_splitter_with_separator = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter_with_separator.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

### Recursive splitting details

In [21]:
some_text = """
Why you need Kubernetes and what it can do
Containers are a good way to bundle and run your applications. In a production environment, you need to manage the containers that run the applications and ensure that there is no downtime. For example, if a container goes down, another container needs to start. Wouldn't it be easier if this behavior was handled by a system? \n\n

That's how Kubernetes comes to the rescue! Kubernetes provides you with a framework to run distributed systems resiliently. It takes care of scaling and failover for your application, provides deployment patterns, and more. For example: Kubernetes can easily manage a canary deployment for your system. \n\n  \

Kubernetes provides you with:

- Service discovery and load balancing Kubernetes can expose a container using the DNS name or using their own IP address. If traffic to a container is high, Kubernetes is able to load balance and distribute the network traffic so that the deployment is stable.
- Storage orchestration Kubernetes allows you to automatically mount a storage system of your choice, such as local storages, public cloud providers, and more.
- Automated rollouts and rollbacks You can describe the desired state for your deployed containers using Kubernetes, and it can change the actual state to the desired state at a controlled rate. For example, you can automate Kubernetes to create new containers for your deployment, remove existing containers and adopt all their resources to the new container.
- Automatic bin packing You provide Kubernetes with a cluster of nodes that it can use to run containerized tasks. You tell Kubernetes how much CPU and memory (RAM) each container needs. Kubernetes can fit containers onto your nodes to make the best use of your resources.
- Self-healing Kubernetes restarts containers that fail, replaces containers, kills containers that don't respond to your user-defined health check, and doesn't advertise them to clients until they are ready to serve.
- Secret and configuration management Kubernetes lets you store and manage sensitive information, such as passwords, OAuth tokens, and SSH keys. You can deploy and update secrets and application configuration without rebuilding your container images, and without exposing secrets in your stack configuration.
- Batch execution In addition to services, Kubernetes can manage your batch and CI workloads, replacing containers that fail, if desired.
- Horizontal scaling Scale your application up and down with a simple command, with a UI, or automatically based on CPU usage.
IPv4/IPv6 dual-stack Allocation of IPv4 and IPv6 addresses to Pods and Services
- Designed for extensibility Add features to your Kubernetes cluster without changing upstream source code.
"""

In [22]:
c_splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0,
    separator = ' '
)
"""
RecursiveCharacterTextSplitter 中 separators 參數，分段時使用的分隔符層級清單（越前越優先）
1. 優先會以段落 \n\n 切
2. 若段落過長，會再以 \n 切句子
3. 若還是超長，再以空格切字
4. 最後才用硬切字元
"""
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

In [14]:
c_splitter.split_text(some_text)

["Why you need Kubernetes and what it can do\nContainers are a good way to bundle and run your applications. In a production environment, you need to manage the containers that run the applications and ensure that there is no downtime. For example, if a container goes down, another container needs to start. Wouldn't it be easier if this behavior was handled by a system?\n\nThat's how Kubernetes comes to the rescue! Kubernetes provides you with a",
 'framework to run distributed systems resiliently. It takes care of scaling and failover for your application, provides deployment patterns, and more. For example: Kubernetes can easily manage a canary deployment for your system.\n\nKubernetes provides you with:\n\n- Service discovery and load balancing Kubernetes can expose a container using the DNS name or using their own IP address. If traffic to a container is high, Kubernetes is able to load',
 'balance and distribute the network traffic so that the deployment is stable.\n- Storage orch

In [23]:
r_splitter.split_text(some_text)

["Why you need Kubernetes and what it can do\nContainers are a good way to bundle and run your applications. In a production environment, you need to manage the containers that run the applications and ensure that there is no downtime. For example, if a container goes down, another container needs to start. Wouldn't it be easier if this behavior was handled by a system?",
 "That's how Kubernetes comes to the rescue! Kubernetes provides you with a framework to run distributed systems resiliently. It takes care of scaling and failover for your application, provides deployment patterns, and more. For example: Kubernetes can easily manage a canary deployment for your system. \n\n  \nKubernetes provides you with:",
 '- Service discovery and load balancing Kubernetes can expose a container using the DNS name or using their own IP address. If traffic to a container is high, Kubernetes is able to load balance and distribute the network traffic so that the deployment is stable.\n- Storage orche

In [25]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r = r_splitter.split_text(some_text)
print(r)
print('\n\n')
print(len(r))

['Why you need Kubernetes and what it can do', 'Containers are a good way to bundle and run your applications. In a production environment, you need to manage the containers that run the', "applications and ensure that there is no downtime. For example, if a container goes down, another container needs to start. Wouldn't it be easier if", 'this behavior was handled by a system?', "That's how Kubernetes comes to the rescue! Kubernetes provides you with a framework to run distributed systems resiliently. It takes care of scaling", 'and failover for your application, provides deployment patterns, and more. For example: Kubernetes can easily manage a canary deployment for your', 'system.', 'Kubernetes provides you with:', '- Service discovery and load balancing Kubernetes can expose a container using the DNS name or using their own IP address. If traffic to a container', 'is high, Kubernetes is able to load balance and distribute the network traffic so that the deployment is stable.', '- S

In [26]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r = r_splitter.split_text(some_text)
print(r)
print('\n\n')
print(len(r))

['Why you need Kubernetes and what it can do', 'Containers are a good way to bundle and run your applications. In a production environment, you need to manage the containers that run the', "applications and ensure that there is no downtime. For example, if a container goes down, another container needs to start. Wouldn't it be easier if", 'this behavior was handled by a system?', "That's how Kubernetes comes to the rescue! Kubernetes provides you with a framework to run distributed systems resiliently. It takes care of scaling", 'and failover for your application, provides deployment patterns, and more. For example: Kubernetes can easily manage a canary deployment for your', 'system.', 'Kubernetes provides you with:', '- Service discovery and load balancing Kubernetes can expose a container using the DNS name or using their own IP address. If traffic to a container', 'is high, Kubernetes is able to load balance and distribute the network traffic so that the deployment is stable.', '- S

## Token splitting

特別適合用在你需要精準控制 LLM token 數量的場景。TokenTextSplitter 是根據「token 數」來分割文字，而不是用字元數或句子長度。
使用時機
1. 要控制輸入給 LLM 的 token 數
2. 做嵌入（embedding）時希望 chunk 不超過某 token 數
3. 對中文或特殊語言文字長度不準確時
4. 想確保輸出給 LLM 的 chunk 不會被截斷

>Tokens are often ~4 characters.

In [4]:
%pip install tiktoken

Defaulting to user installation because normal site-packages is not writeable
Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.9.0
Note: you may need to restart the kernel to use updated packages.


In [8]:
text1 = "foo bar bazzyfoo"

In [9]:
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [10]:
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("./kubernetes-cheatsheet.pdf")
pages = loader.load()

In [12]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages)

In [13]:
len(docs)

141

In [14]:
len(pages)

1

In [15]:
print(docs[0])
print('\n')
print(docs[1])

page_content='Kubernetes Cheatsheet
What is' metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250518155345', 'source': './kubernetes-cheatsheet.pdf', 'file_path': './kubernetes-cheatsheet.pdf', 'total_pages': 1, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': 'D:20250518155345', 'page': 0}


page_content=' Kubernetes 
Kapsule and' metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250518155345', 'source': './kubernetes-cheatsheet.pdf', 'file_path': './kubernetes-cheatsheet.pdf', 'total_pages': 1, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': 'D:20250518155345', 'page': 0}


## MarkdownHeaderTextSplitter

In [16]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [17]:
from langchain.text_splitter import MarkdownHeaderTextSplitter
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [18]:
md_header_splits[0]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}, page_content='Hi this is Jim  \nHi this is Joe')

In [19]:
md_header_splits[1]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}, page_content='Hi this is Lance')