# 字元分割器（Character Text Splitter）

## 概述

在使用 LangChain 進行文件處理時，文字分割是一個非常關鍵的步驟。

```CharacterTextSplitter``` 提供了高效的文字分塊功能，其主要優點包括：

- **Token 限制處理：** 解決大型語言模型（LLM）上下文視窗長度的限制問題
- **搜尋最佳化：** 支援更精準的文字區塊檢索
- **記憶體效率：** 有效處理大型文件，減少記憶體使用量
- **上下文保留：** 透過 ```chunk_overlap``` 保持文字區塊之間的語義連貫性

本教學將介紹文字分割的實際應用，包括核心方法 ```split_text()``` 和 ```create_documents()``` 的使用方式，並探討如何處理進階功能如 metadata（中繼資料）管理。

### 目錄

- [概述](#概述)
- [環境設定](#環境設定)
- [CharacterTextSplitter 範例](#charactertextsplitter-範例)

### 參考資源

- [LangChain TextSplitter API 文件](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TextSplitter.html)
- [LangChain CharacterTextSplitter API 文件](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html)

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_text_splitters",
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Adaptive-RAG",  # title 과 동일하게 설정해 주세요
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as ```OPENAI_API_KEY``` in a ```.env``` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [4]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

False

## CharacterTextSplitter Example

Read and store contents from keywords file
* Open ```./data/appendix-keywords.txt``` file and read its contents.
* Store the read contents in the ```file``` variable

In [5]:
with open("./data/appendix-keywords.txt", encoding="utf-8") as f:
   file = f.read()

Print the first 500 characters of the file contents.

In [6]:
print(file[:500])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to unders


## 建立 CharacterTextSplitter 時的參數說明與使用範例

在 LangChain 中使用 `CharacterTextSplitter` 可以有效地將長文字分割成更小的區塊，便於後續的語言模型處理。

### 📌 參數說明

| 參數名稱 | 說明 |
|----------|------|
| `separator` | 用來分割文字的字串，例如換行符號 `\n`、空格 `" "`、或自訂分隔符。 |
| `chunk_size` | 單一文字區塊的最大長度（以字元為單位）。 |
| `chunk_overlap` | 相鄰區塊之間重疊的字元數，有助於保留上下文。 |
| `length_function` | 用來計算文字長度的函式，預設為 `len`。可自訂長度計算邏輯（例如以 token 數量衡量）。 |
| `is_separator_regex` | 設定為 `True` 時，表示 `separator` 是正規表示式（regex）而非純字串。 |

---

### 🧪 建立範例

```python
from langchain.text_splitter import CharacterTextSplitter

# 建立 TextSplitter 實例
text_splitter = CharacterTextSplitter(
    separator="\n\n",                 # 以雙換行分段
    chunk_size=300,                  # 每塊最多 300 字元
    chunk_overlap=50,               # 每塊重疊 50 字元
    length_function=len,            # 使用 Python 內建 len() 計算長度
    is_separator_regex=False        # 不使用 regex（純文字分隔）
)

# 測試文字
text = """LangChain 是一個強大的框架，可協助開發者構建基於大型語言模型的應用。
它提供了模組化的工具與抽象層，讓複雜任務如多輪對話、RAG 等變得更易於實作。

TextSplitter 是其中關鍵的組件之一，特別適用於將長文檔切分成小段落。
這樣可以更好地讓模型理解上下文，也便於嵌入向量查詢與記憶操作。"""

# 分割結果
chunks = text_splitter.split_text(text)

# 輸出分塊
for i, chunk in enumerate(chunks):
    print(f"=== Chunk {i+1} ===")
    print(chunk)

In [7]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
   separator=" ",           # Splits whenever a space is encountered in text
   chunk_size=250,          # Each chunk contains maximum 250 characters
   chunk_overlap=50,        # Two consecutive chunks share 50 characters
   length_function=len,     # Counts total characters in each chunk
   is_separator_regex=False # Uses space as literal separator, not as regex
)

Create document objects from chunks and display the first one

In [8]:
chunks = text_splitter.create_documents([file])
print(chunks[0])

page_content='Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick'


Demonstrate metadata handling during document creation:

* ```create_documents``` accepts both text data and metadata lists
* Each chunk inherits metadata from its source document

In [9]:
# Define metadata for each document
metadatas = [
   {"document": 1},
   {"document": 2},
]

# Create documents with metadata
documents = text_splitter.create_documents(
   [file, file],  # List of texts to split
   metadatas=metadatas,  # Corresponding metadata
)

print(documents[0])  # Display first document with metadata

page_content='Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick' metadata={'document': 1}


Split text using the ```split_text()``` method.
* ```text_splitter.split_text(file)[0]``` returns the first chunk of the split text

In [10]:
# Split the file text and return the first chunk
text_splitter.split_text(file)[0]

'Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick'