# 如何按字符递归分割文本
这个[文本分割器](/docs/concepts/text_splitters/)是通用文本的推荐选择。它通过一个字符列表进行参数化配置，会按顺序尝试根据这些字符进行分割，直到生成的文本块足够小。默认的字符列表是`["\n\n", "\n", " ", ""]`。这样做的效果是尽可能保持段落（然后是句子，接着是词语）的完整性，因为这些通常被认为是语义关联最强的文本片段。
1. 文本如何分割：按字符列表分割。2. 分块大小的衡量标准：按字符数计算。
以下我们展示示例用法。
要直接获取字符串内容，请使用 `.split_text`。
要创建 LangChain [文档](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html)对象（例如用于下游任务），请使用 `.create_documents` 方法。

In [None]:
%pip install -qU langchain-text-splitters

In [1]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load example document
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])

page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and'
page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.'


In [2]:
text_splitter.split_text(state_of_the_union)[:2]

['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and',
 'of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.']

让我们来回顾一下上面为 `RecursiveCharacterTextSplitter` 设置的参数：- `chunk_size`: 块的最大尺寸，其中尺寸由 `length_function` 决定。- `chunk_overlap`: 分块之间的目标重叠量。重叠的分块有助于减轻因上下文被分割到不同块而导致的信息丢失。- `length_function`: 用于确定分块大小的函数。- `is_separator_regex`: 分隔符列表（默认为`["\n\n", "\n", " ", ""]`）是否应被解释为正则表达式。

## 处理无词边界语言的分词问题
某些书写系统并不存在[词边界](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries)，例如中文、日文和泰文。若使用默认分隔符列表`["\n\n", "\n", " ", ""]`进行文本分割，可能导致词汇被拆分至不同文本块。为保持词汇完整性，可通过覆盖分隔符列表来添加额外标点符号：
* 添加ASCII句点“`.`”、[全角](https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block))句点“`．`”（用于中文文本）以及[中文句号](https://en.wikipedia.org/wiki/CJK_Symbols_and_Punctuation)“`。`”（用于日语和中文）* 添加用于泰语、缅甸语、高棉语和日语的[零宽空格](https://zh.wikipedia.org/wiki/%E9%9B%B6%E5%AE%BD%E7%A9%BA%E6%A0%BC)。* 添加ASCII逗号 "`,`"、Unicode全角逗号 "`，`" 和 Unicode 表意逗号 "`、`"

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    separators=[
        "\n\n",
        "\n",
        " ",
        ".",
        ",",
        "\u200b",  # Zero-width space
        "\uff0c",  # Fullwidth comma
        "\u3001",  # Ideographic comma
        "\uff0e",  # Fullwidth full stop
        "\u3002",  # Ideographic full stop
        "",
    ],
    # Existing args
)