# 如何按标记拆分文本
语言模型存在[token（标记）](/docs/concepts/tokens/)数量限制。使用时不应超出该限制。因此，当您需要[将文本分割](/docs/concepts/text_splitters/)成若干部分时，最好先计算标记数量。目前存在多种标记生成器，在统计文本标记数时，请确保使用与语言模型相同的标记生成器。

## tiktoken
:::note[tiktoken](https://github.com/openai/tiktoken) 是由 `OpenAI` 创建的快速 `BPE` 分词器。:::

我们可以使用 `tiktoken` 来估算使用的 token 数量。对于 OpenAI 模型来说，这种方法可能会更准确。
1. 文本如何分割：通过传入的字符进行分割。2. 如何测量分块大小：通过 `tiktoken` 分词器进行测量。
[CharacterTextSplitter](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html)、[RecursiveCharacterTextSplitter](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html) 和 [TokenTextSplitter](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TokenTextSplitter.html) 可直接与 `tiktoken` 配合使用。

In [None]:
%pip install --upgrade --quiet langchain-text-splitters tiktoken

In [1]:
from langchain_text_splitters import CharacterTextSplitter

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

要使用 [CharacterTextSplitter](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html) 进行分割，然后通过 `tiktoken` 合并分块，请使用其 `.from_tiktoken_encoder()` 方法。请注意，此方法生成的分块可能会大于 `tiktoken` 分词器测量的分块大小。
`.from_tiktoken_encoder()` 方法接收 `encoding_name`（例如 `cl100k_base`）或 `model_name`（例如 `gpt-4`）作为参数。其他所有参数如 `chunk_size`、`chunk_overlap` 和 `separators` 均用于实例化 `CharacterTextSplitter`。

In [6]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

In [3]:
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution.


为了实现对分块大小的硬性限制，我们可以使用 `RecursiveCharacterTextSplitter.from_tiktoken_encoder` 方法，该方法会对超过指定大小的分块进行递归拆分：

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    chunk_size=100,
    chunk_overlap=0,
)

我们也可以加载 `TokenTextSplitter` 分割器，它直接与 `tiktoken` 配合使用，并确保每个分割块都小于设定的块大小。

In [8]:
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Madam Speaker, Madam Vice President, our


某些书面语言（如中文和日文）的字符会编码为2个或更多标记。直接使用`TokenTextSplitter`可能导致一个字符的标记被分割到两个块中，从而产生无效的Unicode字符。请使用`RecursiveCharacterTextSplitter.from_tiktoken_encoder`或`CharacterTextSplitter.from_tiktoken_encoder`来确保每个块包含有效的Unicode字符串。

## spaCy
:::note[spaCy](https://spacy.io/) 是一个用于高级自然语言处理的开源软件库，采用编程语言 Python 和 Cython 编写。:::
LangChain 基于 [spaCy 分词器](https://spacy.io/api/tokenizer) 实现了文本分割器。
1. 文本如何分割：通过 `spaCy` 分词器进行。2. 分块大小的衡量标准：按字符数计算。

In [None]:
%pip install --upgrade --quiet  spacy

In [1]:
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

In [4]:
from langchain_text_splitters import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=1000)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.  



Last year COVID-19 kept us apart.

This year we are finally together again. 



Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans. 



With a duty to one another to the American people to the Constitution. 



And with an unwavering resolve that freedom will always triumph over tyranny. 



Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated. 



He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined. 



He met the Ukrainian people. 



From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.


## SentenceTransformers
[SentenceTransformersTokenTextSplitter](https://python.langchain.com/api_reference/text_splitters/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html) 是一个专为 sentence-transformer 模型设计的文本分割器。其默认行为是将文本分割成适合您所使用的 sentence-transformer 模型令牌窗口大小的块。
要分割文本并根据 sentence-transformers 分词器限制标记数量，请实例化一个 `SentenceTransformersTokenTextSplitter`。您可以选择指定以下参数：
- `chunk_overlap`: 整数类型的令牌重叠计数；- `model_name`: 句子转换模型名称，默认为 `"sentence-transformers/all-mpnet-base-v2"`；- `tokens_per_chunk`: 每个分块的期望token数量

In [2]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "

count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)

2


In [4]:
token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier

print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")

tokens in text to split: 514


In [5]:
text_chunks = splitter.split_text(text=text_to_split)

print(text_chunks[1])

lorem


## NLTK
:::note[自然语言工具包](https://en.wikipedia.org/wiki/Natural_Language_Toolkit)（更常被称为[NLTK](https://www.nltk.org/)）是一套用Python编程语言编写的库和程序集，用于英语的符号化及统计自然语言处理（NLP）。:::

我们不仅可以使用"\n\n"进行分割，还可以利用`NLTK`基于[NLTK分词器](https://www.nltk.org/api/nltk.tokenize.html)进行文本分割。
1. 文本如何分割：通过 `NLTK` 分词器进行。2. 分块大小的衡量标准：按字符数计算。

In [None]:
# pip install nltk

In [1]:
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

In [2]:
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=1000)

In [3]:
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.

Last year COVID-19 kept us apart.

This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans.

With a duty to one another to the American people to the Constitution.

And with an unwavering resolve that freedom will always triumph over tyranny.

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated.

He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined.

He met the Ukrainian people.

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

Groups of citizens blocking tanks with their bodies.


## KoNLPY
:::note[KoNLPy: Python 韩语自然语言处理工具包](https://konlpy.org/en/latest/) 是一个用于韩语自然语言处理（NLP）的 Python 软件包。:::
分词处理涉及将文本分割成更小、更易处理的单元，这些单元被称为词元。词元通常是单词、短语、符号或其他对后续处理和分析至关重要的有意义的元素。在英语等语言中，分词通常通过空格和标点符号来分隔单词。分词的有效性在很大程度上取决于分词器对语言结构的理解，以确保生成有意义的词元。由于为英语设计的分词器无法理解其他语言（如韩语）独特的语义结构，因此无法有效地用于韩语处理。
### 使用KoNLPy的Kkma分析器进行韩语分词对于韩语文本，KoNLPY包含一个名为`Kkma`（Korean Knowledge Morpheme Analyzer）的形态分析器。`Kkma`可对韩语文本进行细致的形态学分析，能够将句子分解为单词，并将单词进一步拆解为相应语素，同时识别每个标记的词性。该工具还能将整段文本分割为独立句子，这一特性在处理长文本时尤为实用。
### 使用注意事项虽然`Kkma`以其细致的分析而闻名，但需要注意的是，这种精确性可能会影响处理速度。因此，`Kkma`最适合用于那些分析深度优先于快速文本处理的应用场景。

In [28]:
# pip install konlpy

In [23]:
# This is a long Korean document that we want to split up into its component sentences.
with open("./your_korean_doc.txt") as f:
    korean_document = f.read()

In [24]:
from langchain_text_splitters import KonlpyTextSplitter

text_splitter = KonlpyTextSplitter()

In [37]:
texts = text_splitter.split_text(korean_document)
# The sentences are split with "\n\n" characters.
print(texts[0])

춘향전 옛날에 남원에 이 도령이라는 벼슬아치 아들이 있었다.

그의 외모는 빛나는 달처럼 잘생겼고, 그의 학식과 기예는 남보다 뛰어났다.

한편, 이 마을에는 춘향이라는 절세 가인이 살고 있었다.

춘 향의 아름다움은 꽃과 같아 마을 사람들 로부터 많은 사랑을 받았다.

어느 봄날, 도령은 친구들과 놀러 나갔다가 춘 향을 만 나 첫 눈에 반하고 말았다.

두 사람은 서로 사랑하게 되었고, 이내 비밀스러운 사랑의 맹세를 나누었다.

하지만 좋은 날들은 오래가지 않았다.

도령의 아버지가 다른 곳으로 전근을 가게 되어 도령도 떠나 야만 했다.

이별의 아픔 속에서도, 두 사람은 재회를 기약하며 서로를 믿고 기다리기로 했다.

그러나 새로 부임한 관아의 사또가 춘 향의 아름다움에 욕심을 내 어 그녀에게 강요를 시작했다.

춘 향 은 도령에 대한 자신의 사랑을 지키기 위해, 사또의 요구를 단호히 거절했다.

이에 분노한 사또는 춘 향을 감옥에 가두고 혹독한 형벌을 내렸다.

이야기는 이 도령이 고위 관직에 오른 후, 춘 향을 구해 내는 것으로 끝난다.

두 사람은 오랜 시련 끝에 다시 만나게 되고, 그들의 사랑은 온 세상에 전해 지며 후세에까지 이어진다.

- 춘향전 (The Tale of Chunhyang)


## Hugging Face 分词器
[Hugging Face](https://huggingface.co/docs/tokenizers/index) 拥有多种分词器。
我们使用Hugging Face的分词器[GPT2TokenizerFast](https://huggingface.co/Ransaka/gpt2-tokenizer-fast)来统计文本的标记长度。
1. 文本如何分割：通过传入的字符进行分割。2. 分块大小的衡量方式：通过`Hugging Face`分词器计算的令牌数量。

In [1]:
from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

In [2]:
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter

In [3]:
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

In [4]:
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution.
