## 文本分块（Text Chunking）

### 1. 文本提取

In [None]:
import pypdf

pdfreader = pypdf.PdfReader('../docs/deepseek-r1.pdf')
num_pages = len(pdfreader.pages)
print(f'Number of pages: {num_pages}')

page = pdfreader.pages[0]
text = page.extract_text()
print(f'First page text:\n{text[:1000]}...')

Number of pages: 22
First page text:
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-
vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing
reasoning behaviors. However, it encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we introduce
DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-
R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the
research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models
(1.5B, 7B, 8B, 14B, 32B, 70

In [None]:
all_pages_text = "".join([page.extract_text() for page in pdfreader.pages])
print(f'All pages text:\n{all_pages_text[:1000]}...')

All pages text:
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-
vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing
reasoning behaviors. However, it encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we introduce
DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-
R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the
research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models
(1.5B, 7B, 8B, 14B, 32B, 70B) distilled from Dee

### 2. 文本切分

将输入的文本分成多个块，每个文本块的长度大小比预设的chunk size小， 具体的切分流程包括：
1. 将文本按照段落切分，一般以"\n\n\n"为结尾
2. 使用nltk sentence tokenizer将段落切分成句子
3. 使用正则表达式将文本按标点符号划分，默认标点为"[^,\.;]+[,\.;]?"
4. 如果有默认的空格`" "`，则继续做切分

In [3]:
DEFAULT_PARAGRAPH_SEP = "\n\n\n"
CHUNKING_REGEX = "[^,.;。？！]+[,.;。？！]?|[,.;。？！]"

In [4]:
from typing import Callable, List, Optional, Tuple

def split_by_paragraph_sep(text: str, sep: str, keep_sep: bool = True) -> List[str]:
    """
    Splits text by a separator.

    Args:
        text (str): The text to split.
        sep (str): The separator to split on.
        keep_sep (bool, optional): Whether to keep the separator in the output. Defaults to True.

    Returns:
       List[str]: A list of split strings.

    """

    if keep_sep:
        parts = text.split(sep)
        result = [sep + s if i > 0 else s for i, s in enumerate(parts)]
        return [s for s in result if s]
    else:
        return text.split(sep)

In [8]:
paragraphs = split_by_paragraph_sep(all_pages_text, DEFAULT_PARAGRAPH_SEP, keep_sep=True)
# print(paragraphs[0])

使用NLTK的sentence tokenizer将段落切分成句子，注意在安装完NLTK后，需要下载popular数据包，否则会报错

pip install --user -U nltk

python -m nltk.downloader popular

NLTK的sentence tokenizer能够确定文本块的起始标号[start_idx, end_idx]，该区间也被称为spans，因此可以用于切分文本块。

In [9]:
from nltk.corpus import stopwords
from nltk.tokenize import PunktSentenceTokenizer

stop_words = set(stopwords.words('english'))
tokenizer = PunktSentenceTokenizer()

def split_by_sentence_tokenizer(text: str, tokenizer: PunktSentenceTokenizer) -> List[str]:
    """
    Splits text by a sentence tokenizer.

    Args:
        text (str): The text to split.
        tokenizer (PunktSentenceTokenizer): The tokenizer to use.

    Returns:
       List[str]: A list of split strings.

    """
    
    spans = list(tokenizer.span_tokenize(text))
    sentences = []
    for i, span in enumerate(spans):
        start = span[0]
        if i < len(spans) - 1:
            end = spans[i + 1][0]
        else:
            end = len(text)
        sentences.append(text[start:end])
    return sentences

In [12]:
sentences = split_by_sentence_tokenizer(paragraphs[0], tokenizer)
print(sentences[0])
print('\n'+sentences[1])

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.


DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-
vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.

