## 语义切分 （Semantic Chunking）
以下代码参考自LlamaIndex的`llama-index-core/llama_idex/core/node_parser/text/semantic_splitter.py`

### 1. 文本提取 

In [92]:
import pypdf

pdfreader = pypdf.PdfReader('../docs/deepseek-r1.pdf')
num_pages = len(pdfreader.pages)
print(f'Number of pages: {num_pages}')

page = pdfreader.pages[0]
text = page.extract_text()
print(f'First page text:\n{text[:1000]}...')

Number of pages: 22
First page text:
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-
vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing
reasoning behaviors. However, it encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we introduce
DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-
R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the
research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models
(1.5B, 7B, 8B, 14B, 32B, 70

In [93]:
all_pages_text = "".join([page.extract_text() for page in pdfreader.pages])
print(f'All pages text:\n{all_pages_text[:1000]}...')

All pages text:
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-
vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing
reasoning behaviors. However, it encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we introduce
DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-
R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the
research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models
(1.5B, 7B, 8B, 14B, 32B, 70B) distilled from Dee

### 2. 句子切分

使用NLTK的`sent_tokenize`函数将文本切分为句子。

In [94]:
from nltk.corpus import stopwords
from nltk.tokenize import PunktSentenceTokenizer

stop_words = set(stopwords.words('english'))
tokenizer = PunktSentenceTokenizer()

def split_by_sentence_tokenizer(text: str, tokenizer: PunktSentenceTokenizer) -> List[str]:
    """
    Splits text by a sentence tokenizer.

    Args:
        text (str): The text to split.
        tokenizer (PunktSentenceTokenizer): The tokenizer to use.

    Returns:
       List[str]: A list of split strings.

    """
    
    spans = list(tokenizer.span_tokenize(text))
    sentences = []
    for i, span in enumerate(spans):
        start = span[0]
        if i < len(spans) - 1:
            end = spans[i + 1][0]
        else:
            end = len(text)
        sentences.append(text[start:end])
    return sentences

In [107]:
sentences = split_by_sentence_tokenizer(all_pages_text, tokenizer)
print(sentences[0])
print('\n' + sentences[1])

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.


DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-
vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.



### 3. 句子组合

通过将每个句子与其前后的句子组合，以提供更广泛的上下文。例如，将句子1、2、3和2组合在一起。在本实现中默认使用前后两个句子作为一个组合，这样可以考独立地虑每个句子间的相似度。

In [110]:
from typing import List

BUFFER_SIZE = 1

def build_sentence_groups(sentences: List[str], buffer_size: int=BUFFER_SIZE) -> List[str]:
    """ 
    Create a buffer by combining each sentence with its previous and next sentence 
    to provide a wider context. 

    Args:
        sentences (List[str]): The list of sentences to be combined
        buffer_size (int, optional): The size of the buffer. 
                                     Defaults to 1 to consider each sentence individually.
    
    Returns:
        list[str]: The list of combined sentences
    """

    combined_sentences = []
    for i in range(len(sentences)):
        combined_sentence = ""

        for j in range(i - buffer_size, i):
            if j >= 0:
                combined_sentence += sentences[j]
        
        combined_sentence += sentences[i]

        for j in range(i + 1, i + buffer_size + 1):
            if j < len(sentences):
                combined_sentence += sentences[j]

        combined_sentences.append(combined_sentence)
    return combined_sentences

In [112]:
combined_sentences = build_sentence_groups(sentences)
print(combined_sentences[0])

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-
vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.



### 4. 文本嵌入
使用Qwen3-Embedding-0.6B-Q8_0将文本转换为向量，以便计算余弦相似度。使用LMStudio作为推理引擎。

In [113]:
from openai import OpenAI
from tqdm import tqdm

client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

def get_text_embedding(text: str, model: str="model-identifier") -> List[float]:
   """ 
    Get the embedding of the text using Qwen3-Embedding-0.6B-Q8_0.

    Args:
        text (str): The text to be embedded

    Returns:
        list[float]: The embedding of the text
    """
   
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

def get_text_embedding_batch(texts: list[str], model: str="model-identifier") -> List[List[float]]:
   """
   Get the text embeddings of the whole text batch

   Args:
       texts (list[str]): The text batch to be embedded

   Returns:
       list[list[float]]: The text embeddings of the whole text batch
   """

   return [get_text_embedding(text, model) for text in tqdm(texts, desc="Text Embedding...")]

In [114]:
embs = get_text_embedding_batch(sentences)

Text Embedding...: 100%|██████████| 912/912 [00:34<00:00, 26.70it/s]


In [115]:
print(f'Total number of embeddings: {len(embs)}')
print(f'The size of embeddings: {len(embs[0])}')
print(f'\nThe initial 10 entries in the first embedding: {embs[0][:10]}')
print(f'\nThe initial 10 entries in the second embedding: {embs[1][:10]}')

Total number of embeddings: 912
The size of embeddings: 1024

The initial 10 entries in the first embedding: [0.017176339402794838, -0.00018442852888256311, -0.00408159988000989, -0.030826974660158157, -0.020394911989569664, -0.008923179470002651, -0.03861364722251892, 0.02715856023132801, -0.041088756173849106, 0.01844685897231102]

The initial 10 entries in the second embedding: [0.0025928406976163387, -0.008726261556148529, -0.0059125348925590515, -0.0328126922249794, -0.011781608685851097, 0.025124231353402138, -0.033701092004776, 0.02491258829832077, -0.04278669133782387, 0.04132683202624321]


### 5. 计算相似度

In [116]:
import numpy as np

def get_similarity(emb1: List[float], emb2: List[float]) -> float:
    """
    Get the similarity between two embeddings using the cosine similarity

    Args:
        emb1 (list[float]): The first embedding
        emb2 (list[float]): The second embedding

    Returns:
        float: The similarity between the two embeddings
    """

    if isinstance(emb1, List) and isinstance(emb2, List):
        emb1 = np.array(emb1)
        emb2 = np.array(emb2)

    product = np.dot(emb1, emb2)
    norm = np.linalg.norm(emb1) * np.linalg.norm(emb2)
    return product / norm

def calculate_distances_between_embeddings(embs: List[List[float]]) -> List[float]:
    """
    Calculate the distances between two consecutive embeddings in the list

    Args:
        embs (list[list[float]]): The list of embeddings to be compared

    Returns:
        list[float]: The list of distances between all pairs of consecutive embeddings
    """

    distances = []
    for i in tqdm(range(len(embs) - 1), desc="Calculating distances between pairs of embeddings..."):
        emb_current = embs[i]
        emb_next = embs[i + 1]

        similarity = get_similarity(emb_current, emb_next)
        distance = 1 - similarity # using the 1 - cosine similarity to measure distance

        distances.append(distance)
    return distances

In [117]:
embs_distances = calculate_distances_between_embeddings(embs)

Calculating distances between pairs of embeddings...: 100%|██████████| 911/911 [00:00<00:00, 11245.87it/s]


In [118]:
print(f'Total number of distances calculated: {len(embs_distances)}')
print(f'\nThe distance between the first and the second embedding is {embs_distances[0]}')
print(f'\nThe distance between the second and the third embedding is {embs_distances[1]}')

Total number of distances calculated: 911

The distance between the first and the second embedding is 0.15076271604693836

The distance between the second and the third embedding is 0.13765858062665015


### 6. 基于相似度合并文本块

In [123]:
def build_text_chunk(combined_sentences: List[str], 
                     embs_distances: List[float], 
                     threshold: float) -> List[str]:
    """
    Build text chunks based on the combined sentences and the distances between embeddings.
    If the distance between two consecutive embeddings is greater than a threshold,
    it means that the two sentences are not similar enough to be merged into one chunk.
    Therefore, we need to split the combined sentences into separate chunks.
    
    Args:
        combined_sentences (List[str]): The combined sentences.
        embs_distances (List[float]): The distances between embeddings.
        threshold (float): The threshold for determining whether two sentences are similar enough to be merged into one chunk.

    Returns:
        List[str]: The text chunks.
    """

    chunks = []
    if len(embs_distances) > 0:
        breakpoint_distance_threshold = np.percentile(embs_distances, threshold)

        indices_above_threshold = [
            i for i, x in enumerate(embs_distances) if x > breakpoint_distance_threshold
        ]

        start_idx = 0
        for idx in tqdm(indices_above_threshold, desc="Building Text Chunks..."):
            group = combined_sentences[start_idx : idx + 1]
            combined_text = "".join(group)
            chunks.append(combined_text)

            start_idx = idx + 1

        if start_idx < len(combined_sentences):
            combined_text = "".join(combined_sentences[start_idx:])
            chunks.append(combined_text)
    else:
        chunks.append("".join(combined_sentences))

    return chunks

In [124]:
THRESHOLD = 95

text_chunks = build_text_chunk(combined_sentences, embs_distances, threshold=THRESHOLD)

Building Text Chunks...: 100%|██████████| 45/45 [00:00<?, ?it/s]


In [125]:
print(f'Total number of Chunks: {len(text_chunks)}')
print(f'\nThe chunk: \n{text_chunks[0]}')

Total number of Chunks: 46

The chunk: 
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-
vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-
vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing
reasoning behaviors. DeepSeek-R1-Zero, a model trained via large-scale rein

自此完成了文本地语义切分，并构造了文本块

为了方便以后调用，以上代码会被封装到一个类中，并添加到`nanoidx/semantic_splitter_node_parser.py`中。