## 语义嵌入（Semantic Embedding）

In [1]:
import pypdf

pdfreader = pypdf.PdfReader('../docs/deepseek-r1.pdf')
page = pdfreader.pages[0]
text = page.extract_text()
all_pages_text = "".join([page.extract_text() for page in pdfreader.pages])
print(f'All pages text:\n{all_pages_text[:1000]}...')

All pages text:
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-
vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing
reasoning behaviors. However, it encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we introduce
DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-
R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the
research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models
(1.5B, 7B, 8B, 14B, 32B, 70B) distilled from Dee

利用封装好的SemanticSplitterNodeParser进行分块

In [2]:
import sys
sys.path.append('..')
from nanoidx.semantic_splitter_node_parser import SemanticSplitterNodeParser

In [3]:
semantic_parser = SemanticSplitterNodeParser()
chunks = semantic_parser.split_text(all_pages_text)

Text Embedding...: 100%|██████████| 912/912 [00:44<00:00, 20.41it/s]
Calculating distances between pairs of embeddings...: 100%|██████████| 911/911 [00:00<00:00, 10322.15it/s]
Building Text Chunks...: 100%|██████████| 46/46 [00:00<00:00, 45730.74it/s]


### 1. 使用嵌入模型得到embeddings

这里同样使用Qwen3-Embedding-0.6B-Q8_0模型，使用LMStudio来作为推理引擎

In [4]:
embs = semantic_parser.get_text_embedding_batch(chunks)

Text Embedding...: 100%|██████████| 47/47 [00:22<00:00,  2.11it/s]


### 2. 使用字典来关联文本块和相应的embedding

In [6]:
assert len(chunks) == len(embs)

In [7]:
chunk_embs_dict_list = [
    {
        'id': i,
        'chunk': chunk,
        'embedding': emb,
    }
    for i, (chunk, emb) in enumerate(zip(chunks, embs))
]

In [8]:
chunk_embs_dict_list[0]

{'id': 0,
 'chunk': 'DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via\nReinforcement Learning\nDeepSeek-AI\nresearch@deepseek.com\nAbstract\nWe introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.\nDeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-\nvised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.\nDeepSeek-R1: Incentivizing Reasoning Capability in LLMs via\nReinforcement Learning\nDeepSeek-AI\nresearch@deepseek.com\nAbstract\nWe introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.\nDeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-\nvised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.\nThrough RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing\nreasoning behaviors. DeepSeek-R1-Zero, a model trained via large-scale reinfo