# Semantic Chunking 简介
文本分块是检索增强生成（RAG）中的一个重要步骤，在RAG中，不同于固定长度的分块，为了提高检索精度，Semantic Chunking是根据语义信息将文本分割成更合适的语义块

## breakpoints查找方法
*   **百分位数 (Percentile)**：找出所有相似度差异的第 X 个百分位数，并在差异下降幅度大于此值的区块进行分割。
*   **标准差 (Standard Deviation)**：在相似度下降幅度超过平均值以下 X 个标准差的区块进行分割。
*   **四分位距 (Interquartile Range, IQR)**：使用四分位距（Q3 - Q1，即第三四分位数减去第一四分位数）来确定分割点。

导入必要的库

In [2]:
import pymupdf
import os
import numpy as np
import json
import openai

提取pdf中的文本

In [3]:
def extract_text_from_pdf(pdf_path):
    """
    提取PDF文件中的文本并打印前`num_chars`个字符。

    参数：
    pdf_path (str): PDF文件的路径。

    返回：
    str: 从PDF中提取的文本。

    """
    # 打开PDF文件
    mypdf = pymupdf.open(pdf_path)
    all_text = ""  # 初始化一个空字符串来存储提取的文本

    # 迭代PDF中的每个页面
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # 获取页面
        text = page.get_text("text")  # 从页面中提取文本
        all_text += text  # 将提取的文本附加到all_text字符串

    return all_text  # 返回提取的文本

pdf_path = "data/AI_Information.pdf"


extracted_text = extract_text_from_pdf(pdf_path)

print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


配置client

In [4]:
client = openai.OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),  # 如果您没有配置环境变量，请在此处用您的API Key进行替换
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"  # 百炼服务的base_url
)

创建Sentence-Level Embeddings

In [5]:
def create_embeddings_in_batches(text_chunks, model="text-embedding-v3", batch_size_limit=10): # 我改成了官方模型名，你可以换回 "text-embedding-v3"
    """
    调用 OpenAI 的 Embedding API 来创建文本列表的嵌入向量，处理批处理大小限制。

    参数:
    text_chunks (List[str]): 需要创建嵌入的文本字符串列表。
    model (str): 使用的嵌入模型。
    batch_size_limit (int): API 允许的最大批处理大小。根据错误信息，这里是10。

    返回:
    List[List[float]]: 所有文本的嵌入向量列表。
    """
    all_embeddings = []
    if not text_chunks:
        return []

    if not isinstance(text_chunks, list): # 确保输入是列表
        text_chunks = [text_chunks]

    for i in range(0, len(text_chunks), batch_size_limit):
        batch = text_chunks[i:i + batch_size_limit]
        try:
            print(f"Processing batch {i//batch_size_limit + 1}, size: {len(batch)}")
            response = client.embeddings.create(
                input=batch,
                model=model,
                encoding_format="float"
            )
            # 从响应中提取该批次的嵌入向量
            batch_embeddings = [item.embedding for item in response.data]
            all_embeddings.extend(batch_embeddings)


        except Exception as e:
            print(f"Error processing batch starting with chunk: '{batch[0][:50]}...'")
            print(f"API Error: {e}")

            raise e 

    return all_embeddings
sentences = extracted_text.split(". ")
embeddings = create_embeddings_in_batches(sentences)
print(f"Generated {len(embeddings)} sentence embeddings.")

Processing batch 1, size: 10
Processing batch 2, size: 10
Processing batch 3, size: 10
Processing batch 4, size: 10
Processing batch 5, size: 10
Processing batch 6, size: 10
Processing batch 7, size: 10
Processing batch 8, size: 10
Processing batch 9, size: 10
Processing batch 10, size: 10
Processing batch 11, size: 10
Processing batch 12, size: 10
Processing batch 13, size: 10
Processing batch 14, size: 10
Processing batch 15, size: 10
Processing batch 16, size: 10
Processing batch 17, size: 10
Processing batch 18, size: 10
Processing batch 19, size: 10
Processing batch 20, size: 10
Processing batch 21, size: 10
Processing batch 22, size: 10
Processing batch 23, size: 10
Processing batch 24, size: 10
Processing batch 25, size: 10
Processing batch 26, size: 8
Generated 258 sentence embeddings.


计算余弦相似度

In [6]:
def cosine_similarity(vec1, vec2):
    """
    计算两个向量之间的余弦相似度。

    参数:
    vec1 (numpy.ndarray): 第一个向量。
    vec2 (numpy.ndarray): 第二个向量。

    返回:
    float: 余弦相似度。

    """
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Compute similarity between consecutive sentences
similarities = [cosine_similarity(embeddings[i], embeddings[i + 1]) for i in range(len(embeddings) - 1)]
print(similarities)

[np.float64(0.6013919200673946), np.float64(0.523915742179365), np.float64(0.5992367054894817), np.float64(0.5681652381873313), np.float64(0.6467175552026007), np.float64(0.6481387924480176), np.float64(0.4759750128504935), np.float64(0.5732421982853909), np.float64(0.5287207011143132), np.float64(0.697532561722569), np.float64(0.733507284579057), np.float64(0.43262204394390386), np.float64(0.28890282858367256), np.float64(0.6558550595232097), np.float64(0.5844338512977106), np.float64(0.6107934390675273), np.float64(0.5436887130330174), np.float64(0.4622488446977352), np.float64(0.5016057216882477), np.float64(0.44435236530646904), np.float64(0.7079794035709691), np.float64(0.5919520924416597), np.float64(0.5037783024518371), np.float64(0.47439976357390534), np.float64(0.5013646767771311), np.float64(0.5819812641586012), np.float64(0.6175757205905621), np.float64(0.5956586973274686), np.float64(0.5947336486016098), np.float64(0.6077561880239367), np.float64(0.5509006531578936), np.flo

根据语义分块

In [None]:
def compute_breakpoints(similarities, method="percentile", threshold=90):
    """
    计算相似度列表中的断点。

    参数:
    similarities (List[float]): 相似度列表。
    method (str): 用于确定断点的方法，可选值为 "percentile", "standard_deviation", "interquartile"。
    threshold (float): 用于确定断点的阈值。

    返回:
    List[int]: 相似度列表中的断点索引。

    """
    
    if method == "percentile":
        threshold_value = np.percentile(similarities, threshold)
    elif method == "standard_deviation":
        
        mean = np.mean(similarities)
        std_dev = np.std(similarities)
        
        threshold_value = mean - (threshold * std_dev)
    elif method == "interquartile":
        
        q1, q3 = np.percentile(similarities, [25, 75])
        
        threshold_value = q1 - 1.5 * (q3 - q1)
    else:
        
        raise ValueError("Invalid method. Choose 'percentile', 'standard_deviation', or 'interquartile'.")

    
    return [i for i, sim in enumerate(similarities) if sim < threshold_value]

# Compute breakpoints using the percentile method with a threshold of 90
breakpoints = compute_breakpoints(similarities, method="percentile", threshold=90)