## 语义分块介绍
文本分块是检索增强生成（RAG）的重要一步，其中大型文本体裁被划分成有意义的片段，以提高检索精度。
与固定长度分块不同，语义分块基于句子之间的内容相似性进行分块。

### 关键术语

- **百分位数分割法**: 找出所有相似度差值的第 X 百分位数，并在相似度下降幅度超过此值的位置分割块。
- **标准差**: 在相似度下降幅度超过低于平均值的 X 个标准差的位置进行分割。
- **四分位距(IQR)**: 使用四分位距（Q3-Q1）确定分割点。

这个笔记本通过使用 **百分位数法** 来实现语义分块，并在一个示例文本上评估其性能。


## 设置环境
我们首先导入必要的库。

In [1]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

## Extracting Text from a PDF File 从PDF文件中提取文本
为了实现RAG，我们首先需要一个文本数据源。在这个例子中，我们使用PyMuPDF库从PDF文件中提取文本。

In [2]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file. 提取PDF文件中的文本。

    Args:
    pdf_path (str): Path to the PDF file. PDF文件的路径。

    Returns:
    str: Extracted text from the PDF. 从PDF中提取的文本。
    """
    # Open the PDF file 打开PDF文件
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text 初始化一个空字符串来存储提取的文本
    
    # Iterate through each page in the PDF 迭代每一页PDF
    for page in mypdf:
        # Extract text from the current page and add spacing 从当前页面提取文本并添加空格
        all_text += page.get_text("text") + " "

    # Return the extracted text, stripped of leading/trailing whitespace 返回提取的文本，并去掉前后空格
    return all_text.strip()

# Define the path to the PDF file 定义PDF文件的路径
pdf_path = "data/AI_Information.pdf"

# Extract text from the PDF file 提取PDF文件中的文本
extracted_text = extract_text_from_pdf(pdf_path)

# Print the first 500 characters of the extracted text 打印提取的文本的前500个字符
print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


##  设置LLM 客户端
初始化LLM客户端以生成嵌入和响应。

In [3]:
# 我们使用硅基流动提供的模型服务，需要先注册账号并申请API key，硅基流动官网：https://www.siliconflow.cn/
client = OpenAI(
    base_url="https://api.siliconflow.cn/v1/",
    api_key="sk-xqmgohpohqgwmrislttlkiodikjzoscyvqdgjmfguvfjodwe"  # 替换为你的API密钥
)

## Creating Sentence-Level Embeddings 创建句子级嵌入
We split text into sentences and generate embeddings. 划分文本为句子并生成嵌入。

In [4]:
def get_embedding(text, model="BAAI/bge-m3"):
    """
    Creates an embedding for the given text using OpenAI. 使用bge-m3模型为给定文本创建嵌入。

    Args:
    text (str): Input text. 输入文本。
    model (str): Embedding model name. 嵌入模型名称。

    Returns:
    np.ndarray: The embedding vector. 嵌入向量。
    """
    response = client.embeddings.create(model=model, input=text)
    return np.array(response.data[0].embedding)


# Splitting text into sentences (basic split) 分割文本为句子（基本分割）
sentences = extracted_text.split(". ")

# Generate embeddings for each sentence 生成每个句子的嵌入
embeddings = [get_embedding(sentence) for sentence in sentences]

print(f"Generated {len(embeddings)} sentence embeddings.")

Generated 257 sentence embeddings.


## Calculating Similarity Differences 计算相似度差异
We compute cosine similarity between consecutive sentences. 计算连续句子之间的余弦相似度。

In [5]:
def cosine_similarity(vec1, vec2):
    """
    Computes cosine similarity between two vectors. 计算两个向量之间的余弦相似度。

    Args:
    vec1 (np.ndarray): First vector. 第一个向量。
    vec2 (np.ndarray): Second vector. 第二个向量。

    Returns:
    float: Cosine similarity. 余弦相似度。
    """
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Compute similarity between consecutive sentences 计算连续句子之间的相似度
similarities = [cosine_similarity(embeddings[i], embeddings[i + 1]) for i in range(len(embeddings) - 1)]

## Implementing Semantic Chunking 实现语义分块
We implement three different methods for finding breakpoints. 实现三个不同的方法来找到断点。

In [6]:
def compute_breakpoints(similarities, method="percentile", threshold=90):
    """
    Computes chunking breakpoints based on similarity drops. 基于相似度下降计算分块断点。

    Args:
    similarities (List[float]): List of similarity scores between sentences. 句子之间的相似度分数列表。
    method (str): 'percentile', 'standard_deviation', or 'interquartile'. 方法（'percentile'、'standard_deviation'或'interquartile'）。
    threshold (float): Threshold value (percentile for 'percentile', std devs for 'standard_deviation'). 阈值（百分位数或标准差）。

    Returns:
    List[int]: Indices where chunk splits should occur. 分块断点的索引列表。
    """
    # Determine the threshold value based on the selected method 根据所选方法确定阈值
    if method == "percentile":
        # Calculate the Xth percentile of the similarity scores 计算相似度分数的第X百分位数
        threshold_value = np.percentile(similarities, threshold)
    elif method == "standard_deviation":
        # Calculate the mean and standard deviation of the similarity scores 计算相似度分数的均值和标准差
        mean = np.mean(similarities)
        std_dev = np.std(similarities)
        # Set the threshold value to mean minus X standard deviations 设定阈值为均值减去X个标准差
        threshold_value = mean - (threshold * std_dev)
    elif method == "interquartile":
        # Calculate the first and third quartiles (Q1 and Q3) 计算第一和第三四分位数（Q1和Q3）
        q1, q3 = np.percentile(similarities, [25, 75])
        # Set the threshold value using the IQR rule for outliers 设定阈值使用IQR规则处理异常值
        threshold_value = q1 - 1.5 * (q3 - q1)
    else:
        # Raise an error if an invalid method is provided 如果提供的无效方法，则引发错误
        raise ValueError("Invalid method. Choose 'percentile', 'standard_deviation', or 'interquartile'.")

    # Identify indices where similarity drops below the threshold value 识别相似度低于阈值的索引
    return [i for i, sim in enumerate(similarities) if sim < threshold_value]

# Compute breakpoints using the percentile method with a threshold of 90 计算使用百分位数方法的90%分位数作为阈值计算断点
breakpoints = compute_breakpoints(similarities, method="percentile", threshold=90)

## Splitting Text into Semantic Chunks 将文本分割成语义块
We split the text based on computed breakpoints. 分割文本基于计算出的断点。

In [7]:
def split_into_chunks(sentences, breakpoints):
    """
    Splits sentences into semantic chunks. 将句子分割为语义块。

    Args:
    sentences (List[str]): List of sentences. 句子列表。
    breakpoints (List[int]): Indices where chunking should occur. 分割点的索引列表。

    Returns:
    List[str]: List of text chunks. 文本块列表。
    """
    chunks = []  # Initialize an empty list to store the chunks 初始化一个空列表来存储文本块
    start = 0  # Initialize the start index 初始化开始索引

    # Iterate through each breakpoint to create chunks 迭代每个分割点来创建文本块
    for bp in breakpoints:
        # Append the chunk of sentences from start to the current breakpoint 到当前分割点的句子作为一个文本块添加到列表中
        chunks.append(". ".join(sentences[start:bp + 1]) + ".")
        start = bp + 1  # Update the start index to the next sentence after the breakpoint 更新开始索引到下一个句子的索引

    # Append the remaining sentences as the last chunk 将剩余的句子作为最后一个文本块添加到列表中
    chunks.append(". ".join(sentences[start:]))
    return chunks  # Return the list of chunks 返回文本块列表

# Create chunks using the split_into_chunks function 创建文本块使用split_into_chunks函数
text_chunks = split_into_chunks(sentences, breakpoints)

# Print the number of chunks created 打印创建的文本块的数量
print(f"Number of semantic chunks: {len(text_chunks)}")

# Print the first chunk to verify the result 打印第一个文本块以验证结果
print("\nFirst text chunk:")
print(text_chunks[0])


Number of semantic chunks: 231

First text chunk:
Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings.


## Creating Embeddings for Semantic Chunks 将语义块的嵌入创建出来
We create embeddings for each chunk for later retrieval. 创建每个块的嵌入，以便后续检索。

In [8]:
def create_embeddings(text_chunks):
    """
    Creates embeddings for each text chunk. 为每个文本块创建嵌入。

    Args:
    text_chunks (List[str]): List of text chunks. 文本块列表。

    Returns:
    List[np.ndarray]: List of embedding vectors. 嵌入向量列表。
    """
    # Generate embeddings for each text chunk using the get_embedding function 生成每个文本块的嵌入使用get_embedding函数
    return [get_embedding(chunk) for chunk in text_chunks]

# Create chunk embeddings using the create_embeddings function 创建块嵌入使用create_embeddings函数
chunk_embeddings = create_embeddings(text_chunks)

## Performing Semantic Search 执行语义搜索
We implement cosine similarity to retrieve the most relevant chunks. 使用余弦相似度检索最相关的块。

In [9]:
def semantic_search(query, text_chunks, chunk_embeddings, k=5):
    """
    Finds the most relevant text chunks for a query. 找到与查询最相关的文本块。

    Args:
    query (str): Search query. 搜索查询。
    text_chunks (List[str]): List of text chunks. 文本块列表。
    chunk_embeddings (List[np.ndarray]): List of chunk embeddings. 块嵌入列表。
    k (int): Number of top results to return. 返回的最相关的结果数。

    Returns:
    List[str]: Top-k relevant chunks. 最相关的前k个文本块。
    """
    # Generate an embedding for the query 使用get_embedding函数为查询生成嵌入
    query_embedding = get_embedding(query)
    
    # Calculate cosine similarity between the query embedding and each chunk embedding 计算查询嵌入与每个块嵌入的余弦相似度
    similarities = [cosine_similarity(query_embedding, emb) for emb in chunk_embeddings]
    
    # Get the indices of the top-k most similar chunks 获取最相似的前k个块的索引
    top_indices = np.argsort(similarities)[-k:][::-1]
    
    # Return the top-k most relevant text chunks 返回最相关的前k个文本块
    return [text_chunks[i] for i in top_indices]

In [10]:
# Load the validation data from a JSON file 从JSON文件中加载验证数据
with open('data/val.json') as f:
    data = json.load(f)

# Extract the first query from the validation data 提取验证数据中的第一个查询
query = data[0]['question']

# Get top 2 relevant chunks 获取最相关的前两个文本块
top_chunks = semantic_search(query, text_chunks, chunk_embeddings, k=2)

# Print the query 打印查询
print(f"Query: {query}")

# Print the top 2 most relevant text chunks 打印最相关的前两个文本块
for i, chunk in enumerate(top_chunks):
    print(f"Context {i+1}:\n{chunk}\n{'='*40}")

Query: What is 'Explainable AI' and why is it considered important?
Context 1:

Transparency and Explainability 
Transparency and explainability are essential for building trust in AI systems. Explainable AI (XAI) 
techniques aim to make AI decisions more understandable, enabling users to assess their 
fairness and accuracy.
Context 2:

Explainable AI (XAI) 
Explainable AI (XAI) aims to make AI systems more transparent and understandable. Research in 
XAI focuses on developing methods for explaining AI decisions, enhancing trust, and improving 
accountability.


## 生成基于检索到的片段的响应

In [11]:
# 定义AI助手的系统提示
system_prompt = """
You are an AI assistant that strictly answers based on the given context. 
If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'
"""

def generate_response(system_prompt, user_message, model="THUDM/GLM-4-9B-0414"):
    """
    生成基于系统提示和用户消息的AI模型响应。

    Args:
    system_prompt (str): 系统提示用于指导AI的行为。
    user_message (str): 用户的消息或查询。
    model (str): 要使用的模型。

    Returns:
    dict: AI模型的响应。
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

# 创建基于顶部块的用户提示
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

# 生成AI响应
ai_response = generate_response(system_prompt, user_prompt)

## 评估AI响应
比较AI响应与期望答案，并给出分数。

In [12]:
# 定义评估系统的系统提示
evaluate_system_prompt = """
You are an intelligent evaluation system tasked with assessing the AI assistant's responses. 
If the AI assistant's response is very close to the true response, assign a score of 1. 
If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. 
If the response is partially aligned with the true response, assign a score of 0.5. give a detailed explanation of your evaluation."""

# 创建评估提示，将用户查询、AI响应、真实答案和评估系统提示组合在一起。
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"

# 生成评估响应，使用评估系统提示和评估提示。
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)

# 打印评估响应
print(evaluation_response.choices[0].message.content)


### Evaluation Score: 1

#### Detailed Explanation:

The AI assistant's response aligns very closely with the true response in both content and structure. Here’s a breakdown of the comparison:

1. **Definition of Explainable AI (XAI):**
   - **AI Response:** "Explainable AI (XAI) is a field of research that aims to make AI systems more transparent and understandable."
   - **True Response:** "Explainable AI (XAI) aims to make AI systems more transparent and understandable, providing insights into how they make decisions."
   
   The AI response accurately captures the essence of XAI as making AI systems transparent and understandable. While the true response adds the detail of "providing insights into how they make decisions," the core meaning is preserved, and the omission does not significantly detract from the overall accuracy.

2. **Importance of XAI:**
   - **AI Response:** "It focuses on developing methods for explaining AI decisions, which enhances trust and improves accountabi