# 文本块大小对Simple RAG的影响
在检索增强生成（RAG）中，选择合适的数据块大小对于提高检索精度至关重要。目标是平衡检索性能和响应质量。

下面将通过下面的步骤来评估不同的块大小对RAG的影响：

1. 预处理：提取PDF中的文本。
2. 分块：将文本分割成不同大小的块。
3. 生成Embedding：将每个块向量化。
4. 检索：使用用户查询的Embedding在块中进行检索。
5. 生成响应：使用检索到的块生成响应。
6. 评估：比较不同块大小下的响应质量。
7. 结论：分析不同块大小对RAG的影响。

导入必要的库

In [1]:
import pymupdf
import os
import numpy as np
import json
import openai

配置client

In [2]:
client = openai.OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),  # 如果您没有配置环境变量，请在此处用您的API Key进行替换
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"  # 百炼服务的base_url
)

提取文本

In [3]:
def extract_text_from_pdf(pdf_path):
    """
    提取PDF文件中的文本并打印前`num_chars`个字符。

    参数：
    pdf_path (str): PDF文件的路径。

    返回：
    str: 从PDF中提取的文本。

    """
    # 打开PDF文件
    mypdf = pymupdf.open(pdf_path)
    all_text = ""  # 初始化一个空字符串来存储提取的文本

    # 迭代PDF中的每个页面
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # 获取页面
        text = page.get_text("text")  # 从页面中提取文本
        all_text += text  # 将提取的文本附加到all_text字符串

    return all_text  # 返回提取的文本

pdf_path = "data/AI_Information.pdf"


extracted_text = extract_text_from_pdf(pdf_path)

print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


分块

In [4]:
def chunk_text(text, n, overlap):
    """
    将文本分割为多个块，每个块的大小为n，重叠部分为overlap。
    参数：
    text: 输入的文本
    n: 每个块的大小
    overlap: 相邻块之间的重叠部分大小

    返回：
    文本块列表
    """
    chunks = []  
    for i in range(0, len(text), n - overlap):
        
        chunks.append(text[i:i + n])
    
    return chunks  


chunk_sizes = [128, 256, 512]


text_chunks_dict = {size: chunk_text(extracted_text, size, size // 5) for size in chunk_sizes}


for size, chunks in text_chunks_dict.items():
    print(f"Chunk Size: {size}, Number of Chunks: {len(chunks)}")

Chunk Size: 128, Number of Chunks: 326
Chunk Size: 256, Number of Chunks: 164
Chunk Size: 512, Number of Chunks: 82


向量化

In [6]:
from tqdm import tqdm
def create_embeddings_in_batches(text_chunks, model="text-embedding-v3", batch_size_limit=10): # 我改成了官方模型名，你可以换回 "text-embedding-v3"
    """
    调用 OpenAI 的 Embedding API 来创建文本列表的嵌入向量，处理批处理大小限制。

    参数:
    text_chunks (List[str]): 需要创建嵌入的文本字符串列表。
    model (str): 使用的嵌入模型。
    batch_size_limit (int): API 允许的最大批处理大小。根据错误信息，这里是10。

    返回:
    List[List[float]]: 所有文本的嵌入向量列表。
    """
    all_embeddings = []
    if not text_chunks:
        return []

    if not isinstance(text_chunks, list): # 确保输入是列表
        text_chunks = [text_chunks]

    for i in range(0, len(text_chunks), batch_size_limit):
        batch = text_chunks[i:i + batch_size_limit]
        try:
            print(f"Processing batch {i//batch_size_limit + 1}, size: {len(batch)}")
            response = client.embeddings.create(
                input=batch,
                model=model,
                encoding_format="float"
            )
            # 从响应中提取该批次的嵌入向量
            batch_embeddings = [item.embedding for item in response.data]
            all_embeddings.extend(batch_embeddings)


        except Exception as e:
            print(f"Error processing batch starting with chunk: '{batch[0][:50]}...'")
            print(f"API Error: {e}")

            raise e 

    return all_embeddings


chunk_embeddings_dict = {size: create_embeddings_in_batches(chunks) for size, chunks in tqdm(text_chunks_dict.items(), desc="Generating Embeddings")}



Processing batch 1, size: 10
Processing batch 2, size: 10
Processing batch 3, size: 10
Processing batch 4, size: 10
Processing batch 5, size: 10
Processing batch 6, size: 10
Processing batch 7, size: 10
Processing batch 8, size: 10
Processing batch 9, size: 10
Processing batch 10, size: 10
Processing batch 11, size: 10
Processing batch 12, size: 10
Processing batch 13, size: 10
Processing batch 14, size: 10
Processing batch 15, size: 10
Processing batch 16, size: 10
Processing batch 17, size: 10
Processing batch 18, size: 10
Processing batch 19, size: 10
Processing batch 20, size: 10
Processing batch 21, size: 10
Processing batch 22, size: 10
Processing batch 23, size: 10
Processing batch 24, size: 10
Processing batch 25, size: 10
Processing batch 26, size: 10
Processing batch 27, size: 10
Processing batch 28, size: 10
Processing batch 29, size: 10
Processing batch 30, size: 10
Processing batch 31, size: 10
Processing batch 32, size: 10
Processing batch 33, size: 6




Processing batch 1, size: 10
Processing batch 2, size: 10
Processing batch 3, size: 10
Processing batch 4, size: 10
Processing batch 5, size: 10
Processing batch 6, size: 10
Processing batch 7, size: 10
Processing batch 8, size: 10
Processing batch 9, size: 10
Processing batch 10, size: 10
Processing batch 11, size: 10
Processing batch 12, size: 10
Processing batch 13, size: 10
Processing batch 14, size: 10
Processing batch 15, size: 10
Processing batch 16, size: 10
Processing batch 17, size: 4




Processing batch 1, size: 10
Processing batch 2, size: 10
Processing batch 3, size: 10
Processing batch 4, size: 10
Processing batch 5, size: 10
Processing batch 6, size: 10
Processing batch 7, size: 10
Processing batch 8, size: 10
Processing batch 9, size: 2


Generating Embeddings: 100%|██████████| 3/3 [00:31<00:00, 10.42s/it]


检索

In [7]:
def cosine_similarity(vec1, vec2):
    """
    计算两个向量之间的余弦相似度。

    参数:
    vec1 (numpy.ndarray): 第一个向量。
    vec2 (numpy.ndarray): 第二个向量。

    返回:
    float: 余弦相似度。

    """
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [8]:
def retrieve(query, text_chunks, embeddings,model="text-embedding-v3",  k=5):
    """
    通过向量的余弦相似度来检索文本
    参数：
    query: 查询文本
    text_chunks: 文本块列表
    embeddings: 文本块的嵌入列表
    k: 返回的文本块数量
    返回：
    与查询文本最相似的文本块列表
    """
    # 创建查找
    response = client.embeddings.create(
                input=query,
                model=model,
                encoding_format="float"
            )
    query_embedding = response.data[0].embedding
    similarity_scores = []  # 相似度得分列表
    for i, embedding in enumerate(embeddings):
        similarity_score = cosine_similarity(query_embedding, embedding)
        similarity_scores.append((i, similarity_score))
    # 排序
    similarity_scores.sort(key=lambda x: x[1], reverse=True)
    # 返回
    top_k_indices = [score[0] for score in similarity_scores[:k]]
    top_k_chunks = [text_chunks[i] for i in top_k_indices]
    return top_k_chunks

In [9]:
with open('data/val.json') as f:
    data = json.load(f)

query = data[3]['question']
retrieved_chunks_dict = {size: retrieve(query, text_chunks_dict[size], chunk_embeddings_dict[size]) for size in chunk_sizes}

print(retrieved_chunks_dict[256])

['ng biological data, predicting drug \nefficacy, and identifying potential drug candidates. AI-powered systems reduce the time and cost \nof bringing new treatments to market. \nPersonalized Medicine \nAI enables personalized medicine by analyzing individual pa', 'es personalized medicine by analyzing individual patient data, predicting treatment \nresponses, and tailoring interventions. Personalized medicine enhances treatment effectiveness \nand reduces adverse effects. \nRobotic Surgery \nAI-powered robotic surgery s', 'ains. \nThese applications include: \nHealthcare \nAI is transforming healthcare through applications such as medical diagnosis, drug discovery, \npersonalized medicine, and robotic surgery. AI-powered tools can analyze medical images, \npredict patient outcome', 'nt outcomes, and assisting in treatment planning. AI-powered tools enhance accuracy, \nefficiency, and patient care. \nDrug Discovery and Development \nAI accelerates drug discovery and development by anal

基于检索到的文本生成响应

In [25]:
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(query, system_prompt, retrieved_chunks, model="qwen3-4b"):

    context = "\n".join([f"Context {i+1}:\n{chunk}" for i, chunk in enumerate(retrieved_chunks)])
    
    user_prompt = f"{context}\n\nQuestion: {query}"

    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        extra_body={"enable_thinking": False}
    )


    return response.choices[0].message.content


ai_responses_dict = {size: generate_response(query, system_prompt, retrieved_chunks_dict[size]) for size in chunk_sizes}
print(ai_responses_dict[256])

AI contributes to personalized medicine by analyzing individual patient data, predicting treatment responses, and tailoring interventions. This allows for more effective treatments tailored to individual needs, enhancing treatment effectiveness and reducing adverse effects.


评估

In [26]:
# Define evaluation scoring system constants
SCORE_FULL = 1.0     # Complete match or fully satisfactory
SCORE_PARTIAL = 0.5  # Partial match or somewhat satisfactory
SCORE_NONE = 0.0     # No match or unsatisfactory

In [28]:
# Define strict evaluation prompt templates
FAITHFULNESS_PROMPT_TEMPLATE = """
Evaluate the faithfulness of the AI response compared to the true answer.
User Query: {question}
AI Response: {response}
True Answer: {true_answer}

Faithfulness measures how well the AI response aligns with facts in the true answer, without hallucinations.

INSTRUCTIONS:
- Score STRICTLY using only these values:
    * {full} = Completely faithful, no contradictions with true answer
    * {partial} = Partially faithful, minor contradictions
    * {none} = Not faithful, major contradictions or hallucinations
- Return ONLY the numerical score ({full}, {partial}, or {none}) with no explanation or additional text.
"""

RELEVANCY_PROMPT_TEMPLATE = """
Evaluate the relevancy of the AI response to the user query.
User Query: {question}
AI Response: {response}

Relevancy measures how well the response addresses the user's question.

INSTRUCTIONS:
- Score STRICTLY using only these values:
    * {full} = Completely relevant, directly addresses the query
    * {partial} = Partially relevant, addresses some aspects
    * {none} = Not relevant, fails to address the query
- Return ONLY the numerical score ({full}, {partial}, or {none}) with no explanation or additional text.
"""



In [None]:
def evaluate_response(question, response, true_answer):
        """
        检验AI生成的回答的真实度和相关性

        Args:
            question (str): 用户的问题
            response (str): AI生成的回答
            true_answer (str): 真实的回答

        Returns:
            tuple: 包含两个评分的元组，分别是真实度评分和相关性评分
        """
        # Format the evaluation prompts
        faithfulness_prompt = FAITHFULNESS_PROMPT_TEMPLATE.format(
                question=question, 
                response=response, 
                true_answer=true_answer,
                full=SCORE_FULL,
                partial=SCORE_PARTIAL,
                none=SCORE_NONE
        )
        
        relevancy_prompt = RELEVANCY_PROMPT_TEMPLATE.format(
                question=question, 
                response=response,
                full=SCORE_FULL,
                partial=SCORE_PARTIAL,
                none=SCORE_NONE
        )

        # Request faithfulness evaluation from the model
        faithfulness_response = client.chat.completions.create(
               model="qwen-plus",
                temperature=0,
                messages=[
                        {"role": "system", "content": "You are an objective evaluator. Return ONLY the numerical score."},
                        {"role": "user", "content": faithfulness_prompt}
                ]
        )
        
        # Request relevancy evaluation from the model
        relevancy_response = client.chat.completions.create(
                model="qwen-plus",
                temperature=0,
                messages=[
                        {"role": "system", "content": "You are an objective evaluator. Return ONLY the numerical score."},
                        {"role": "user", "content": relevancy_prompt}
                ]
        )
        
        # Extract scores and handle potential parsing errors
        try:
                faithfulness_score = float(faithfulness_response.choices[0].message.content.strip())
        except ValueError:
                print("Warning: Could not parse faithfulness score, defaulting to 0")
                faithfulness_score = 0.0
                
        try:
                relevancy_score = float(relevancy_response.choices[0].message.content.strip())
        except ValueError:
                print("Warning: Could not parse relevancy score, defaulting to 0")
                relevancy_score = 0.0

        return faithfulness_score, relevancy_score

# True answer for the first validation data
true_answer = data[3]['ideal_answer']

# Evaluate response for chunk size 256 and 128
faithfulness, relevancy = evaluate_response(query, ai_responses_dict[256], true_answer)
faithfulness2, relevancy2 = evaluate_response(query, ai_responses_dict[128], true_answer)

# print the evaluation scores
print(f"Faithfulness Score (Chunk Size 256): {faithfulness}")
print(f"Relevancy Score (Chunk Size 256): {relevancy}")

print(f"\n")

print(f"Faithfulness Score (Chunk Size 128): {faithfulness2}")
print(f"Relevancy Score (Chunk Size 128): {relevancy2}")

Faithfulness Score (Chunk Size 256): 1.0
Relevancy Score (Chunk Size 256): 1.0


Faithfulness Score (Chunk Size 128): 1.0
Relevancy Score (Chunk Size 128): 1.0
