# RAG 中的上下文丰富检索（Context-Enriched Retrieval）

检索增强生成 (RAG) 通过从外部来源检索相关知识来增强 AI 的回复。传统的检索方法返回的是孤立的文本块，这可能导致答案不够完整。

为了解决这个问题，出现了**上下文丰富检索**方法，该方法确保检索到的信息包含相邻的文本块，以获得更好的连贯性。

**具体步骤：**

*   **数据提取 (Data Ingestion)**：从 PDF 文件中提取文本。
*   **带重叠上下文的分块 (Chunking with Overlapping Context)**：将文本分割成相互重叠的文本块，以保留上下文信息。
*   **嵌入向量创建 (Embedding Creation)**：将文本块转换为数值表示形式。
*   **上下文感知检索 (Context-Aware Retrieval)**：检索相关文本块及其相邻的文本块，以提升信息的完整性。
*   **回复生成 (Response Generation)**：使用语言模型根据检索到的上下文生成回复。
*   **评估 (Evaluation)**：评估模型回复的准确性。

导入必要的包

In [1]:
import pymupdf
import os
import numpy as np
import json
import openai

从pdf提取文本

In [2]:
def extract_text_from_pdf(pdf_path):
    """
    提取PDF文件中的文本并打印前`num_chars`个字符。

    参数：
    pdf_path (str): PDF文件的路径。

    返回：
    str: 从PDF中提取的文本。

    """
    # 打开PDF文件
    mypdf = pymupdf.open(pdf_path)
    all_text = ""  # 初始化一个空字符串来存储提取的文本

    # 迭代PDF中的每个页面
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # 获取页面
        text = page.get_text("text")  # 从页面中提取文本
        all_text += text  # 将提取的文本附加到all_text字符串

    return all_text  # 返回提取的文本

pdf_path = "data/AI_Information.pdf"


extracted_text = extract_text_from_pdf(pdf_path)

print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


分块

In [3]:
def chunk_text(text, n, overlap):
    """
    将文本分割成n个部分，每个部分之间有overlap个字符的重叠。
    参数：
    text (str): 要分割的文本
    n (int): 每个部分的字符数
    overlap (int): 每个部分之间的重叠字符数

    返回：
    list: 包含分割后的文本部分的列表
    """
    chunks = []  
    
    
    for i in range(0, len(text), n - overlap):
        
        chunks.append(text[i:i + n])

    return chunks  

In [4]:
pdf_path = "data/AI_Information.pdf"
extracted_text = extract_text_from_pdf(pdf_path)
text_chunks = chunk_text(extracted_text, 1000, 200)
print("Number of text chunks:", len(text_chunks))
print("\nFirst text chunk:")
print(text_chunks[0])

Number of text chunks: 42

First text chunk:
Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past few decades, advancements in computing power and data availability 
have significantly accelerated the development and deployment of AI. 
Historical Context 
The idea of artificial intelligence has existed for centuries, often depicted in myths and fiction. 
However, the formal field of AI research began in the mid-20th century. The Dartmouth Workshop 
in 1956 is widely considered the birthplace of AI. Early AI research focused on problem-solving 
and 

配置client

In [5]:
client = openai.OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),  # 如果您没有配置环境变量，请在此处用您的API Key进行替换
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"  # 百炼服务的base_url
)

向量化

In [11]:
def create_embeddings_in_batches(text_chunks, model="text-embedding-v3", batch_size_limit=10): # 我改成了官方模型名，你可以换回 "text-embedding-v3"
    """
    调用 OpenAI 的 Embedding API 来创建文本列表的嵌入向量，处理批处理大小限制。

    参数:
    text_chunks (List[str]): 需要创建嵌入的文本字符串列表。
    model (str): 使用的嵌入模型。
    batch_size_limit (int): API 允许的最大批处理大小。根据错误信息，这里是10。

    返回:
    List[List[float]]: 所有文本的嵌入向量列表。
    """
    all_embeddings = []
    if not text_chunks:
        return []

    if not isinstance(text_chunks, list): # 确保输入是列表
        text_chunks = [text_chunks]

    for i in range(0, len(text_chunks), batch_size_limit):
        batch = text_chunks[i:i + batch_size_limit]
        try:
            print(f"Processing batch {i//batch_size_limit + 1}, size: {len(batch)}")
            response = client.embeddings.create(
                input=batch,
                model=model,
                encoding_format="float"
            )
            # 从响应中提取该批次的嵌入向量
            batch_embeddings = [item.embedding for item in response.data]
            all_embeddings.extend(batch_embeddings)


        except Exception as e:
            print(f"Error processing batch starting with chunk: '{batch[0][:50]}...'")
            print(f"API Error: {e}")

            raise e 

    return all_embeddings

response = create_embeddings_in_batches(text_chunks)

Processing batch 1, size: 10
Processing batch 2, size: 10
Processing batch 3, size: 10
Processing batch 4, size: 10
Processing batch 5, size: 2


余弦相似度

In [7]:
def cosine_similarity(vec1, vec2):
    """
    计算两个向量之间的余弦相似度。

    参数:
    vec1 (numpy.ndarray): 第一个向量。
    vec2 (numpy.ndarray): 第二个向量。

    返回:
    float: 余弦相似度。

    """
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

上下文感知的检索

In [15]:
def context_enriched_search(query, text_chunks, embeddings, k=1, context_size=1,model="text-embedding-v3"):
    """
    检索最相关的文本块，同时考虑上下文信息。
    参数：
        query (str): 用户的查询。
        text_chunks (list): 文本块列表。
        embeddings (list): 文本块的嵌入列表。
        k (int): 要返回的最相关文本块的数量。
        context_size (int): 要考虑的上下文大小。
    返回：
        list: 最相关的文本块列表。
    
    """

    response = client.embeddings.create(
                input=query,
                model=model,
                encoding_format="float"
            )
    query_embedding = response.data[0].embedding
    similarity_scores = []

    for i, chunk_embedding in enumerate(embeddings):

        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding))

        similarity_scores.append((i, similarity_score))


    similarity_scores.sort(key=lambda x: x[1], reverse=True)


    top_index = similarity_scores[0][0]

    # Define the range for context inclusion

    start = max(0, top_index - context_size)
    end = min(len(text_chunks), top_index + context_size + 1)


    return [text_chunks[i] for i in range(start, end)]

检索一条请求

In [16]:

with open('data/val.json') as f:
    data = json.load(f)


query = data[0]['question']

top_chunks = context_enriched_search(query, text_chunks, response, k=1, context_size=1)


print("Query:", query)

for i, chunk in enumerate(top_chunks):
    print(f"Context {i + 1}:\n{chunk}\n=====================================")

Query: What is 'Explainable AI' and why is it considered important?
Context 1:
nt aligns with societal values. Education and awareness campaigns inform the public 
about AI, its impacts, and its potential. 
Chapter 19: AI and Ethics 
Principles of Ethical AI 
Ethical AI principles guide the development and deployment of AI systems to ensure they are fair, 
transparent, accountable, and beneficial to society. Key principles include respect for human 
rights, privacy, non-discrimination, and beneficence. 
 
 
Addressing Bias in AI 
AI systems can inherit and amplify biases present in the data they are trained on, leading to unfair 
or discriminatory outcomes. Addressing bias requires careful data collection, algorithm design, 
and ongoing monitoring and evaluation. 
Transparency and Explainability 
Transparency and explainability are essential for building trust in AI systems. Explainable AI (XAI) 
techniques aim to make AI decisions more understandable, enabling users to assess their 
f

基于检索生成回答

In [None]:
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(system_prompt, user_prompt, model="qwen3-4b"):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
        extra_body={"enable_thinking": False}
    
    )
    return response

# Create the user prompt based on the top chunks
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

# Generate AI response
ai_response = generate_response(system_prompt, user_prompt)
print(ai_response.choices[0].message.content)

Explainable AI (XAI) refers to techniques that aim to make AI decisions more understandable to users, enabling them to assess the fairness and accuracy of these decisions. It is considered important because transparency and explainability are key to building trust in AI systems. By making AI systems understandable and providing insights into their decision-making processes, users can better evaluate the reliability and fairness of AI, which is essential for its acceptance and ethical use.


评估

In [None]:
# Define the system prompt for the evaluation system
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."

# Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"

# Generate the evaluation response using the evaluation system prompt and evaluation prompt
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)

# Print the evaluation response
print(evaluation_response.choices[0].message.content)