# 简单 RAG 中的上下文块标头 (CCH)

检索增强生成 (RAG) 通过在生成响应之前检索相关的外部知识来提高语言模型的事实准确性。然而，标准分块通常会丢失重要上下文，从而降低检索效率。

上下文块标头 (CCH) 通过在嵌入每个块之前将高级上下文（例如文档标题或章节标头）附加到每个块，从而增强 RAG。这可以提高检索质量并防止出现脱离上下文的响应。

## 步骤：

1. **数据提取**：加载并预处理文本数据。
2. **使用上下文标头进行分块**：提取章节标题并将其附加到块中。
3. **创建嵌入**：将上下文增强的块转换为数字表示。
4. **语义搜索**：根据用户查询检索相关块。
5. **生成响应**：使用语言模型根据检索到的文本生成响应。
6. **评估**：使用评分系统评估响应准确性。

导入必要的库

In [1]:
import pymupdf
import os
import numpy as np
import json
import openai

提取pdf中的文本

In [2]:
def extract_text_from_pdf(pdf_path):
    """
    提取PDF文件中的文本并打印前`num_chars`个字符。

    参数：
    pdf_path (str): PDF文件的路径。

    返回：
    str: 从PDF中提取的文本。

    """
    # 打开PDF文件
    mypdf = pymupdf.open(pdf_path)
    all_text = ""  # 初始化一个空字符串来存储提取的文本

    # 迭代PDF中的每个页面
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # 获取页面
        text = page.get_text("text")  # 从页面中提取文本
        all_text += text  # 将提取的文本附加到all_text字符串

    return all_text  # 返回提取的文本

pdf_path = "data/AI_Information.pdf"


extracted_text = extract_text_from_pdf(pdf_path)

print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


配置client

In [3]:
client = openai.OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),  # 如果您没有配置环境变量，请在此处用您的API Key进行替换
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"  # 百炼服务的base_url
)

文本分块，并带上headers

In [4]:
def generate_chunk_header(chunk, model="qwen-plus"):
    """
    使用llm给文本生成标题

    参数:
    chunk (str): 文本块。
    model (str): 用于生成标题的模型。

    返回:
    str: 生成的标题。
    """

    system_prompt = "Generate a concise and informative title for the given text."
    

    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": chunk}
        ]
    )


    return response.choices[0].message.content.strip()

In [5]:
def chunk_text_with_headers(text, n, overlap):
    """
    文本分块，同时生成每个块的标题。
    参数：
        text (str): 要分块的文本。
        n (int): 每个块的最大字符数。
        overlap (int): 块之间的重叠字符数。
    返回：
        list: 包含每个块的标题和文本的字典列表。
    """
    chunks = []  


    for i in range(0, len(text), n - overlap):
        chunk = text[i:i + n]  
        header = generate_chunk_header(chunk)  
        chunks.append({"header": header, "text": chunk}) 

    return chunks 

In [6]:
pdf_path = "data/AI_Information.pdf"
extracted_text = extract_text_from_pdf(pdf_path)
text_chunks = chunk_text_with_headers(extracted_text, 1000, 200)
print("Sample Chunk:")
print("Header:", text_chunks[0]['header'])
print("Content:", text_chunks[0]['text'])

Sample Chunk:
Header: "Introduction to Artificial Intelligence: Historical Context and Development"
Content: Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past few decades, advancements in computing power and data availability 
have significantly accelerated the development and deployment of AI. 
Historical Context 
The idea of artificial intelligence has existed for centuries, often depicted in myths and fiction. 
However, the formal field of AI research began in the mid-20th century. The Dartmouth Workshop 
in 1956 is widely considered the birt

向量化

In [8]:
def create_embeddings_in_batches(text_chunks, model="text-embedding-v3", batch_size_limit=10): # 我改成了官方模型名，你可以换回 "text-embedding-v3"
    """
    调用 OpenAI 的 Embedding API 来创建文本列表的嵌入向量，处理批处理大小限制。

    参数:
    text_chunks (List[str]): 需要创建嵌入的文本字符串列表。
    model (str): 使用的嵌入模型。
    batch_size_limit (int): API 允许的最大批处理大小。根据错误信息，这里是10。

    返回:
    List[List[float]]: 所有文本的嵌入向量列表。
    """
    all_embeddings = []
    if not text_chunks:
        return []

    if not isinstance(text_chunks, list): # 确保输入是列表
        text_chunks = [text_chunks]

    for i in range(0, len(text_chunks), batch_size_limit):
        batch = text_chunks[i:i + batch_size_limit]
        try:
            print(f"Processing batch {i//batch_size_limit + 1}, size: {len(batch)}")
            response = client.embeddings.create(
                input=batch,
                model=model,
                encoding_format="float"
            )
            # 从响应中提取该批次的嵌入向量
            batch_embeddings = [item.embedding for item in response.data]
            all_embeddings.extend(batch_embeddings)


        except Exception as e:
            print(f"Error processing batch starting with chunk: '{batch[0][:50]}...'")
            print(f"API Error: {e}")

            raise e 

    return all_embeddings

def create_embeddings(text, model="text-embedding-v3"):
    """
    字符串向量化
    参数:
    text (str): 需要创建嵌入的文本字符串。
    model (str): 使用的嵌入模型。

    返回:
    List[float]: 文本的嵌入向量。
    """
    response = client.embeddings.create(
        model=model,
        input=text
    )

    return response.data[0].embedding

In [10]:
from tqdm import tqdm  
embeddings = []  


for chunk in tqdm(text_chunks, desc="Generating embeddings"):
    
    text_embedding = create_embeddings(chunk["text"])
    
    header_embedding = create_embeddings(chunk["header"])
    
    embeddings.append({"header": chunk["header"], "text": chunk["text"], "embedding": text_embedding, "header_embedding": header_embedding})

Generating embeddings: 100%|██████████| 42/42 [00:15<00:00,  2.69it/s]


语义检索

In [11]:
def cosine_similarity(vec1, vec2):
    """
    计算两个向量之间的余弦相似度。

    参数:
    vec1 (numpy.ndarray): 第一个向量。
    vec2 (numpy.ndarray): 第二个向量。

    返回:
    float: 余弦相似度。

    """
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [12]:
def semantic_search(query, chunks, k=5):
    """
    检索最相关的文本块。
    参数：
        query (str): 用户的查询。
        chunks (list): 文本块列表。
        k (int): 要返回的最相关文本块的数量。
    返回：
        list: 最相关的文本块列表。

    """

    query_embedding = create_embeddings(query)

    similarities = []  
    

    for chunk in chunks:

        sim_text = cosine_similarity(np.array(query_embedding), np.array(chunk["embedding"]))

        sim_header = cosine_similarity(np.array(query_embedding), np.array(chunk["header_embedding"]))

        avg_similarity = (sim_text + sim_header) / 2

        similarities.append((chunk, avg_similarity))


    similarities.sort(key=lambda x: x[1], reverse=True)

    return [x[0] for x in similarities[:k]]

跑一条请求

In [13]:
with open('data/val.json') as f:
    data = json.load(f)

query = data[0]['question']
top_chunks = semantic_search(query, embeddings, k=2)

print("Query:", query)
for i, chunk in enumerate(top_chunks):
    print(f"Header {i+1}: {chunk['header']}")
    print(f"Content:\n{chunk['text']}\n")

Query: What is 'Explainable AI' and why is it considered important?
Header 1: "Building Trustworthy AI: Transparency, Privacy, and Accountability"
Content:
systems. Explainable AI (XAI) 
techniques aim to make AI decisions more understandable, enabling users to assess their 
fairness and accuracy. 
Privacy and Data Protection 
AI systems often rely on large amounts of data, raising concerns about privacy and data 
protection. Ensuring responsible data handling, implementing privacy-preserving techniques, 
and complying with data protection regulations are crucial. 
Accountability and Responsibility 
Establishing accountability and responsibility for AI systems is essential for addressing potential 
harms and ensuring ethical behavior. This includes defining roles and responsibilities for 
developers, deployers, and users of AI systems. 
Chapter 20: Building Trust in AI 
Transparency and Explainability 
Transparency and explainability are key to building trust in AI. Making AI systems u

基于检索生成答案

In [16]:
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(system_prompt, user_message, model="qwen3-4b"):
    """
    生成AI响应。
    参数：
        system_prompt (str): 系统提示。
        user_message (str): 用户消息。
        model (str): 要使用的模型。
    返回：
        str: AI生成的响应。
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ],
        extra_body={"enable_thinking": False}
    )
    return response


user_prompt = "\n".join([f"Header: {chunk['header']}\nContent:\n{chunk['text']}" for chunk in top_chunks])
user_prompt = f"{user_prompt}\nQuestion: {query}"


ai_response = generate_response(system_prompt, user_prompt)

评估

In [17]:

evaluate_system_prompt = """You are an intelligent evaluation system. 
Assess the AI assistant's response based on the provided context. 
- Assign a score of 1 if the response is very close to the true answer. 
- Assign a score of 0.5 if the response is partially correct. 
- Assign a score of 0 if the response is incorrect.
Return only the score (0, 0.5, or 1)."""


true_answer = data[0]['ideal_answer']


evaluation_prompt = f"""
User Query: {query}
AI Response: {ai_response}
True Answer: {true_answer}
{evaluate_system_prompt}
"""

evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)

print("Evaluation Score:", evaluation_response.choices[0].message.content)

Evaluation Score: 1
