# 用于增强 RAG 系统的查询转换

通过三种查询转换技术，在不依赖 LangChain 等专门库的情况下，提升 RAG 系统中的检索性能。通过修改用户查询，我们可以显著提高检索信息的关联性和全面性。

## 关键转换方法

1.  **查询重写**：使查询更具体、更详细，以提高搜索精度。
2.  **退后提示**：生成更广泛的查询，以检索有用的上下文信息。
3.  **子查询分解**：将复杂查询分解为更简单的组件，以实现全面检索。

导入必要的库

In [1]:
import pymupdf
import os
import numpy as np
import json
import openai
from tqdm import tqdm
import re

配置client

In [2]:
client = openai.OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),  # 如果您没有配置环境变量，请在此处用您的API Key进行替换
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"  # 百炼服务的base_url
)

请求转换方法

In [3]:
# 方法一：请求重写
def rewrite_query(original_query, model="qwen-plus"):
    """
    重写用户查询以提高搜索结果的准确性。
    参数：
    original_query (str)：用户的原始查询。
    model (str)：用于重写查询的模型名称。

    返回：
    str：重写后的查询。
    """
    system_prompt = "You are an AI assistant specialized in improving search queries. Your task is to rewrite user queries to be more specific, detailed, and likely to retrieve relevant information."
    
    user_prompt = f"""
    Rewrite the following query to make it more specific and detailed. Include relevant terms and concepts that might help in retrieving accurate information.
    
    Original query: {original_query}
    
    Rewritten query:
    """
    
    response = client.chat.completions.create(
        model=model,
        temperature=0.0,  
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    
    return response.choices[0].message.content.strip()

In [4]:
# 方法二：后退提示词
def generate_step_back_query(original_query, model="qwen-plus"):
    """
    生成后退提示词
    参数：
    original_query (str): 原始用户查询
    model (str): 用于生成后退提示词的模型

    返回：
    str: 后退提示词
    """
    
    system_prompt = "You are an AI assistant specialized in search strategies. Your task is to generate broader, more general versions of specific queries to retrieve relevant background information."
    
    
    user_prompt = f"""
    Generate a broader, more general version of the following query that could help retrieve useful background information.
    
    Original query: {original_query}
    
    Step-back query:
    """
    
    
    response = client.chat.completions.create(
        model=model,
        temperature=0.1,  
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    
    return response.choices[0].message.content.strip()

In [5]:
# 方法三：分解为子查询

def decompose_query(original_query, num_subqueries=4, model="qwen-plus"):
    """
    分解复杂查询为子查询
    参数：
    original_query (str): 原始查询
    num_subqueries (int): 生成的子查询数量
    model (str): 用于生成子查询的模型

    返回：
    list: 生成的子查询列表
    """

    system_prompt = "You are an AI assistant specialized in breaking down complex questions. Your task is to decompose complex queries into simpler sub-questions that, when answered together, address the original query."
    
    user_prompt = f"""
    Break down the following complex query into {num_subqueries} simpler sub-queries. Each sub-query should focus on a different aspect of the original question.
    
    Original query: {original_query}
    
    Generate {num_subqueries} sub-queries, one per line, in this format:
    1. [First sub-query]
    2. [Second sub-query]
    And so on...
    """


    response = client.chat.completions.create(
        model=model,
        temperature=0.2,  
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    

    content = response.choices[0].message.content.strip()
    
    lines = content.split("\n")
    sub_queries = []
    
    for line in lines:
        if line.strip() and any(line.strip().startswith(f"{i}.") for i in range(1, 10)):

            query = line.strip()
            query = query[query.find(".")+1:].strip()
            sub_queries.append(query)
    
    return sub_queries

使用查询转换

In [6]:

original_query = "What are the impacts of AI on job automation and employment?"


print("Original Query:", original_query)


rewritten_query = rewrite_query(original_query)
print("\n1. Rewritten Query:")
print(rewritten_query)


step_back_query = generate_step_back_query(original_query)
print("\n2. Step-back Query:")
print(step_back_query)


sub_queries = decompose_query(original_query, num_subqueries=4)
print("\n3. Sub-queries:")
for i, query in enumerate(sub_queries, 1):
    print(f"   {i}. {query}")

Original Query: What are the impacts of AI on job automation and employment?

1. Rewritten Query:
What are the specific impacts of artificial intelligence (AI) on job automation across various industries, and how does it affect employment rates, job displacement, skill requirements, and workforce adaptation? Additionally, what roles do machine learning, natural language processing, and robotics play in accelerating or mitigating these effects, and what strategies are being implemented to address potential economic and social challenges?

2. Step-back Query:
What are the economic, social, and technological implications of automation and artificial intelligence on workforce dynamics and employment trends?

3. Sub-queries:
   1. How does AI contribute to job automation in various industries?
   2. Which types of jobs are most at risk due to AI-driven automation?
   3. How might AI create new employment opportunities or roles?
   4. What are the societal and economic implications of AI on 

简易向量库

In [7]:
class SimpleVectorStore:
    """
    简易的向量存储库。
    """
    def __init__(self):
        
        self.vectors = []
        self.texts = []
        self.metadata = []
    
    def add_item(self, text, embedding, metadata=None):
        """
        添加一个新的项到存储库。

        参数:
        text (str): 文本内容。
        embedding (List[float]): 文本的嵌入向量。
        metadata (Dict, optional): 与文本相关的元数据。
        """
        self.vectors.append(np.array(embedding))
        self.texts.append(text)
        self.metadata.append(metadata or {})
    
    def similarity_search(self, query_embedding, k=5):
        """
        查找与查询嵌入向量最相似的文本。

        参数:
        query_embedding (List[float]): 查询的嵌入向量。
        k (int, optional): 返回最相似的k个结果。

        返回:
        List[Dict]: 最相似的文本及其相关信息。
        """
        if not self.vectors:
            return []
        

        query_vector = np.array(query_embedding)
        

        similarities = []
        for i, vector in enumerate(self.vectors):
            similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
            similarities.append((i, similarity))
        

        similarities.sort(key=lambda x: x[1], reverse=True)
        

        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({
                "text": self.texts[idx],
                "metadata": self.metadata[idx],
                "similarity": score
            })
        
        return results

向量化

In [19]:
def create_embeddings_in_batches(text_chunks, model="text-embedding-v3", batch_size_limit=10): # 我改成了官方模型名，你可以换回 "text-embedding-v3"
    """
    调用 OpenAI 的 Embedding API 来创建文本列表的嵌入向量，处理批处理大小限制。

    参数:
    text_chunks (List[str]): 需要创建嵌入的文本字符串列表。
    model (str): 使用的嵌入模型。
    batch_size_limit (int): API 允许的最大批处理大小。根据错误信息，这里是10。

    返回:
    List[List[float]]: 所有文本的嵌入向量列表。
    """
    all_embeddings = []
    if not text_chunks:
        return []

    if not isinstance(text_chunks, list): # 确保输入是列表
        text_chunks = [text_chunks]

    for i in range(0, len(text_chunks), batch_size_limit):
        batch = text_chunks[i:i + batch_size_limit]
        try:
            #print(f"Processing batch {i//batch_size_limit + 1}, size: {len(batch)}")
            response = client.embeddings.create(
                input=batch,
                model=model,
                encoding_format="float"
            )
            # 从响应中提取该批次的嵌入向量
            batch_embeddings = [item.embedding for item in response.data]
            all_embeddings.extend(batch_embeddings)


        except Exception as e:
            print(f"Error processing batch starting with chunk: '{batch[0][:50]}...'")
            print(f"API Error: {e}")

            raise e 

    return all_embeddings

def create_embeddings(text, model="text-embedding-v3"):
    """
    字符串向量化
    参数:
    text (str): 需要创建嵌入的文本字符串。
    model (str): 使用的嵌入模型。

    返回:
    List[float]: 文本的嵌入向量。
    """
    response = client.embeddings.create(
        model=model,
        input=text
    )

    return response.data[0].embedding

从pdf提取文本

In [9]:
def extract_text_from_pdf(pdf_path):
    """
    提取PDF文件中的文本并打印前`num_chars`个字符。

    参数：
    pdf_path (str): PDF文件的路径。

    返回：
    str: 从PDF中提取的文本。

    """
    # 打开PDF文件
    mypdf = pymupdf.open(pdf_path)
    all_text = ""  # 初始化一个空字符串来存储提取的文本

    # 迭代PDF中的每个页面
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # 获取页面
        text = page.get_text("text")  # 从页面中提取文本
        all_text += text  # 将提取的文本附加到all_text字符串

    return all_text  # 返回提取的文本

pdf_path = "data/AI_Information.pdf"


extracted_text = extract_text_from_pdf(pdf_path)

print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


分块

In [10]:
def chunk_text(text, n, overlap):
    """
    将文本分割为多个块，每个块的大小为n，重叠部分为overlap。
    参数：
    text: 输入的文本
    n: 每个块的大小
    overlap: 相邻块之间的重叠部分大小

    返回：
    文本块列表
    """
    chunks = []  
    for i in range(0, len(text), n - overlap):
        
        chunks.append(text[i:i + n])
    
    return chunks  

向量化并存储文本

In [None]:
def process_document(pdf_path, chunk_size=1000, chunk_overlap=200):
    """
    处理文档以进行 RAG。

    参数：
    pdf_path (str): PDF 文件的路径。
    chunk_size (int): 每个块的字符数。
    chunk_overlap (int): 块之间的字符重叠。

    返回：
    SimpleVectorStore: 包含文档块及其嵌入的向量存储。
    """
    print("Extracting text from PDF...")
    extracted_text = extract_text_from_pdf(pdf_path)
    
    print("Chunking text...")
    chunks = chunk_text(extracted_text, chunk_size, chunk_overlap)
    print(f"Created {len(chunks)} text chunks")
    
    print("Creating embeddings for chunks...")
    
    chunk_embeddings = create_embeddings_in_batches(chunks)
    
    
    store = SimpleVectorStore()
    
    
    for i, (chunk, embedding) in enumerate(zip(chunks, chunk_embeddings)):
        store.add_item(
            text=chunk,
            embedding=embedding,
            metadata={"index": i, "source": pdf_path}
        )
    
    print(f"Added {len(chunks)} chunks to the vector store")
    return store

使用查询变换进行检索

In [None]:
def transformed_search(query, vector_store, transformation_type, top_k=3):
    """
    查询转换后的query
    """
    print(f"Transformation type: {transformation_type}")
    print(f"Original query: {query}")
    
    results = []
    
    if transformation_type == "rewrite":
        
        transformed_query = rewrite_query(query)
        print(f"Rewritten query: {transformed_query}")
        
        query_embedding = create_embeddings(transformed_query)
        
        results = vector_store.similarity_search(query_embedding, k=top_k)
        
    elif transformation_type == "step_back":
        
        transformed_query = generate_step_back_query(query)
        print(f"Step-back query: {transformed_query}")
    
        query_embedding = create_embeddings(transformed_query)
        
        results = vector_store.similarity_search(query_embedding, k=top_k)
        
    elif transformation_type == "decompose":
        
        sub_queries = decompose_query(query)
        print("Decomposed into sub-queries:")
        for i, sub_q in enumerate(sub_queries, 1):
            print(f"{i}. {sub_q}")
        
        print("sub_queries:\n", sub_queries)
        sub_query_embeddings = create_embeddings(sub_queries)
        print("sub_query_embeddings:\n", sub_query_embeddings)
        
        all_results = []
        for i, embedding in enumerate(sub_query_embeddings):
            print(f"Sub-query {i+1} embedding: {embedding}")
            sub_results = vector_store.similarity_search(embedding, k=2) 
            all_results.extend(sub_results)
        
        
        seen_texts = {}
        for result in all_results:
            text = result["text"]
            if text not in seen_texts or result["similarity"] > seen_texts[text]["similarity"]:
                seen_texts[text] = result
        
        
        results = sorted(seen_texts.values(), key=lambda x: x["similarity"], reverse=True)[:top_k]
        
    else:
        
        query_embedding = create_embeddings(query)
        results = vector_store.similarity_search(query_embedding, k=top_k)
    
    return results

In [13]:
def generate_response(query, context, model="qwen3-4b"):
    
    system_prompt = "You are a helpful AI assistant. Answer the user's question based only on the provided context. If you cannot find the answer in the context, state that you don't have enough information."
    
    user_prompt = f"""
        Context:
        {context}

        Question: {query}

        Please provide a comprehensive answer based only on the context above.
    """
    
    response = client.chat.completions.create(
        model=model,
        temperature=0,  
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
         extra_body={"enable_thinking": False}
    )

    return response.choices[0].message.content.strip()

完整RAG流程

In [14]:
def rag_with_query_transformation(pdf_path, query, transformation_type=None):
    
    vector_store = process_document(pdf_path)
    
    if transformation_type:
        results = transformed_search(query, vector_store, transformation_type)
    else:

        query_embedding = create_embeddings(query)
        results = vector_store.similarity_search(query_embedding, k=3)

    context = "\n\n".join([f"PASSAGE {i+1}:\n{result['text']}" for i, result in enumerate(results)])
    
    response = generate_response(query, context)
    
    return {
        "original_query": query,
        "transformation_type": transformation_type,
        "context": context,
        "response": response
    }

评估

In [15]:

def compare_responses(results, reference_answer, model="qwen-plus"):
    """
    比较不同查询转换技术的响应并评估它们的性能。

    参数：
    results (dict): 包含不同查询转换技术的响应的字典。
    reference_answer (str): 参考答案。
    model (str): 用于生成比较结果的模型。

    返回：
    None
    """

    system_prompt = """You are an expert evaluator of RAG systems. 
    Your task is to compare different responses generated using various query transformation techniques 
    and determine which technique produced the best response compared to the reference answer."""
    

    comparison_text = f"""Reference Answer: {reference_answer}\n\n"""
    
    for technique, result in results.items():
        comparison_text += f"{technique.capitalize()} Query Response:\n{result['response']}\n\n"
    

    user_prompt = f"""
    {comparison_text}
    
    Compare the responses generated by different query transformation techniques to the reference answer.
    
    For each technique (original, rewrite, step_back, decompose):
    1. Score the response from 1-10 based on accuracy, completeness, and relevance
    2. Identify strengths and weaknesses
    
    Then rank the techniques from best to worst and explain which technique performed best overall and why.
    """
    

    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    

    print("\n===== EVALUATION RESULTS =====")
    print(response.choices[0].message.content)
    print("=============================")

In [16]:
def evaluate_transformations(pdf_path, query, reference_answer=None):
    """
    评估不同的查询转换技术。

    参数：
        pdf_path (str): PDF文档的路径
        query (str): 要评估的查询
        reference_answer (str): 可选的参考答案用于比较

    返回：
        Dict: 评估结果
    
    """

    transformation_types = [None, "rewrite", "step_back", "decompose"]
    results = {}
    

    for transformation_type in transformation_types:
        type_name = transformation_type if transformation_type else "original"
        print(f"\n===== Running RAG with {type_name} query =====")
        

        result = rag_with_query_transformation(pdf_path, query, transformation_type)
        results[type_name] = result
        

        print(f"Response with {type_name} query:")
        print(result["response"])
        print("=" * 50)
    

    if reference_answer:
        compare_responses(results, reference_answer)
    
    return results

In [25]:
with open('data/val.json') as f:
    data = json.load(f)

# Extract the first query from the validation data
query = data[0]['question']

# Extract the reference answer from the validation data
reference_answer = data[0]['ideal_answer']

# pdf_path
pdf_path = "data/AI_Information.pdf"

# Run evaluation
evaluation_results = evaluate_transformations(pdf_path, query, reference_answer)


===== Running RAG with original query =====
Extracting text from PDF...
Chunking text...
Created 42 text chunks
Creating embeddings for chunks...
Added 42 chunks to the vector store
Response with original query:
Explainable AI (XAI) is a set of techniques designed to make AI decision-making processes more understandable and transparent. The goal of XAI is to provide insights into how AI models arrive at their decisions, which helps users assess the reliability, fairness, and accuracy of these decisions. This transparency is crucial for building trust in AI systems, as it allows users to understand the rationale behind AI outputs and verify that they are making fair and accurate judgments.

XAI is considered important for several reasons. Firstly, it enhances accountability and responsibility by making AI systems more transparent, which is essential for addressing potential harms and ensuring ethical behavior. Secondly, it supports the principles of fairness and accuracy by enabling us

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()