# 用于增强 RAG 系统的重新排序

本笔记本实现了重新排序技术，以提高 RAG 系统中的检索质量。重新排序在初始检索之后充当第二道筛选步骤，以确保使用最相关的内容来生成响应。

## 重新排序的关键概念

1.  **初始检索**：使用基本相似性搜索进行第一遍检索（准确性较低但速度较快）。
2.  **文档评分**：评估每个检索到的文档与查询的相关性。
3.  **重新排序**：根据相关性得分对文档进行排序。
4.  **选择**：仅使用最相关的文档进行响应生成。

导入必要的库

In [1]:
import pymupdf
import os
import numpy as np
import json
import openai
from tqdm import tqdm
import re

提取pdf文本

In [2]:
def extract_text_from_pdf(pdf_path):
    """
    提取PDF文件中的文本并打印前`num_chars`个字符。

    参数：
    pdf_path (str): PDF文件的路径。

    返回：
    str: 从PDF中提取的文本。

    """
    # 打开PDF文件
    mypdf = pymupdf.open(pdf_path)
    all_text = ""  # 初始化一个空字符串来存储提取的文本

    # 迭代PDF中的每个页面
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # 获取页面
        text = page.get_text("text")  # 从页面中提取文本
        all_text += text  # 将提取的文本附加到all_text字符串

    return all_text  # 返回提取的文本

pdf_path = "data/AI_Information.pdf"


extracted_text = extract_text_from_pdf(pdf_path)

print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


分块

In [3]:
def chunk_text(text, n, overlap):
    """
    将文本分割为多个块，每个块的大小为n，重叠部分为overlap。
    参数：
    text: 输入的文本
    n: 每个块的大小
    overlap: 相邻块之间的重叠部分大小

    返回：
    文本块列表
    """
    chunks = []  
    for i in range(0, len(text), n - overlap):
        
        chunks.append(text[i:i + n])
    
    return chunks  

配置client

In [4]:
client = openai.OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),  # 如果您没有配置环境变量，请在此处用您的API Key进行替换
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"  # 百炼服务的base_url
)

简易向量库

In [5]:
class SimpleVectorStore:
    """
    简易的向量存储库。
    """
    def __init__(self):
        
        self.vectors = []
        self.texts = []
        self.metadata = []
    
    def add_item(self, text, embedding, metadata=None):
        """
        添加一个新的项到存储库。

        参数:
        text (str): 文本内容。
        embedding (List[float]): 文本的嵌入向量。
        metadata (Dict, optional): 与文本相关的元数据。
        """
        self.vectors.append(np.array(embedding))
        self.texts.append(text)
        self.metadata.append(metadata or {})
    
    def similarity_search(self, query_embedding, k=5):
        """
        查找与查询嵌入向量最相似的文本。

        参数:
        query_embedding (List[float]): 查询的嵌入向量。
        k (int, optional): 返回最相似的k个结果。

        返回:
        List[Dict]: 最相似的文本及其相关信息。
        """
        if not self.vectors:
            return []
        

        query_vector = np.array(query_embedding)
        

        similarities = []
        for i, vector in enumerate(self.vectors):
            similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
            similarities.append((i, similarity))
        

        similarities.sort(key=lambda x: x[1], reverse=True)
        

        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({
                "text": self.texts[idx],
                "metadata": self.metadata[idx],
                "similarity": score
            })
        
        return results

向量化


In [6]:
def create_embeddings_in_batches(text_chunks, model="text-embedding-v3", batch_size_limit=10): # 我改成了官方模型名，你可以换回 "text-embedding-v3"
    """
    调用 OpenAI 的 Embedding API 来创建文本列表的嵌入向量，处理批处理大小限制。

    参数:
    text_chunks (List[str]): 需要创建嵌入的文本字符串列表。
    model (str): 使用的嵌入模型。
    batch_size_limit (int): API 允许的最大批处理大小。根据错误信息，这里是10。

    返回:
    List[List[float]]: 所有文本的嵌入向量列表。
    """
    all_embeddings = []
    if not text_chunks:
        return []

    if not isinstance(text_chunks, list): # 确保输入是列表
        text_chunks = [text_chunks]

    for i in range(0, len(text_chunks), batch_size_limit):
        batch = text_chunks[i:i + batch_size_limit]
        try:
            #print(f"Processing batch {i//batch_size_limit + 1}, size: {len(batch)}")
            response = client.embeddings.create(
                input=batch,
                model=model,
                encoding_format="float"
            )
            # 从响应中提取该批次的嵌入向量
            batch_embeddings = [item.embedding for item in response.data]
            all_embeddings.extend(batch_embeddings)


        except Exception as e:
            print(f"Error processing batch starting with chunk: '{batch[0][:50]}...'")
            print(f"API Error: {e}")

            raise e 

    return all_embeddings

def create_embeddings(text, model="text-embedding-v3"):
    """
    字符串向量化
    参数:
    text (str): 需要创建嵌入的文本字符串。
    model (str): 使用的嵌入模型。

    返回:
    List[float]: 文本的嵌入向量。
    """
    response = client.embeddings.create(
        model=model,
        input=text
    )

    return response.data[0].embedding

文本处理流程

In [7]:
def process_document(pdf_path, chunk_size=1000, chunk_overlap=200):
    """
    处理文本，用于RAG。
    """

    print("Extracting text from PDF...")
    extracted_text = extract_text_from_pdf(pdf_path)
    

    print("Chunking text...")
    chunks = chunk_text(extracted_text, chunk_size, chunk_overlap)
    print(f"Created {len(chunks)} text chunks")
    

    print("Creating embeddings for chunks...")
    chunk_embeddings = create_embeddings_in_batches(chunks)
    
    # Initialize a simple vector store
    store = SimpleVectorStore()
    
    # Add each chunk and its embedding to the vector store
    for i, (chunk, embedding) in enumerate(zip(chunks, chunk_embeddings)):
        store.add_item(
            text=chunk,
            embedding=embedding,
            metadata={"index": i, "source": pdf_path}
        )
    
    print(f"Added {len(chunks)} chunks to the vector store")
    return store

使用llm重排序

In [8]:
def rerank_with_llm(query, results, top_n=3, model="qwen-turbo"):
    """
    使用 LLM 对搜索结果进行重新排序。

    参数:
    query (str): 用户的查询。
    results (List[Dict]): 包含文档文本、元数据和相似度的搜索结果。
    top_n (int): 返回的重新排序结果数量。
    model (str): 使用的 LLM 模型。

    返回:
    List[Dict]: 重新排序后的搜索结果。
    """
    print(f"Reranking {len(results)} documents...")  
    
    scored_results = []  
    
    
    system_prompt = """You are an expert at evaluating document relevance for search queries.
Your task is to rate documents on a scale from 0 to 10 based on how well they answer the given query.

Guidelines:
- Score 0-2: Document is completely irrelevant
- Score 3-5: Document has some relevant information but doesn't directly answer the query
- Score 6-8: Document is relevant and partially answers the query
- Score 9-10: Document is highly relevant and directly answers the query

You MUST respond with ONLY a single integer score between 0 and 10. Do not include ANY other text."""
    
    
    for i, result in enumerate(results):
        
        if i % 5 == 0:
            print(f"Scoring document {i+1}/{len(results)}...")
        
        
        user_prompt = f"""Query: {query}

Document:
{result['text']}

Rate this document's relevance to the query on a scale from 0 to 10:"""
        
        
        response = client.chat.completions.create(
            model=model,
            temperature=0,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ]
        )
        
        
        score_text = response.choices[0].message.content.strip()
        
        
        score_match = re.search(r'\b(10|[0-9])\b', score_text)
        if score_match:
            score = float(score_match.group(1))
        else:
            
            print(f"Warning: Could not extract score from response: '{score_text}', using similarity score instead")
            score = result["similarity"] * 10
        
        
        scored_results.append({
            "text": result["text"],
            "metadata": result["metadata"],
            "similarity": result["similarity"],
            "relevance_score": score
        })
    
    
    reranked_results = sorted(scored_results, key=lambda x: x["relevance_score"], reverse=True)
    
    
    return reranked_results[:top_n]

简易的基于关键词的重排

In [9]:
def rerank_with_keywords(query, results, top_n=3):
    """
    一个建议的Reranker，它使用关键字匹配来提高与查询相关的文档的排名。
    这个函数会给每个文档分配一个分数，分数越高，文档越可能与查询相关。
    这个函数会返回与查询最相关的文档。

    参数:
    query (str): 用户的查询。
    results (List[dict]): 包含文档文本、元数据和相似度的列表。
    top_n (int): 返回的结果数量。

    返回:
    List[dict]: 与查询最相关的文档列表。
    """
    
    keywords = [word.lower() for word in query.split() if len(word) > 3]
    
    scored_results = []  
    
    for result in results:
        document_text = result["text"].lower()  
        
        base_score = result["similarity"] * 0.5
        
        keyword_score = 0
        for keyword in keywords:
            if keyword in document_text:
                
                keyword_score += 0.1
                
                first_position = document_text.find(keyword)
                if first_position < len(document_text) / 4:  
                    keyword_score += 0.1
                
                
                frequency = document_text.count(keyword)
                keyword_score += min(0.05 * frequency, 0.2)  
        
        
        final_score = base_score + keyword_score
        
        
        scored_results.append({
            "text": result["text"],
            "metadata": result["metadata"],
            "similarity": result["similarity"],
            "relevance_score": final_score
        })
    

    reranked_results = sorted(scored_results, key=lambda x: x["relevance_score"], reverse=True)
    

    return reranked_results[:top_n]

生成响应

In [11]:
def generate_response(query, context, model="qwen3-4b"):
    """
    生成响应
    """
    
    system_prompt = "You are a helpful AI assistant. Answer the user's question based only on the provided context. If you cannot find the answer in the context, state that you don't have enough information."
    
    
    user_prompt = f"""
        Context:
        {context}

        Question: {query}

        Please provide a comprehensive answer based only on the context above.
    """
    
    
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        extra_body={"enable_thinking": False}
    )
    
    
    return response.choices[0].message.content

完整RAG流程

In [13]:
def rag_with_reranking(query, vector_store, reranking_method="llm", top_n=3, model="qwen-turbo"):
    
    query_embedding = create_embeddings(query)
    
    initial_results = vector_store.similarity_search(query_embedding, k=10)
    
    if reranking_method == "llm":
        reranked_results = rerank_with_llm(query, initial_results, top_n=top_n)
    elif reranking_method == "keywords":
        reranked_results = rerank_with_keywords(query, initial_results, top_n=top_n)
    else:
        
        reranked_results = initial_results[:top_n]
    
    context = "\n\n===\n\n".join([result["text"] for result in reranked_results])
    
    response = generate_response(query, context, model)
    
    return {
        "query": query,
        "reranking_method": reranking_method,
        "initial_results": initial_results[:top_n],
        "reranked_results": reranked_results,
        "context": context,
        "response": response
    }

评估

In [14]:
with open('data/val.json') as f:
    data = json.load(f)

query = data[0]['question']

reference_answer = data[0]['ideal_answer']

pdf_path = "data/AI_Information.pdf"

In [15]:

vector_store = process_document(pdf_path)


query = "Does AI have the potential to transform the way we live and work?"


print("Comparing retrieval methods...")


print("\n=== STANDARD RETRIEVAL ===")
standard_results = rag_with_reranking(query, vector_store, reranking_method="none")
print(f"\nQuery: {query}")
print(f"\nResponse:\n{standard_results['response']}")


print("\n=== LLM-BASED RERANKING ===")
llm_results = rag_with_reranking(query, vector_store, reranking_method="llm")
print(f"\nQuery: {query}")
print(f"\nResponse:\n{llm_results['response']}")


print("\n=== KEYWORD-BASED RERANKING ===")
keyword_results = rag_with_reranking(query, vector_store, reranking_method="keywords")
print(f"\nQuery: {query}")
print(f"\nResponse:\n{keyword_results['response']}")

Extracting text from PDF...
Chunking text...
Created 42 text chunks
Creating embeddings for chunks...
Added 42 chunks to the vector store
Comparing retrieval methods...

=== STANDARD RETRIEVAL ===

Query: Does AI have the potential to transform the way we live and work?

Response:
Yes, AI has the potential to significantly transform the way we live and work, as indicated by the context provided. The transformative impact of AI spans multiple domains, including business, industry, creative fields, and the future of work.

In business and industry, AI is already driving improvements in operational efficiency, decision-making, and cost reduction. It enables automation of routine tasks, analysis of large datasets to uncover valuable insights, and optimization of processes such as supply chain management and customer relationship management. AI-powered tools like chatbots and recommendation engines enhance customer engagement and satisfaction, while predictive analytics improves demand fore

In [None]:
def evaluate_reranking(query, standard_results, reranked_results, reference_answer=None):

    system_prompt = """You are an expert evaluator of RAG systems.
    Compare the retrieved contexts and responses from two different retrieval methods.
    Assess which one provides better context and a more accurate, comprehensive answer."""
    
    comparison_text = f"""Query: {query}

    Standard Retrieval Context:
    {standard_results['context'][:1000]}... [truncated]

    Standard Retrieval Answer:
    {standard_results['response']}

    Reranked Retrieval Context:
    {reranked_results['context'][:1000]}... [truncated]

    Reranked Retrieval Answer:
    {reranked_results['response']}"""


    if reference_answer:
        comparison_text += f"""
        
        Reference Answer:
        {reference_answer}"""


    user_prompt = f"""
        {comparison_text}

        Please evaluate which retrieval method provided:
        1. More relevant context
        2. More accurate answer
        3. More comprehensive answer
        4. Better overall performance

        Provide a detailed analysis with specific examples.
        """
    

    response = client.chat.completions.create(
        model="qwen-plus",
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    

    return response.choices[0].message.content

In [None]:

evaluation = evaluate_reranking(
    query=query,  
    standard_results=standard_results,  
    reranked_results=llm_results,  
    reference_answer=reference_answer  
)


print("\n=== EVALUATION RESULTS ===")
print(evaluation)


=== EVALUATION RESULTS ===
### Evaluation of the Two Retrieval Methods

#### **1. More Relevant Context**

- **Standard Retrieval Context**:  
  The context provided by the standard retrieval method is broad and focuses on several aspects of AI's impact, including automation in finance, job displacement, reskilling/upskilling, human-AI collaboration, and new job roles. While it touches on many important points, it lacks depth in certain areas, such as ethical considerations and creativity, which are crucial for understanding AI's full transformative potential.

- **Reranked Retrieval Context**:  
  The reranked retrieval context provides more focused information on specific topics like human-AI collaboration, new job roles, ethical considerations, creativity, and innovation. It goes deeper into areas such as AI-generated art, creative tools, and the social impact of AI. This makes it more relevant to the query because it addresses both the practical and philosophical implications of A