# Simple RAG介绍
混合检索增强生成（Retrieval-augmentedgeneration，RAG）是一种将信息检索与生成模型相结合的混合方法。它通过引入外部知识来增强语言模型的性能，从而提高准确性和事实正确性。

Simple RAG的基本步骤如下：

1. **预处理**（Preprocess）：加载和预处理文本数据。
2. **分块**（Chunking）：将数据拆分为更小的块，以提高检索性能。
3. **向量化**（Embedding）：使用嵌入模型将文本块转换为数值表示。
4. **索引**（Indexing）：将嵌入向量存储在索引中，以便快速检索。
5. **相关性检索**（Retrieval）：根据用户查询检索相关的文本块。
6. **响应生成**（Response Generation）：使用语言模型根据检索到的文本生成响应。

下面的notebook实现了一个简单的RAG方法，并评估了模型的响应，并探索了各种改进。

# 导入必要的库

In [11]:
import pymupdf
import os
import numpy as np
import json
import openai

# 预处理

使用PyMuPDF库提取PDF中的文本

In [12]:
def extract_text_from_pdf(pdf_path):
    """
    提取PDF文件中的文本并打印前`num_chars`个字符。

    参数：
    pdf_path (str): PDF文件的路径。

    返回：
    str: 从PDF中提取的文本。

    """
    # 打开PDF文件
    mypdf = pymupdf.open(pdf_path)
    all_text = ""  # 初始化一个空字符串来存储提取的文本

    # 迭代PDF中的每个页面
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # 获取页面
        text = page.get_text("text")  # 从页面中提取文本
        all_text += text  # 将提取的文本附加到all_text字符串

    return all_text  # 返回提取的文本



# 分块

预处理后，将文本分块，以便后续的检索和生成。

In [13]:
def chunk_text(text, chunk_size, overlap):
    """
    将预处理后的文本分割成指定大小的块。

    Args:
        text (str): 要分割的文本。
        chunk_size (int): 每个块的大小。
        overlap (int): 块之间的重叠大小。

    Returns:
        list: 包含文本块的列表。
    """
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunk = text[i:i + chunk_size]
        chunks.append(chunk)
    return chunks



提取文本和分块

In [14]:
pdf_path = "./data/AI_Information.pdf"
extracted_text = extract_text_from_pdf(pdf_path)
text_chunks = chunk_text(extracted_text,1000,200)
print('number of text chunks:',len(text_chunks))
print('first chunk:\n',text_chunks[0])

number of text chunks: 42
first chunk:
 Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past few decades, advancements in computing power and data availability 
have significantly accelerated the development and deployment of AI. 
Historical Context 
The idea of artificial intelligence has existed for centuries, often depicted in myths and fiction. 
However, the formal field of AI research began in the mid-20th century. The Dartmouth Workshop 
in 1956 is widely considered the birthplace of AI. Early AI research focused on problem-solving 
and symbo

# 初始化embedding模型

In [15]:
client = openai.OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),  # 如果您没有配置环境变量，请在此处用您的API Key进行替换
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"  # 百炼服务的base_url
)

# 生成Embedding
将文本向量化

In [16]:
def create_embeddings_in_batches(text_chunks, model="text-embedding-v3", batch_size_limit=10): # 我改成了官方模型名，你可以换回 "text-embedding-v3"
    """
    调用 OpenAI 的 Embedding API 来创建文本列表的嵌入向量，处理批处理大小限制。

    参数:
    text_chunks (List[str]): 需要创建嵌入的文本字符串列表。
    model (str): 使用的嵌入模型。
    batch_size_limit (int): API 允许的最大批处理大小。根据错误信息，这里是10。

    返回:
    List[List[float]]: 所有文本的嵌入向量列表。
    """
    all_embeddings = []
    if not text_chunks:
        return []

    if not isinstance(text_chunks, list): # 确保输入是列表
        text_chunks = [text_chunks]

    for i in range(0, len(text_chunks), batch_size_limit):
        batch = text_chunks[i:i + batch_size_limit]
        try:
            print(f"Processing batch {i//batch_size_limit + 1}, size: {len(batch)}")
            response = client.embeddings.create(
                input=batch,
                model=model,
                encoding_format="float"
            )
            # 从响应中提取该批次的嵌入向量
            batch_embeddings = [item.embedding for item in response.data]
            all_embeddings.extend(batch_embeddings)


        except Exception as e:
            print(f"Error processing batch starting with chunk: '{batch[0][:50]}...'")
            print(f"API Error: {e}")

            raise e 

    return all_embeddings

response = create_embeddings_in_batches(text_chunks)



Processing batch 1, size: 10
Processing batch 2, size: 10
Processing batch 3, size: 10
Processing batch 4, size: 10
Processing batch 5, size: 2


# 相似度检索
使用余弦相似度查找最相机的文本向量


In [17]:
def cosine_similarity(vec1, vec2):
    """
    计算余弦相似度
    """
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

检索

In [20]:
def retrieve(query, text_chunks, embeddings, model="text-embedding-v3", k=5):
    """
    通过向量的余弦相似度来检索文本
    参数：
    query: 查询文本
    text_chunks: 文本块列表
    embeddings: 文本块的嵌入列表
    k: 返回的文本块数量
    返回：
    与查询文本最相似的文本块列表
    """
    # 创建查找
    response = client.embeddings.create(
                input=query,
                model=model,
                encoding_format="float"
            )
    query_embedding = response.data[0].embedding
    similarity_scores = []  # 相似度得分列表
    for i, embedding in enumerate(embeddings):
        similarity_score = cosine_similarity(query_embedding, embedding)
        similarity_scores.append((i, similarity_score))
    # 排序
    similarity_scores.sort(key=lambda x: x[1], reverse=True)
    # 返回
    top_k_indices = [score[0] for score in similarity_scores[:k]]
    top_k_chunks = [text_chunks[i] for i in top_k_indices]
    return top_k_chunks



测试一条查询


In [21]:
with open('data/val.json') as f:
    data = json.load(f)

query = data[0]['question']

top_chunks = retrieve(query, text_chunks, response, k=2)

# Print the query
print("Query:", query)

# Print the top 2 most relevant text chunks
for i, chunk in enumerate(top_chunks):
    print(f"Context {i + 1}:\n{chunk}\n=====================================")

Query: What is 'Explainable AI' and why is it considered important?
Context 1:
systems. Explainable AI (XAI) 
techniques aim to make AI decisions more understandable, enabling users to assess their 
fairness and accuracy. 
Privacy and Data Protection 
AI systems often rely on large amounts of data, raising concerns about privacy and data 
protection. Ensuring responsible data handling, implementing privacy-preserving techniques, 
and complying with data protection regulations are crucial. 
Accountability and Responsibility 
Establishing accountability and responsibility for AI systems is essential for addressing potential 
harms and ensuring ethical behavior. This includes defining roles and responsibilities for 
developers, deployers, and users of AI systems. 
Chapter 20: Building Trust in AI 
Transparency and Explainability 
Transparency and explainability are key to building trust in AI. Making AI systems understandable 
and providing insights into their decision-making processes he

# 基于检索到的文本生成响应

In [22]:
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(system_prompt, user_prompt, model="qwen3-4b"):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        extra_body={"enable_thinking": False}
    
    )
    return response

# Create the user prompt based on the top chunks
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

# Generate AI response
ai_response = generate_response(system_prompt, user_prompt)
print(ai_response.choices[0].message.content)

Explainable AI (XAI) is a set of techniques aimed at making AI decisions more understandable. It seeks to provide insights into how AI models make decisions, which helps users assess the reliability and fairness of these decisions. XAI is considered important because it enhances trust in AI systems, ensures accountability, and allows users to evaluate the accuracy and fairness of AI decisions. By making AI more transparent, XAI supports responsible data handling and ethical behavior in the development and deployment of AI systems.


# 评估AI响应
比较AI响应与预期答案并分配分数。

In [23]:
# Define the system prompt for the evaluation system
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."

# Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"

# Generate the evaluation response using the evaluation system prompt and evaluation prompt
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)

# Print the evaluation response
print(evaluation_response.choices[0].message.content)

1.0

The AI response is very close to the true response. Both definitions accurately describe Explainable AI (XAI) as aiming to make AI systems more transparent and understandable, providing insights into how they make decisions. Both responses highlight the importance of XAI for building trust, accountability, and ensuring fairness in AI systems. The AI response includes additional context about reliability, ethical behavior, and responsible data handling, which complement the true response rather than contradict it. The core message and key points are essentially identical.
