# 简单RAG概述

检索增强生成（RAG）是一种混合方法，结合了信息检索和生成模型。它通过引入外部知识来增强语言模型的性能，从而提高准确性和事实正确性。

在一个简单的RAG设置中，我们按照以下步骤进行：

1. **数据引入**：加载并预处理文本数据。
2. **分块**：将数据分成较小的块，以提高检索性能。
3. **嵌入创建**：使用嵌入模型将文本块转换为数值表示。
4. **语义搜索**：基于用户查询检索相关块。
5. **响应生成**：使用语言模型生成响应，基于检索到的文本。

本笔记本实现了一个简单的RAG方法，评估模型的响应，并探索各种改进。

## 设置环境
我们首先导入必要的库。

In [1]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

## 提取PDF文件中的文本
为了实现RAG，我们首先需要一个文本数据源。在这个例子中，我们使用PyMuPDF库从PDF文件中提取文本。


In [2]:
def extract_text_from_pdf(pdf_path):
    """
    提取PDF文件中的文本，并打印前`num_chars`个字符。

    Args:
    pdf_path (str): PDF文件的路径。

    Returns:
    str: 从PDF文件中提取出的文本。
    """
    
    # 打开PDF文件
    mypdf = fitz.open(pdf_path)
    all_text = ""  # 初始化一个空字符串来存储提取出的文本

    # 迭代PDF中的每一页
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # 获取当前页
        text = page.get_text("text")  # 提取文本
        all_text += text  # 追加提取出的文本到all_text字符串

    return all_text  # 返回提取出的文本

## 分块抽取文本
当我们提取出文本时，我们可以将其分成较小的、重叠的块，以提高检索精度。

In [3]:
def chunk_text(text, n, overlap):
    """
    分割给定的文本为n个字符的段落，并具有重叠。
    Args:
    text (str): 要分割的文本。
    n (int): 每个段落的字符数。
    overlap (int): 两个段落之间的重叠字符数。
    
    Returns:
    List[str]: 分割后的文本段落列表。
    """
    chunks = []  # 初始化一个空列表来存储分割后的文本段
    
    # 循环遍历文本，步长为(n - overlap)
    for i in range(0, len(text), n - overlap):
        # 将文本从索引i到i+n的片段添加到chunks列表中
        chunks.append(text[i:i + n])

    return chunks  # 返回文本片段列表

##  设置LLM 客户端
初始化LLM客户端以生成嵌入和响应。

In [None]:
# 我们使用硅基流动提供的模型服务，需要先注册账号并申请API key，硅基流动官网：https://www.siliconflow.cn/
client = OpenAI(
    base_url="https://api.siliconflow.cn/v1/",
    api_key="************"  # 替换为你的API密钥
)

## 提取PDF文件中的文本并分块
现在，我们加载PDF，提取文本，并将其分块。

In [5]:
#定义PDF文件的路径
pdf_path = "data/AI_Information.pdf"

# 提取PDF文件中的文本
extracted_text = extract_text_from_pdf(pdf_path)

# 分割文本为1000个字符的片段，每两个片段之间有200个字符的重叠
text_chunks = chunk_text(extracted_text, 1000, 200)

# 打印创建的文本片段的数量
print("Number of text chunks:", len(text_chunks))

# 打印第一个文本片段
print("\nFirst text chunk:")
print(text_chunks[0])

Number of text chunks: 42

First text chunk:
Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past few decades, advancements in computing power and data availability 
have significantly accelerated the development and deployment of AI. 
Historical Context 
The idea of artificial intelligence has existed for centuries, often depicted in myths and fiction. 
However, the formal field of AI research began in the mid-20th century. The Dartmouth Workshop 
in 1956 is widely considered the birthplace of AI. Early AI research focused on problem-solving 
and 

## 创建文本块的嵌入
嵌入将文本转换为数值向量，这使得有效的相似性搜索成为可能。

In [6]:
def create_embeddings(text, model="BAAI/bge-m3"):
    """
    创建给定文本的嵌入使用指定的llm模型，这里我们选择智源系列。

    Args:
    text (str): 输入文本，用于创建嵌入。
    model (str): 使用bge-m3模型创建嵌入。

    Returns:
    dict: OpenAI API的响应，其中包含嵌入。
    """
    # 创建输入文本的嵌入使用指定的模型
    response = client.embeddings.create(
        model=model,
        input=text
    )

    return response  # 返回包含嵌入的响应

# 创建文本片段的嵌入
response = create_embeddings(text_chunks)

## 语义搜索
使用余弦相似度来找到用户查询的最相关文本块。

In [7]:
def cosine_similarity(vec1, vec2):
    """
    计算两个向量之间的余弦相似度。

    Args:
    vec1 (np.ndarray): 第一个向量。
    vec2 (np.ndarray): 第二个向量。

    Returns:
    float:  两个向量之间的余弦相似度。
    """
    # 计算两个向量的点积，并除以它们范数的乘积
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [8]:
def semantic_search(query, text_chunks, embeddings, k=5):
    """
    执行语义搜索，使用给定的查询和嵌入在文本片段上进行搜索。

    Args:
    query (str): 查询，用于语义搜索。
    text_chunks (List[str]): 文本片段的列表，用于搜索。
    embeddings (List[dict]): 嵌入的列表，用于文本片段。
    k (int): 返回的最相关的文本片段的数量。默认值为5。

    Returns:
    List[str]: 返回的最相关的文本片段的列表。
    """
    # 创建查询的嵌入
    query_embedding = create_embeddings(query).data[0].embedding
    similarity_scores = []  # 初始化列表以存储相似性得分

    # 计算查询嵌入和每个文本片段嵌入之间的相似性得分
    for i, chunk_embedding in enumerate(embeddings):
        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))
        similarity_scores.append((i, similarity_score))  # 将索引和相似性得分添加到列表中

    # 按照相似性得分的降序排序
    similarity_scores.sort(key=lambda x: x[1], reverse=True)
    # 获取最相似的文本片段的索引
    top_indices = [index for index, _ in similarity_scores[:k]]
    # 返回最相关的文本片段的列表
    return [text_chunks[index] for index in top_indices]


## 在提取的文本块中运行查询

In [9]:
# 从JSON文件中加载验证数据
with open('data/val.json') as f:
    data = json.load(f)

# 从验证数据中提取第一个查询
query = data[0]['question']

# 进行语义搜索以找到查询的前2个最相关的文本块
top_chunks = semantic_search(query, text_chunks, response.data, k=2)

# 打印查询
print("Query:", query)

# 打印前2个最相关的文本块
for i, chunk in enumerate(top_chunks):
    print(f"Context {i + 1}:\n{chunk}\n=====================================")

Query: What is 'Explainable AI' and why is it considered important?
Context 1:
systems. Explainable AI (XAI) 
techniques aim to make AI decisions more understandable, enabling users to assess their 
fairness and accuracy. 
Privacy and Data Protection 
AI systems often rely on large amounts of data, raising concerns about privacy and data 
protection. Ensuring responsible data handling, implementing privacy-preserving techniques, 
and complying with data protection regulations are crucial. 
Accountability and Responsibility 
Establishing accountability and responsibility for AI systems is essential for addressing potential 
harms and ensuring ethical behavior. This includes defining roles and responsibilities for 
developers, deployers, and users of AI systems. 
Chapter 20: Building Trust in AI 
Transparency and Explainability 
Transparency and explainability are key to building trust in AI. Making AI systems understandable 
and providing insights into their decision-making processes he

## 生成基于检索到的片段的响应

In [10]:
# 定义AI助手的系统提示
system_prompt = """
You are an AI assistant that strictly answers based on the given context. 
If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'
"""

def generate_response(system_prompt, user_message, model="THUDM/GLM-4-9B-0414"):
    """
    生成基于系统提示和用户消息的AI模型响应。

    Args:
    system_prompt (str): 系统提示用于指导AI的行为。
    user_message (str): 用户的消息或查询。
    model (str): 要使用的模型。默认值为"meta-llama/Llama-2-7B-chat-hf"。

    Returns:
    dict: AI模型的响应。
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

# 创建基于顶部块的用户提示
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

# 生成AI响应
ai_response = generate_response(system_prompt, user_prompt)

## 评估AI响应
比较AI响应与期望答案，并给出分数。

In [11]:
# 定义评估系统的系统提示
evaluate_system_prompt = """
You are an intelligent evaluation system tasked with assessing the AI assistant's responses. 
If the AI assistant's response is very close to the true response, assign a score of 1. 
If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. 
If the response is partially aligned with the true response, assign a score of 0.5. give a detailed explanation of your evaluation."""

# 创建评估提示，将用户查询、AI响应、真实答案和评估系统提示组合在一起。
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"

# 生成评估响应，使用评估系统提示和评估提示。
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)

# 打印评估响应
print(evaluation_response.choices[0].message.content)


### Evaluation Score: 1

#### Detailed Explanation:

The AI assistant's response is very close to the true response. Here’s a breakdown of the alignment:

1. **Definition of Explainable AI (XAI):**
   - **AI Response:** "Explainable AI (XAI) is a set of techniques aimed at making AI decisions more understandable, enabling users to assess their fairness and accuracy."
   - **True Response:** "Explainable AI (XAI) aims to make AI systems more transparent and understandable, providing insights into how they make decisions."
   
   The AI response accurately captures the essence of XAI by describing it as a set of techniques that enhance understandability and allow users to assess fairness and accuracy. The true response also emphasizes transparency and providing insights into decision-making processes, which aligns well with the AI response.

2. **Importance of XAI:**
   - **AI Response:** "It is considered important because it enhances transparency and explainability, which are key to b