[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/HyPE_Hypothetical_Prompt_Embeddings.ipynb)

# 假设性提示嵌入 (HyPE)

## 概述

此代码实现了一个由假设性提示嵌入 (HyPE) 增强的检索增强生成 (RAG) 系统。与传统的RAG管道在查询-文档风格不匹配方面存在困难不同，HyPE在索引阶段预计算假设性问题。这将检索转换为问题-问题匹配问题，消除了对昂贵的运行时查询扩展技术的需求。

## 笔记本的关键组件

1. PDF处理和文本提取
2. 文本分块以保持连贯的信息单元
3. **假设性提示嵌入生成** 使用LLM为每个块创建多个代理问题
4. 使用 [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) 和 OpenAI 嵌入创建向量存储
5. 用于查询处理文档的检索器设置
6. RAG系统的评估

## 方法详情

### 文档预处理

1. 使用 `PyPDFLoader` 加载PDF。
2. 使用 `RecursiveCharacterTextSplitter` 将文本分割成块，具有指定的块大小和重叠。

### 假设性问题生成

HyPE不是嵌入原始文本块，而是**为每个块生成多个假设性提示**。这些**预计算的问题**模拟用户查询，改善与真实世界搜索的对齐。这消除了像HyDE等技术中需要的运行时合成答案生成的需求。

### 向量存储创建

1. 每个假设性问题都使用OpenAI嵌入进行嵌入。
2. 构建FAISS向量存储，**将每个问题嵌入与其原始块关联**。
3. 这种方法**为每个块存储多个表示**，增加检索灵活性。

### 检索器设置

1. 检索器针对**问题-问题匹配**而不是直接文档检索进行优化。
2. FAISS索引在假设性提示嵌入上实现**高效的最近邻**搜索。
3. 检索到的块为下游LLM生成提供**更丰富和更精确的上下文**。

## 关键特性

1. **预计算假设性提示** – 在没有运行时开销的情况下改善查询对齐。
2. **多向量表示** – 每个块被多次索引以获得更广泛的语义覆盖。
3. **高效检索** – FAISS确保在增强嵌入上进行快速相似性搜索。
4. **模块化设计** – 管道易于适应不同的数据集和检索设置。此外，它与大多数优化（如重排序等）兼容。

## 评估

HyPE的有效性在多个数据集上进行评估，显示：

- 检索精度提高多达42个百分点
- 声明召回率提高多达45个百分点
    (在[预印本](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335)中查看完整评估结果)

## 此方法的优点

1. **消除查询时开销** – 所有假设性生成都在索引时离线完成。
2. **增强检索精度** – 查询和存储内容之间更好的对齐。
3. **可扩展且高效** – 没有额外的每查询计算成本；检索与标准RAG一样快。
4. **灵活且可扩展** – 可以与高级RAG技术（如重排序）结合使用。

## 结论

HyPE为传统RAG系统提供了一个可扩展且高效的替代方案，克服了查询-文档风格不匹配，同时避免了运行时查询扩展的计算成本。通过将假设性提示生成移至索引阶段，它显著增强了检索精度和效率，使其成为现实世界应用的实用解决方案。

有关更多详细信息，请参阅完整论文：[预印本](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335)


<div style="text-align: center;">

<img src="../images/hype.svg" alt="HyPE" style="width:70%; height:auto;">
</div>

# 包安装和导入

下面的单元格安装了运行此笔记本所需的所有必要软件包。


In [None]:
# 安装所需的包
!pip install faiss-cpu futures langchain-community python-dotenv tqdm

In [None]:
# 克隆存储库以访问辅助函数和评估模块
!git clone https://github.com/NirDiamant/RAG_TECHNIQUES.git
import sys
sys.path.append('RAG_TECHNIQUES')
# 如果您需要使用最新数据运行
# !cp -r RAG_TECHNIQUES/data .

In [63]:
import os
import sys
import faiss
from tqdm import tqdm
from dotenv import load_dotenv
from concurrent.futures import ThreadPoolExecutor, as_completed
from langchain_community.docstore.in_memory import InMemoryDocstore


# 从 .env 文件加载环境变量
load_dotenv()

# 设置 OpenAI API 密钥环境变量（如果不使用 OpenAI 请注释掉）
if not os.getenv('OPENAI_API_KEY'):
    os.environ["OPENAI_API_KEY"] = input("请输入您的 OpenAI API 密钥: ")
else:
    os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

# 为 Colab 兼容性替换了原始路径附加
from helper_functions import *
from evaluation.evalute_rag import *


### 定义常量

- `PATH`: 要嵌入到 RAG 管道的数据的路径

本教程使用 OpenAI 端点 ([可用模型](https://platform.openai.com/docs/pricing))。
- `LANGUAGE_MODEL_NAME`: 要使用的语言模型的名称。
- `EMBEDDING_MODEL_NAME`: 要使用的嵌入模型的名称。

本教程使用 `RecursiveCharacterTextSplitter` 分块方法，其中使用的分块长度函数是 python `len` 函数。此处要调整的分块变量是：
- `CHUNK_SIZE`: 一个块的最小长度
- `CHUNK_OVERLAP`: 两个连续块的重叠部分。

In [None]:
# Download required data files
import os
os.makedirs('data', exist_ok=True)

# Download the PDF document used in this notebook
!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf


In [64]:
PATH = "data/Understanding_Climate_Change.pdf"
LANGUAGE_MODEL_NAME = "gpt-4o-mini"
EMBEDDING_MODEL_NAME = "text-embedding-3-small"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200

### 定义假设性提示嵌入的生成

下面的代码块为每个文本块生成假设性问题，并将其嵌入以供检索。

- LLM 从输入块中提取关键问题。
- 这些问题使用 OpenAI 的模型进行嵌入。
- 该函数返回原始块及其提示嵌入，稍后用于检索。

为确保输出干净，会删除多余的换行符，并且在需要时可以使用正则表达式解析来改进列表格式。

In [65]:
def generate_hypothetical_prompt_embeddings(chunk_text: str):
    """
    Uses the LLM to generate multiple hypothetical questions for a single chunk.
    These questions will be used as 'proxies' for the chunk during retrieval.

    Parameters:
    chunk_text (str): Text contents of the chunk

    Returns:
    chunk_text (str): Text contents of the chunk. This is done to make the 
        multithreading easier
    hypothetical prompt embeddings (List[float]): A list of embedding vectors
        generated from the questions
    """
    llm = ChatOpenAI(temperature=0, model_name=LANGUAGE_MODEL_NAME)
    embedding_model = OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)

    question_gen_prompt = PromptTemplate.from_template(
        "Analyze the input text and generate essential questions that, when answered, \
        capture the main points of the text. Each question should be one line, \
        without numbering or prefixes.\n\n \
        Text:\n{chunk_text}\n\nQuestions:\n"
    )
    question_chain = question_gen_prompt | llm | StrOutputParser()

    # parse questions from response
    # Notes: 
    # - gpt4o likes to split questions by \n\n so we remove one \n
    # - for production or if using smaller models from ollama, it's beneficial to use regex to parse 
    # things like (un)ordeed lists
    # r"^\s*[\-\*\•]|\s*\d+\.\s*|\s*[a-zA-Z]\)\s*|\s*\(\d+\)\s*|\s*\([a-zA-Z]\)\s*|\s*\([ivxlcdm]+\)\s*"
    questions = question_chain.invoke({"chunk_text": chunk_text}).replace("\n\n", "\n").split("\n")
    
    return chunk_text, embedding_model.embed_documents(questions)


### 定义 FAISS 向量存储的创建和填充

下面的代码块通过并行嵌入文本块来构建 FAISS 向量存储。

会发生什么？
- 并行处理 – 使用线程更快地生成嵌入。
- FAISS 初始化 – 设置 L2 索引以进行高效的相似性搜索。
- 块嵌入 – 每个块存储多次，每个生成的问句嵌入一次。
- 内存存储 – 使用 InMemoryDocstore 进行快速查找。

这确保了高效的检索，通过预先计算的问句嵌入改进了查询对齐。

In [66]:
def prepare_vector_store(chunks: List[str]):
    """
    Creates and populates a FAISS vector store from a list of text chunks.

    This function processes a list of text chunks in parallel, generating 
    hypothetical prompt embeddings for each chunk.
    The embeddings are stored in a FAISS index for efficient similarity search.

    Parameters:
    chunks (List[str]): A list of text chunks to be embedded and stored.

    Returns:
    FAISS: A FAISS vector store containing the embedded text chunks.
    """

    # Wait with initialization to see vector lengths
    vector_store = None  

    with ThreadPoolExecutor() as pool:  
        # Use threading to speed up generation of prompt embeddings
        futures = [pool.submit(generate_hypothetical_prompt_embeddings, c) for c in chunks]
        
        # Process embeddings as they complete
        for f in tqdm(as_completed(futures), total=len(chunks)):  
            
            chunk, vectors = f.result()  # Retrieve the processed chunk and its embeddings
            
            # Initialize the FAISS vector store on the first chunk
            if vector_store == None:  
                vector_store = FAISS(
                    embedding_function=OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME),  # Define embedding model
                    index=faiss.IndexFlatL2(len(vectors[0]))  # Define an L2 index for similarity search
                    docstore=InMemoryDocstore(),  # Use in-memory document storage
                    index_to_docstore_id={}  # Maintain index-to-document mapping
                )
            
            # Pair the chunk's content with each generated embedding vector.
            # Each chunk is inserted multiple times, once for each prompt vector
            chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]
            
            # Add embeddings to the store
            vector_store.add_embeddings(chunks_with_embedding_vectors)  

    return vector_store  # Return the populated vector store


### 将 PDF 编码到 FAISS 向量存储中

下面的代码块处理一个 PDF 文件并将其内容存储为嵌入以供检索。

会发生什么？
- PDF 加载 – 从文档中提取文本。
- 分块 – 将文本分割成重叠的段落以更好地保留上下文。
- 预处理 – 清理文本以提高嵌入质量。
- 向量存储创建 – 生成嵌入并将其存储在 FAISS 中以供检索。

In [70]:
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)

    vectorstore = prepare_vector_store(cleaned_texts)

    return vectorstore

### 创建 HyPE 向量存储

现在我们处理 PDF 并存储其嵌入。
此步骤使用编码后的文档初始化 FAISS 向量存储。

In [71]:
# Chunk size can be quite large with HyPE as we are not loosing percision with more
# information. For production, test how exhaustive your model is in generating sufficient 
# amount of questions per chunk. This will mostly depend on your information density.
chunks_vector_store = encode_pdf(PATH, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

100%|██████████| 97/97 [00:22<00:00,  4.40it/s]


### 创建检索器

现在我们设置检索器以从向量存储中获取相关块。

基于查询相似度检索最相关的前 `k=3` 个块。

In [79]:
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 3})

### 测试检索器

现在我们使用示例查询来测试检索。

- 查询向量存储以找到最相关的块。
- 对结果进行去重以删除可能重复的块。
- 显示检索到的上下文以供检查。

此步骤验证检索器是否为给定问题返回有意义和多样化的信息。

In [80]:
test_query = "What is the main cause of climate change?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)
context = list(set(context))
show_context(context)

Context 1:
Most of these climate changes are attributed to very small variations in Earth's orbit that 
change the amount of solar energy our planet receives. During the Holocene epoch, which 
began at the end of the last ice age, human societies f lourished, but the industrial era has seen 
unprecedented changes.  
Modern Observations  
Modern scientific observations indicate a rapid increase in global temperatures, sea levels, 
and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has 
documented these changes extensively. Ice core samples, tree rings, and ocean sediments 
provide a historical record that scientists use to understand past climate conditions and 
predict future trends. The evidence overwhelmingly shows that recent changes are primarily 
driven by human activities, particularly the emission of greenhou se gases.  
Chapter 2: Causes of Climate Change  
Greenhouse Gases


Context 2:
driven by human activities, particularly the emission of green

### 评估结果

In [76]:
evaluate_rag(chunks_query_retriever)

{'questions': ['1. **Multiple Choice: Causes of Climate Change**',
  '   - What is the primary cause of the current climate change trend?',
  '     A) Solar radiation variations',
  '     B) Natural cycles of the Earth',
  '     C) Human activities, such as burning fossil fuels',
  '     D) Volcanic eruptions',
  '',
  '2. **True or False: Impact on Biodiversity**',
  '   - True or False: Climate change does not have any significant impact on the migration patterns and extinction rates of various species.',
  '',
  '3. **Short Answer: Mitigation Strategies**',
  '   - What are two effective strategies that can be implemented at a community level to mitigate the effects of climate change?',
  '',
  '4. **Matching: Climate Change Effects**',
  '   - Match the following effects of climate change (numbered) with their likely consequences (lettered).',
  '     1. Rising sea levels',
  '     2. Increased frequency of extreme weather events',
  '     3. Melting polar ice caps',
  '     4. Oce