[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/context_enrichment_window_around_chunk.ipynb)

# 用于文档检索的上下文丰富窗口

## 概述

此代码实现了一种用于在向量数据库中进行文档检索的上下文丰富窗口技术。它通过为每个检索到的块添加周围的上下文来增强标准检索过程，从而提高返回信息的连贯性和完整性。

## 动机

传统的向量搜索通常返回孤立的文本块，这可能缺乏充分理解所需的必要上下文。这种方法旨在通过包含相邻的文本块来提供对检索信息的更全面的视图。

## 关键组件

1. PDF 处理和文本分块
2. 使用 FAISS 和 OpenAI 嵌入创建向量存储
3. 带有上下文窗口的自定义检索功能
4. 标准检索与上下文丰富检索的比较

## 方法详情

### 文档预处理

1. 读取 PDF 并将其转换为字符串。
2. 将文本分割成带有重叠的块，每个块都用其索引进行标记。

### 向量存储创建

1. 使用 OpenAI 嵌入来创建块的向量表示。
2. 从这些嵌入中创建一个 FAISS 向量存储。

### 上下文丰富的检索

1. `retrieve_with_context_overlap` 函数执行以下步骤：
   - 根据查询检索相关块
   - 对于每个相关块，获取相邻的块
   - 连接这些块，并考虑重叠部分
   - 返回每个相关块的扩展上下文

### 检索比较

笔记本中包含一个部分，用于比较标准检索与上下文丰富的方法。

## 此方法的优点

1. 提供更连贯、上下文更丰富的结果
2. 在保持向量搜索优势的同时，减轻其返回孤立文本片段的倾向
3. 允许灵活调整上下文窗口的大小

## 结论

这种上下文丰富窗口技术为提高基于向量的文档搜索系统中检索信息的质量提供了一种有前途的方法。通过提供周围的上下文，它有助于保持检索信息的连贯性和完整性，从而可能在问答等下游任务中带来更好的理解和更准确的响应。

<div style="text-align: center;">

<img src="../images/vector-search-comparison_context_enrichment.svg" alt="context enrichment window" style="width:70%; height:auto;">
</div>

<div style="text-align: center;">

<img src="../images/context_enrichment_window.svg" alt="context enrichment window" style="width:70%; height:auto;">
</div>

# 包安装和导入

下面的单元格安装了运行此笔记本所需的所有必要软件包。


In [None]:
# 安装所需的包
!pip install langchain python-dotenv

In [None]:
# 克隆存储库以访问辅助函数和评估模块
!git clone https://github.com/NirDiamant/RAG_TECHNIQUES.git
import sys
sys.path.append('RAG_TECHNIQUES')
# 如果您需要使用最新数据运行
# !cp -r RAG_TECHNIQUES/data .

In [1]:
import os
import sys
from dotenv import load_dotenv
from langchain.docstore.document import Document


# 为 Colab 兼容性替换了原始路径附加
from helper_functions import *
from evaluation.evalute_rag import *

# 从 .env 文件加载环境变量
load_dotenv()

# 设置 OpenAI API 密钥环境变量
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')



### 定义 PDF 路径

In [None]:
# 下载所需的数据文件
import os
os.makedirs('data', exist_ok=True)

# 下载本笔记本中使用的 PDF 文档
!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf


In [2]:
path = "data/Understanding_Climate_Change.pdf"

### 将 PDF 读取为字符串

In [3]:
content = read_pdf_to_string(path)

### 将文本拆分为块并附带块按时间顺序索引的元数据的函数

In [4]:
def split_text_to_chunks_with_indices(text: str, chunk_size: int, chunk_overlap: int) -> List[Document]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(Document(page_content=chunk, metadata={"index": len(chunks), "text": text}))
        start += chunk_size - chunk_overlap
    return chunks

### 相应地拆分我们的文档

In [18]:
chunks_size = 400
chunk_overlap = 200
docs = split_text_to_chunks_with_indices(content, chunks_size, chunk_overlap)

### 创建向量存储和检索器

In [20]:
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
chunks_query_retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

### 从向量存储中提取第 k<sup>th</sup> 个块（按原始顺序）的函数 


In [53]:
def get_chunk_by_index(vectorstore, target_index: int) -> Document:
    """
    根据元数据中的索引从向量存储中检索块。
    
    参数：
    vectorstore (VectorStore): 包含块的向量存储。
    target_index (int): 要检索的块的索引。
    
    返回：
    Optional[Document]: 作为 Document 对象检索到的块，如果未找到则为 None。
    """
    # 这是一个简化版本。在实践中，您可能需要一种更有效的方法
    # 来根据索引检索块，具体取决于您的向量存储实现。
    all_docs = vectorstore.similarity_search("", k=vectorstore.index.ntotal)
    for doc in all_docs:
        if doc.metadata.get('index') == target_index:
            return doc
    return None

### 检查功能

In [54]:
chunk = get_chunk_by_index(vectorstore, 0)
print(chunk.page_content)

Understanding Climate Change 
Chapter 1: Introduction to Climate Change 
Climate change refers to significant, long-term changes in the global climate. The term 
"global climate" encompasses the planet's overall weather patterns, including temperature, 
precipitation, and wind patterns, over an extended period. Over the past century, human 
activities, particularly the burning of fossil fuels and 


### 基于语义相似性从向量存储中检索，然后用其前后的 num_neighbors 填充每个检索到的块，同时考虑块重叠以在其周围构建一个有意义的宽窗口的函数

In [55]:
def retrieve_with_context_overlap(vectorstore, retriever, query: str, num_neighbors: int = 1, chunk_size: int = 200, chunk_overlap: int = 20) -> List[str]:
    """
    根据查询检索块，然后获取相邻块并将其连接起来，
    同时考虑重叠和正确的索引。

    参数：
    vectorstore (VectorStore): 包含块的向量存储。
    retriever: 用于获取相关文档的检索器对象。
    query (str): 用于搜索相关块的查询。
    num_neighbors (int): 在每个相关块之前和之后要检索的块数。
    chunk_size (int): 最初拆分时每个块的大小。
    chunk_overlap (int): 最初拆分时块之间的重叠。

    返回：
    List[str]: 连接的块序列列表，每个序列都以一个相关块为中心。
    """
    relevant_chunks = retriever.get_relevant_documents(query)
    result_sequences = []

    for chunk in relevant_chunks:
        current_index = chunk.metadata.get('index')
        if current_index is None:
            continue

        # 确定要检索的块的范围
        start_index = max(0, current_index - num_neighbors)
        end_index = current_index + num_neighbors + 1  # +1 因为范围在末尾是排他的

        # 检索范围内的所有块
        neighbor_chunks = []
        for i in range(start_index, end_index):
            neighbor_chunk = get_chunk_by_index(vectorstore, i)
            if neighbor_chunk:
                neighbor_chunks.append(neighbor_chunk)

        # 按索引对块进行排序以确保正确的顺序
        neighbor_chunks.sort(key=lambda x: x.metadata.get('index', 0))

        # 连接块，考虑重叠
        concatenated_text = neighbor_chunks[0].page_content
        for i in range(1, len(neighbor_chunks)):
            current_chunk = neighbor_chunks[i].page_content
            overlap_start = max(0, len(concatenated_text) - chunk_overlap)
            concatenated_text = concatenated_text[:overlap_start] + current_chunk

        result_sequences.append(concatenated_text)

    return result_sequences

### 比较常规检索和带上下文窗口的检索

In [None]:
# 基线方法
query = "Explain the role of deforestation and fossil fuels in climate change."
baseline_chunk = chunks_query_retriever.get_relevant_documents(query
    ,
    k=1
)
# 聚焦上下文丰富方法
enriched_chunks = retrieve_with_context_overlap(
    vectorstore,
    chunks_query_retriever,
    query,
    num_neighbors=1,
    chunk_size=400,
    chunk_overlap=200
)

print("Baseline Chunk:")
print(baseline_chunk[0].page_content)
print("\nEnriched Chunks:")
print(enriched_chunks[0])

### 一个展示附加上下文窗口优越性的示例

In [49]:

document_content = """
Artificial Intelligence (AI) has a rich history dating back to the mid-20th century. The term "Artificial Intelligence" was coined in 1956 at the Dartmouth Conference, marking the field's official beginning.

In the 1950s and 1960s, AI research focused on symbolic methods and problem-solving. The Logic Theorist, created in 1955 by Allen Newell and Herbert A. Simon, is often considered the first AI program.

The 1960s saw the development of expert systems, which used predefined rules to solve complex problems. DENDRAL, created in 1965, was one of the first expert systems, designed to analyze chemical compounds.

However, the 1970s brought the first "AI Winter," a period of reduced funding and interest in AI research, largely due to overpromised capabilities and underdelivered results.

The 1980s saw a resurgence with the popularization of expert systems in corporations. The Japanese government's Fifth Generation Computer Project also spurred increased investment in AI research globally.

Neural networks gained prominence in the 1980s and 1990s. The backpropagation algorithm, although discovered earlier, became widely used for training multi-layer networks during this time.

The late 1990s and 2000s marked the rise of machine learning approaches. Support Vector Machines (SVMs) and Random Forests became popular for various classification and regression tasks.

Deep Learning, a subset of machine learning using neural networks with many layers, began to show promising results in the early 2010s. The breakthrough came in 2012 when a deep neural network significantly outperformed other machine learning methods in the ImageNet competition.

Since then, deep learning has revolutionized many AI applications, including image and speech recognition, natural language processing, and game playing. In 2016, Google's AlphaGo defeated a world champion Go player, a landmark achievement in AI.

The current era of AI is characterized by the integration of deep learning with other AI techniques, the development of more efficient and powerful hardware, and the ethical considerations surrounding AI deployment.

Transformers, introduced in 2017, have become a dominant architecture in natural language processing, enabling models like GPT (Generative Pre-trained Transformer) to generate human-like text.

As AI continues to evolve, new challenges and opportunities arise. Explainable AI, robust and fair machine learning, and artificial general intelligence (AGI) are among the key areas of current and future research in the field.
"""

chunks_size = 250
chunk_overlap = 20
document_chunks = split_text_to_chunks_with_indices(document_content, chunks_size, chunk_overlap)
document_vectorstore = FAISS.from_documents(document_chunks, embeddings)
document_retriever = document_vectorstore.as_retriever(search_kwargs={"k": 1})

query = "When did deep learning become prominent in AI?"
context = document_retriever.get_relevant_documents(query)
context_pages_content = [doc.page_content for doc in context]

print("Regular retrieval:\n")
show_context(context_pages_content)

sequences = retrieve_with_context_overlap(document_vectorstore, document_retriever, query, num_neighbors=1)
print("\nRetrieval with context enrichment:\n")
show_context(sequences)

Regular retrieval:

Context 1:

Deep Learning, a subset of machine learning using neural networks with many layers, began to show promising results in the early 2010s. The breakthrough came in 2012 when a deep neural network significantly outperformed other machine learning method



Retrieval with context overlap:

Context 1:
ng multi-layer networks during this time.

The late 1990s and 2000s marked the rise of machine learning approaches. Support Vector Machines (SVMs) and Random Forests became popular for various classification and regression tasks.

Deep Learning, a subset of machine learning using neural networks with many layers, began to show promising results in the early 2010s. The breakthrough came in 2012 when a deep neural network significantly outperformed other machine learning methods in the ImageNet competition.

Since then, deep learning has revolutionized many AI applications, including image and speech recognition, natural language processing, and game playing. In 20