## 🧠 Context Window Enrichment for Better Retrieval | RAG100X

This notebook implements **Context Window Enrichment** — a technique that augments vector-retrieved chunks by including their **neighboring context**, improving semantic coherence for downstream LLM generation.

Instead of treating chunks as isolated units, we reconstruct **semantic windows** around each top-retrieved chunk by including adjacent text spans. This mimics how humans read: understanding a paragraph by seeing what came before and after.

The result? Richer context, fewer hallucinations, and answers grounded in complete ideas.

---

### ✅ What You’ll Learn

- Why single chunks often miss important supporting context  
- How adding “pre” and “post” neighbors helps complete the meaning  
- How to reconstruct enriched windows using chunk IDs and overlaps  
- When this strategy improves performance over plain Top-k chunk retrieval  

---

### 🔍 Real-world Analogy

Suppose someone gives you just this line from a textbook:

> *"...and thus, the algorithm converges in O(log n) time."*

You’re left wondering: *Which algorithm? What was the problem setup?*  
By seeing the sentences before and after, the meaning becomes clear.

✅ **Context Window Enrichment ensures LLMs get that full picture — not just a floating quote.**

---

### 🔬 How Context Window Enrichment Works

Let’s say your document is split into overlapping chunks using LangChain. When you run a query:

| Step                     | What Happens                                                                   |
|--------------------------|----------------------------------------------------------------------------------|
| 1. Chunking              | Document split into overlapping chunks and indexed into a FAISS vector store    |
| 2. Retrieval             | Top-k most relevant chunks are retrieved via OpenAI embeddings                  |
| 3. Context Windowing     | For each top chunk, we identify and include N neighbors before and after        |
| 4. Deduplication         | Overlapping or repeated chunks are pruned to stay within token limits           |
| 5. Output                | The enriched chunk windows are sent to the LLM for final generation             |

🔁 These padded windows ensure more complete answers — especially for queries that relate to topics spread across multiple adjacent chunks.

---

### 🧪 Why This Works So Well

- 🧩 **Captures semantic flow**: Ideas span multiple chunks — windowing reconnects them  
- 📚 **Boosts factual grounding**: More complete segments reduce missing links and ambiguity  
- 🧠 **Great for retrieval + rerankers**: Windows improve reranking and answer fluency  

---

### 🏗️ Why This Matters in Production

Imagine retrieving this chunk alone:

> *"In 2015, Google introduced TensorFlow..."*

Helpful? Yes. But isolated? Also yes.

Now, with windowed context:

1. *"The rise of deep learning frameworks..."*  
2. *"In 2015, Google introduced TensorFlow..."*  
3. *"This marked a major shift in accessible model training..."*

✅ This enriched passage supports better reasoning — not just fact extraction.

**Context Windowing bridges the gap between chunk retrieval and story comprehension.**

---

### 🔄 Where This Fits in RAG100X

So far in RAG100X, you’ve seen:

1. Simple PDF QA with FAISS  
2. CSV semantic search  
3. Blog-based web RAG  
4. Chunk size sensitivity studies  
5. Proposition-aware chunking  
6. Query rewriting and transformation  
7. HyDE (imagined documents)  
8. HyPE (imagined questions)  
9. CCH (contextual headers)  
10. RSE (segment optimization)

Now in **Day 11**, we zoom in on a **low-cost, high-impact enhancement**:  
> 💡 **Context Window Enrichment gives each retrieved chunk the support it needs to shine.**


## 📦 Installation & Setup

In [None]:
# Install required packages
!pip install langchain python-dotenv PyMuPDF langchain-community

import os
import sys
from dotenv import load_dotenv
from langchain.docstore.document import Document
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Path to PDF

In [None]:
path = "data/Understanding_Climate_Change.pdf"

### 📄 Step 1: Load and Chunk the PDF Document with Index Metadata

Before we can perform any kind of intelligent document retrieval, we first need to **read the PDF content** and split it into smaller units called **chunks**. Chunking helps large documents become searchable and easier to process by language models.

But instead of just splitting the text blindly, we also assign each chunk a **chronological index**. This index captures the original order of chunks in the document.


### 🧠 Why Track Chunk Indices?

When we apply techniques like **context enrichment windows**, we often want to include not just one retrieved chunk but also the chunks that **surround it** — either before, after, or both.

To do this effectively, we need to know the position of each chunk in the document. That’s where the index comes in.


### 🧪 Example

If your document has `10,000` characters, and you choose:

- `chunk_size = 500`
- `chunk_overlap = 100`

Then the function will produce overlapping chunks like:

- **Chunk 0** → characters `0–500`  
- **Chunk 1** → characters `400–900`  
- **Chunk 2** → characters `800–1300`  
- ... and so on.

Each chunk is stored along with:

- ✅ `index` → its position in the document  
- ✅ `text` → the full original document (for reference)

This index is critical when later applying **context window enrichment**, where we retrieve not just the top chunk, but also its **neighbors** to preserve coherence and improve answer quality.


In [None]:

import fitz

def read_pdf_to_string(path):
    """
    Read a PDF document from the specified path and return its content as a string.

    Args:
        path (str): The file path to the PDF document.

    Returns:
        str: The concatenated text content of all pages in the PDF document.

    The function uses the 'fitz' library (PyMuPDF) to open the PDF document, iterate over each page,
    extract the text content from each page, and append it to a single string.
    """
    # Open the PDF document located at the specified path
    doc = fitz.open(path)
    content = ""
    # Iterate over each page in the document
    for page_num in range(len(doc)):
        # Get the current page
        page = doc[page_num]
        # Extract the text content from the current page and append it to the content string
        content += page.get_text()
    return content

# Read the PDF into a string
content = read_pdf_to_string(path)

from typing import List
from langchain.schema import Document

def split_text_to_chunks_with_indices(text: str, chunk_size: int, chunk_overlap: int) -> List[Document]:
    """
    Splits a long text into overlapping chunks and adds metadata for RSE.
    
    Args:
        text: The full document content as a string.
        chunk_size: Number of characters per chunk.
        chunk_overlap: Number of characters shared between consecutive chunks.

    Returns:
        A list of LangChain Document objects with:
            - page_content: the chunked text
            - metadata: index (position in doc) and original text
    """
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]

        # Each chunk has the chunk text and metadata:
        # - 'index' stores its position in the document
        # - 'text' stores the full original text for later scoring
        chunks.append(Document(
            page_content=chunk,
            metadata={
                "index": len(chunks),  # chronological position
                "text": text           # full doc (used in reranking)
            }
        ))

        start += chunk_size - chunk_overlap  # move forward with overlap

    return chunks


### Split our document accordingly

In [None]:
chunks_size = 400
chunk_overlap = 200
docs = split_text_to_chunks_with_indices(content, chunks_size, chunk_overlap)

### Create vector store and retriever

In [None]:
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
chunks_query_retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

### Retrieve a Specific Chunk from the Vector Store by Index

Once all chunks are embedded and stored in the vector store, there may be situations where you want to **retrieve a specific chunk by its original order** in the document — not by semantic similarity.

This function allows us to fetch the **kᵗʰ chunk** (based on its original position, i.e., the `index` stored during chunking). It’s especially useful when reconstructing **contiguous segments** around a highly relevant chunk — for example, in **context enrichment** or **Relevant Segment Extraction (RSE)**.


### 🔍 Why This Matters

Most vector stores return **top-k semantically similar** chunks.

But for techniques like RSE or when applying **sliding windows**, we often need:

- A specific chunk (e.g., the one most relevant to the query)
- Chunks surrounding it in the original document

This utility function helps with that by looking up chunks using the `index` stored in the metadata.


### ⚠️ Limitations

This implementation **scans all documents** in the vector store, which is inefficient for large datasets.

For production systems, it's better to:

- Maintain a direct index-to-document mapping (e.g., a dictionary)
- Or pre-load all docs with indices into memory


### 🧪 Example Use Case

Let’s say:

- You already used semantic search to identify that chunk with index `42` is highly relevant.
- Now you want to fetch its neighboring chunks (e.g., `41` and `43`) to reconstruct a more complete context window.

This function lets you fetch **any such chunk on demand by index**.


In [None]:
def get_chunk_by_index(vectorstore, target_index: int) -> Document:
    """
    Retrieve a chunk from the vectorstore based on its index in the metadata.

    Args:
    - vectorstore: The vectorstore object containing embedded chunks.
    - target_index: The index of the chunk to retrieve (as stored in metadata).

    Returns:
    - The Document object that matches the given index, or None if not found.
    """
    
    # Perform a similarity search with an empty string just to retrieve all docs
    # k = vectorstore.index.ntotal ensures all documents are returned
    all_docs = vectorstore.similarity_search("", k=vectorstore.index.ntotal)

    # Loop through all documents and return the one with matching metadata index
    for doc in all_docs:
        if doc.metadata.get('index') == target_index:
            return doc

    # Return None if no matching chunk found
    return None


### 🔁 Retrieve with Context Window (Overlap-Aware)

This function improves standard semantic retrieval by adding a few chunks before and after each relevant chunk — forming a broader context window.

Instead of returning isolated chunks, we fetch neighboring ones based on their original position, and merge them carefully to avoid duplication due to chunk overlaps.

---

### 🛠️ How it works (in short)

- Run semantic search to find top relevant chunks.
- For each, fetch a few neighbors before and after (based on `num_neighbors`).
- Merge the text while trimming repeated regions caused by chunk overlap.


In [None]:
def retrieve_with_context_overlap(vectorstore, retriever, query: str, num_neighbors: int = 1, chunk_size: int = 200, chunk_overlap: int = 20) -> List[str]:
    """
    Retrieve semantically relevant chunks and expand them with neighboring chunks
    to build wider, more meaningful context windows.
    """

    # Step 1: Get the top relevant chunks based on semantic similarity
    relevant_chunks = retriever.get_relevant_documents(query)
    result_sequences = []

    for chunk in relevant_chunks:
        # Step 2: Get the index of the current relevant chunk
        current_index = chunk.metadata.get('index')
        if current_index is None:
            continue  # Skip if no index found

        # Step 3: Determine the window of surrounding chunks
        start_index = max(0, current_index - num_neighbors)
        end_index = current_index + num_neighbors + 1  # +1 because Python range is exclusive

        # Step 4: Collect the neighbor chunks in that range
        neighbor_chunks = []
        for i in range(start_index, end_index):
            neighbor_chunk = get_chunk_by_index(vectorstore, i)
            if neighbor_chunk:
                neighbor_chunks.append(neighbor_chunk)

        # Step 5: Sort chunks to maintain their original order in the document
        neighbor_chunks.sort(key=lambda x: x.metadata.get('index', 0))

        # Step 6: Concatenate chunk contents with proper handling of overlap
        concatenated_text = neighbor_chunks[0].page_content
        for i in range(1, len(neighbor_chunks)):
            current_chunk = neighbor_chunks[i].page_content
            # Trim the overlapping region before appending
            overlap_start = max(0, len(concatenated_text) - chunk_overlap)
            concatenated_text = concatenated_text[:overlap_start] + current_chunk

        # Step 7: Add the final enriched context window to the result
        result_sequences.append(concatenated_text)

    return result_sequences


### 🔍 Comparing Regular Retrieval vs. Context-Enriched Retrieval

In this step, we compare:

1. **Baseline Retrieval** — returns the top semantically relevant chunk as-is.
2. **Context-Enriched Retrieval** — expands the top chunk by adding its neighboring chunks from the original document, forming a wider and more coherent window.

This helps illustrate how enriching the context can provide a fuller and more connected passage for generation.


In [None]:
# Baseline approach
query = "Explain the role of deforestation and fossil fuels in climate change."
baseline_chunk = chunks_query_retriever.get_relevant_documents(query, k=1)

# Focused context enrichment approach
enriched_chunks = retrieve_with_context_overlap(
    vectorstore,
    chunks_query_retriever,
    query,
    num_neighbors=1,
    chunk_size=400,
    chunk_overlap=200
)

print("Baseline Chunk:")
print(baseline_chunk[0].page_content)

print("\nEnriched Chunks:")
print(enriched_chunks[0])


---

## 📘 Summary & Credits

This notebook is based on the excellent open-source repository [RAG_Techniques by NirDiamant](https://github.com/NirDiamant/RAG_Techniques).  
I referred to that work to understand how the pipeline is structured and then reimplemented the same concept in a **fully self-contained** way, but using recent models — as part of my personal learning journey.

The purpose of this notebook is purely **educational**:  
- To deepen my understanding of Retrieval-Augmented Generation systems  
- To keep a clean, trackable log of what I’ve built and learned  
- And to serve as a future reference for myself or others starting from scratch

To support that, I’ve added clear, concise markdowns throughout the notebook — explaining *why* each package was installed, *why* each line of code exists, and *how* each component fits into the overall RAG pipeline. It’s designed to help anyone (including my future self) grasp the **how** and the **why**, not just the **what**.

## 🔍 Why Use Context Enrichment in RAG?

Standard top-k retrieval returns isolated chunks, which can miss important surrounding context.

**Context Enrichment** addresses this by:
- 🔗 Retrieving **neighboring chunks** around relevant ones
- 🧱 Reconstructing **coherent local windows** from the original document
- 🎯 Improving grounding and reducing hallucinations in generated answers

---

## 🧠 What’s New in This Version?

This implementation includes:

- ➕ **Neighbor-aware chunk retrieval** by index  
- 🔄 **Overlap-aware concatenation** to reduce repetition and preserve flow  
- ⚙️ Plug-and-play utility that wraps around any vector store + retriever  

---

## 📈 Inferences & Key Takeaways

- ✅ Adding nearby chunks boosts context **without increasing retrieval k**  
- 📚 Especially useful for **longer-form or structured content**  
- 🔍 Retains local coherence, which helps LLMs better understand and generate  

---

## 🚀 What Could Be Added Next?

- 🧪 Evaluate impact on QA performance vs. baseline top-k  
- ⚖️ Add dynamic neighbor selection based on chunk length or doc structure  
- 🔌 Turn into a custom retriever class for easier integration with LangChain or LlamaIndex  

---

## 💡 Final Word

This notebook is part of my larger personal project: **RAG100x** — a challenge to build and log my journney in RAG from 0 100 in the coming months.

It’s not built to impress — it’s built to **progress**.  
Everything here is structured to enable **daily iteration**, focused experimentation, and clean documentation.

If you're exploring RAG from first principles, feel free to use this as a scaffold for your own builds. And of course — check out the original repository for broader implementations and ideas.
