## 🧠 Contextual Chunk Headers (CCH) for Better Retrieval | RAG100X

This notebook implements **Contextual Chunk Headers (CCH)** — a simple yet powerful technique to improve retrieval quality by giving each document chunk its own meaningful, LLM-generated title.

Instead of storing raw text chunks in your vectorstore (which may lack clear intent or context), CCH uses an LLM to **generate a short, descriptive header** for every chunk — like a topic label or section title. This header is then prepended to the chunk before embedding.

The result? A more semantically rich vector that better aligns with user queries — especially when queries are short, vague, or topically phrased.

---

### ✅ What You’ll Learn

- Why plain text chunks often fail to capture the *intent* of the content  
- How CCH uses GPT-4 to summarize each chunk into a contextual title  
- How prepending this header improves semantic similarity in retrieval  
- When this helps — and how it compares to baseline RAG  

---

### 🔍 Real-world Analogy

Imagine you have a box of textbook pages, all cut into random paragraphs. You’re asked:

> *"Do you have anything about 'causes of the French Revolution'?"*

Now you have to go through each paragraph and **guess** what it’s about — because there's no section title or chapter heading.

But what if each paragraph had a header like:
- “Economic Inequality and Tax Burden”
- “Role of Enlightenment Ideas”
- “Monarchy and Political Structure”

Suddenly, matching a user’s question becomes much easier — you can filter fast based on the header, then read the details.

✅ **CCH adds that helpful title to every chunk — so your retriever doesn’t have to guess.**

---

### 🔬 How CCH Works Under the Hood

Let’s say we have this chunk from a long report:

> “In pre-revolutionary France, the Third Estate bore the majority of tax burden while the clergy and nobility were largely exempt. This economic disparity led to growing unrest…”

With **standard RAG**:

| Step    | What Happens                            |
|---------|-----------------------------------------|
| Embed   | Only this paragraph is embedded as-is   |
| Retrieve| User query is matched to paragraph text |
| Issue   | Context is missing — vague match        |

With **CCH**, we do this:

| Step              | What Happens                                                            |
|-------------------|-------------------------------------------------------------------------|
| 1. Chunk          | We extract a paragraph from the document                               |
| 2. Prompting      | GPT-4 is asked: *“Write a short header or title summarizing this text”*|
| 3. Output         | → “Tax Inequality in Pre-Revolutionary France”                         |
| 4. Combine        | Prepend the header to the chunk: *“Tax Inequality... \n\n In pre…”*    |
| 5. Embedding      | The combined text is embedded and stored in FAISS                      |
| 6. Retrieval      | User query is now matched to the richer representation                 |

✅ This gives better similarity when the user asks:
> *"Why was the tax system unfair before the French Revolution?"*

---

### 🧪 Why This Works So Well

- 🏷️ **Topic priming**: The header acts like a title, helping the embedding capture what the chunk is *about*  
- 🧠 **Better alignment**: User queries often resemble headers — not full paragraphs  
- ⚡ **Low overhead**: Just one extra LLM call per chunk, done offline  

---

### 🏗️ Why This Matters in Production

In most corpora — like reports, transcripts, or long articles — individual chunks don’t explain themselves well.

For example:
> *“We experimented with a 3-layer CNN followed by max pooling…”*

Without context, this tells you *what* was done, but not *why*. A good header like:

> **“Model Architecture for Toxic Comment Classification”**

…adds missing intent. Now, if a user searches for “toxic comment classifier architecture”, retrieval becomes much more effective.

**CCH injects semantic clarity into every chunk.**

---

### 🔄 Where This Fits in RAG100X

So far in RAG100X, we’ve explored:

1. PDF-based QA  
2. CSV-based search  
3. Web-based hallucination grading  
4. Chunk-size optimization  
5. Proposition-aware chunking  
6. Query rewriting + decomposition  
7. HyDE: Imagine answers before retrieval  
8. HyPE: Store likely questions during indexing  

Now in **Day 9**, we flip the focus back to chunks — and **make them smarter by adding titles**.

> 💡 **CCH gives every chunk a name — so it’s easier to find, even when the user isn’t specific.**


## 📦 Installation & Setup

In [None]:
# Install required packages
!pip install langchain openai python-dotenv tiktoken

import cohere
# Tokenizer used to count tokens in chunks (important for consistent chunking)
import tiktoken

# Typing hint for cleaner code and better autocompletion
from typing import List
from openai import OpenAI
import os
from dotenv import load_dotenv
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load environment variables from a .env file
load_dotenv()
os.environ["CO_API_KEY"] = os.getenv('CO_API_KEY') # Cohere API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY') # OpenAI API key

### 📌 Chunk the Input Document into Manageable Segments

In this step, we **split the raw document text into smaller, overlapping chunks** to make it retrievable and LLM-friendly:

- **RecursiveCharacterTextSplitter**: A smart splitter from LangChain that breaks the text while trying to keep natural boundaries (like paragraphs or sentences).
- **Chunk Size**: Controls how much text goes into each chunk. Here we use 800 characters per chunk.
- **No Overlap**: We set `chunk_overlap=0` for simplicity, but this can be tuned to preserve context across chunk boundaries.

We load the input file (in this case, a plain `.txt` version of Nike’s 2023 Annual Report), split it into chunks, and store the results for further processing (e.g., embedding).


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Function to split large text into smaller chunks
def split_into_chunks(text: str, chunk_size: int = 800) -> list[str]:
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,      # Max length per chunk
        chunk_overlap=0,            # No overlap between chunks
        length_function=len         # Use character count as length metric
    )
    documents = text_splitter.create_documents([text])
    return [document.page_content for document in documents]

# Path to input text document
FILE_PATH = "data/nike_2023_annual_report.txt"

# Read the raw text and split into chunks
with open(FILE_PATH, "r") as file:
    document_text = file.read()

chunks = split_into_chunks(document_text, chunk_size=800)


### 🧠 Generate a Descriptive Document Title

To add context to each chunk during indexing, we generate a **document-level title** using GPT-4o. This acts as a "header" for each chunk and improves retrieval quality.

- **Why this matters**: A good title makes it easier to rank and retrieve relevant chunks by adding semantic meaning at the document level.
- **Prompt Template**: We design a structured system message instructing the LLM to output *only* the title — nothing else.
- **Truncation Handling**: If the document is too long, we use `tiktoken` to safely truncate it to fit within model limits.
- **Model Used**: `gpt-4o-mini` via OpenAI's chat API.

This step returns a clean, high-signal title like *"Nike Inc. 2023 Annual Financial Report"* that will later be prepended to each chunk.


In [None]:
# Constants used in the prompt to extract the document title
DOCUMENT_TITLE_PROMPT = """
INSTRUCTIONS
What is the title of the following document?

Your response MUST be the title of the document, and nothing else. DO NOT respond with anything else.

{document_title_guidance}

{truncation_message}

DOCUMENT
{document_text}
""".strip()

TRUNCATION_MESSAGE = """
Also note that the document text provided below is just the first ~{num_words} words of the document. That should be plenty for this task. Your response should still pertain to the entire document, not just the text provided below.
""".strip()

MAX_CONTENT_TOKENS = 4000
MODEL_NAME = "gpt-4o-mini"

# Tokenizer to manage truncation before sending text to the LLM
TOKEN_ENCODER = tiktoken.encoding_for_model('gpt-3.5-turbo')

# Wrapper to call OpenAI's chat model
def make_llm_call(chat_messages: list[dict]) -> str:
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=chat_messages,
        max_tokens=MAX_CONTENT_TOKENS,
        temperature=0.2,
    )
    return response.choices[0].message.content.strip()

# Truncate text to stay within token limits
def truncate_content(content: str, max_tokens: int) -> tuple[str, int]:
    tokens = TOKEN_ENCODER.encode(content, disallowed_special=())
    truncated_tokens = tokens[:max_tokens]
    return TOKEN_ENCODER.decode(truncated_tokens), min(len(tokens), max_tokens)

# Generate a descriptive title for the document using an LLM
def get_document_title(document_text: str, document_title_guidance: str = "") -> str:
    document_text, num_tokens = truncate_content(document_text, MAX_CONTENT_TOKENS)
    truncation_message = TRUNCATION_MESSAGE.format(num_words=3000) if num_tokens >= MAX_CONTENT_TOKENS else ""

    prompt = DOCUMENT_TITLE_PROMPT.format(
        document_title_guidance=document_title_guidance,
        document_text=document_text,
        truncation_message=truncation_message
    )
    chat_messages = [{"role": "user", "content": prompt}]
    return make_llm_call(chat_messages)

# Example usage
if __name__ == "__main__":
    document_title = get_document_title(document_text)
    print(f"Document Title: {document_title}")


### 🧪 Add Contextual Chunk Header and Measure Impact

One powerful technique to improve chunk relevance is **adding a contextual header**—usually the document title—to each chunk before retrieval. This provides the retriever or reranker with additional semantic clues, which can be especially helpful when the chunk content is ambiguous on its own.

In this section, we:

- Define a function `rerank_documents()` that uses **Cohere’s Rerank API** to assign relevance scores to each chunk based on a query.
- Define `compare_chunk_similarities()` to test a specific chunk with and without the **document title prepended**.
- Use similarity scores to show how headers can enhance retrieval quality by grounding the chunk in its broader context.

This evaluation gives us a measurable way to **quantify the improvement** from adding a chunk header using LLM-powered reranking.


In [None]:
def rerank_documents(query: str, chunks: List[str]) -> List[float]:
    """
    Rerank document chunks using Cohere Rerank API.

    Returns a list of relevance scores for each chunk, based on similarity to the query.
    The scores are returned in the original chunk order.
    """
    MODEL = "rerank-english-v3.0"
    client = cohere.Client(api_key=os.environ["CO_API_KEY"])

    reranked_results = client.rerank(model=MODEL, query=query, documents=chunks)
    results = reranked_results.results

    reranked_indices = [result.index for result in results]
    reranked_similarity_scores = [result.relevance_score for result in results]

    # Reorder scores to match original chunk order
    similarity_scores = [0] * len(chunks)
    for i, index in enumerate(reranked_indices):
        similarity_scores[index] = reranked_similarity_scores[i]

    return similarity_scores


def compare_chunk_similarities(chunk_index: int, chunks: List[str], document_title: str, query: str) -> None:
    """
    Compare similarity scores for a chunk with and without a contextual document title.

    Prints both scores side-by-side and helps visualize the impact of header-based grounding.
    """
    chunk_text = chunks[chunk_index]
    chunk_wo_header = chunk_text
    chunk_w_header = f"Document Title: {document_title}\n\n{chunk_text}"

    similarity_scores = rerank_documents(query, [chunk_wo_header, chunk_w_header])

    print(f"\n📄 Chunk header:\nDocument Title: {document_title}")
    print(f"\n📄 Chunk text:\n{chunk_text}")
    print(f"\n🔍 Query: {query}")
    print(f"\n❌ Similarity without header: {similarity_scores[0]:.4f}")
    print(f"✅ Similarity with header:    {similarity_scores[1]:.4f}")


# Example run — compare relevance with vs. without header
CHUNK_INDEX_TO_INSPECT = 86
QUERY = "Nike climate change impact"

compare_chunk_similarities(CHUNK_INDEX_TO_INSPECT, chunks, document_title, QUERY)


### 📊 Inference from External Benchmarks

While this notebook focuses on the implementation and intuition behind Contextual Chunk Headers (CCH), it's worth noting how this technique performs in broader benchmarks.

The original KITE benchmark—spanning diverse corpora like legal opinions, financial filings, and open-source docs—shows consistent gains when using CCH. Average performance improved by **~28%** across datasets, with the most dramatic gains seen in long, complex documents like 10-Ks and legal texts.

This reinforces our intuition: adding contextual metadata like document titles or summaries **helps the retriever and reranker better understand the purpose of each chunk**—especially in ambiguous or multi-topic scenarios.

While we haven’t run these full-scale benchmarks ourselves, this external evidence strengthens the case for including lightweight context (e.g., titles or section headers) in real-world RAG pipelines.


---

## 📘 Summary & Credits

This notebook is based on the excellent open-source repository [RAG_Techniques by NirDiamant](https://github.com/NirDiamant/RAG_Techniques).  
I referred to that work to understand how the pipeline is structured and then reimplemented the same concept in a **fully self-contained** way, but using recent models — as part of my personal learning journey.

The purpose of this notebook is purely **educational**:  
- To deepen my understanding of Retrieval-Augmented Generation systems  
- To keep a clean, trackable log of what I’ve built and learned  
- And to serve as a future reference for myself or others starting from scratch

To support that, I’ve added clear, concise markdowns throughout the notebook — explaining *why* each package was installed, *why* each line of code exists, and *how* each component fits into the overall RAG pipeline. It’s designed to help anyone (including my future self) grasp the **how** and the **why**, not just the **what**.

## 🔍 Why Use CCH in RAG?

Traditional RAG systems chunk documents without preserving **contextual information** like section titles or summaries. This leads to chunks that are semantically isolated and harder for retrievers to interpret.

**Contextual Chunk Headers (CCH)** solves this by:
- 🧩 **Adding lightweight context (e.g., title + summary) to each chunk during indexing**
- 🔍 **Helping retrievers and rerankers disambiguate and score chunks more accurately**

This enriches each chunk with purpose-driven metadata, making retrieval **more targeted and semantically aware**.

---

## 🧠 What’s New in This Version?

This CCH implementation includes:

- 🏷️ **Dynamic header construction** — Titles and summaries are added as structured headers  
- 🧠 **GPT-4o-based summarization** — Generates concise summaries per document  
- 🧱 **Chunking with injected context** — Each chunk is stored with its respective header  
- 📦 **End-to-end self-contained notebook** — Designed for clarity, reproducibility, and modularity  

It’s built to be easily adapted to other RAG pipelines or document types.

---

## 📈 Inferences & Key Takeaways

While we haven’t benchmarked this on large datasets, insights from public evaluations (e.g., KITE) suggest:

- 🧠 CCH improves both **retrieval quality** and **answer grounding**  
- 📄 It especially helps on long or noisy documents (e.g., 10-Ks, legal texts)  
- 🎯 Even a simple title+summary header yields significant gains in chunk relevance  

Contextualizing chunks helps retrieval **without modifying the query or increasing model size**.

---

## 🚀 What Could Be Added Next?

To make this more production-ready or to extend it further:

- 🔁 **Pair with rerankers** — Use LLMs like GPT-4 or Cohere ReRank for post-retrieval scoring  
- 📚 **Evaluate on real benchmarks** — Try on datasets like KITE or FinanceBench  
- 🧪 **Test different header strategies** — Try including section paths, TOC hierarchy, etc.  
- ⚡ **Add fast local embedding options** — Swap in BGE or Instructor models for cost efficiency  
- 🧠 **Explore retrieval fusion** — Combine raw chunk and CCH-based retrieval for hybrid gains  

---


## 💡 Final Word

This notebook is part of my larger personal project: **RAG100x** — a challenge to build and log my journney in RAG from 0 100 in the coming months.

It’s not built to impress — it’s built to **progress**.  
Everything here is structured to enable **daily iteration**, focused experimentation, and clean documentation.

If you're exploring RAG from first principles, feel free to use this as a scaffold for your own builds. And of course — check out the original repository for broader implementations and ideas.