# Contextual Chunk Headers (CCH)

## Overview

Contextual chunk headers (CCH) is a method of creating document-level and section-level context, and prepending those chunk headers to the chunks prior to embedding them. This gives the embeddings a much more accurate and complete representation of the content and meaning of the text. In our testing, this feature leads to a dramatic improvement in retrieval quality. In addition to increasing the rate at which the correct information is retrieved, CCH also substantially reduces the rate at which irrelevant results show up in the search results. This reduces the rate at which the LLM misinterprets a piece of text in downstream chat and generation applications.

## Motivation

A large percentage of the problems developers face with RAG comes down to this: Individual chunks oftentimes do not contain sufficient context to be properly used by the retrieval system or the LLM. This leads to the inability to answer questions and, more worryingly, hallucinations.

Examples of this problem
- Chunks oftentimes refer to their subject via implicit references and pronouns. This causes them to not be retrieved when they should be, or to not be properly understood by the LLM.
- Naive chunking can lead to text being split “mid-thought” leaving neither chunk with useful context.
- Individual chunks oftentimes only make sense in the context of the entire section or document, and can be misleading when read on their own.

## Key Components

#### Contextual chunk headers
The idea here is to add in higher-level context to the chunk by prepending a chunk header. This chunk header could be as simple as just the document title, or it could use a combination of document title, a concise document summary, and the full hierarchy of section and sub-section titles.

## Method Details

#### Document title and summary generation
We use an LLM to generate a title and descriptive one sentence summary of what the document is about. 

#### Break into semantically similar sections (optional)
Semantic sectioning uses an LLM to break a document into sections. It works by annotating the document with line numbers and then prompting an LLM to identify the starting and ending lines for each “semantically cohesive section.” These sections should be anywhere from a few paragraphs to a few pages long. The sections then get broken into smaller chunks if needed. The LLM is also prompted to generate descriptive titles for each section.

#### Chunk the document
If semantic sectioning is used, each section is split into chunks. If not, then the document is chunked normally. 


## Setup

First, you'll need to set the API keys as environmental variables

In [52]:
import cohere
import tiktoken
from openai import OpenAI
import os
from scipy.stats import beta
import numpy as np
from dotenv import load_dotenv
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load environment variables from a .env file
load_dotenv()
os.environ["CO_API_KEY"] = os.getenv('CO_API_KEY') # Cohere API key

#### Read the document

In [32]:
file_path = "../data/nike_2023_annual_report.txt"
doc_id = os.path.basename(file_path).split(".")[0] # grab the file name without the extension so we can use it as the doc_id

kb_id = "nike_10k"

with open(file_path, "r") as f:
    document_text = f.read()

In [45]:
# Define the functions for generating the document title and summary

DOCUMENT_TITLE_PROMPT = """
INSTRUCTIONS
What is the title of the following document?

Your response MUST be the title of the document, and nothing else. DO NOT respond with anything else.

{document_title_guidance}

{truncation_message}

DOCUMENT
{document_text}
""".strip()


TRUNCATION_MESSAGE = """
Also note that the document text provided below is just the first ~{num_words} words of the document. That should be plenty for this task. Your response should still pertain to the entire document, not just the text provided below.
""".strip()


def make_llm_call(chat_messages: list[dict]) -> str:
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=chat_messages,
        max_tokens=4000,
        temperature=0.2,
    )
    llm_output = response.choices[0].message.content.strip()
    return llm_output

def truncate_content(content: str, max_tokens: int):
    TOKEN_ENCODER = tiktoken.encoding_for_model('gpt-3.5-turbo')
    tokens = TOKEN_ENCODER.encode(content, disallowed_special=())
    truncated_tokens = tokens[:max_tokens]
    return TOKEN_ENCODER.decode(truncated_tokens), min(len(tokens), max_tokens)

def get_document_title(document_text: str, document_title_guidance: str = ""):
    # truncate the content if it's too long
    max_content_tokens = 4000 # if this number changes, also update num_words in the truncation message below
    document_text, num_tokens = truncate_content(document_text, max_content_tokens)
    if num_tokens < max_content_tokens:
        truncation_message = ""
    else:
        truncation_message = TRUNCATION_MESSAGE.format(num_words=3000)

    # get document title
    prompt = DOCUMENT_TITLE_PROMPT.format(document_title_guidance=document_title_guidance, document_text=document_text, truncation_message=truncation_message)
    chat_messages = [{"role": "user", "content": prompt}]
    document_title = make_llm_call(chat_messages)
    return document_title


In [46]:
def split_into_chunks(text: str, chunk_size: int = 800):
    """
    Note: it's very important that chunk overlap is set to 0 here, since results are created by concatenating chunks.
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=0, length_function=len
    )
    texts = text_splitter.create_documents([text])
    chunks = [text.page_content for text in texts]
    return chunks

In [47]:
def transform(x):
    """
    Transformation function to map the absolute relevance value to a value that is more uniformly distributed between 0 and 1
    - This is critical for the new version of RSE to work properly, because it utilizes the absolute relevance values to calculate the similarity scores
    - The relevance values given by the Cohere reranker tend to be very close to 0 or 1. This beta function used here helps to spread out the values more uniformly.
    """
    a, b = 0.4, 0.4  # These can be adjusted to change the distribution shape
    return beta.cdf(x, a, b)

def rerank_documents(query: str, documents: list) -> list:
    """
    Use Cohere Rerank API to rerank the search results
    """
    model = "rerank-english-v3.0"
    client = cohere.Client(api_key=os.environ["CO_API_KEY"])
    decay_rate = 30

    reranked_results = client.rerank(model=model, query=query, documents=documents)
    results = reranked_results.results
    reranked_indices = [result.index for result in results]
    reranked_similarity_scores = [result.relevance_score for result in results] # in order of reranked_indices
    
    # convert back to order of original documents and calculate the chunk values
    similarity_scores = [0] * len(documents)
    chunk_values = [0] * len(documents)
    for i, index in enumerate(reranked_indices):
        absolute_relevance_value = transform(reranked_similarity_scores[i])
        similarity_scores[index] = absolute_relevance_value
        v = np.exp(-i/decay_rate)*absolute_relevance_value # decay the relevance value based on the rank
        chunk_values[index] = v

    return similarity_scores, chunk_values

In [48]:
# Get the document title
document_title = get_document_title(document_text)

In [49]:
# Split into chunks

chunks = []

section_chunks = split_into_chunks(
    document_text, chunk_size=800
)
for chunk in section_chunks:
    chunks.append(
        {
            "chunk_text": chunk,
            "document_title": document_title
        }
    )

In [50]:
documents = []
documents_no_context = [] # baseline for comparison
for i in range(len(chunks)):
    chunk_text = chunks[i]["chunk_text"]
    document = f"Document Title: {document_title}\n\n{chunk_text}"
    documents.append(document)
    documents_no_context.append(chunk_text)

chunk_index_to_inspect = 86
print (documents[chunk_index_to_inspect])

Document Title: NIKE, INC. ANNUAL REPORT ON FORM 10-K

Given the broad and global scope of our operations, we are particularly vulnerable to the physical risks of climate change, such 
as shifts in weather patterns. Extreme weather conditions in the areas in which our retail stores, suppliers, manufacturers, 
customers, distribution centers, offices, headquarters and vendors are located could adversely affect our operating results and 
financial condition. Moreover, natural disasters such as earthquakes, hurricanes, wildfires, tsunamis, floods or droughts, whether 
occurring in the United States or abroad, and their related consequences and effects, including energy shortages and public 
health issues, have in the past temporarily disrupted, and could in the future disrupt, our operations, the operations of our


In [51]:
query = "Nike climate change impact"

similarity_scores, chunk_values = rerank_documents(query, [documents[chunk_index_to_inspect], documents_no_context[chunk_index_to_inspect]])

print (f"Similarity with contextual chunk header: {similarity_scores[0]}")
print (f"Similarity without contextual chunk header: {similarity_scores[1]}")

Similarity with contextual chunk header: 0.783890004625817
Similarity without contextual chunk header: 0.24543648666467205


# Eval results

#### KITE

We evaluated CCH on an end-to-end RAG benchmark we created, called KITE (Knowledge-Intensive Task Evaluation).

KITE currently consists of 4 datasets and a total of 50 questions.
- **AI Papers** - ~100 academic papers about AI and RAG, downloaded from arXiv in PDF form.
- **BVP Cloud 10-Ks** - 10-Ks for all companies in the Bessemer Cloud Index (~70 of them), in PDF form.
- **Sourcegraph Company Handbook** - ~800 markdown files, with their original directory structure, downloaded from Sourcegraph's publicly accessible company handbook GitHub [page](https://github.com/sourcegraph/handbook/tree/main/content).
- **Supreme Court Opinions** - All Supreme Court opinions from Term Year 2022 (delivered from January '23 to June '23), downloaded from the official Supreme Court [website](https://www.supremecourt.gov/opinions/slipopinion/22) in PDF form.

Ground truth answers are included with each sample. Most samples also include grading rubrics. Grading is done on a scale of 0-10 for each question, with a strong LLM doing the grading.

We compare CCH with standard Top-k retrieval (k=20). All other parameters remain the same between the two configurations. We use the Cohere 3 reranker, and we use GPT-4o for response generation. The average length of the relevant knowledge string is roughly the same between the two configurations, so cost and latency are similar.

|                         | Top-k    | CCH+Top-k    |
|-------------------------|----------|--------------|
| AI Papers               | 4.5      | 4.7          |
| BVP Cloud               | 2.6      | 6.3          |
| Sourcegraph             | 5.7      | 5.8          |
| Supreme Court Opinions  | 6.1      | 7.4          |
| **Average**             | 4.72     | 6.04         |

We can see that CCH leads to an improvement in performance on each of the four datasets. The overall average score increases from 4.72 -> 6.04, a 27.9% increase.

#### FinanceBench

We've also evaluated CCH on FinanceBench, where it contributed to a score of 83%, compared to a baseline score of 19%. For that benchmark, we tested CCH and relevant segment extraction (RSE) jointly, so we can't say exactly how much CCH contributed to that result. But the combination of CCH and RSE clearly leads to substantial accuracy improvements on FinanceBench.