# Late Chunking

Below is the implementation of late chunking from the paper - [Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models](https://arxiv.org/abs/2409.04701). The chunking method allows us to retain contextual information from the whole document within the chunks.

The code runs through it step by step before wrapping in a function.

In [33]:
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
from tqdm import tqdm

import torch

import numpy as np

In [5]:
model_name = "Qwen/Qwen3-Embedding-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

## Sample Text

Below i have some sample text as well as different methods to create the chunks from this text:
* Simple Chunks - Keep only the body of text between each header - this will be used for a text search.
* Title Chunks - Keep the body of text and the Markdown header - this will be a slightly more enriched search and still findable within the main text.
* Enriched Chunks - This uses the main header and subheader as metadata, this cannot be searched so is used as a baseline.

In [6]:
text= """
# Coffee Machine Maintenance

##   Daily Cleaning Routine

After each use, the coffee machine should be cleaned to prevent buildup of coffee oils and residue.
This includes emptying and rinsing the carafe, removing used coffee grounds, and wiping down the exterior and drip tray.
Ensuring these steps are done daily helps maintain hygiene and keeps the machine functioning efficiently.

##  Descaling and Internal Maintenance

The machine should be descaled approximately once a month, or more frequently if hard water is used.
Use a descaling solution or a mixture of water and white vinegar, following the manufacturer’s instructions.
This process removes mineral deposits from internal components, which helps maintain water flow and brewing temperature.

## Filter and Component Checks

Regularly inspect and clean or replace the coffee filter, whether it’s reusable or disposable.
Additionally, check the water reservoir and any removable parts for signs of mould or buildup.
Keeping these components clean ensures consistent coffee quality and extends the lifespan of the machine.
"""

In [7]:
simple_chunk_1 = """After each use, the coffee machine should be cleaned to prevent buildup of coffee oils and residue.
This includes emptying and rinsing the carafe, removing used coffee grounds, and wiping down the exterior and drip tray.
Ensuring these steps are done daily helps maintain hygiene and keeps the machine functioning efficiently.
"""

simple_chunk_2 = """The machine should be descaled approximately once a month, or more frequently if hard water is used.
Use a descaling solution or a mixture of water and white vinegar, following the manufacturer’s instructions.
This process removes mineral deposits from internal components, which helps maintain water flow and brewing temperature.
"""

simple_chunk_3 = """Regularly inspect and clean or replace the coffee filter, whether it’s reusable or disposable.
Additionally, check the water reservoir and any removable parts for signs of mould or buildup.
Keeping these components clean ensures consistent coffee quality and extends the lifespan of the machine.
"""

simple_split_chunks = [simple_chunk_1, simple_chunk_2, simple_chunk_3]

In [8]:
title_chunk_1 = """
##   Daily Cleaning Routine

After each use, the coffee machine should be cleaned to prevent buildup of coffee oils and residue.
This includes emptying and rinsing the carafe, removing used coffee grounds, and wiping down the exterior and drip tray.
Ensuring these steps are done daily helps maintain hygiene and keeps the machine functioning efficiently.
"""

title_chunk_2 = """
##  Descaling and Internal Maintenance

The machine should be descaled approximately once a month, or more frequently if hard water is used.
Use a descaling solution or a mixture of water and white vinegar, following the manufacturer’s instructions.
This process removes mineral deposits from internal components, which helps maintain water flow and brewing temperature.
"""

title_chunk_3 = """
## Filter and Component Checks

Regularly inspect and clean or replace the coffee filter, whether it’s reusable or disposable.
Additionally, check the water reservoir and any removable parts for signs of mould or buildup.
Keeping these components clean ensures consistent coffee quality and extends the lifespan of the machine.
"""

title_split_chunks = [title_chunk_1, title_chunk_2, title_chunk_3]

In [9]:
enriched_chunk_1 = """
Title: Coffee Machine Maintenance
Subtitle: Daily Cleaning Routine
Text: After each use, the coffee machine should be cleaned to prevent buildup of coffee oils and residue.
This includes emptying and rinsing the carafe, removing used coffee grounds, and wiping down the exterior and drip tray.
Ensuring these steps are done daily helps maintain hygiene and keeps the machine functioning efficiently.
"""

enriched_chunk_2 = """
Title: Coffee Machine Maintenance
Subtitle: Descaling and Internal Maintenance
Text: The machine should be descaled approximately once a month, or more frequently if hard water is used.
Use a descaling solution or a mixture of water and white vinegar, following the manufacturer’s instructions.
This process removes mineral deposits from internal components, which helps maintain water flow and brewing temperature.
"""

enriched_chunk_3 = """
Title: Coffee Machine Maintenance
Subtitle: Filter and Component Checks
Text: Regularly inspect and clean or replace the coffee filter, whether it’s reusable or disposable.
Additionally, check the water reservoir and any removable parts for signs of mould or buildup.
Keeping these components clean ensures consistent coffee quality and extends the lifespan of the machine.
"""

enriched_split_chunks = [enriched_chunk_1, enriched_chunk_2, enriched_chunk_3]

## Simple Embedding Function

Here we define a function for embedding the tokens

In [35]:
def simple_embedding(text, model_name=model_name):
    from sentence_transformers import SentenceTransformer
    
    # Load the model
    model = SentenceTransformer(model_name)

    # Encode the input text
    query_embeddings = model.encode(text)

    # Convert to list
    embedding_list = query_embeddings.tolist()

    # Ensure embedding_list is a list of lists
    if isinstance(embedding_list[0], (int, float)):
        embedding_list = [embedding_list]

    return embedding_list

### Perform Simple Embedding on the Text Chunks Above

In [36]:
simple_chunk_embedding = simple_embedding(simple_split_chunks)
title_chunk_embedding = simple_embedding(title_split_chunks)
enriched_chunk_embedding = simple_embedding(enriched_split_chunks)

## Late Chunking Function

Below we have a simple late chunking function which chunks based on a fixed chunk size of 50 tokens.

In [10]:
def late_chunking(text, chunk_size = 50, model=model, tokenizer=tokenizer):
    """Function to perform the Late Chunking method of embedding"""

    # Encode the whole text and ensure the offsets for tokens are stored
    encodings = tokenizer(
        text,
        return_tensors="pt",
        return_offsets_mapping=True,
        padding=True,
        truncation=False
    )

    # Convert offsets from Tensor to numpy array
    offsets = encodings["offset_mapping"][0].numpy()
    num_tokens = len(offsets)

    token_embeddings = []

    # Loop through tokens and embed token by token
    with torch.no_grad():
        outputs = model(**encodings)
        embs = outputs.last_hidden_state.squeeze()
        token_embeddings = embs.tolist()

    chunks = []

    # Now loop through the token embeddings and create the chunks by number of tokens.
    # Then pool the embeddings and take the average point wise across the embeddings for each chunk
    for i in range(0, num_tokens, chunk_size):
    
        start_token = i
        end_token = min(i + chunk_size, num_tokens - 1)
    
        start_char = offsets[start_token][0]
        end_char = offsets[end_token][1] if end_token < num_tokens else offsets[-1][1]

        chunk_text = text[start_char:end_char].strip()
        chunk_embs = token_embeddings[start_token:end_token+1]
        # Take the mean for each position in all embedding vectors for each chunk
        chunk_emb = torch.tensor(chunk_embs).mean(dim=0).tolist()
        
        chunks.append({
            "text": chunk_text,
            "embedding": chunk_emb
        })

    return chunks

## Custom Late Chunking Function

What if we want to chunk differently than a fixed length? Here i create a slightly more sophisticated couple of functions which search though a larger text for a string and find the character positions of each and then use those as the chunks.

In [11]:
def get_token_range_for_text(text, target_text, tokenizer=tokenizer):
    """Helper function to find the token range corresponding to a target text substring"""
    
    # Encode the entire text to get the original token offsets
    encodings = tokenizer(
        text,
        return_tensors="pt",
        return_offsets_mapping=True,
        padding=True,
        truncation=False
    )
    offsets = encodings["offset_mapping"][0].numpy()

    # Find the character start and end of the target text in the original text
    target_start = text.find(target_text)
    if target_start == -1:
        raise ValueError("Target text not found in the original text.")
    target_end = target_start + len(target_text)

    # Locate the tokens that cover the target character range
    start_token = None
    end_token = None
    for i, (char_start, char_end) in enumerate(offsets):
        if char_start <= target_start < char_end:
            start_token = i
        if char_start < target_end <= char_end:
            end_token = i
            break

    if start_token is None or end_token is None:
        raise ValueError("Target text spans beyond token boundaries.")

    return (start_token, end_token)

In [12]:
def late_chunking_with_ranges(
    text, 
    chunk_size=50, 
    custom_chunk_ranges=None,
    model=model, 
    tokenizer=tokenizer
):
    """Function to perform the Late Chunking method of embedding with optional custom chunk boundaries"""

    # Encode the whole text and ensure the offsets for tokens are stored
    encodings = tokenizer(
        text,
        return_tensors="pt",
        return_offsets_mapping=True,
        padding=True,
        truncation=False
    )

    # Convert offsets from Tensor to numpy array
    offsets = encodings["offset_mapping"][0].numpy()
    num_tokens = len(offsets)

    token_embeddings = []

    # Loop through tokens and embed token by token
    with torch.no_grad():
        outputs = model(**encodings)
        embs = outputs.last_hidden_state.squeeze()
        token_embeddings = embs.tolist()

    chunks = []

    # Determine chunk ranges
    if custom_chunk_ranges is not None:
        # Use custom defined chunk ranges
        chunk_ranges = custom_chunk_ranges
    else:
        # Default to chunking by fixed size
        chunk_ranges = [(i, min(i + chunk_size, num_tokens - 1)) for i in range(0, num_tokens, chunk_size)]

    # Now process each defined chunk range
    for start_token, end_token in chunk_ranges:
        start_char = offsets[start_token][0]
        end_char = offsets[end_token][1] if end_token < num_tokens else offsets[-1][1]

        chunk_text = text[start_char:end_char].strip()
        chunk_embs = token_embeddings[start_token:end_token+1]
        # Take the mean for each position in all embedding vectors for each chunk
        chunk_emb = torch.tensor(chunk_embs).mean(dim=0).tolist()
        
        chunks.append({
            "text": chunk_text,
            "embedding": chunk_emb
        })

    return chunks

## Get the Ranges for Chunks within Text

### Simple Text

In [14]:
chunk_1_range = get_token_range_for_text(text, simple_chunk_1)
chunk_2_range = get_token_range_for_text(text, simple_chunk_2)
chunk_3_range = get_token_range_for_text(text, simple_chunk_3)

chunk_ranges = [chunk_1_range, chunk_2_range, chunk_3_range]
chunk_ranges

[(12, 72), (81, 138), (145, 195)]

### Title Text

In [15]:
chunk_title_1_range = get_token_range_for_text(text, title_chunk_1)
chunk_title_2_range = get_token_range_for_text(text, title_chunk_2)
chunk_title_3_range = get_token_range_for_text(text, title_chunk_3)

chunk_title_ranges = [chunk_title_1_range, chunk_title_2_range, chunk_title_3_range]
chunk_title_ranges

[(5, 72), (72, 138), (138, 195)]

## Perform Late Chunking

### Without Specific Ranges

In [31]:
result = late_chunking(text)

print(result[0]['text'])
print(result[0]['embedding'][0:5])

# Coffee Machine Maintenance

##   Daily Cleaning Routine

After each use, the coffee machine should be cleaned to prevent buildup of coffee oils and residue.
This includes emptying and rinsing the carafe, removing used coffee grounds, and wiping down the
[-0.1366955190896988, -6.695122718811035, -0.15915973484516144, 0.11862444132566452, -2.3169548511505127]


### With Simple Text Range (No Headers)

In [27]:
simple_result = late_chunking_with_ranges(text, custom_chunk_ranges=chunk_ranges)

print(simple_result[0]['text'])
print(simple_result[0]['embedding'][0:5])

After each use, the coffee machine should be cleaned to prevent buildup of coffee oils and residue.
This includes emptying and rinsing the carafe, removing used coffee grounds, and wiping down the exterior and drip tray.
Ensuring these steps are done daily helps maintain hygiene and keeps the machine functioning efficiently.
[-0.17096972465515137, -6.924577713012695, -0.03905126079916954, 1.0475043058395386, -1.8993895053863525]


### With Title Text Ranges

In [28]:
title_result = late_chunking_with_ranges(text, custom_chunk_ranges=chunk_title_ranges)

print(title_result[0]['text'])
print(title_result[0]['embedding'][0:5])

##   Daily Cleaning Routine

After each use, the coffee machine should be cleaned to prevent buildup of coffee oils and residue.
This includes emptying and rinsing the carafe, removing used coffee grounds, and wiping down the exterior and drip tray.
Ensuring these steps are done daily helps maintain hygiene and keeps the machine functioning efficiently.
[-0.18933133780956268, -6.56510591506958, -0.11924945563077927, 0.9724718928337097, -1.7468141317367554]


## Cosine Similarity

Below is the function to calculate the cosine similarity, which is the angle between two vectors.

In [34]:
def cosine_similarity(a: list[float], b: list[float]) -> float:
    arr1 = np.array(a)
    arr2 = np.array(b)
    return float(arr1.dot(arr2) / (np.linalg.norm(arr1) * np.linalg.norm(arr2)))

## Similarity Scoring with Question Example

Below i utilise the embeddings calculated and a sample question to check the similarity between question and chunks.

In [37]:
question = "How do i clean the drinks machine?"
question_embed = simple_embedding(question)[0]

### Simple Embedding Similarity

In [38]:
for chunk in simple_chunk_embedding:
    sim = cosine_similarity(question_embed, chunk)
    print(sim)

0.6121795947891503
0.5415360480186737
0.5336931622388909


In [39]:
for chunk in title_chunk_embedding:
    sim = cosine_similarity(question_embed, chunk)
    print(sim)

0.5889322870180265
0.5167286511227129
0.5083043711352284


In [40]:
for chunk in enriched_chunk_embedding:
    sim = cosine_similarity(question_embed, chunk)
    print(sim)

0.5611430670921256
0.5320672727309541
0.4838873749405817


### Late Chunking Similarity

In [41]:
for res in simple_result:
    embedding = res['embedding']
    sim = cosine_similarity(question_embed, embedding)
    print(sim)

0.6087372591271948
0.4980847456652066
0.5025916297033763


In [42]:
for res in title_result:
    embedding = res['embedding']
    sim = cosine_similarity(question_embed, embedding)
    print(sim)

0.6235854485507689
0.5277800851847356
0.5138187483650648
