# Late Chunking

Below is the implementation of late chunking from the paper - [Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models](https://arxiv.org/abs/2409.04701). The chunking method allows us to retain contextual information from the whole document within the chunks.

The code runs through it step by step before wrapping in a function.

In [1]:
import torch
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = "Qwen/Qwen3-Embedding-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

In [3]:
text= """
# Coffee Machine Maintenance

##   Daily Cleaning Routine

After each use, the coffee machine should be cleaned to prevent buildup of coffee oils and residue.
This includes emptying and rinsing the carafe, removing used coffee grounds, and wiping down the exterior and drip tray.
Ensuring these steps are done daily helps maintain hygiene and keeps the machine functioning efficiently.

##  Descaling and Internal Maintenance

The machine should be descaled approximately once a month, or more frequently if hard water is used.
Use a descaling solution or a mixture of water and white vinegar, following the manufacturer’s instructions.
This process removes mineral deposits from internal components, which helps maintain water flow and brewing temperature.

## Filter and Component Checks

Regularly inspect and clean or replace the coffee filter, whether it’s reusable or disposable.
Additionally, check the water reservoir and any removable parts for signs of mould or buildup.
Keeping these components clean ensures consistent coffee quality and extends the lifespan of the machine.
"""


In [4]:
chunk_1 = """
Title: Coffee Machine Maintenance
Subtitle: Daily Cleaning Routine
Text: After each use, the coffee machine should be cleaned to prevent buildup of coffee oils and residue.
This includes emptying and rinsing the carafe, removing used coffee grounds, and wiping down the exterior and drip tray.
Ensuring these steps are done daily helps maintain hygiene and keeps the machine functioning efficiently.
"""

chunk_2 = """
Title: Coffee Machine Maintenance
Subtitle: Descaling and Internal Maintenance
Text: The machine should be descaled approximately once a month, or more frequently if hard water is used.
Use a descaling solution or a mixture of water and white vinegar, following the manufacturer’s instructions.
This process removes mineral deposits from internal components, which helps maintain water flow and brewing temperature.
"""

chunk_3 = """
Title: Coffee Machine Maintenance
Subtitle: Filter and Component Checks
Text: Regularly inspect and clean or replace the coffee filter, whether it’s reusable or disposable.
Additionally, check the water reservoir and any removable parts for signs of mould or buildup.
Keeping these components clean ensures consistent coffee quality and extends the lifespan of the machine.
"""

split_chunks = [chunk_1, chunk_2, chunk_3]

In [5]:
def simple_embedding(text, model_name=model_name):
    from sentence_transformers import SentenceTransformer
    
    # Load the model
    model = SentenceTransformer(model_name)

    # Encode the input text
    query_embeddings = model.encode(text, prompt_name="query")

    # Convert to list
    embedding_list = query_embeddings.tolist()

    # Ensure embedding_list is a list of lists
    if isinstance(embedding_list[0], (int, float)):
        embedding_list = [embedding_list]

    # Helper to format numbers
    def format_item(x):
        if isinstance(x, (int, float)):
            return f"{x:.4f}"
        return str(x)

    # Print first 5 elements of each embedding
    for sublist in embedding_list:
        items = sublist[:5]
        formatted = [format_item(x) for x in items]
        if len(sublist) > 5:
            formatted.append("...")
        print(f"[{', '.join(formatted)}]")

    return embedding_list

In [6]:
def late_chunking(text, chunk_size = 50, model=model, tokenizer=tokenizer):
    """Function to perform the Late Chunking method of embedding"""

    # Encode the whole text and ensure the offsets for tokens are stored
    encodings = tokenizer(
        text,
        return_tensors="pt",
        return_offsets_mapping=True,
        padding=True,
        truncation=False
    )

    # Convert offsets from Tensor to numpy array
    offsets = encodings["offset_mapping"][0].numpy()
    num_tokens = len(offsets)

    token_embeddings = []

    # Loop through tokens and embed token by token
    with torch.no_grad():
        outputs = model(**encodings)
        embs = outputs.last_hidden_state.squeeze()
        token_embeddings = embs.tolist()

    chunks = []

    # Now loop through the token embeddings and create the chunks by number of tokens.
    # Then pool the embeddings and take the average point wise across the embeddings for each chunk
    for i in range(0, num_tokens, chunk_size):
    
        start_token = i
        end_token = min(i + chunk_size, num_tokens - 1)
    
        start_char = offsets[start_token][0]
        end_char = offsets[end_token][1] if end_token < num_tokens else offsets[-1][1]

        chunk_text = text[start_char:end_char].strip()
        chunk_embs = token_embeddings[start_token:end_token+1]
        # Take the mean for each position in all embedding vectors for each chunk
        chunk_emb = torch.tensor(chunk_embs).mean(dim=0).tolist()
        
        chunks.append({
            "text": chunk_text,
            "embedding": chunk_emb
        })

    return chunks

In [7]:
result = late_chunking(text, chunk_size=100)

In [8]:
result

[{'text': '# Coffee Machine Maintenance\n\n##   Daily Cleaning Routine\n\nAfter each use, the coffee machine should be cleaned to prevent buildup of coffee oils and residue.\nThis includes emptying and rinsing the carafe, removing used coffee grounds, and wiping down the exterior and drip tray.\nEnsuring these steps are done daily helps maintain hygiene and keeps the machine functioning efficiently.\n\n##  Descaling and Internal Maintenance\n\nThe machine should be descaled approximately once a month, or more frequently if hard water is used.',
  'embedding': [-0.21350090205669403,
   -6.131389141082764,
   -0.2711300551891327,
   1.5745975971221924,
   -1.2774486541748047,
   -2.6624135971069336,
   5.511046886444092,
   -14.986359596252441,
   -6.01763916015625,
   12.031206130981445,
   -4.3222808837890625,
   -4.662331581115723,
   4.790397644042969,
   -0.1141764298081398,
   -4.645014762878418,
   5.283888816833496,
   8.794221878051758,
   -3.3818359375,
   -1.0572710037231445,
