<a href="https://colab.research.google.com/github/Stephyj2/NLP-Late-Chunking-/blob/main/How_Late_Chunking_Can_Enhance_Your_Retrieval_Systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **How Late Chunking Can Enhance Your Retrieval Systems**

In modern information retrieval systems, effectively chunking text into smaller segments is critical for improving the performance of search engines and question-answering models. Traditional chunking methods often split text before encoding, resulting in the loss of valuable context. A more advanced technique, **Late Chunking**, allows for context-sensitive pooling of embeddings after encoding, ensuring that valuable relationships between chunks remain intact.

This tutorial will guide you through the steps of implementing Late Chunking using OpenAI's `transformers` library, alongside a model like `jinaai/jina-embeddings-v2-base-en`. You will learn how to install necessary libraries, split text into chunks, and perform Late Chunking, followed by a comparison of chunking techniques to show why Late Chunking provides more accurate results.

## **Step 1: Installing Required Libraries**

First, let's install the `transformers` library, which provides access to pre-trained models for various natural language processing (NLP) tasks:

In [None]:
!pip install transformers==4.43.4

Collecting transformers==4.43.4
  Downloading transformers-4.43.4-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.43.4-py3-none-any.whl (9.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.44.2
    Uninstalling transformers-4.44.2:
      Successfully uninstalled transformers-4.44.2
Successfully installed transformers-4.43.4


This ensures that the correct version of `transformers` is used, which supports the functionality required for this tutorial.

## **Step 2: Loading the Pre-trained Model and Tokenizer**

To get started, we'll load the model and tokenizer. Here, we use the `jinaai/jina-embeddings-v2-base-en` model, which supports mean pooling—a crucial feature for Late Chunking. You can replace this model with any other model that has similar capabilities.

In [None]:
from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

configuration_bert.py:   0%|          | 0.00/8.24k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- configuration_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_bert.py:   0%|          | 0.00/97.7k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- modeling_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/275M [00:00<?, ?B/s]

This step initializes the tokenizer and model that we will use to process the input text and generate embeddings.

## **Step 3: Defining a Text Chunking Function**

The next step involves splitting the text into smaller chunks based on sentence boundaries. Here's how the `chunk_by_sentences` function works:

In [None]:
def chunk_by_sentences(input_text: str, tokenizer: callable):
    """
    Split the input text into sentences using the tokenizer
    :param input_text: The text snippet to split into sentences
    :param tokenizer: The tokenizer to use
    :return: A tuple containing the list of text chunks and their corresponding token spans
    """
    inputs = tokenizer(input_text, return_tensors='pt', return_offsets_mapping=True)
    punctuation_mark_id = tokenizer.convert_tokens_to_ids('.')
    sep_id = tokenizer.convert_tokens_to_ids('[SEP]')
    token_offsets = inputs['offset_mapping'][0]
    token_ids = inputs['input_ids'][0]

    chunk_positions = [
        (i, int(start + 1))
        for i, (token_id, (start, end)) in enumerate(zip(token_ids, token_offsets))
        if token_id == punctuation_mark_id
        and (
            token_offsets[i + 1][0] - token_offsets[i][1] > 0
            or token_ids[i + 1] == sep_id
        )
    ]

    chunks = [
        input_text[x[1] : y[1]]
        for x, y in zip([(1, 0)] + chunk_positions[:-1], chunk_positions)
    ]
    span_annotations = [
        (x[0], y[0]) for (x, y) in zip([(1, 0)] + chunk_positions[:-1], chunk_positions)
    ]
    return chunks, span_annotations

This function tokenizes the input text and uses sentence boundaries to create chunks. It also returns the span annotations, which will be crucial for the Late Chunking process.

### **Example Input Text**

In [None]:
input_text = "Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."
chunks, span_annotations = chunk_by_sentences(input_text, tokenizer)
print('Chunks:\n- "' + '"\n- "'.join(chunks) + '"')

Chunks:
- "Berlin is the capital and largest city of Germany, both by area and by population."
- " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."


The text is split into chunks, and the span annotations are stored for later use.

## **Step 4: Implementing Late Chunking**

Traditional chunking encodes the chunks before embedding, which often leads to a loss of context. Late Chunking, however, pools embeddings based on the chunk spans after encoding. Here's how you can implement Late Chunking:

In [None]:
def late_chunking(model_output, span_annotation, max_length=None):
    token_embeddings = model_output[0]
    outputs = []

    for embeddings, annotations in zip(token_embeddings, span_annotation):
        if max_length is not None:
            annotations = [
                (start, min(end, max_length - 1))
                for (start, end) in annotations
                if start < (max_length - 1)
            ]
        pooled_embeddings = [
            embeddings[start:end].sum(dim=0) / (end - start)
            for start, end in annotations
            if (end - start) >= 1
        ]
        pooled_embeddings = [embedding.detach().cpu().numpy() for embedding in pooled_embeddings]
        outputs.append(pooled_embeddings)

    return outputs

This function pools the embeddings after encoding, ensuring that the context is preserved.

## **Step 5: Encoding the Chunks**

Now, let's encode the chunks using both traditional chunking and Late Chunking:

In [None]:
# Encode using traditional chunking
embeddings_traditional_chunking = model.encode(chunks)

# Encode using Late Chunking
inputs = tokenizer(input_text, return_tensors='pt')
model_output = model(**inputs)
embeddings = late_chunking(model_output, [span_annotations])[0]

In this code, the chunks are encoded twice: once with traditional chunking and once using Late Chunking, where the embedding happens after encoding.

## **Step 6: Comparing Similarities**

Finally, we'll compare the similarity of the word "Berlin" with the chunks, both for traditional and Late Chunking methods. The cosine similarity function is used to measure how close the embeddings are:

In [None]:
import numpy as np

cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

berlin_embedding = model.encode('Berlin')

for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking):
    print(f'similarity_new("Berlin", "{chunk}"):', cos_sim(berlin_embedding, new_embedding))
    print(f'similarity_trad("Berlin", "{chunk}"):', cos_sim(berlin_embedding, trad_embeddings))

similarity_new("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.85316974
similarity_trad("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.8486219
similarity_new("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.8365807
similarity_trad("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.70843387


The similarity score for the context-sensitive Late Chunking method should be higher, as it preserves more of the sentence structure and context, allowing for more accurate results.

## **Conclusion**

By following this tutorial, you have implemented and compared Late Chunking with traditional chunking methods. Late Chunking enhances retrieval systems by allowing context-sensitive pooling of embeddings, which leads to more accurate representation and better similarity scores in real-world applications.

You can expand this method by integrating it into a full-fledged retrieval system or combining it with state-of-the-art models like OpenAI, LangChain, or ChromaDB for more advanced use cases.