# Hybrid chunking

## Overview

Hybrid chunking applies tokenization-aware refinements on top of document-based hierarchical chunking.

For more details, see [here](../../concepts/chunking#hybrid-chunker).

## Setup

In [1]:
%pip install -qU pip docling transformers

Note: you may need to restart the kernel to use updated packages.


In [1]:
from pathlib import Path

DOC_SOURCE = Path("../tests/data/pdf/2206.01062.pdf")

## Basic usage

We first convert the document:

In [2]:
from docling.document_converter import DocumentConverter

doc = DocumentConverter().convert(source=DOC_SOURCE).document

  from .autonotebook import tqdm as notebook_tqdm


For a basic chunking scenario, we can just instantiate a `HybridChunker`, which will use
the default parameters.

In [14]:
from docling.chunking import HybridChunker

chunker = HybridChunker()
chunk_iter = chunker.chunk(dl_doc=doc)

Token indices sequence length is longer than the specified maximum sequence length for this model (2914 > 512). Running this sequence through the model will result in indexing errors


> ðŸ‘‰ **NOTE**: As you see above, using the `HybridChunker` can sometimes lead to a warning from the transformers library, however this is a "false alarm" â€” for details check [here](https://docling-project.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model).

Note that the text you would typically want to embed is the context-enriched one as
returned by the `contextualize()` method:

In [15]:
for i, chunk in enumerate(chunk_iter):
    print(f"=== {i} ===")
    print(f"chunk.text:\n{f'{chunk.text[:300]}â€¦'!r}")

    enriched_text = chunker.contextualize(chunk=chunk)
    print(f"chunker.contextualize(chunk):\n{f'{enriched_text[:300]}â€¦'!r}")

    print()

=== 0 ===
chunk.text:
'Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com\nChristoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com\nAhmed S. Nassar IBM Research\nRueschlikon, Switzerland ahn@zurich.ibm.com\nMichele Dolfi IBM Research Rueschlikon, Switzerland dol@zurich.ibm.com\nPeter Staaâ€¦'
chunker.contextualize(chunk):
'DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis\nBirgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com\nChristoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com\nAhmed S. Nassar IBM Research\nRueschlikon, Switzerland ahn@zurich.ibm.com\nMichele Dâ€¦'

=== 1 ===
chunk.text:
'PDF document conversion, layout segmentation, object-detection, data set, Machine Learningâ€¦'
chunker.contextualize(chunk):
'KEYWORDS\nPDF document conversion, layout segmentation, object-detection, data set, Machine Learningâ€¦'

=== 2 ===
chunk.text:
"Birgit Pfitzmann, Christoph Auer, Michele Dolfi

## Configuring tokenization

For more control on the chunking, we can parametrize tokenization as shown below.

In a RAG / retrieval context, it is important to make sure that the chunker and
embedding model are using the same tokenizer.

ðŸ‘‰ HuggingFace transformers tokenizers can be used as shown in the following example:

In [16]:
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer

from docling.chunking import HybridChunker

EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
MAX_TOKENS = 64  # set to a small number for illustrative purposes

tokenizer = HuggingFaceTokenizer(
    tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
    max_tokens=MAX_TOKENS,  # optional, by default derived from `tokenizer` for HF case
)

ðŸ‘‰ Alternatively, [OpenAI tokenizers](https://github.com/openai/tiktoken) can be used as shown in the example below (uncomment to use â€” requires installing `docling-core[chunking-openai]`):

In [7]:
# import tiktoken

# from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer

# tokenizer = OpenAITokenizer(
#     tokenizer=tiktoken.encoding_for_model("gpt-4o"),
#     max_tokens=128 * 1024,  # context window length required for OpenAI tokenizers
# )

We can now instantiate our chunker:

In [17]:
chunker = HybridChunker(
    tokenizer=tokenizer,
    merge_peers=True,  # optional, defaults to True
)
chunk_iter = chunker.chunk(dl_doc=doc)
chunks = list(chunk_iter)

Token indices sequence length is longer than the specified maximum sequence length for this model (2914 > 512). Running this sequence through the model will result in indexing errors


Points to notice looking at the output chunks below:
- Where possible, we fit the limit of 64 tokens for the metadata-enriched serialization form (see chunk 2)
- Where needed, we stop before the limit, e.g. see cases of 63 as it would otherwise run into a comma (see chunk 6)
- Where possible, we merge undersized peer chunks (see chunk 0)
- "Tail" chunks trailing right after merges may still be undersized (see chunk 8)

In [18]:
for i, chunk in enumerate(chunks):
    print(f"=== {i} ===")
    txt_tokens = tokenizer.count_tokens(chunk.text)
    print(f"chunk.text ({txt_tokens} tokens):\n{chunk.text!r}")

    ser_txt = chunker.contextualize(chunk=chunk)
    ser_tokens = tokenizer.count_tokens(ser_txt)
    print(f"chunker.contextualize(chunk) ({ser_tokens} tokens):\n{ser_txt!r}")

    print()

=== 0 ===
chunk.text (42 tokens):
'Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com\nChristoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com'
chunker.contextualize(chunk) (60 tokens):
'DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis\nBirgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com\nChristoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com'

=== 1 ===
chunk.text (41 tokens):
'Ahmed S. Nassar IBM Research\nRueschlikon, Switzerland ahn@zurich.ibm.com\nMichele Dolfi IBM Research Rueschlikon, Switzerland dol@zurich.ibm.com'
chunker.contextualize(chunk) (59 tokens):
'DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis\nAhmed S. Nassar IBM Research\nRueschlikon, Switzerland ahn@zurich.ibm.com\nMichele Dolfi IBM Research Rueschlikon, Switzerland dol@zurich.ibm.com'

=== 2 ===
chunk.text (33 tokens):
'Peter Staar IBM Research Rueschlikon, Switzerland taa@zurich.ibm.com\nF