# 2.b Chunking

In this notebook you will see:
- How to chunk documents
- How to include metadata in the chunks

We will use as document parser `docling_core` and only consider text.

Here, simple chunking is considered, but more advanced one exist (based on semantic or LLMs, or hierarchical, specialized for given data format, ...)

# Setup

In [1]:
import os
import re

from docling.document_converter import DocumentConverter

from conversational_toolkit.chunking.base import Chunker, Chunk

  from .autonotebook import tqdm as notebook_tqdm


Consider using the pymupdf_layout package for a greatly improved page layout analysis.


In [2]:
path_to_docs = "data/docs"
path_to_document = os.path.join(path_to_docs, "alexnet_paper.pdf")

In [3]:
doc_converter = DocumentConverter()

conv_res = doc_converter.convert(path_to_document)
md = conv_res.document.export_to_markdown()

# replace \n per " ", as often just new lines
md = re.sub(r"(?<!\n)\n(?!\n)", " ", md)

doc_title_to_document = {"alexnet_paper.pdf": md}

[32m[INFO] 2026-02-26 15:19:13,874 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-26 15:19:13,875 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-26 15:19:13,882 [RapidOCR] main.py:53: Using C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-26 15:19:13,964 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-26 15:19:13,967 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2026-02-26 15:19:13,967 [RapidOCR] main.py:53: Using C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\mod

# Chunking

A chunk has been defined as a `content` (here, text), a `title`, a `mine_type` and related `metadata`:

```python
class Chunk(BaseModel):
    title: str
    content: str
    mime_type: str
    metadata: dict[str, Any] = Field(default_factory=dict)
```

A chunker converts an input (i.e. a text), into a list of chunks.`

```python
class Chunker(ABC):
    @abstractmethod
    def make_chunks(self, *args: Any, **kwargs: Any) -> list[Chunk]:
        pass

```

## Number of Char Based

In [4]:
class NumberOfCharChunker(Chunker):
    def make_chunks(
        self,
        document_to_text: dict[str, str],
        max_number_of_characters: int,
        chunk_overlap: int,
    ) -> list[Chunk]:
        """Splits the text into chunks of a specified maximum number of characters, with a specified overlap between chunks."""

        chunk_cnt = 0
        chunks: list[Chunk] = []

        for doc_title, text in document_to_text.items():
            start = 0
            end = max_number_of_characters

            while start < len(text):
                chunk_text = text[start:end]
                chunks.append(
                    Chunk(
                        title=str(chunk_cnt),
                        mime_type="text/markdown",
                        content=chunk_text,
                        metadata={"start": start, "end": end, "doc_title": doc_title},
                    )
                )

                start += max_number_of_characters - chunk_overlap
                end += max_number_of_characters - chunk_overlap
                chunk_cnt += 1

            return chunks

In [5]:
chunker = NumberOfCharChunker()
chunks = chunker.make_chunks(
    document_to_text=doc_title_to_document,
    max_number_of_characters=1024,
    chunk_overlap=128,
)
print(len(chunks))

41


In [6]:
print(chunks[26].title)
print(chunks[26].mime_type)
print(chunks[26].metadata, "\n")
print(chunks[26].content)

26
text/markdown
{'start': 23296, 'end': 24320, 'doc_title': 'alexnet_paper.pdf'} 

ion error rate stopped improving with the current learning rate. The learning rate was initialized at 0.01 and

reduced three times prior to termination. We trained the network for roughly 90 cycles through the training set of 1.2 million images, which took five to six days on two NVIDIA GTX 580 3GB GPUs.

## 6 Results

Our results on ILSVRC-2010 are summarized in Table 1. Our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0% 5 . The best performance achieved during the ILSVRC2010 competition was 47.1% and 28.2% with an approach that averages the predictions produced from six sparse-coding models trained on different features [2], and since then the best published results are 45.7% and 25.7% with an approach that averages the predictions of two classifiers trained on Fisher Vectors (FVs) computed from two types of densely-sampled features [24].

We also entered our model in the IL

In [7]:
# Let's check the overlap
print(chunks[27].content[:256])

 our model in the ILSVRC-2012 competition and report our results in Table 2. Since the ILSVRC-2012 test set labels are not publicly available, we cannot report test error rates for all the models that we tried. In the remainder of this paragraph, we use va


## Specific Chars Split

In [8]:
class SpecificCharChunker(Chunker):
    def split_on_character(self, text: str, split_character: str) -> list[str]:
        """Splits the text on the specified character."""
        return text.split(split_character)

    def split_on_nb_characters(
        self, text: str, max_number_of_characters: int
    ) -> list[str]:
        """Splits the text into chunks of a specified maximum number of characters."""
        return [
            text[i : i + max_number_of_characters]
            for i in range(0, len(text), max_number_of_characters)
        ]

    def make_chunks(
        self,
        split_characters: list[str],
        document_to_text: dict[str, str],
        max_number_of_characters: int,
    ) -> list[Chunk]:
        """Splits the text into chunks of a specified maximum number of characters, with a specified overlap between chunks."""

        chunk_cnt = 0

        chunks_to_split: list[Chunk] = []
        chunks: list[Chunk] = []

        chunks_to_split.append(
            Chunk(
                title="0",
                mime_type="text/markdown",
                content=document_to_text["alexnet_paper.pdf"],
                metadata={"doc_title": "alexnet_paper.pdf"},
            )
        )

        for split_character in split_characters:
            # for each chunk in chunks_to_split, split it using the split_char
            # then, if it is too small, remove it from chunks_to_split and add it to chunks
            for chunk in chunks_to_split:
                split_chunks = self.split_on_character(chunk.content, split_character)

                for split_chunk in split_chunks:
                    if len(split_chunk) <= max_number_of_characters:
                        chunks.append(
                            Chunk(
                                title=str(chunk_cnt),
                                mime_type="text/markdown",
                                content=split_chunk,
                                metadata={"doc_title": chunk.metadata["doc_title"]},
                            )
                        )
                        chunk_cnt += 1
                    else:
                        chunks_to_split.append(
                            Chunk(
                                title=str(chunk_cnt),
                                mime_type="text/markdown",
                                content=split_chunk,
                                metadata={"doc_title": chunk.metadata["doc_title"]},
                            )
                        )
                        chunk_cnt += 1

                # remove the original chunk from chunks_to_split
                chunks_to_split.remove(chunk)

        # for each chunk still in chunks_to_split, split it using the split_char
        # and move them to chunks
        for chunk in chunks_to_split:
            split_chunks = self.split_on_nb_characters(
                chunk.content, max_number_of_characters
            )

            for split_chunk in split_chunks:
                chunks.append(
                    Chunk(
                        title=str(chunk_cnt),
                        mime_type="text/markdown",
                        content=split_chunk,
                        metadata={"doc_title": chunk.metadata["doc_title"]},
                    )
                )
                chunk_cnt += 1

        return chunks

In [9]:
chunker = SpecificCharChunker()
chunks = chunker.make_chunks(
    split_characters=["\n\n\n", "\n\n", "\n"],
    document_to_text=doc_title_to_document,
    max_number_of_characters=1024,
)
print(len(chunks))

96


In [10]:
md[23618:24974]

'Our results on ILSVRC-2010 are summarized in Table 1. Our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0% 5 . The best performance achieved during the ILSVRC2010 competition was 47.1% and 28.2% with an approach that averages the predictions produced from six sparse-coding models trained on different features [2], and since then the best published results are 45.7% and 25.7% with an approach that averages the predictions of two classifiers trained on Fisher Vectors (FVs) computed from two types of densely-sampled features [24].\n\nWe also entered our model in the ILSVRC-2012 competition and report our results in Table 2. Since the ILSVRC-2012 test set labels are not publicly available, we cannot report test error rates for all the models that we tried. In the remainder of this paragraph, we use validation and test error rates interchangeably because in our experience they do not differ by more than 0.1% (see Table 2). The CNN described in this paper achieves a 

In [11]:
print(chunks[62].title)
print(chunks[62].mime_type)
print(chunks[62].metadata, "\n")
print(chunks[62].content)

67
text/markdown
{'doc_title': 'alexnet_paper.pdf'} 

Our results on ILSVRC-2010 are summarized in Table 1. Our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0% 5 . The best performance achieved during the ILSVRC2010 competition was 47.1% and 28.2% with an approach that averages the predictions produced from six sparse-coding models trained on different features [2], and since then the best published results are 45.7% and 25.7% with an approach that averages the predictions of two classifiers trained on Fisher Vectors (FVs) computed from two types of densely-sampled features [24].


------------