In [1]:
# !pip3 install "chonkie[all]"

In [2]:
story = """
A Purple Dragon's Dawn

In the heart of the Dragon Realms, beneath the towering spires of the Artisans' world, a very special egg rested among dozens of others in the Year of the Dragon. While most dragon eggs shimmered in shades of blue and green, this particular egg gleamed with an unusual purple hue that caught the sunlight in mysterious ways.

Elder Dragon Astor watched over the eggs with particular interest in the purple one. According to ancient scrolls, purple dragons were exceedingly rare and possessed unique abilities that could master all the elemental powers - not just one, as was common among their kin.

On the evening of the great hatching, while other dragonets emerged gracefully from their shells, the purple egg practically exploded with energy, sending sparkles in every direction. Out tumbled a tiny purple dragon with bright orange wings and golden horns - Spyro. While other newly hatched dragons were still finding their footing, Spyro was already charging around the nursery, bouncing off walls and accidentally setting tapestries ablaze with his surprisingly powerful flame breath.

It was during these first chaotic moments that a young dragonfly named Sparx, attracted by the commotion, flew into the nursery. Instead of being frightened by the energetic purple dragonet, Sparx seemed fascinated. The two locked eyes, and an immediate bond formed - one that would last a lifetime.

The elder dragons quickly discovered that raising Spyro would be unlike raising any other young dragon. While his peers were content to study the ancient arts carefully and methodically, Spyro preferred learning by doing - usually at full speed. He would glide before he was taught how, dive-bomb practice dummies before understanding proper form, and somehow manage to succeed through sheer determination and natural talent.

Nestor, the Artisans' leader, often found himself both impressed and exasperated by Spyro's antics. "That young dragon," he would say, "either going to be the realm's greatest hero or its biggest troublemaker." As it turned out, he would become both.

What truly set Spyro apart wasn't just his color or his abilities - it was his unwavering courage and his willingness to help others, even when the odds seemed impossible. While other young dragons stayed within the safety of their home worlds, Spyro would venture out to help anyone in need, with faithful Sparx always by his side.

This inherent bravery would serve him well, for none of his teachers could have predicted that this small, headstrong purple dragon would one day face the likes of Gnasty Gnorc, protect the realms from Ripto's rage, and save the Dragon Worlds time and time again. But that's exactly what made Spyro special - he never backed down from a challenge, no matter how daunting it seemed.

As he grew, Spyro maintained that same infectious enthusiasm and determination that had characterized his hatching. He transformed from an energetic hatchling into a legendary hero, proving that great things often come in small, purple packages.
"""

## Basic
https://docs.chonkie.ai/getting-started/introduction

In [3]:
from chonkie import TokenChunker

In [4]:
chunker = TokenChunker() # defaults to using GPT2 tokenizer

In [5]:
text = """
Woah! Chonkie, the chunking library is so cool! I love the tiny hippo hehe.
"""

In [6]:
chunks = chunker(text)

In [7]:
chunks

[Chunk(text=
 Woah! Chonkie, the chunking library is so cool! I love the tiny hippo hehe.
 , start_index=0, end_index=77, token_count=26, context=None)]

## Exp 1 (TokenChunker)
Split text into fixed-size token chunks with configurable overlap

The TokenChunker splits text into chunks based on token count, ensuring each chunk stays within specified token limits.

https://docs.chonkie.ai/chunkers/token-chunker

In [8]:
from chonkie import TokenChunker

In [9]:
chunker = TokenChunker(
    tokenizer="gpt2",  # Supports string identifiers
    chunk_size=128,    # Maximum tokens per chunk
    chunk_overlap=64  # Overlap between chunks
)

In [10]:
chunks = chunker.chunk(story)

In [11]:
for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")
    print("==="*36)

Chunk text: 
A Purple Dragon's Dawn

In the heart of the Dragon Realms, beneath the towering spires of the Artisans' world, a very special egg rested among dozens of others in the Year of the Dragon. While most dragon eggs shimmered in shades of blue and green, this particular egg gleamed with an unusual purple hue that caught the sunlight in mysterious ways.

Elder Dragon Astor watched over the eggs with particular interest in the purple one. According to ancient scrolls, purple dragons were exceedingly rare and possessed unique abilities that could master all the elemental powers - not just one, as was common among their kin.
Token count: 128
Start index: 0
End index: 623
Chunk text:  an unusual purple hue that caught the sunlight in mysterious ways.

Elder Dragon Astor watched over the eggs with particular interest in the purple one. According to ancient scrolls, purple dragons were exceedingly rare and possessed unique abilities that could master all the elemental powers - not just

## Exp 2 (WordChunker)
Split text into chunks while maintaining word boundaries

The WordChunker splits text into chunks while preserving word boundaries, ensuring that words stay intact and readable.

https://docs.chonkie.ai/chunkers/word-chunker

In [12]:
from chonkie import WordChunker

In [13]:
chunker = WordChunker(
    tokenizer="gpt2",        # Supports string identifiers
    chunk_size=128,         # Maximum tokens per chunk
    chunk_overlap=64        # Overlap between chunks
)

In [14]:
chunks = chunker.chunk(story)

In [15]:
for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")
    print("==="*36)

Chunk text: 
A Purple Dragon's Dawn

In the heart of the Dragon Realms, beneath the towering spires of the Artisans' world, a very special egg rested among dozens of others in the Year of the Dragon. While most dragon eggs shimmered in shades of blue and green, this particular egg gleamed with an unusual purple hue that caught the sunlight in mysterious ways.

Elder Dragon Astor watched over the eggs with particular interest in the purple one. According to ancient scrolls, purple dragons were exceedingly rare and possessed unique abilities that could master all the elemental powers - not just one, as was common among their kin.
Token count: 128
Start index: 0
End index: 623
Chunk text:  an unusual purple hue that caught the sunlight in mysterious ways.

Elder Dragon Astor watched over the eggs with particular interest in the purple one. According to ancient scrolls, purple dragons were exceedingly rare and possessed unique abilities that could master all the elemental powers - not just

## Exp 3 (SentenceChunker)
Split text into chunks while preserving sentence boundaries

The SentenceChunker splits text into chunks while preserving complete sentences, ensuring that each chunk maintains proper sentence boundaries and context.

https://docs.chonkie.ai/chunkers/sentence-chunker

In [16]:
from chonkie import SentenceChunker

In [17]:
chunker = SentenceChunker(
    tokenizer="gpt2",                # Supports string identifiers
    chunk_size=128,                  # Maximum tokens per chunk
    chunk_overlap=64,               # Overlap between chunks
    min_sentences_per_chunk=1        # Minimum sentences in each chunk
)

In [18]:
chunks = chunker.chunk(story)

In [19]:
for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")
    print(f"Number of sentences: {len(chunk.sentences)}")
    print("==="*36)

Chunk text: 
A Purple Dragon's Dawn

In the heart of the Dragon Realms, beneath the towering spires of the Artisans' world, a very special egg rested among dozens of others in the Year of the Dragon. While most dragon eggs shimmered in shades of blue and green, this particular egg gleamed with an unusual purple hue that caught the sunlight in mysterious ways.

Elder Dragon Astor watched over the eggs with particular interest in the purple one.
Token count: 95
Start index: 0
End index: 435
Number of sentences: 5
Chunk text:  While most dragon eggs shimmered in shades of blue and green, this particular egg gleamed with an unusual purple hue that caught the sunlight in mysterious ways.

Elder Dragon Astor watched over the eggs with particular interest in the purple one. According to ancient scrolls, purple dragons were exceedingly rare and possessed unique abilities that could master all the elemental powers - not just one, as was common among their kin.

On the evening of the great hatch

## Exp 4 (RecursiveChunker)
Recursively chunk documents into smaller chunks.

The RecursiveChunker is a chunker that recursively chunks documents into smaller chunks. It is a good choice for documents that are long but well structured, for example, a book or a research paper.

https://docs.chonkie.ai/chunkers/recursive-chunker

In [20]:
from chonkie import RecursiveChunker, RecursiveRules

In [21]:
chunker = RecursiveChunker(
    tokenizer="gpt2",
    chunk_size=128,
    rules=RecursiveRules(), # Default rules
    min_characters_per_chunk=1,
)

In [22]:
chunks = chunker.chunk(story)

In [23]:
for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")
    print("==="*36)

Chunk text: 
A Purple Dragon's Dawn

In the heart of the Dragon Realms, beneath the towering spires of the Artisans' world, a very special egg rested among dozens of others in the Year of the Dragon. While most dragon eggs shimmered in shades of blue and green, this particular egg gleamed with an unusual purple hue that caught the sunlight in mysterious ways.


Token count: 78
Start index: 0
End index: 351
Chunk text: Elder Dragon Astor watched over the eggs with particular interest in the purple one. According to ancient scrolls, purple dragons were exceedingly rare and possessed unique abilities that could master all the elemental powers - not just one, as was common among their kin.


Token count: 52
Start index: 351
End index: 625
Chunk text: On the evening of the great hatching, while other dragonets emerged gracefully from their shells, the purple egg practically exploded with energy, sending sparkles in every direction. Out tumbled a tiny purple dragon with bright orange wings a

## Exp 5 (SemanticChunker)
Split text into chunks based on semantic similarity

The SemanticChunker splits text into chunks based on semantic similarity, ensuring that related content stays together in the same chunk. This approach is particularly useful for RAG applications where context preservation is crucial.

https://docs.chonkie.ai/chunkers/semantic-chunker

In [24]:
from chonkie import SemanticChunker

In [25]:
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    threshold="auto",                               # Similarity threshold (0-1) or (1-100) or "auto"
    chunk_size=128,                              # Maximum tokens per chunk
    min_sentences=1                              # Initial sentences per chunk
)

  from .autonotebook import tqdm as notebook_tqdm


In [26]:
chunks = chunker.chunk(story)

In [27]:
for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Number of sentences: {len(chunk.sentences)}")

Chunk text: 
A Purple Dragon's Dawn


Token count: 6
Number of sentences: 2
Chunk text: In the heart of the Dragon Realms, beneath the towering spires of the Artisans' world, a very special egg rested among dozens of others in the Year of the Dragon. While most dragon eggs shimmered in shades of blue and green, this particular egg gleamed with an unusual purple hue that caught the sunlight in mysterious ways.

Elder Dragon Astor watched over the eggs with particular interest in the purple one. According to ancient scrolls, purple dragons were exceedingly rare and possessed unique abilities that could master all the elemental powers - not just one, as was common among their kin.


Token count: 115
Number of sentences: 4
Chunk text: On the evening of the great hatching, while other dragonets emerged gracefully from their shells, the purple egg practically exploded with energy, sending sparkles in every direction. Out tumbled a tiny purple dragon with bright orange wings and golden horns 

## Exp 6 (SDPMChunker)

Split text using Semantic Double-Pass Merging for improved context preservation

The SDPMChunker extends semantic chunking by using a double-pass merging approach. It first groups content by semantic similarity, then merges similar groups within a skip window, allowing it to connect related content that may not be consecutive in the text. This technique is particularly useful for documents with recurring themes or concepts spread apart.

https://docs.chonkie.ai/chunkers/sdpm-chunker

In [28]:
from chonkie import SDPMChunker

In [29]:
chunker = SDPMChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    threshold=0.5,                              # Similarity threshold (0-1)
    chunk_size=128,                             # Maximum tokens per chunk
    min_sentences=1,                            # Initial sentences per chunk
    skip_window=0                               # Number of chunks to skip when looking for similarities
)

In [30]:
chunks = chunker.chunk(story)

In [31]:
for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Number of sentences: {len(chunk.sentences)}")

Chunk text: 
A Purple Dragon's Dawn

In the heart of the Dragon Realms, beneath the towering spires of the Artisans' world, a very special egg rested among dozens of others in the Year of the Dragon. While most dragon eggs shimmered in shades of blue and green, this particular egg gleamed with an unusual purple hue that caught the sunlight in mysterious ways.

Elder Dragon Astor watched over the eggs with particular interest in the purple one. According to ancient scrolls, purple dragons were exceedingly rare and possessed unique abilities that could master all the elemental powers - not just one, as was common among their kin.


Token count: 121
Number of sentences: 6
Chunk text: On the evening of the great hatching, while other dragonets emerged gracefully from their shells, the purple egg practically exploded with energy, sending sparkles in every direction. Out tumbled a tiny purple dragon with bright orange wings and golden horns - Spyro. While other newly hatched dragons were sti

## Exp 7 (LateChunker)
Split text into chunks based on a late-bound token count

LateChunker is based on the paper Late Chunking, which uses a long-context embedding model to first chunk such that the entire document is within the context window. Then, it splits appart the embeddings into chunks of a specified size, either by token chunking or sentence chunking.e consecutive in the text. This technique is particularly useful for documents with recurring themes or concepts spread apart.

https://docs.chonkie.ai/chunkers/late-chunker

In [32]:
from chonkie import LateChunker

In [33]:
chunker = LateChunker(
    embedding_model="all-MiniLM-L6-v2",
    mode = "sentence",
    chunk_size=128,
    min_sentences_per_chunk=1,
    min_characters_per_sentence=12,
)

In [34]:
chunks = chunker(story)

Token indices sequence length is longer than the specified maximum sequence length for this model (611 > 256). Running this sequence through the model will result in indexing errors


In [35]:
for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Number of sentences: {len(chunk.sentences)}")

Chunk text: 
A Purple Dragon's Dawn

In the heart of the Dragon Realms, beneath the towering spires of the Artisans' world, a very special egg rested among dozens of others in the Year of the Dragon. While most dragon eggs shimmered in shades of blue and green, this particular egg gleamed with an unusual purple hue that caught the sunlight in mysterious ways.

Elder Dragon Astor watched over the eggs with particular interest in the purple one. According to ancient scrolls, purple dragons were exceedingly rare and possessed unique abilities that could master all the elemental powers - not just one, as was common among their kin.


Token count: 121
Number of sentences: 6
Chunk text: On the evening of the great hatching, while other dragonets emerged gracefully from their shells, the purple egg practically exploded with energy, sending sparkles in every direction. Out tumbled a tiny purple dragon with bright orange wings and golden horns - Spyro. While other newly hatched dragons were sti

## Misc.
Batch Chunking

In [36]:
# Example

texts = [
    "First document about topic A...",
    "Second document about topic B..."
]

batch_chunks = chunker(texts)

for chunks in batch_chunks:
    for curr_chunk in chunks:
        print(f"Chunk text: {curr_chunk.text}")
        print(f"Token count: {curr_chunk.token_count}")
        print(f"Number of sentences: {len(curr_chunk.sentences)}")

Chunk text: First document about topic A...
Token count: 8
Number of sentences: 1
Chunk text: Second document about topic B...
Token count: 8
Number of sentences: 1
