# Helix Chunking

Uses Chonkie's chunker under the hood. You can read which Chonkie chunker will work best for your use case at https://docs.chonkie.ai/python-sdk/chunkers/overview

In [3]:
import helix

### Sample Text for Single Text Chunking

In [4]:
massive_text_blob = """
This is a massive text blob that we want to chunk into smaller pieces for processing. It contains multiple sentences and paragraphs that need to be divided appropriately to maintain context while fitting within token limits. When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.

The chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text. This is especially important in applications like document question answering, where missing a single sentence could lead to incorrect answers. Additionally, chunkers may need to account for different languages, code blocks, or special formatting, which can add complexity to the chunking process.

This example demonstrates how the token chunker works with a realistic text sample that would be common in document processing and RAG (Retrieval-Augmented Generation) applications. The chunks will be created with specified token limits and overlap settings to optimize for both comprehension and processing efficiency. Each chunk will contain metadata about its position in the original text and token count for further processing. By using a robust chunking strategy, we can ensure that downstream models receive high-quality, context-rich input, improving the overall performance of NLP pipelines and applications.
"""

### Sample Text List for Batch Chunking

In [5]:
texts = [
    "First document to chunk with some content for testing.",
    "Second document with different content for batch processing."
]

### Sample Text For Code Chunker

In [6]:
code_sample = """
def hello_world():
    print("Hello, Chonkie!")

class MyClass:
    def __init__(self):
        self.value = 42
"""

code_samples = [
    "def func1():\n    pass",
    "const x = 10;\nfunction add(a, b) { return a + b; }"
]

### Helper Functions

#### Printing out Chunks from Single Text Chunking

In [7]:
def print_chunks(chunks, chunker_name):
    print(f"\n=== {chunker_name} - Single Text ===")
    print(f"Created {len(chunks)} chunks:")
    for i, chunk in enumerate(chunks):
        print(f"\nChunk {i+1}:")
        print(f"  Text: {chunk.text}")
        print(f"  Start: {chunk.start_index}")
        print(f"  End: {chunk.end_index}")
        print(f"  Tokens: {chunk.token_count}")

#### Printing out Chunks from Batch Text Chunking

In [8]:
def print_batch_chunks(batch_chunks, chunker_name):
    print(f"\n=== {chunker_name} - Batch Text ===")
    for doc_idx, doc_chunks in enumerate(batch_chunks):
        print(f"\nDocument {doc_idx + 1} ({len(doc_chunks)} chunks):")
        for chunk_idx, chunk in enumerate(doc_chunks):
            text_preview = chunk.text
            print(f"  Chunk {chunk_idx + 1}: {text_preview} (tokens: {chunk.token_count})")


### Token Chunker

In [8]:
chunks = helix.Chunk.token_chunk(massive_text_blob)
print_chunks(chunks, "Token Chunker")


=== Token Chunker - Single Text ===
Created 1 chunks:

Chunk 1:
  Text: 
This is a massive text blob that we want to chunk into smaller pieces for processing. It contains multiple sentences and paragraphs that need to be divided appropriately to maintain context while fitting within token limits. When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.

The chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text. This is especially important in applications like document question answering, where missing a single sentence could lead to incorrect answers. Additionally, chunk

In [9]:
batch_chunks = helix.Chunk.token_chunk(texts)
print_batch_chunks(batch_chunks, "Token Chunker")

🦛 choooooooooooooooooooonk 100% • 2/2 batches chunked [00:00<00:00, 4040.76batch/s] 🌱


=== Token Chunker - Batch Text ===

Document 1 (1 chunks):
  Chunk 1: First document to chunk with some content for testing. (tokens: 54)

Document 2 (1 chunks):
  Chunk 1: Second document with different content for batch processing. (tokens: 60)





### Sentence Chunker

In [10]:
chunks = helix.Chunk.sentence_chunk(massive_text_blob)
print_chunks(chunks, "Sentence Chunker")


=== Sentence Chunker - Single Text ===
Created 1 chunks:

Chunk 1:
  Text: 
This is a massive text blob that we want to chunk into smaller pieces for processing. It contains multiple sentences and paragraphs that need to be divided appropriately to maintain context while fitting within token limits. When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.

The chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text. This is especially important in applications like document question answering, where missing a single sentence could lead to incorrect answers. Additionally, ch

In [11]:
batch_chunks = helix.Chunk.sentence_chunk(texts)
print_batch_chunks(batch_chunks, "Sentence Chunker")

🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:00<00:00, 11.63doc/s] 🌱


=== Sentence Chunker - Batch Text ===

Document 1 (1 chunks):
  Chunk 1: First document to chunk with some content for testing. (tokens: 54)

Document 2 (1 chunks):
  Chunk 1: Second document with different content for batch processing. (tokens: 60)





### Recursive Chunker

In [13]:
chunks = helix.Chunk.recursive_chunk(massive_text_blob)
print_chunks(chunks, "Recursive Chunker")


=== Recursive Chunker - Single Text ===
Created 1 chunks:

Chunk 1:
  Text: 
This is a massive text blob that we want to chunk into smaller pieces for processing. It contains multiple sentences and paragraphs that need to be divided appropriately to maintain context while fitting within token limits. When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.

The chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text. This is especially important in applications like document question answering, where missing a single sentence could lead to incorrect answers. Additionally, c

In [14]:
batch_chunks = helix.Chunk.recursive_chunk(texts)
print_batch_chunks(batch_chunks, "Recursive Chunker")

🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:00<00:00, 13.45doc/s] 🌱


=== Recursive Chunker - Batch Text ===

Document 1 (1 chunks):
  Chunk 1: First document to chunk with some content for testing. (tokens: 54)

Document 2 (1 chunks):
  Chunk 1: Second document with different content for batch processing. (tokens: 60)





### Code Chunker

In [16]:
chunks = helix.Chunk.code_chunk(code_sample, language="python")
print_chunks(chunks, "Code Chunker")


=== Code Chunker - Single Text ===
Created 1 chunks:

Chunk 1:
  Text: 
def hello_world():
    print("Hello, Chonkie!")

class MyClass:
    def __init__(self):
        self.value = 42

  Start: 0
  End: 113
  Tokens: 109


In [17]:
batch_chunks = helix.Chunk.code_chunk(code_samples, language="python")
print_batch_chunks(batch_chunks, "Code Chunker")

🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:00<00:00, 3210.34doc/s] 🌱


=== Code Chunker - Batch Text ===

Document 1 (1 chunks):
  Chunk 1: def func1():
    pass (tokens: 21)

Document 2 (1 chunks):
  Chunk 1: const x = 10;
function add(a, b) { return a + b; } (tokens: 49)





### Semantic Chunker

In [18]:
chunks = helix.Chunk.semantic_chunk(massive_text_blob)
print_chunks(chunks, "Semantic Chunker")

  from .autonotebook import tqdm as notebook_tqdm



=== Semantic Chunker - Single Text ===
Created 7 chunks:

Chunk 1:
  Text: 
This is a massive text blob that we want to chunk into smaller pieces for processing. It contains multiple sentences and paragraphs that need to be divided appropriately to maintain context while fitting within token limits. When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.
  Start: 0
  End: 524
  Tokens: 94

Chunk 2:
  Text: 

The chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text.
  Start: 524
  End: 763
  Tokens: 45

Chunk 3:
  Text:  This is especially important in applications like 

In [19]:
batch_chunks = helix.Chunk.semantic_chunk(texts)
print_batch_chunks(batch_chunks, "Semantic Chunker")

🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:00<00:00, 3343.41doc/s] 🌱


=== Semantic Chunker - Batch Text ===

Document 1 (1 chunks):
  Chunk 1: First document to chunk with some content for testing. (tokens: 10)

Document 2 (1 chunks):
  Chunk 1: Second document with different content for batch processing. (tokens: 9)





### SDPM Chunker

In [20]:
chunks = helix.Chunk.sdp_chunk(massive_text_blob)
print_chunks(chunks, "SDPM Chunker")


=== SDPM Chunker - Single Text ===
Created 7 chunks:

Chunk 1:
  Text: 
This is a massive text blob that we want to chunk into smaller pieces for processing. It contains multiple sentences and paragraphs that need to be divided appropriately to maintain context while fitting within token limits. When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.
  Start: 0
  End: 524
  Tokens: 94

Chunk 2:
  Text: 

The chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text.
  Start: 524
  End: 763
  Tokens: 45

Chunk 3:
  Text:  This is especially important in applications like docu

In [21]:
batch_chunks = helix.Chunk.sdp_chunk(texts)
print_batch_chunks(batch_chunks, "SDPM Chunker")

🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:00<00:00, 4527.04doc/s] 🌱


=== SDPM Chunker - Batch Text ===

Document 1 (1 chunks):
  Chunk 1: First document to chunk with some content for testing. (tokens: 10)

Document 2 (1 chunks):
  Chunk 1: Second document with different content for batch processing. (tokens: 9)





### Late Chunker

In [22]:
chunks = helix.Chunk.late_chunk(massive_text_blob)
print_chunks(chunks, "Late Chunker")

Token indices sequence length is longer than the specified maximum sequence length for this model (300 > 256). Running this sequence through the model will result in indexing errors



=== Late Chunker - Single Text ===
Created 1 chunks:

Chunk 1:
  Text: 
This is a massive text blob that we want to chunk into smaller pieces for processing. It contains multiple sentences and paragraphs that need to be divided appropriately to maintain context while fitting within token limits. When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.

The chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text. This is especially important in applications like document question answering, where missing a single sentence could lead to incorrect answers. Additionally, chunke

In [23]:
batch_chunks = helix.Chunk.late_chunk(texts)
print_batch_chunks(batch_chunks, "Late Chunker")

🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:00<00:00, 24.91doc/s] 🌱


=== Late Chunker - Batch Text ===

Document 1 (1 chunks):
  Chunk 1: First document to chunk with some content for testing. (tokens: 12)

Document 2 (1 chunks):
  Chunk 1: Second document with different content for batch processing. (tokens: 11)





### Neural Chunker

In [24]:
chunks = helix.Chunk.neural_chunk(massive_text_blob)
print_chunks(chunks, "Neural Chunker")

Device set to use cpu



=== Neural Chunker - Single Text ===
Created 8 chunks:

Chunk 1:
  Text: 

  Start: 0
  End: 1
  Tokens: 1

Chunk 2:
  Text: This is a massive text blob that we want to chunk into smaller pieces for processing. It contains multiple sentences and paragraphs that need to be divided appropriately to maintain context while fitting within token limits.
  Start: 1
  End: 225
  Tokens: 38

Chunk 3:
  Text:  When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.

  Start: 225
  End: 525
  Tokens: 54

Chunk 4:
  Text: 
The chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text.
 

In [25]:

batch_chunks = helix.Chunk.neural_chunk(texts)
print_batch_chunks(batch_chunks, "Neural Chunker")

Device set to use cpu
🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:00<00:00, 14.30doc/s] 🌱


=== Neural Chunker - Batch Text ===

Document 1 (1 chunks):
  Chunk 1: First document to chunk with some content for testing. (tokens: 10)

Document 2 (1 chunks):
  Chunk 1: Second document with different content for batch processing. (tokens: 9)





### Slumber Chunker

You need to set an Gemini API key in your env to run this

In [1]:
import dotenv
dotenv.load_dotenv()

True

In [9]:
chunks = helix.Chunk.slumber_chunk(massive_text_blob)
print_chunks(chunks, "Slumber Chunker")

🦛 choooooooooooooooooooonk 100% • 36/36 splits processed [00:41<00:00,  1.17s/split] 🌱


=== Slumber Chunker - Single Text ===
Created 3 chunks:

Chunk 1:
  Text: 
This is a massive text blob that we want to chunk into smaller pieces for processing. Itcontainsmultiplesentencesandparagraphsthatneedtobedividedappropriatelytomaintaincontextwhilefittingwithintokenlimits.When working with large documents, it is important to ensure that each chunk maintains enough context for downstream tasks, such as retrieval or summarization. Chunking strategies can vary depending on the use case, but the goal is always to balance context preservation with processing efficiency.

The chunker should handle overlaps properly to ensure no important information is lost at chunk boundaries. For example, if a sentence is split between two chunks, the overlap ensures that both chunks retain the full meaning of the text. This is especially important in applications like document question answering, where missing a single sentence could lead to incorrect answers. 
  Start: 0
  End: 908
  Tokens: 888





In [10]:
batch_chunks = helix.Chunk.slumber_chunk(texts)
print_batch_chunks(batch_chunks, "Slumber Chunker")

🦛 choooooooooooooooooooonk 100% • 1/1 splits processed [00:05<00:00,  5.84s/split] 🌱
🦛 choooooooooooooooooooonk 100% • 1/1 splits processed [00:06<00:00,  6.27s/split] 🌱
🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:12<00:00,  6.06s/doc] 🌱


=== Slumber Chunker - Batch Text ===

Document 1 (1 chunks):
  Chunk 1: First document to chunk with some content for testing. (tokens: 54)

Document 2 (1 chunks):
  Chunk 1: Second document with different content for batch processing. (tokens: 60)



