## Chonkie

Chunking is the process of breaking down a text into smaller, more manageable pieces, that can be used for RAG applications.

### Chunking Strategies

1. Token Chunker
2. Sentence Chunker
3. Recursive Chunker
4. Code Chunker
5. Semantic Chunker
6. SDPM Chunker
7. Late Chunker
8. Neural Chunker

### Installation

In [1]:
!pip install chonkie[all]

#or 

# # Basic installation (TokenChunker, SentenceChunker, RecursiveChunker)
# pip install chonkie

# # For Hugging Face Hub support
# pip install "chonkie[hub]"

# # For visualization support (e.g., rich text output)
# pip install "chonkie[viz]"

# # For the default semantic provider support (includes Model2Vec)
# pip install "chonkie[semantic]"

# # For OpenAI embeddings support
# pip install "chonkie[openai]"

# # For Cohere embeddings support
# pip install "chonkie[cohere]"

# # For Jina embeddings support
# pip install "chonkie[jina]"

# # For SentenceTransformer embeddings support (required by LateChunker)
# pip install "chonkie[st]"

# # For CodeChunker support
# pip install "chonkie[code]"

# # For NeuralChunker support (BERT-based)
# pip install "chonkie[neural]"

# # For SlumberChunker support (Genie/LLM interface)
# pip install "chonkie[genie]"

# # For installing multiple features together
# pip install "chonkie[st, code, genie]"

# # For all features
# pip install "chonkie[all]"


Defaulting to user installation because normal site-packages is not writeable


### Token Chunker

Split text into fixed-size token chunks with configurable overlap



In [None]:
from chonkie import TokenChunker

# Basic initialization with default parameters
chunker = TokenChunker(
    tokenizer="gpt2",  # Supports string identifiers
    chunk_size=512,    # Maximum tokens per chunk
    chunk_overlap=128  # Overlap between chunks
)


### Single Text Chunking

In [3]:
text = "Some long text that needs to be chunked into smaller pieces..."
chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")

Chunk text: Some long text that needs to be chunked into smaller pieces...
Token count: 13
Start index: 0
End index: 62


### Batch Chunking

In [None]:
texts = [
    "First document to chunk...",
    "Second document to chunk..."
]
batch_chunks = chunker.chunk_batch(texts)

for doc_chunks in batch_chunks:
    for chunk in doc_chunks:
        print(f"Chunk: {chunk.text}")

In [None]:
# Single text
chunks = chunker("Text to chunk...")

# Multiple texts
batch_chunks = chunker(["Text 1...", "Text 2..."])

### Sentence Chunker

- It splits text into chunks while preserving complete sentence.
- It ensures that each chunk maintains proper sentence boundaries and context.

In [6]:
from chonkie import SentenceChunker

# Basic initialization with default parameters
chunker = SentenceChunker(
    tokenizer_or_token_counter="gpt2",                # Supports string identifiers
    chunk_size=15,                  # Maximum tokens per chunk
    chunk_overlap=2,               # Overlap between chunks
    min_sentences_per_chunk=1        # Minimum sentences in each chunk
)

In [7]:
text = """This is the first sentence. This is the second sentence. 
And here's a third one with some additional context."""
chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Number of sentences: {len(chunk.sentences)}")

Chunk text: This is the first sentence. This is the second sentence. 
Token count: 13
Number of sentences: 2
Chunk text: 
And here's a third one with some additional context.
Token count: 12
Number of sentences: 1


### Recursive Chunker 
    
- Recursively chunks documents into smaller chunks. 
- Good choice for documents that are long, well structured. For example: a book or research paper.

In [8]:
from chonkie import RecursiveChunker, RecursiveRules

chunker = RecursiveChunker(
    tokenizer_or_token_counter = "gpt2",
    chunk_size = 12,
    rules = RecursiveRules(),
    min_characters_per_chunk = 24,
    return_type = "chunks",
)

In [None]:
#https://huggingface.co/datasets/chonkie-ai/recipes/viewer/recipes?views%5B%5D=recipes

from chonkie import RecursiveChunker

# Initialize the recursive chunker to chunk Markdown
chunker = RecursiveChunker.from_recipe("markdown", lang="en")

# Initialize the recursive chunker to chunk Hindi texts
chunker = RecursiveChunker.from_recipe(lang="hi")

In [9]:
text = """This is the first sentence. This is the second sentence. 
And here's a third one with some additional context.
This is the first sentence."""

chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")

Chunk text: This is the first sentence. 
Token count: 7
Chunk text: This is the second sentence. 

Token count: 8
Chunk text: And here's a third one with some additional context.

Token count: 12
Chunk text: This is the first sentence.
Token count: 6


In [10]:
texts = [
    "This is the first sentence. This is the second sentence. And here's a third one with some additional context.",
    "This is the first sentence. This is the second sentence. And here's a third one with some additional context.",
]

chunks = chunker.chunk_batch(texts)

for chk in chunks:
    for chunk in chk:
        print(f"Chunk text: {chunk.text}")
        print(f"Token count: {chunk.token_count}")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Chunk text: This is the first sentence. 
Token count: 7
Chunk text: This is the second sentence. 
Token count: 7
Chunk text: And here's a third one with some additional context.
Token count: 11
Chunk text: This is the first sentence. 
Token count: 7
Chunk text: This is the second sentence. 
Token count: 7
Chunk text: And here's a third one with some additional context.
Token count: 11
