# Input text

In [1]:
text = ''''One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]

You can't understand the world without understanding the concept of superlinear returns. And if you're ambitious you definitely should, because this will be the wave you surf on.

It may seem as if there are a lot of different situations with superlinear returns, but as far as I can tell they reduce to two fundamental causes: exponential growth and thresholds.

The most obvious case of superlinear returns is when you're working on something that grows exponentially. For example, growing bacterial cultures. When they grow at all, they grow exponentially. But they're tricky to grow. Which means the difference in outcome between someone who's adept at it and someone who's not is very great.

Startups can also grow exponentially, and we see the same pattern there. Some manage to achieve high growth rates. Most don't. And as a result you get qualitatively different outcomes: the companies with high growth rates tend to become immensely valuable, while the ones with lower growth rates may not even survive.

Y Combinator encourages founders to focus on growth rate rather than absolute numbers. It prevents them from being discouraged early on, when the absolute numbers are still low. It also helps them decide what to focus on: you can use growth rate as a compass to tell you how to evolve the company. But the main advantage is that by focusing on growth rate you tend to get something that grows exponentially.

YC doesn't explicitly tell founders that with growth rate "you get out what you put in," but it's not far from the truth. And if growth rate were proportional to performance, then the reward for performance p over time t would be proportional to pt.

Even after decades of thinking about this, I find that sentence startling.'''

# Chunker settings

In [2]:
CHUNK_SIZE = 128
CHUNK_OVERLAP = 32

# Token Chunker

In [3]:
from chonkie import TokenChunker

# Basic initialization with default parameters
chunker = TokenChunker(
    tokenizer="gpt2",                   # Supports string identifiers
    chunk_size=CHUNK_SIZE,              # Maximum tokens per chunk
    chunk_overlap=CHUNK_OVERLAP         # Overlap between chunks
)

chunks = chunker.chunk(text)
for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")

  from .autonotebook import tqdm as notebook_tqdm


Chunk text: 'One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is
Token count: 128
Start index: 0
End index: 573
Chunk text:  get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same 

# Sentence Chunker

In [4]:
from chonkie import SentenceChunker

# Basic initialization with default parameters
chunker = SentenceChunker(
    tokenizer_or_token_counter="gpt2",      # Supports string identifiers
    chunk_size=CHUNK_SIZE,                  # Maximum tokens per chunk
    chunk_overlap=CHUNK_OVERLAP,            # Overlap between chunks
    min_sentences_per_chunk=1               # Minimum sentences in each chunk
)

chunks = chunker.chunk(text)
for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")

Chunk text: 'One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

Token count: 108
Start index: 0
End index: 472
Chunk text: You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of the

# Recursive Chunker

In [5]:
from chonkie import RecursiveChunker

chunker = RecursiveChunker.from_recipe(
    name="default",
    lang="en",
    chunk_size=CHUNK_SIZE,  # Maximum tokens per chunk
)

chunks = chunker.chunk(text)
for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")

Chunk text: 'One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.
Token count: 107
Start index: 0
End index: 471
Chunk text: 

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
Token count: 95
Start

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [6]:
type(chunks[0]), chunks[0].to_dict().keys()

(chonkie.types.recursive.RecursiveChunk,
 dict_keys(['text', 'start_index', 'end_index', 'token_count', 'context', 'level']))

# Semantic Chunker

In [7]:
from chonkie import SemanticChunker

# Basic initialization with default parameters
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    threshold=0.5,                               # Similarity threshold (0-1) or (1-100) or "auto"
    chunk_size=CHUNK_SIZE,                       # Maximum tokens per chunk
    min_sentences=1                              # Initial sentences per chunk
)

chunks = chunker.chunk(text)
for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")

Chunk text: 'One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

Token count: 111
Start index: 0
End index: 472
Chunk text: 
It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]

You can't understand

In [8]:
type(chunks[0]), chunks[0].to_dict().keys()

(chonkie.types.semantic.SemanticChunk,
 dict_keys(['text', 'start_index', 'end_index', 'token_count', 'context', 'sentences']))

Inspecting sentence embeddings

In [9]:
chunk_sentences = chunks[0].sentences
type(chunk_sentences[0]), len(chunk_sentences)

(chonkie.types.semantic.SemanticSentence, 5)

In [10]:
for cs in chunk_sentences:
    print(f"Sentence text: {cs.text}")
    print(f"Sentence embedding: {cs.embedding[:10]}...")  # Print first 5 elements of the embedding
    print("")

Sentence text: 'One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Sentence embedding: [-0.08002032  0.08384115 -0.15242851 -0.01662905  0.1546974   0.1862143
 -0.14668271 -0.1180561  -0.09991404 -0.06428061]...

Sentence text: 
Teachers and coaches implicitly told us the returns were linear. 
Sentence embedding: [-0.29372326  0.0668354  -0.17556645  0.06406913  0.19441003  0.2577094
 -0.15412916 -0.13282807 -0.07830012  0.02185827]...

Sentence text: "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. 
Sentence embedding: [-0.3769128   0.11292984 -0.10909131 -0.00038342  0.20120978  0.2154198
 -0.1785941  -0.14814211  0.05318506  0.0355464 ]...

Sentence text: If your product is only half as good as your competitor's, you don't get half as many customers. 
Sentence embedding: [-0.3903855   0.11076837 -0.04779201 -0.03220946  0.1733

# Neural Chunker

In [11]:
from chonkie import NeuralChunker

# Basic initialization with default parameters
chunker = NeuralChunker(
    model="mirth/chonky_modernbert_base_1",  # Default model
    device_map="auto",                       # Device to run the model on ('cpu', 'cuda', etc.)
    min_characters_per_chunk=10,             # Minimum characters for a chunk
    return_type="chunks"                     # Output type
)

chunks = chunker.chunk(text)
for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Device set to use mps


Chunk text: 'One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in."
Token count: 63
Start index: 0
End index: 282
Chunk text:  They meant well, but this is rarely true.
Token count: 10
Start index: 282
End index: 324
Chunk text:  If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business.
Token count: 51
Start index: 324
End index: 554
Chunk text:  Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, mil