### Code Chunker

- Split code into chunks based on code structure

- It splits code into chunks based on its structure, leveraging Abstract Syntax Trees (ASTs) to create contextually relevant segments.

AST is a tree representation of the syntactic structure of source code, \
providing a hierarchial view of code elements like functions, variables, and control flows.

In [1]:
from chonkie import CodeChunker

# Basic initialization with default parameters
chunker = CodeChunker(
    language="python",                 # Specify the programming language
    tokenizer_or_token_counter="gpt2", # Tokenizer to use
    chunk_size=512,                    # Maximum tokens per chunk
    include_nodes=False                # Optionally include AST nodes in output
)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#chunker

In [3]:
code = """
mlflow.set_tracking_uri('http://localhost:8080')
mlflow.set_experiment('Exp-1')

with mlflow.start_run() as run:
    x, y = make_regression(n_features=4, n_informative=2, random_state=42, shuffle=False)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
    params = {'max_depth': 1, 'random_state': 42}
    model = RandomForestRegressor(**params)
    model.fit(x_train, y_train)

    y_pred = model.predict(x_test)
    signature = infer_signature(x_test, y_pred)
    
    mlflow.log_params(params)
    mlflow.log_metrics({'mse': mean_squared_error(y_test, y_pred)})

    mlflow.sklearn.log_model(
        sk_model=model,
        artifact_path='sklearn-model-2',
        signature=signature,
        registered_model_name='sklearn-random-forest-reg'
    )

"""
chunks = chunker.chunk(code)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Language: {chunk.language}")
    if chunk.nodes:
        print(f"Node count: {len(chunk.nodes)}")

Chunk text: 
mlflow.set_tracking_uri('http://localhost:8080')
mlflow.set_experiment('Exp-1')

with mlflow.start_run() as run:
    x, y = make_regression(n_features=4, n_informative=2, random_state=42, shuffle=False)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
    params = {'max_depth': 1, 'random_state': 42}
    model = RandomForestRegressor(**params)
    model.fit(x_train, y_train)

    y_pred = model.predict(x_test)
    signature = infer_signature(x_test, y_pred)
    
    mlflow.log_params(params)
    mlflow.log_metrics({'mse': mean_squared_error(y_test, y_pred)})

    mlflow.sklearn.log_model(
        sk_model=model,
        artifact_path='sklearn-model-2',
        signature=signature,
        registered_model_name='sklearn-random-forest-reg'
    )


Token count: 351
Language: None


### Semantic Chunking

In [4]:
from chonkie import SemanticChunker

# Basic initialization with default parameters
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    threshold=0.5,                               # Similarity threshold (0-1) or (1-100) or "auto"
    chunk_size=20,                              # Maximum tokens per chunk
    min_sentences=1                              # Initial sentences per chunk
)

In [5]:
text = """First paragraph about a Donkeys.
Why donkey are colored black and white.
This paragraph is about Pigs.
Pigs are pink and have curly tails."""

chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Number of sentences: {len(chunk.sentences)}")


Chunk text: First paragraph about a Donkeys.
Why donkey are colored black and white.

Token count: 15
Number of sentences: 2
Chunk text: This paragraph is about Pigs.
Pigs are pink and have curly tails.
Token count: 14
Number of sentences: 2


### SDPM Chunker

In [6]:
from chonkie import SDPMChunker

# Basic initialization with default parameters
chunker = SDPMChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    threshold=0.5,                              # Similarity threshold (0-1)
    chunk_size=50,                             # Maximum tokens per chunk
    min_sentences=1,                            # Initial sentences per chunk
    skip_window=1                               # Number of chunks to skip when looking for similarities
)

In [7]:
text = """The neural network processes input data through layers.
Training data is essential for model performance.
GPUs accelerate neural network computations significantly.
Quality training data improves model accuracy.
TPUs provide specialized hardware for deep learning.
Data preprocessing is a crucial step in training."""

chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Number of sentences: {len(chunk.sentences)}")

Chunk text: The neural network processes input data through layers.
Training data is essential for model performance.
GPUs accelerate neural network computations significantly.
Quality training data improves model accuracy.
TPUs provide specialized hardware for deep learning.

Token count: 42
Number of sentences: 5
Chunk text: Data preprocessing is a crucial step in training.
Token count: 12
Number of sentences: 1


### Late Chunker

In [None]:
# !pip install numpy==1.26.4
# !pip install tf-keras

In [12]:
from chonkie import LateChunker, RecursiveRules

chunker = LateChunker(
    embedding_model="all-MiniLM-L6-v2",
    chunk_size=10,
    rules=RecursiveRules(),
    min_characters_per_chunk=24,
)

In [14]:
text = """First paragraph about a specific topic.
Second paragraph continuing the same topic.
Third paragraph switching to a different topic.
Fourth paragraph expanding on the new topic."""

chunks = chunker(text)

for chunk in chunks:
    
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    #print(f"Number of sentences: {len(chunk.sentences)}")

Chunk text: First paragraph about a specific topic.

Token count: 8
Chunk text: Second paragraph continuing the same topic.

Token count: 7
Chunk text: Third paragraph switching to a different topic.

Token count: 8
Chunk text: Fourth paragraph expanding on the new topic.
Token count: 9


### Neural Chunker

In [16]:
from chonkie import NeuralChunker

# Basic initialization with default parameters
chunker = NeuralChunker(
    model="mirth/chonky_modernbert_base_1",  # Default model
    device_map="cuda",                        # Device to run the model on ('cpu', 'cuda', etc.)
    min_characters_per_chunk=10,             # Minimum characters for a chunk
    return_type="chunks"                     # Output type
)

Device set to use cuda


In [18]:
text = """Limited context window - All LLMs have a limit on how much text they can process at once. This is referred to as the Context Window. Chunking helps in breaking down the large text document into processable tokens
Computational Efficiency - It is not possible to load a 100GB document every time you make a query. Attention mechanisms, even when optimized, are computationally expensive O(n). Chunking keeps things efficient and memory-friendly.
Better Representation - As mentioned earlier, chunks represent each idea as an independent entity. Not chunking your document will likely cause your model to conflate concepts and get confused. 
Representation models use lossy compression, so keeping chunks concise ensures the model understands the context better.
Reduced Hallucination - Feeding too much context at once causes the models to hallucinate. They start using irrelevant information to answer queries, and that's a big no-no. Smaller, focused chunks reduce this risk."""

chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}") # Note: token_count might be added post-hoc or not available depending on implementation
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")

Chunk text: Limited context window - All LLMs have a limit on how much text they can process at once. This is referred to as the Context Window. Chunking helps in breaking down the large text document into processable tokens

Token count: 48
Start index: 0
End index: 213
Chunk text: Computational Efficiency - It is not possible to load a 100GB document every time you make a query. Attention mechanisms, even when optimized, are computationally expensive O(n). Chunking keeps things efficient and memory-friendly.

Token count: 51
Start index: 213
End index: 445
Chunk text: Better Representation - As mentioned earlier, chunks represent each idea as an independent entity. Not chunking your document will likely cause your model to conflate concepts and get confused. 
Representation models use lossy compression, so keeping chunks concise ensures the model understands the context better.

Token count: 60
Start index: 445
End index: 761
Chunk text: Reduced Hallucination - Feeding too much conte