# Text Chunking Strategy

In [1]:
# Sample Text


sample_text = '''
Artificial Intelligence (AI) is a branch of computer science focused on building smart machines.
It has applications in a variety of fields, including healthcare, finance, and transportation.
LLMs like GPT-4 are examples of AI systems trained on large datasets.
They are capable of understanding and generating human-like language.
However, due to memory constraints, text must be split into chunks.
This process is known as chunking, and it ensures the model can process all input without being overloaded.
Various strategies exist for chunking, including fixed-length, sentence-based, and sliding windows.
Choosing the right method depends on your use case.
'''

In [None]:
# fixed-length chunking (word-based)


def fixed_length_chunking(text, chunk_size):
    """Splits text into fixed-length chunks."""
    words = text.split()
    return [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

fixed_length_chunking(sample_text, chunk_size=10)

['Artificial Intelligence (AI) is a branch of computer science focused',
 'on building smart machines. It has applications in a variety',
 'of fields, including healthcare, finance, and transportation. LLMs like GPT-4',
 'are examples of AI systems trained on large datasets. They',
 'are capable of understanding and generating human-like language. However, due',
 'to memory constraints, text must be split into chunks. This',
 'process is known as chunking, and it ensures the model',
 'can process all input without being overloaded. Various strategies exist',
 'for chunking, including fixed-length, sentence-based, and sliding windows. Choosing the',
 'right method depends on your use case.']

In [16]:
# sentence-based chunking

import re

def sentence_chunks_simple(text, n_sentences=2):
    # Simple sentence splitting using regex
    sentences = re.split(r'[.!?]+', text.strip())
    sentences = [s.strip() for s in sentences if s.strip()]
    
    chunks = []
    for i in range(0, len(sentences), n_sentences):
        chunk = '. '.join(sentences[i:i+n_sentences])
        if chunk:
            chunks.append(chunk + '.')
    return chunks

sentence_based = sentence_chunks_simple(sample_text, 2)
for i, chunk in enumerate(sentence_based):
    print(f"\n--- Sentence Chunk {i+1} ---\n{chunk}")


--- Sentence Chunk 1 ---
Artificial Intelligence (AI) is a branch of computer science focused on building smart machines. It has applications in a variety of fields, including healthcare, finance, and transportation.

--- Sentence Chunk 2 ---
LLMs like GPT-4 are examples of AI systems trained on large datasets. They are capable of understanding and generating human-like language.

--- Sentence Chunk 3 ---
However, due to memory constraints, text must be split into chunks. This process is known as chunking, and it ensures the model can process all input without being overloaded.

--- Sentence Chunk 4 ---
Various strategies exist for chunking, including fixed-length, sentence-based, and sliding windows. Choosing the right method depends on your use case.


In [19]:
def sliding_window_chunks(text, chunk_size=10, overlap=3):
    words = text.split()  # split text into words
    chunks = []

    step = chunk_size - overlap  # how far to move for next chunk

    for i in range(0, len(words), step):
        chunk_words = words[i:i + chunk_size]  # take chunk_size words
        chunk = ' '.join(chunk_words)  # join words back into a sentence
        chunks.append(chunk)

    return chunks

sliding_chunks = sliding_window_chunks(sample_text, 20, 5)
for i, chunk in enumerate(sliding_chunks):
    print(f"\n--- Sliding Chunk {i+1} ---\n{chunk}")


--- Sliding Chunk 1 ---
Artificial Intelligence (AI) is a branch of computer science focused on building smart machines. It has applications in a variety

--- Sliding Chunk 2 ---
has applications in a variety of fields, including healthcare, finance, and transportation. LLMs like GPT-4 are examples of AI systems

--- Sliding Chunk 3 ---
are examples of AI systems trained on large datasets. They are capable of understanding and generating human-like language. However, due

--- Sliding Chunk 4 ---
generating human-like language. However, due to memory constraints, text must be split into chunks. This process is known as chunking,

--- Sliding Chunk 5 ---
process is known as chunking, and it ensures the model can process all input without being overloaded. Various strategies exist

--- Sliding Chunk 6 ---
being overloaded. Various strategies exist for chunking, including fixed-length, sentence-based, and sliding windows. Choosing the right method depends on your

--- Sliding Chunk 7 ---
