# Chunking Strategies

#### Setup: Sample Documents

First, let's create  a sample document to work with.

In [6]:
# Sample document about Python programming
sample_document = """
Python is a high-level, interpreted programming language known for its simplicity and readability. 
It was created by Guido van Rossum and first released in 1991. Python supports multiple programming 
paradigms including procedural, object-oriented, and functional programming.

One of Python's key strengths is its extensive standard library, which provides tools for many common 
programming tasks. The language emphasizes code readability with its use of significant indentation. 
Python's syntax allows programmers to express concepts in fewer lines of code compared to languages 
like C++ or Java.

Python is widely used in web development, data science, artificial intelligence, scientific computing, 
and automation. Popular frameworks include Django and Flask for web development, NumPy and Pandas for 
data analysis, and TensorFlow and PyTorch for machine learning.

The Python Package Index (PyPI) hosts hundreds of thousands of third-party packages that extend Python's 
capabilities. Installation is simple using pip, Python's package installer. The active community contributes 
to a rich ecosystem of libraries and frameworks.

Python continues to be one of the most popular programming languages worldwide. Its beginner-friendly nature 
makes it ideal for education, while its powerful features support professional software development and research.
"""

print(f"Documents length: {len(sample_document)} characters")
print(f"Documents length: {len(sample_document.split())} words")
print(f"\nFirst 200 characters:\n{sample_document[:200]}...")


Documents length: 1367 characters
Documents length: 187 words

First 200 characters:

Python is a high-level, interpreted programming language known for its simplicity and readability. 
It was created by Guido van Rossum and first released in 1991. Python supports multiple programming...


#### Fixed-Size Chunking (Character-Based)

**Simplest approach**: Split text every N characters.

**How it works:**

1. Set a chunk size (e.g., 200 characters)
2. Split the text at every 200 characters
3. Optionally add overlap between chunks

In [7]:
def chunk_by_characters(text, chunks_size=200, overlap=50):
    """
    Split text into chunks of specified character length.

    Args:
        text: The text to chunk
        chunk_size: Number of characters per chunk
        overlap: Number of characters to overlap between chunks

    Returns:
        List of text chunks
    """

    chunks = []
    start = 0

    while start < len(text):
        # Get chunks from start to start + chunk_size
        end = start + chunks_size
        chunk = text[start:end]
        chunks.append(chunk)

        # Move start position (with overlap)
        start += chunks_size - overlap
    
    return chunks

chunks = chunk_by_characters(sample_document, chunks_size=200, overlap=50)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks[:3], 1):   # Show first 3 chunks
    print(f"Chunk {i} ({len(chunk)} chars:)")
    print(chunk)
    print("_" * 80)

Number of chunks: 10

Chunk 1 (200 chars:)

Python is a high-level, interpreted programming language known for its simplicity and readability. 
It was created by Guido van Rossum and first released in 1991. Python supports multiple programming
________________________________________________________________________________
Chunk 2 (200 chars:)
ased in 1991. Python supports multiple programming 
paradigms including procedural, object-oriented, and functional programming.

One of Python's key strengths is its extensive standard library, which
________________________________________________________________________________
Chunk 3 (200 chars:)
strengths is its extensive standard library, which provides tools for many common 
programming tasks. The language emphasizes code readability with its use of significant indentation. 
Python's syntax
________________________________________________________________________________


#### Fixed-Size Chunking (Word-Based)

**Better approach:** Split by words instead of characters.

**How it works:**

1. Split text into words
2. Group words into chunks of N words
3. Join words back into text



In [9]:
def chunk_by_words(text, chunk_size=50, overlap=10):
    """
    Split text into chunks of specified word count.

    Args:
        text: The text to chunk
        chunk_size: Number of words per chunk
        overlap: Number of words to overlap between chunks

    Returns:
        List of text chunks
    """
    # Split text into words
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        # Get chunk of words
        end = start + chunk_size
        chunk_words = words[start:end]

        # Join words back into text
        chunk = ' '.join(chunk_words)
        chunks.append(chunk)

        # Move start position (with overlap)
        start += chunk_size - overlap

    return chunks

chunks = chunk_by_words(sample_document, chunk_size=50, overlap=10)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks[:3], 1):
    print(f"chunk {1} ({len(chunk.split())}) words:")
    print(chunk)
    print("-" * 80)

Number of chunks: 5

chunk 1 (50) words:
Python is a high-level, interpreted programming language known for its simplicity and readability. It was created by Guido van Rossum and first released in 1991. Python supports multiple programming paradigms including procedural, object-oriented, and functional programming. One of Python's key strengths is its extensive standard library, which provides tools for
--------------------------------------------------------------------------------
chunk 1 (50) words:
strengths is its extensive standard library, which provides tools for many common programming tasks. The language emphasizes code readability with its use of significant indentation. Python's syntax allows programmers to express concepts in fewer lines of code compared to languages like C++ or Java. Python is widely used in web
--------------------------------------------------------------------------------
chunk 1 (50) words:
like C++ or Java. Python is widely used in web development, d

#### Sentence-Based Chunking

**Better approach:** Split by sentences, then group into chunks.

**How it works:**

1. Split text into sentences
2. Goup sentences until reaching target chunk size
3. Preserves sentence boundaries

In [None]:
def chunk_by_sentences(text, max_chunk_size=500):
    """
    Split text into chunks by sentences, keeping sentences intact.

    Args:
        text: The text to chunk
        max_chunk_size: Maximum characters per chunk

    Returns: 
        List of text chunks
    """
    # Simple sentence splitting (split on . ! ?)
    import re
    sentences = re.split(r'(?<=[.!])')