## Chunking Strategies

### Character-based Chunking

In [1]:
# Sample document about Python programming
sample_document = """
Python is a high-level, interpreted programming language known for its simplicity and readability. 
It was created by Guido van Rossum and first released in 1991. Python supports multiple programming 
paradigms including procedural, object-oriented, and functional programming.

One of Python's key strengths is its extensive standard library, which provides tools for many common 
programming tasks. The language emphasizes code readability with its use of significant indentation. 
Python's syntax allows programmers to express concepts in fewer lines of code compared to languages 
like C++ or Java.

Python is widely used in web development, data science, artificial intelligence, scientific computing, 
and automation. Popular frameworks include Django and Flask for web development, NumPy and Pandas for 
data analysis, and TensorFlow and PyTorch for machine learning.

The Python Package Index (PyPI) hosts hundreds of thousands of third-party packages that extend Python's 
capabilities. Installation is simple using pip, Python's package installer. The active community contributes 
to a rich ecosystem of libraries and frameworks.

Python continues to be one of the most popular programming languages worldwide. Its beginner-friendly nature 
makes it ideal for education, while its powerful features support professional software development and research.
"""

print(f"Document length: {len(sample_document)} characters")
print(f"Document length: {len(sample_document.split())} words")
print(f"\nFirst 200 characters:\n{sample_document[:200]}...")

Document length: 1367 characters
Document length: 187 words

First 200 characters:

Python is a high-level, interpreted programming language known for its simplicity and readability. 
It was created by Guido van Rossum and first released in 1991. Python supports multiple programming...


In [3]:
def chunk_by_characters(text, chunk_size=200, overlap=50):

    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)

        # move start position with overlap
        start += chunk_size - overlap
    return chunks

chunks = chunk_by_characters(sample_document)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks[:3], 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-"*80)

Number of chunks: 10

Chunk 1 (200 chars):

Python is a high-level, interpreted programming language known for its simplicity and readability. 
It was created by Guido van Rossum and first released in 1991. Python supports multiple programming
--------------------------------------------------------------------------------
Chunk 2 (200 chars):
ased in 1991. Python supports multiple programming 
paradigms including procedural, object-oriented, and functional programming.

One of Python's key strengths is its extensive standard library, which
--------------------------------------------------------------------------------
Chunk 3 (200 chars):
strengths is its extensive standard library, which provides tools for many common 
programming tasks. The language emphasizes code readability with its use of significant indentation. 
Python's syntax
--------------------------------------------------------------------------------


### Word-based Chunking

In [14]:
def chunk_by_words(text, chunk_size=50, overlap=10):
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = start + chunk_size
        chunk_words = words[start:end]
        
        #join words back into text
        chunk = ' '.join(chunk_words)
        chunks.append(chunk)

        # Move the start position with overlap
        start+= chunk_size - overlap

    return chunks

chunks = chunk_by_words(sample_document)
print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks[:3],1):
    print(f"Chunk {i} ({len(chunk.split())} words):")
    print(chunk)
    print("-"*80)
    

Number of chunks: 5

Chunk 1 (50 words):
Python is a high-level, interpreted programming language known for its simplicity and readability. It was created by Guido van Rossum and first released in 1991. Python supports multiple programming paradigms including procedural, object-oriented, and functional programming. One of Python's key strengths is its extensive standard library, which provides tools for
--------------------------------------------------------------------------------
Chunk 2 (50 words):
strengths is its extensive standard library, which provides tools for many common programming tasks. The language emphasizes code readability with its use of significant indentation. Python's syntax allows programmers to express concepts in fewer lines of code compared to languages like C++ or Java. Python is widely used in web
--------------------------------------------------------------------------------
Chunk 3 (50 words):
like C++ or Java. Python is widely used in web development, d

### Sentence-bassed Chunking

In [None]:
def chunk_by_sentences(text, max_chunk_size=500):

    # Simple sentence splitting (split on .!?)
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text)

    chunks = []
    current_chunk = ""

    for sentence in sentences:
        # Check if adding this sentence would exceed max size