## Chunking Strategies

### Character-based Chunking

In [3]:
# Sample document about Python programming
sample_document = """
Python is a high-level, interpreted programming language known for its simplicity and readability. 
It was created by Guido van Rossum and first released in 1991. Python supports multiple programming 
paradigms including procedural, object-oriented, and functional programming.

One of Python's key strengths is its extensive standard library, which provides tools for many common 
programming tasks. The language emphasizes code readability with its use of significant indentation. 
Python's syntax allows programmers to express concepts in fewer lines of code compared to languages 
like C++ or Java.

Python is widely used in web development, data science, artificial intelligence, scientific computing, 
and automation. Popular frameworks include Django and Flask for web development, NumPy and Pandas for 
data analysis, and TensorFlow and PyTorch for machine learning.

The Python Package Index (PyPI) hosts hundreds of thousands of third-party packages that extend Python's 
capabilities. Installation is simple using pip, Python's package installer. The active community contributes 
to a rich ecosystem of libraries and frameworks.

Python continues to be one of the most popular programming languages worldwide. Its beginner-friendly nature 
makes it ideal for education, while its powerful features support professional software development and research.
"""

print(f"Document length: {len(sample_document)} characters")
print(f"Document length: {len(sample_document.split())} words")
print(f"\nFirst 200 characters:\n{sample_document[:200]}...")

Document length: 1367 characters
Document length: 187 words

First 200 characters:

Python is a high-level, interpreted programming language known for its simplicity and readability. 
It was created by Guido van Rossum and first released in 1991. Python supports multiple programming...


In [3]:
def chunk_by_characters(text, chunk_size=200, overlap=50):

    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)

        # move start position with overlap
        start += chunk_size - overlap
    return chunks

chunks = chunk_by_characters(sample_document)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks[:3], 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-"*80)

Number of chunks: 10

Chunk 1 (200 chars):

Python is a high-level, interpreted programming language known for its simplicity and readability. 
It was created by Guido van Rossum and first released in 1991. Python supports multiple programming
--------------------------------------------------------------------------------
Chunk 2 (200 chars):
ased in 1991. Python supports multiple programming 
paradigms including procedural, object-oriented, and functional programming.

One of Python's key strengths is its extensive standard library, which
--------------------------------------------------------------------------------
Chunk 3 (200 chars):
strengths is its extensive standard library, which provides tools for many common 
programming tasks. The language emphasizes code readability with its use of significant indentation. 
Python's syntax
--------------------------------------------------------------------------------


### Word-based Chunking

In [14]:
def chunk_by_words(text, chunk_size=50, overlap=10):
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = start + chunk_size
        chunk_words = words[start:end]
        
        #join words back into text
        chunk = ' '.join(chunk_words)
        chunks.append(chunk)

        # Move the start position with overlap
        start+= chunk_size - overlap

    return chunks

chunks = chunk_by_words(sample_document)
print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks[:3],1):
    print(f"Chunk {i} ({len(chunk.split())} words):")
    print(chunk)
    print("-"*80)
    

Number of chunks: 5

Chunk 1 (50 words):
Python is a high-level, interpreted programming language known for its simplicity and readability. It was created by Guido van Rossum and first released in 1991. Python supports multiple programming paradigms including procedural, object-oriented, and functional programming. One of Python's key strengths is its extensive standard library, which provides tools for
--------------------------------------------------------------------------------
Chunk 2 (50 words):
strengths is its extensive standard library, which provides tools for many common programming tasks. The language emphasizes code readability with its use of significant indentation. Python's syntax allows programmers to express concepts in fewer lines of code compared to languages like C++ or Java. Python is widely used in web
--------------------------------------------------------------------------------
Chunk 3 (50 words):
like C++ or Java. Python is widely used in web development, d

### Sentence-bassed Chunking

In [15]:
def chunk_by_sentences(text, max_chunk_size=500):

    # Simple sentence splitting (split on .!?)
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text)

    chunks = []
    current_chunk = ""

    for sentence in sentences:
        # Check if adding this sentence would exceed max size
        if len(current_chunk) + len(sentence) > max_chunk_size and current_chunk:
            # Save current chunk and start new one
            chunks.append(current_chunk.strip())
            current_chunk = sentence # current_chunk should contain the last sentence that wasn't added due to the max_chunk_size being exceeeded

        else:
            # Add sentence to current_chunk
            current_chunk += " "+ sentence if current_chunk else sentence

    
    # The last chunk
    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

chunks = chunk_by_sentences(sample_document, max_chunk_size=400)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-" * 80)

Number of chunks: 4

Chunk 1 (398 chars):
Python is a high-level, interpreted programming language known for its simplicity and readability. It was created by Guido van Rossum and first released in 1991. Python supports multiple programming 
paradigms including procedural, object-oriented, and functional programming. One of Python's key strengths is its extensive standard library, which provides tools for many common 
programming tasks.
--------------------------------------------------------------------------------
Chunk 2 (320 chars):
The language emphasizes code readability with its use of significant indentation. Python's syntax allows programmers to express concepts in fewer lines of code compared to languages 
like C++ or Java. Python is widely used in web development, data science, artificial intelligence, scientific computing, 
and automation.
--------------------------------------------------------------------------------
Chunk 3 (332 chars):
Popular frameworks include Django 

### Paragraph-based Chunking

In [16]:
def chunk_by_paragraphs(text, min_chunk_size=100):
    paragraphs = text.split('\n\n')
    chunks = []
    current_chunk = ""

    for para in paragraphs:
        para = para.strip()
        if not para:
            continue

        # if paragraph is less than the min_chunk_size, combine wiht next
        if len(para) < min_chunk_size:
            current_chunk += "\n\n" + para if current_chunk else para
        else:
            # Save any previously existing chunk
            if current_chunk:
                chunks.append(current_chunk.strip())
            # start new chunk with this paragraph
            current_chunk = para
    
    # the last chunk
    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

chunks = chunk_by_paragraphs(sample_document, min_chunk_size=100)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-" * 80)


Number of chunks: 5

Chunk 1 (277 chars):
Python is a high-level, interpreted programming language known for its simplicity and readability. 
It was created by Guido van Rossum and first released in 1991. Python supports multiple programming 
paradigms including procedural, object-oriented, and functional programming.
--------------------------------------------------------------------------------
Chunk 2 (323 chars):
One of Python's key strengths is its extensive standard library, which provides tools for many common 
programming tasks. The language emphasizes code readability with its use of significant indentation. 
Python's syntax allows programmers to express concepts in fewer lines of code compared to languages 
like C++ or Java.
--------------------------------------------------------------------------------
Chunk 3 (270 chars):
Python is widely used in web development, data science, artificial intelligence, scientific computing, 
and automation. Popular frameworks include Djang

## Adding Metadata to Chunks

In [None]:
def chunk_with_metadata(text, source_name, chunk_size=50, overlap=10):
    words = text.split()
    chunks = []
    start = 0
    chunk_index = 0

    while start < len(words):
        end = start + chunk_size
        chunk_words = words[start:end]
        chunk_text = ' '.join(chunk_words)

        # Create chunk with metadata
        chunk_with_meta = {
            'text': chunk_text,
            "metadata": {
                "source": source_name,
                "chunk_index": chunk_index,
                "chunk_size": len(chunk_words),
                "char_count": len(chunk_text),
                "start_word": words[start],
                "end_word": end
            }
        }

        chunks.append(chunk_with_meta)
        start += chunk_size - overlap
        chunk_index += 1
    return chunks

chunks = chunk_with_metadata(
    sample_document, source_name="python_intro.txt",
    chunk_size=50, overlap=10
)

print(f"Total chunks: {len(chunks)}\n")
print("First chunk with metadata:")
print("=" * 80)
print(f"Text: {chunks[1]['text'][:200]}...")# prints the first 200 words in the first chunk
print(f"\nMetadata:")
for key, value in chunks[1]['metadata'].items():
    print(f" {key}: {value}")

Total chunks: 5

First chunk with metadata:
Text: strengths is its extensive standard library, which provides tools for many common programming tasks. The language emphasizes code readability with its use of significant indentation. Python's syntax a...

Metadata:
 source: python_intro.txt
 chunk_index: 1
 chunk_size: 50
 char_count: 329
 start_word: strengths
 end_word: 90


## Loading and Chunking a real document

### Loading Text Files

In [None]:
def load_and_chunk_text_file(file_path, chunk_size=500, overlap=50):
    import os

    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()

        file_name = os.path.basename(file_path)
        file_size = os.path.getsize(file_path)

        chunks = chunk_by_sentences(text, max_chunk_size=chunk_size)