## Chunking Strategies

### Setup Sample Document

In [2]:
# Sample document about Python programming
sample_document = """
Python is a high-level, interpreted programming language known for its simplicity and readability. 
It was created by Guido van Rossum and first released in 1991. Python supports multiple programming 
paradigms including procedural, object-oriented, and functional programming.
 
One of Python's key strengths is its extensive standard library, which provides tools for many common 
programming tasks. The language emphasizes code readability with its use of significant indentation. 
Python's syntax allows programmers to express concepts in fewer lines of code compared to languages 
like C++ or Java.

Python is widely used in web development, data science, artificial intelligence, scientific computing, 
and automation. Popular frameworks include Django and Flask for web development, NumPy and Pandas for 
data analysis, and TensorFlow and PyTorch for machine learning.

The Python Package Index (PyPI) hosts hundreds of thousands of third-party packages that extend Python's 
capabilities. Installation is simple using pip, Python's package installer. The active community contributes 
to a rich ecosystem of libraries and frameworks.

Python continues to be one of the most popular programming languages worldwide. Its beginner-friendly nature 
makes it ideal for education, while its powerful features support professional software development and research.
"""

print(f"Document length: {len(sample_document)} characters")
print(f"Document length: {len(sample_document.split())} words")
print(f"\nFirst 200 characters:\n{sample_document[:200]}...")

Document length: 1367 characters
Document length: 187 words

First 200 characters:

Python is a high-level, interpreted programming language known for its simplicity and readability. 
It was created by Guido van Rossum and first released in 1991. Python supports multiple programming...


## Fixed chunking size(Character-Based)

In [3]:
def chunk_by_characters(text, chunk_size=200, overlap=50):
  """
  Split text into chunks of specified character length

  Args:
        text: The text to chunk
        chunk_size: Number of characters per chunk
        overlap: Number of characters to overlap between chunks
    
    Returns:
        List of text chunks
  """
  chunks =[]
  start = 0

  while start < len(text):
    # Get chunk from start to + chunk size
    end = start + chunk_size
    chunk = text[start:end]
    chunks.append(chunk)

    # Move start position
    start += chunk_size - overlap

  return chunks

# Test it
chunks = chunk_by_characters(sample_document, chunk_size=200, overlap=50)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks[:3], 1):   #Show first 3 chunks
  print(f"Chunk {i} ({len(chunk)} chars):")
  print(chunk)
  print("=" * 80)

Number of chunks: 10

Chunk 1 (200 chars):

Python is a high-level, interpreted programming language known for its simplicity and readability. 
It was created by Guido van Rossum and first released in 1991. Python supports multiple programming
Chunk 2 (200 chars):
ased in 1991. Python supports multiple programming 
paradigms including procedural, object-oriented, and functional programming.

One of Python's key strengths is its extensive standard library, which
Chunk 3 (200 chars):
strengths is its extensive standard library, which provides tools for many common 
programming tasks. The language emphasizes code readability with its use of significant indentation. 
Python's syntax


## Fixed-Size Chunking (Word-Based)

In [4]:
def chunk_by_words(text, chunk_size=50, overlap=10):
  """
  Split text into chunks of specified word count.
    
  Args:
      text: The text to chunk
      chunk_size: Number of words per chunk
      overlap: Number of words to overlap between chunks
    
  Returns:
      List of text chunks
    """
  # Split text into words
  words = text.split()
  chunks = []
  start = 0

  while start < len(words):
    # Get chunk of words
    end = start + chunk_size
    chunk_words = words[start:end]

    # Join words back into text
    chunk = ' '.join(chunk_words)
    chunks.append(chunk)

    # Move start position  (with overlap)
    start += chunk_size - overlap
  
  return chunks

# Test it
chunks = chunk_by_words(sample_document, chunk_size=50, overlap=10)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks[:3], 1):  # Show first 3 chunks
    print(f"Chunk {i} ({len(chunk.split())} words):")
    print(chunk)
    print("-" * 80)

Number of chunks: 5

Chunk 1 (50 words):
Python is a high-level, interpreted programming language known for its simplicity and readability. It was created by Guido van Rossum and first released in 1991. Python supports multiple programming paradigms including procedural, object-oriented, and functional programming. One of Python's key strengths is its extensive standard library, which provides tools for
--------------------------------------------------------------------------------
Chunk 2 (50 words):
strengths is its extensive standard library, which provides tools for many common programming tasks. The language emphasizes code readability with its use of significant indentation. Python's syntax allows programmers to express concepts in fewer lines of code compared to languages like C++ or Java. Python is widely used in web
--------------------------------------------------------------------------------
Chunk 3 (50 words):
like C++ or Java. Python is widely used in web development, d

## Sentence-Based Chunking

In [5]:
def chunk_by_sentences(text, max_chunk_size=500):
  """
    Split text into chunks by sentences, keeping sentences intact.
    
    Args:
        text: The text to chunk
        max_chunk_size: Maximum characters per chunk
    
    Returns:
        List of text chunks
    """
  # Simple sentence splitting (Split on . ! ?)
  import re
  sentences = re.split(r'(?<=[.!?])\s+', text)

  chunks = []
  current_chunk = ""

  for sentence in sentences:
    # Check if adding this sentence would exceed max size
    if len(current_chunk) + len(sentence) > max_chunk_size and current_chunk:
      # Save current chunk and start new one
      chunks.append(current_chunk.strip())
      current_chunk = sentence
    else:
      # Add sentence to current chunk
      current_chunk += " " + sentence if current_chunk else sentence

  # Don't forget the last chunk
  if current_chunk:
    chunks.append(current_chunk.strip())
  
  return chunks


# Test it
chunks = chunk_by_sentences(sample_document, max_chunk_size=400)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
  print(f"Chunk {i} ({len(chunk)} chars):")
  print(chunk)
  print("-" * 80)


Number of chunks: 4

Chunk 1 (398 chars):
Python is a high-level, interpreted programming language known for its simplicity and readability. It was created by Guido van Rossum and first released in 1991. Python supports multiple programming 
paradigms including procedural, object-oriented, and functional programming. One of Python's key strengths is its extensive standard library, which provides tools for many common 
programming tasks.
--------------------------------------------------------------------------------
Chunk 2 (320 chars):
The language emphasizes code readability with its use of significant indentation. Python's syntax allows programmers to express concepts in fewer lines of code compared to languages 
like C++ or Java. Python is widely used in web development, data science, artificial intelligence, scientific computing, 
and automation.
--------------------------------------------------------------------------------
Chunk 3 (332 chars):
Popular frameworks include Django 

## Paragraph-Based Chunking

In [6]:
def chunk_by_paragraphs(text, min_chunk_size= 100):
  """
  Split text by paragraphs (double newlines).
    
  Args:
      text: The text to chunk
      min_chunk_size: Minimum characters per chunk (combine small paragraphs)
    
  Returns:
      List of text chunks
  """
  # Split by double newlines (paragraph seperator)
  paragraphs = text.split('\n\n')

  chunks = []
  current_chunk = ""
  
  for para in paragraphs:
    para = para.strip()
    if not para:
      continue

    # If paragraph is too small, combine with next
    if len(para) < min_chunk_size:
      current_chunk += "\n\n" + para if current_chunk else para
    else:
      # Save previous chunk if exists
      if current_chunk:
        chunks.append(current_chunk.strip())

  return chunks

# Test it
chunks = chunk_by_paragraphs(sample_document, min_chunk_size=100)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
  print(f"Chunk {i} ({len(chunk)} chars):")
  print(chunk)
  print("-" * 80)

Number of chunks: 0



### Adding Metadata to chunks

#### Implementing Metadata

In [None]:
def chunk_with_metadata(text, source_name, chunk_size=50, overlap=10):
  """
  Create chunks with metadata.
    
  Args:
      text: The text to chunk
      source_name: Name of the source document
      chunk_size: Number of words per chunk
      overlap: Number of words to overlap
    
    Returns:
        List of dictionaries with 'text' and 'metadata'
    """
  # Get basic chunks
  words = text.split()
  chunks = []
  start = 0
  chunk_index = 0

  while start < len(words):
    end = start + chunk_size
    chunk_words = words[start:end]
    chunk_text = ' '.join(chunk_words)

    # Create chunk with metadata
    chunk_with_meta = {
      'text': chunk_text,
      'metadata': {
        'source': source_name,
        'chunk_index': chunk_index,
        'chunk_size': len(chunk_words),
        'char_count': len(chunk_text),
        'start_word': start,
        'end_word': end
      }
    }

    chunks.append(chunk_with_meta)
    start += chunk_size - overlap
    chunk_index += 1
  return chunks

# Test it
chunks = chunk+chunk_with_metadata(
  sample_document,
  source_name="example.txt",
  chunk_size=5,
  overlap=10
)

print(f"Total chunks: {len(chunks)}\n")
print("First chunk with metadata:")
print("=" * 80)
print(f"Text: {chunks[1]['text'][:200]}...")
print(f"\nMetadata:")
for key, value in chunks[1]['metadata'].items():
    print(f"  {key}: {value}")


## Loading and Chunking Real Documents

### Loading Text Files

In [2]:
import os

In [None]:
def load_and_chunk_text_file(file_path, chunk_size=500, overlap=50):
    """
    Load a text file and chunk it.
    
    Args:
        file_path: Path to the text file
        chunk_size: Characters per chunk
        overlap: Character overlap between chunks
    
    Returns:
        List of chunks with metadata
    """
    import os
    
    # Read the file
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()
    
    # Get file metadata
    file_name = os.path.basename(file_path)
    file_size = os.path.getsize(file_path)
    
    # Chunk the text
    chunks = chunk_by_sentences(text, max_chunk_size=chunk_size)
    
    # Add metadata to each chunk
    chunks_with_metadata = []
    for i, chunk in enumerate(chunks):
        chunks_with_metadata.append({
            'text': chunk,
            'metadata': {
                'source': file_name,
                'file_path': file_path,
                'file_size': file_size,
                'chunk_index': i,
                'total_chunks': len(chunks)
            }
        })
    
    return chunks_with_metadata

# Example usage (create a sample file first)
sample_file_path = 'sample_document.txt'
with open(sample_file_path, 'w', encoding='utf-8') as f:
    f.write(sample_document)

# Load and chunk
chunks = load_and_chunk_text_file(sample_file_path, chunk_size=400)

print(f"Loaded and chunked: {chunks[0]['metadata']['source']}")
print(f"Total chunks: {len(chunks)}")
print(f"\nChunk 1:")
print(chunks[0]['text'])

In [None]:
chunks[2]['metadata']

In [None]:
def load_and_chunk_pdf(file_path, chunk_size=500):
    """
    Load a PDF file and chunk it.
    
    Args:
        file_path: Path to the PDF file
        chunk_size: Characters per chunk
    
    Returns:
        List of chunks with metadata (including page numbers)
    """
    import PyPDF2
    import os
    
    chunks_with_metadata = []
    file_name = os.path.basename(file_path)
    
    # Open PDF
    with open(file_path, 'rb') as f:
        pdf_reader = PyPDF2.PdfReader(f)
        num_pages = len(pdf_reader.pages)
        
        # Process each page
        for page_num in range(num_pages):
            # Extract text from page
            page = pdf_reader.pages[page_num]
            text = page.extract_text()
            
            # Chunk the page text
            page_chunks = chunk_by_sentences(text, max_chunk_size=chunk_size)
            
            # Add metadata to each chunk
            for chunk_idx, chunk in enumerate(page_chunks):
                chunks_with_metadata.append({
                    'text': chunk,
                    'metadata': {
                        'source': file_name,
                        'page': page_num + 1,  # 1-indexed
                        'total_pages': num_pages,
                        'chunk_on_page': chunk_idx,
                    }
                })
    
    return chunks_with_metadata

# Example (you would use this with a real PDF file)
print("PDF loading function ready!")
print("\nUsage:")
print("chunks = load_and_chunk_pdf('your_document.pdf', chunk_size=500)")

In [None]:
# You can test the load_and_chunk function with the code below

# chunks = load_and_chunk_pdf("Retrieval-Augmented_Generation_RAG.pdf", chunk_size=500)

# print(f"Total chunks: {len(chunks)}\n")

# for i, c in enumerate(chunks[:5]):  # show first 5 chunks
#     print(f"CHUNK {i+1}")
#     print("TEXT:", c["text"][:200])  # print first 200 chars
#     print("META:", c["metadata"])
#     print("-"*40)


## Practice Exercises

#### Document A: FAQ

Strategy_A
Sentence chunking 


Reason_A
Sentence chunking becuse the FAQ documents questions are in a single sentence so it will be easy to retrieve answer for each sentence 


In [None]:
doc_a = """
Q: What is the return policy?
A: Items can be returned within 30 days of purchase with original receipt.

Q: Do you offer international shipping?
A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location.

Q: How do I track my order?
A: Use the tracking number sent to your email after shipment.

"""

In [8]:
def chunk_by_sentences(text, max_chunk_size=50):
  """
    Split text into chunks by sentences, keeping sentences intact.
    
    Args:
        text: The text to chunk
        max_chunk_size: Maximum characters per chunk
    
    Returns:
        List of text chunks
    """
  # Simple sentence splitting (Split on . ! ?)
  import re
  sentences = re.split(r'(?<=[.!?])\s+', text)

  chunks = []
  current_chunk = ""

  for sentence in sentences:
    # Check if adding this sentence would exceed max size
    if len(current_chunk) + len(sentence) > max_chunk_size and current_chunk:
      # Save current chunk and start new one
      chunks.append(current_chunk.strip())
      current_chunk = sentence
    else:
      # Add sentence to current chunk
      current_chunk += " " + sentence if current_chunk else sentence

  # Don't forget the last chunk
  if current_chunk:
    chunks.append(current_chunk.strip())
  
  return chunks


# Test it
chunks = chunk_by_sentences(doc_a, max_chunk_size=50)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
  print(f"Chunk {i} ({len(chunk)} chars):")
  print(chunk)
  print("-" * 80)


Number of chunks: 7

Chunk 1 (29 chars):
Q: What is the return policy?
--------------------------------------------------------------------------------
Chunk 2 (74 chars):
A: Items can be returned within 30 days of purchase with original receipt.
--------------------------------------------------------------------------------
Chunk 3 (39 chars):
Q: Do you offer international shipping?
--------------------------------------------------------------------------------
Chunk 4 (47 chars):
A: Yes, we ship to over 50 countries worldwide.
--------------------------------------------------------------------------------
Chunk 5 (32 chars):
Shipping times vary by location.
--------------------------------------------------------------------------------
Chunk 6 (27 chars):
Q: How do I track my order?
--------------------------------------------------------------------------------
Chunk 7 (61 chars):
A: Use the tracking number sent to your email after shipment.
-------------------------------------

In [9]:
doc_b = """
Installation Guide

Step 1: Download the installer from our website.
Extract the zip file to your desired location.

Step 2: Run setup.exe as administrator.
Follow the on-screen instructions.

Step 3: Configure your API key in the settings file.
The settings file is located at config/settings.json.
"""

In [10]:
doc_c = """
The Future of Renewable Energy

Solar and wind power have seen tremendous growth in recent years. As technology improves
and costs decrease, renewable energy becomes increasingly competitive with fossil fuels.

Energy storage solutions are critical for renewable adoption. Battery technology advances
enable better grid management and reliability. This addresses the intermittent nature of
solar and wind power.

Policy support and public awareness continue to drive the transition. Many countries have
set ambitious renewable energy targets for the coming decades.
"""