# Advanced Chunking Strategies

## Welcome!
In Basic RAG, we used simple character-based chunking. But chunking is one of the most impactful
factors in RAG quality. Bad chunking = bad retrieval = bad answers!

## What You Will Learn
1. **Why chunking matters** - The impact on retrieval quality
2. **Fixed-size chunking** - What we did before (baseline)
3. **Semantic chunking** - Split based on meaning, not characters
4. **Parent-Child chunking** - Small chunks for search, big chunks for context
5. **Document-aware chunking** - Respect document structure (headers, sections)

## The Problem with Basic Chunking
```
Basic chunking (500 chars) might split like this:

Chunk 1: "...LoRA (Low-Rank Adaptation) is a method for fine-tuning large"
Chunk 2: "language models efficiently. It works by..."

The concept is split! Neither chunk has the complete information.
```

Let us learn better approaches!

## Step 1: Environment Setup

In [2]:
# Load environment variables
from dotenv import load_dotenv
import os

load_dotenv()
print("Environment loaded!")
print(f"OpenAI API Key found: {'OPENAI_API_KEY' in os.environ}")

Environment loaded!
OpenAI API Key found: False


In [None]:
# Install required packages (run once)
# !pip install langchain-experimental sentence-transformers

## Step 2: Load Sample Document

We will use the same PDF you have been working with for consistency.

In [3]:
from langchain_community.document_loaders import PyPDFLoader

# Load our familiar PDF
pdf_path = "llm_fundamentals.pdf"

# Check if file exists, if not try alternate path
if not os.path.exists(pdf_path):
    pdf_path = "../RAG/llm_fundamentals.pdf"

loader = PyPDFLoader(pdf_path)
documents = loader.load()

# Combine all pages into one text for demonstration
full_text = "\n\n".join([doc.page_content for doc in documents])

print(f"Loaded {len(documents)} pages")
print(f"Total characters: {len(full_text)}")
print(f"\nFirst 500 characters:")
print(full_text[:500])

  from .autonotebook import tqdm as notebook_tqdm


Loaded 8 pages
Total characters: 15540

First 500 characters:
@genieincodebottle 
Instagram | GitHub | Medium | YouTube 
How to Be Better Than Most in GenAI 
 
Contents 
 
Core LLM Building Blocks ....................................................................................................................... 2 
Advanced Architectures ......................................................................................................................... 3 
Training & Tuning ............................................................................


---
## Strategy 1: Fixed-Size Chunking (Baseline)

This is what we have been doing. Let us see its limitations.

**How it works:**
- Split every N characters
- Add overlap to avoid cutting mid-sentence

**Problem:**
- Does not respect meaning or document structure
- May split related concepts across chunks

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Basic fixed-size chunking (what we did before)
basic_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,      # 500 characters per chunk
    chunk_overlap=50,    # 50 character overlap
    length_function=len,
    separators=["\n\n", "\n", " ", ""]  # Try to split on paragraphs first
)

basic_chunks = basic_splitter.split_text(full_text)

print(f"Fixed-Size Chunking Results:")
print(f"Number of chunks: {len(basic_chunks)}")
print(f"Average chunk size: {sum(len(c) for c in basic_chunks) / len(basic_chunks):.0f} chars")
print(f"\n" + "="*80)
print(f"Sample chunks (notice how they might cut mid-concept):")
print(f"\n" + "="*80)

for i, chunk in enumerate(basic_chunks[:3], 1):
    print(f"\nChunk {i} ({len(chunk)} chars):")
    print(f"{chunk[:200]}...")
    print("-"*40)

Fixed-Size Chunking Results:
Number of chunks: 37
Average chunk size: 419 chars

Sample chunks (notice how they might cut mid-concept):


Chunk 1 (404 chars):
@genieincodebottle 
Instagram | GitHub | Medium | YouTube 
How to Be Better Than Most in GenAI 
 
Contents 
 
Core LLM Building Blocks ....................................................................
----------------------------------------

Chunk 2 (452 chars):
Training & Tuning .................................................................................................................................... 3 
Generation Controls .............................
----------------------------------------

Chunk 3 (448 chars):
Efficiency & Scaling ................................................................................................................................. 5 
Data & Preprocessing ............................
----------------------------------------


---
## Strategy 2: Semantic Chunking

**The Idea:**
Instead of splitting by character count, split when the MEANING changes!

**How it works:**
1. Split text into sentences
2. Create embeddings for each sentence
3. Compare embeddings of adjacent sentences
4. When similarity drops significantly - that is a chunk boundary!

**Think of it like:**
Reading a book and noticing when the topic changes.

In [5]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings

# Initialize embeddings for semantic comparison
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Create semantic chunker
# breakpoint_threshold_type options:
#   - "percentile": Split when similarity drops below X percentile
#   - "standard_deviation": Split when similarity drops by X std deviations
#   - "interquartile": Split based on IQR

semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",  # Most intuitive option
    breakpoint_threshold_amount=70  # Split at 70th percentile of similarity drops
)

print("Creating semantic chunks (this may take a moment)...")
semantic_chunks = semantic_splitter.split_text(full_text)

print(f"\nSemantic Chunking Results:")
print(f"Number of chunks: {len(semantic_chunks)}")
print(f"Average chunk size: {sum(len(c) for c in semantic_chunks) / len(semantic_chunks):.0f} chars")
print(f"\n" + "="*80)
print(f"Sample semantic chunks (notice how each chunk has a coherent topic):")
print(f"\n" + "="*80)

for i, chunk in enumerate(semantic_chunks[:3], 1):
    print(f"\nChunk {i} ({len(chunk)} chars):")
    print(f"{chunk[:300]}..." if len(chunk) > 300 else chunk)
    print("-"*40)

Creating semantic chunks (this may take a moment)...

Semantic Chunking Results:
Number of chunks: 54
Average chunk size: 287 chars

Sample semantic chunks (notice how each chunk has a coherent topic):


Chunk 1 (402 chars):
@genieincodebottle 
Instagram | GitHub | Medium | YouTube 
How to Be Better Than Most in GenAI 
 
Contents 
 
Core LLM Building Blocks ....................................................................................................................... 2 
Advanced Architectures ......................
----------------------------------------

Chunk 2 (153 chars):
3 
Training & Tuning ....................................................................................................................................
----------------------------------------

Chunk 3 (150 chars):
3 
Generation Controls ...............................................................................................................................
----------------------------------------

---
## Strategy 3: Parent-Child Chunking

**The Problem:**
- Small chunks - Better search precision, but less context for LLM
- Large chunks - More context, but harder to find relevant parts

**The Solution: Have both!**
- **Child chunks**: Small (used for searching)
- **Parent chunks**: Large (sent to LLM for context)

**How it works:**
```
1. User asks a question
2. Search finds relevant CHILD chunk (small, precise)
3. Return the PARENT chunk (large, full context) to LLM
```

**Analogy:**
Like finding a specific sentence in a book (child), but reading the whole paragraph (parent).

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Step 1: Create PARENT chunks (larger, for context)
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,     # Large chunks
    chunk_overlap=100
)

parent_chunks = parent_splitter.split_text(full_text)
print(f"Parent chunks created: {len(parent_chunks)}")
print(f"Average parent size: {sum(len(c) for c in parent_chunks) / len(parent_chunks):.0f} chars")

Parent chunks created: 16
Average parent size: 1005 chars


In [7]:
# Step 2: Create CHILD chunks from each parent
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,      # Small chunks for precise search
    chunk_overlap=30
)

# Store parent-child relationships
parent_child_map = {}  # child_chunk -> parent_chunk
all_child_chunks = []

for parent_idx, parent_chunk in enumerate(parent_chunks):
    # Split this parent into children
    children = child_splitter.split_text(parent_chunk)
    
    for child in children:
        all_child_chunks.append(child)
        # Map each child back to its parent
        parent_child_map[child] = {
            "parent_idx": parent_idx,
            "parent_text": parent_chunk
        }

print(f"\nChild chunks created: {len(all_child_chunks)}")
print(f"Average child size: {sum(len(c) for c in all_child_chunks) / len(all_child_chunks):.0f} chars")
print(f"\nRatio: ~{len(all_child_chunks) / len(parent_chunks):.1f} children per parent")


Child chunks created: 69
Average child size: 234 chars

Ratio: ~4.3 children per parent


In [8]:
# Step 3: Demonstrate the parent-child relationship
print("Example Parent-Child Relationship:")
print("="*80)

# Pick a child chunk
sample_child = all_child_chunks[5]
sample_parent_info = parent_child_map[sample_child]

print(f"\nCHILD CHUNK (what we SEARCH with):")
print(f"Length: {len(sample_child)} chars")
print(f"Content: {sample_child}")
print(f"\n" + "-"*80)

print(f"\nPARENT CHUNK (what we SEND TO LLM):")
print(f"Length: {len(sample_parent_info['parent_text'])} chars")
print(f"Content: {sample_parent_info['parent_text'][:500]}...")

Example Parent-Child Relationship:

CHILD CHUNK (what we SEARCH with):
Length: 294 chars
Content: Data & Preprocessing ............................................................................................................................. 6 
Evaluation & Benchmarks ...................................................................................................................... 6

--------------------------------------------------------------------------------

PARENT CHUNK (what we SEND TO LLM):
Length: 1466 chars
Content: @genieincodebottle 
Instagram | GitHub | Medium | YouTube 
How to Be Better Than Most in GenAI 
 
Contents 
 
Core LLM Building Blocks ....................................................................................................................... 2 
Advanced Architectures ......................................................................................................................... 3 
Training & Tuning ....................................

In [12]:
# Step 4: Create a simple parent-child retriever
from langchain_chroma import Chroma
from langchain_core.documents import Document

# Create documents with parent reference in metadata
child_documents = []
for i, child in enumerate(all_child_chunks):
    parent_info = parent_child_map[child]
    doc = Document(
        page_content=child,
        metadata={
            "child_id": i,
            "parent_idx": parent_info["parent_idx"],
            "is_child": True
        }
    )
    child_documents.append(doc)

# Create vector store with CHILD chunks (for searching)
child_vectorstore = Chroma.from_documents(
    documents=child_documents,
    embedding=embeddings,
    collection_name="child_chunks"
)

print(f"Vector store created with {len(child_documents)} child chunks")

Vector store created with 69 child chunks


In [13]:
def parent_child_retrieve(query: str, k: int = 3):
    """
    Search using child chunks, but return parent chunks for context.
    
    This gives you:
    - Precise search (small child chunks)
    - Rich context (large parent chunks)
    """
    # Step 1: Search child chunks
    child_results = child_vectorstore.similarity_search(query, k=k)
    
    # Step 2: Get unique parent chunks
    seen_parents = set()
    parent_results = []
    
    for child_doc in child_results:
        parent_idx = child_doc.metadata["parent_idx"]
        
        if parent_idx not in seen_parents:
            seen_parents.add(parent_idx)
            parent_results.append({
                "parent_idx": parent_idx,
                "parent_text": parent_chunks[parent_idx],
                "matched_child": child_doc.page_content
            })
    
    return parent_results

# Test it!
query = "What is LoRA?"
results = parent_child_retrieve(query, k=3)

print(f"Query: {query}")
print(f"\n" + "="*80)

for i, result in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(f"Matched CHILD (search hit): {result['matched_child'][:100]}...")
    print(f"\nReturned PARENT (full context): {result['parent_text'][:300]}...")
    print("-"*80)

Query: What is LoRA?


Result 1:
Matched CHILD (search hit): 9. QLoRA → LoRA + quantization, enabling fine-tuning of huge models on modest hardware 
10. PEFT → F...

Returned PARENT (full context): @genieincodebottle 
Advanced Architectures 
1. Diffusion Models → Generate images/video by learning to reverse noise process (DALL-E, 
Midjourney, Stable Diffusion) 
2. VAEs (Variational Autoencoders) → Probabilistic generative models with latent spaces 
3. GANs (Generative Adversarial Networks) → G...
--------------------------------------------------------------------------------

Result 2:
Matched CHILD (search hit): @genieincodebottle 
Instagram | GitHub | Medium | YouTube 
How to Be Better Than Most in GenAI 
 
Co...

Returned PARENT (full context): @genieincodebottle 
Instagram | GitHub | Medium | YouTube 
How to Be Better Than Most in GenAI 
 
Contents 
 
Core LLM Building Blocks ..........................................................................................................

---
## Strategy 4: Document-Aware Chunking

**The Idea:**
Respect the document's natural structure (headers, sections, lists).

**Best for:**
- Technical documentation
- Legal documents with sections
- Markdown files
- HTML content

In [14]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

# Sample markdown document (simulating structured content)
sample_markdown = """
# LLM Fundamentals

## Core Concepts

Large Language Models are neural networks trained on vast amounts of text data.
They learn patterns in language and can generate human-like text.

### Transformer Architecture

The transformer is the backbone of modern LLMs. It uses self-attention
to process sequences in parallel, making it efficient and powerful.

### Attention Mechanism

Attention allows the model to focus on relevant parts of the input
when generating each output token.

## Fine-tuning Techniques

### LoRA

Low-Rank Adaptation (LoRA) is an efficient fine-tuning method.
It adds small trainable matrices to the model instead of updating all weights.

### QLoRA

QLoRA combines LoRA with quantization for even more efficient fine-tuning.
It allows fine-tuning large models on consumer hardware.
"""

# Define headers to split on
headers_to_split_on = [
    ("#", "h1"),      # Main title
    ("##", "h2"),     # Section
    ("###", "h3"),    # Subsection
]

# Create markdown-aware splitter
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

# Split the document
md_chunks = markdown_splitter.split_text(sample_markdown)

print(f"Document-Aware Chunking Results:")
print(f"Number of chunks: {len(md_chunks)}")
print(f"\n" + "="*80)

for i, chunk in enumerate(md_chunks, 1):
    print(f"\nChunk {i}:")
    print(f"Headers: {chunk.metadata}")  # Shows the section hierarchy!
    print(f"Content: {chunk.page_content}")
    print("-"*40)

Document-Aware Chunking Results:
Number of chunks: 5


Chunk 1:
Headers: {'h1': 'LLM Fundamentals', 'h2': 'Core Concepts'}
Content: Large Language Models are neural networks trained on vast amounts of text data.
They learn patterns in language and can generate human-like text.
----------------------------------------

Chunk 2:
Headers: {'h1': 'LLM Fundamentals', 'h2': 'Core Concepts', 'h3': 'Transformer Architecture'}
Content: The transformer is the backbone of modern LLMs. It uses self-attention
to process sequences in parallel, making it efficient and powerful.
----------------------------------------

Chunk 3:
Headers: {'h1': 'LLM Fundamentals', 'h2': 'Core Concepts', 'h3': 'Attention Mechanism'}
Content: Attention allows the model to focus on relevant parts of the input
when generating each output token.
----------------------------------------

Chunk 4:
Headers: {'h1': 'LLM Fundamentals', 'h2': 'Fine-tuning Techniques', 'h3': 'LoRA'}
Content: Low-Rank Adaptation (LoRA) is an effic

---
## Comparison: Which Strategy to Use?

| Strategy | Best For | Pros | Cons |
|----------|----------|------|------|
| **Fixed-Size** | Quick prototypes | Fast, simple | May split concepts |
| **Semantic** | General documents | Respects meaning | Slower, variable sizes |
| **Parent-Child** | Precise search + context | Best of both worlds | More complex setup |
| **Document-Aware** | Structured docs | Preserves structure | Needs structured input |

In [15]:
# Summary comparison
print("Chunking Strategy Comparison:")
print("="*80)
print(f"\n{'Strategy':<20} {'Chunks':<10} {'Avg Size':<15} {'Notes'}")
print("-"*80)
print(f"{'Fixed-Size':<20} {len(basic_chunks):<10} {sum(len(c) for c in basic_chunks) / len(basic_chunks):<15.0f} {'Fast, simple'}")
print(f"{'Semantic':<20} {len(semantic_chunks):<10} {sum(len(c) for c in semantic_chunks) / len(semantic_chunks):<15.0f} {'Meaning-based'}")
print(f"{'Parent-Child':<20} {len(all_child_chunks):<10} {sum(len(c) for c in all_child_chunks) / len(all_child_chunks):<15.0f} {'Search + context'}")
print(f"{'(Parents)':<20} {len(parent_chunks):<10} {sum(len(c) for c in parent_chunks) / len(parent_chunks):<15.0f} {'Full context'}")

Chunking Strategy Comparison:

Strategy             Chunks     Avg Size        Notes
--------------------------------------------------------------------------------
Fixed-Size           37         419             Fast, simple
Semantic             54         287             Meaning-based
Parent-Child         69         234             Search + context
(Parents)            16         1005            Full context


---
## Best Practices for Chunking

### 1. Data Cleaning (Pre-processing)
Clean your data BEFORE chunking!

In [16]:
import re

def clean_text(text: str) -> str:
    """
    Clean text before chunking for better quality.
    
    Good chunking starts with clean data!
    """
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Remove special characters that do not add meaning
    text = re.sub(r'[^\w\s.,!?;:\-\(\)\[\]\'\"]+', '', text)
    
    # Fix common OCR errors (if from scanned PDFs)
    text = text.replace('|', 'I')  # Common OCR mistake
    
    # Remove page numbers, headers, footers (customize based on your docs)
    text = re.sub(r'Page \d+ of \d+', '', text)
    
    return text.strip()

# Example
dirty_text = "This   is    messy    text|with OCR   errors.   Page 1 of 10"
clean = clean_text(dirty_text)
print(f"Before: {dirty_text}")
print(f"After:  {clean}")

Before: This   is    messy    text|with OCR   errors.   Page 1 of 10
After:  This is messy textwith OCR errors.


### 2. Chunk Size Guidelines

In [17]:
# Recommended chunk sizes based on use case
chunk_size_guide = """
Chunk Size Guidelines:
======================

Small (200-500 chars):
  - Best for: Precise Q&A, factual lookups
  - Example: "What year was X founded?"
  - Trade-off: May lack context

Medium (500-1000 chars):
  - Best for: General RAG applications
  - Example: "Explain how X works"
  - Trade-off: Good balance (recommended default)

Large (1000-2000 chars):
  - Best for: Complex reasoning, summaries
  - Example: "Compare X and Y"
  - Trade-off: Less precise retrieval

Overlap Guidelines:
  - 10-20% of chunk size
  - Ensures context is not lost at boundaries
  - Example: chunk_size=500, overlap=50-100
"""
print(chunk_size_guide)


Chunk Size Guidelines:

Small (200-500 chars):
  - Best for: Precise Q&A, factual lookups
  - Example: "What year was X founded?"
  - Trade-off: May lack context

Medium (500-1000 chars):
  - Best for: General RAG applications
  - Example: "Explain how X works"
  - Trade-off: Good balance (recommended default)

Large (1000-2000 chars):
  - Best for: Complex reasoning, summaries
  - Example: "Compare X and Y"
  - Trade-off: Less precise retrieval

Overlap Guidelines:
  - 10-20% of chunk size
  - Ensures context is not lost at boundaries
  - Example: chunk_size=500, overlap=50-100



---
## Summary

### What You Have Learned:
1. **Fixed-Size**: Quick but may split concepts (our baseline)
2. **Semantic**: Splits based on meaning changes (smarter)
3. **Parent-Child**: Small chunks for search, large for context (best of both)
4. **Document-Aware**: Respects headers and structure (for organized docs)

### Key Takeaways:
- Chunking is ONE OF THE MOST IMPORTANT factors in RAG quality
- There is no single best strategy - it depends on your use case
- Clean your data before chunking
- Test different strategies and measure results (we will learn evaluation later!)

### When to Use Each:
- **Quick prototype**: Fixed-size
- **General RAG**: Semantic or Fixed-size with good overlap
- **High-quality production**: Parent-Child
- **Technical docs**: Document-Aware

### Next Up:
**Hybrid Search** - Combining keyword search (BM25) with semantic search for even better retrieval!