# Lab 3.5.2: Chunking Strategies Comparison

**Module:** 3.5 - RAG Systems & Vector Databases  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê (Intermediate)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

- [ ] Understand why chunking strategy matters for RAG quality
- [ ] Implement fixed-size chunking with various sizes
- [ ] Implement semantic chunking by section/paragraph
- [ ] Implement sentence-based chunking
- [ ] Evaluate and compare strategies using retrieval quality metrics
- [ ] Know when to use each strategy

---

## üìö Prerequisites

- Completed: Lab 3.5.1 (Basic RAG Pipeline)
- Knowledge of: Basic RAG concepts, embeddings

---

## üåç Real-World Context

**The Problem:** You've built a RAG system but users complain that it sometimes retrieves irrelevant chunks or misses important information. The culprit? Poor chunking strategy.

**The Impact:**
- Too small chunks: "The model uses" - uses WHAT? Context is missing!
- Too large chunks: Returns pages of text when only one sentence was needed
- Wrong boundaries: Splits a concept mid-explanation

**The Solution:** Test different chunking strategies and measure which works best for YOUR data.

---

## üßí ELI5: Why Does Chunking Matter?

> **Imagine you're cutting a pizza.**
>
> - Cut it into tiny pieces (like dice) ‚Üí Hard to pick up, toppings fall off. That's **too small chunks** - you lose context.
> - Don't cut it at all ‚Üí Can't fit a whole pizza in your mouth! That's **too large chunks** - too much irrelevant info.
> - Cut through the middle of a topping ‚Üí Messy and wasteful. That's **bad boundaries** - splitting concepts mid-thought.
>
> The perfect pizza slice: just right size, cut between the toppings, easy to eat. That's what good chunking achieves for documents!

---

## Part 1: Setup

In [None]:
# Install dependencies
!pip install -q langchain langchain-community langchain-huggingface chromadb sentence-transformers nltk

import nltk
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

print("‚úÖ Dependencies installed!")

In [None]:
import os
import time
import re
from pathlib import Path
from typing import List, Dict, Tuple, Any
from dataclasses import dataclass
import numpy as np

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    MarkdownHeaderTextSplitter,
    SentenceTransformersTokenTextSplitter
)
from langchain.schema import Document
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

import torch
import gc

print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")

In [None]:
# Load our sample documents
DOCS_PATH = Path("../data/sample_documents")

documents = []
for file_path in sorted(DOCS_PATH.glob("*.md")):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    documents.append(Document(
        page_content=content,
        metadata={"source": file_path.name}
    ))
    print(f"üìÑ Loaded: {file_path.name} ({len(content):,} chars)")

print(f"\nüìö Total: {len(documents)} documents")

---

## Part 2: Chunking Strategies

We'll implement and compare four different chunking strategies:

| Strategy | Description | Best For |
|----------|-------------|----------|
| Fixed-Size (Small) | 256 chars, 25 overlap | Precise retrieval |
| Fixed-Size (Medium) | 512 chars, 50 overlap | Balanced |
| Fixed-Size (Large) | 1024 chars, 100 overlap | Rich context |
| Semantic (Headers) | Split by markdown headers | Structured docs |
| Sentence-Based | Group sentences | Natural boundaries |

### 2.1 Fixed-Size Chunking

In [None]:
def create_fixed_size_chunks(
    documents: List[Document],
    chunk_size: int = 512,
    chunk_overlap: int = 50
) -> List[Document]:
    """
    Split documents into fixed-size chunks with overlap.
    Uses recursive splitting to try to maintain semantic boundaries.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    return splitter.split_documents(documents)

# Create chunks with different sizes
print("üìè Creating fixed-size chunks...")

chunks_256 = create_fixed_size_chunks(documents, chunk_size=256, chunk_overlap=25)
chunks_512 = create_fixed_size_chunks(documents, chunk_size=512, chunk_overlap=50)
chunks_1024 = create_fixed_size_chunks(documents, chunk_size=1024, chunk_overlap=100)

print(f"\nüìä Chunk Counts:")
print(f"   256 chars:  {len(chunks_256):4d} chunks (avg: {np.mean([len(c.page_content) for c in chunks_256]):.0f} chars)")
print(f"   512 chars:  {len(chunks_512):4d} chunks (avg: {np.mean([len(c.page_content) for c in chunks_512]):.0f} chars)")
print(f"   1024 chars: {len(chunks_1024):4d} chunks (avg: {np.mean([len(c.page_content) for c in chunks_1024]):.0f} chars)")

### 2.2 Semantic Chunking (By Headers)

In [None]:
def create_semantic_chunks(
    documents: List[Document],
    max_chunk_size: int = 1500
) -> List[Document]:
    """
    Split documents by markdown headers, preserving semantic structure.
    Falls back to fixed-size if chunks are too large.
    """
    # Define header hierarchy
    headers_to_split_on = [
        ("#", "header_1"),
        ("##", "header_2"),
        ("###", "header_3"),
    ]
    
    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
        strip_headers=False  # Keep headers in content for context
    )
    
    # Secondary splitter for oversized chunks
    size_splitter = RecursiveCharacterTextSplitter(
        chunk_size=max_chunk_size,
        chunk_overlap=100
    )
    
    all_chunks = []
    
    for doc in documents:
        # First split by headers
        header_chunks = markdown_splitter.split_text(doc.page_content)
        
        for chunk in header_chunks:
            # Reconstruct header context
            header_context = ""
            for key in ['header_1', 'header_2', 'header_3']:
                if key in chunk.metadata:
                    level = int(key[-1])
                    header_context += "#" * level + " " + chunk.metadata[key] + "\n"
            
            content = header_context + chunk.page_content
            
            # If chunk is too large, split further
            if len(content) > max_chunk_size:
                sub_chunks = size_splitter.split_text(content)
                for i, sub in enumerate(sub_chunks):
                    all_chunks.append(Document(
                        page_content=sub,
                        metadata={
                            "source": doc.metadata["source"],
                            **chunk.metadata,
                            "sub_chunk": i
                        }
                    ))
            else:
                all_chunks.append(Document(
                    page_content=content,
                    metadata={
                        "source": doc.metadata["source"],
                        **chunk.metadata
                    }
                ))
    
    return all_chunks

print("üìë Creating semantic chunks (by headers)...")
chunks_semantic = create_semantic_chunks(documents)

print(f"   Created {len(chunks_semantic)} chunks")
print(f"   Avg size: {np.mean([len(c.page_content) for c in chunks_semantic]):.0f} chars")

### 2.3 Sentence-Based Chunking

In [None]:
def create_sentence_chunks(
    documents: List[Document],
    sentences_per_chunk: int = 5,
    sentence_overlap: int = 1
) -> List[Document]:
    """
    Split documents by sentences, grouping N sentences per chunk.
    Maintains natural language boundaries.
    """
    from nltk.tokenize import sent_tokenize
    
    all_chunks = []
    
    for doc in documents:
        # Tokenize into sentences
        sentences = sent_tokenize(doc.page_content)
        
        # Group sentences with overlap
        i = 0
        chunk_idx = 0
        while i < len(sentences):
            # Get chunk sentences
            end = min(i + sentences_per_chunk, len(sentences))
            chunk_sentences = sentences[i:end]
            chunk_text = " ".join(chunk_sentences)
            
            all_chunks.append(Document(
                page_content=chunk_text,
                metadata={
                    "source": doc.metadata["source"],
                    "chunk_idx": chunk_idx,
                    "sentence_count": len(chunk_sentences)
                }
            ))
            
            # Move forward with overlap
            i += sentences_per_chunk - sentence_overlap
            chunk_idx += 1
    
    return all_chunks

print("üìù Creating sentence-based chunks...")
chunks_sentence = create_sentence_chunks(documents, sentences_per_chunk=5, sentence_overlap=1)

print(f"   Created {len(chunks_sentence)} chunks")
print(f"   Avg size: {np.mean([len(c.page_content) for c in chunks_sentence]):.0f} chars")

### Compare Chunk Characteristics

In [None]:
strategies = {
    "Fixed-256": chunks_256,
    "Fixed-512": chunks_512,
    "Fixed-1024": chunks_1024,
    "Semantic": chunks_semantic,
    "Sentence": chunks_sentence
}

print("üìä Chunking Strategy Comparison:")
print("=" * 70)
print(f"{'Strategy':<15} {'Count':<10} {'Avg Size':<12} {'Min':<8} {'Max':<8}")
print("-" * 70)

for name, chunks in strategies.items():
    sizes = [len(c.page_content) for c in chunks]
    print(f"{name:<15} {len(chunks):<10} {np.mean(sizes):<12.0f} {min(sizes):<8} {max(sizes):<8}")

---

## Part 3: Visual Comparison

In [None]:
# Let's look at how each strategy chunks the same content
sample_doc = documents[0]  # DGX Spark guide

print(f"üìÑ Sample Document: {sample_doc.metadata['source']}")
print(f"   Total length: {len(sample_doc.page_content):,} characters")
print("\n" + "=" * 70)

# Show first few chunks from each strategy for this document
for name, all_chunks in strategies.items():
    # Filter chunks from this document
    doc_chunks = [c for c in all_chunks if c.metadata.get('source') == sample_doc.metadata['source']]
    
    print(f"\nüîπ {name}: {len(doc_chunks)} chunks")
    
    if doc_chunks:
        # Show first chunk
        first_chunk = doc_chunks[0].page_content[:200]
        print(f"   First chunk preview: '{first_chunk}...'")

---

## Part 4: Retrieval Quality Evaluation

Now let's actually test which chunking strategy retrieves the best results!

### Create Evaluation Dataset

In [None]:
# Evaluation Q&A pairs with expected source documents
eval_dataset = [
    {
        "question": "How much unified memory does DGX Spark have?",
        "expected_source": "dgx_spark_technical_guide.md",
        "expected_keywords": ["128GB", "unified memory", "LPDDR5X"]
    },
    {
        "question": "What is the attention mechanism in transformers?",
        "expected_source": "transformer_architecture_explained.md",
        "expected_keywords": ["attention", "query", "key", "value"]
    },
    {
        "question": "How does LoRA reduce training memory requirements?",
        "expected_source": "lora_finetuning_guide.md",
        "expected_keywords": ["low-rank", "decomposition", "parameters", "trainable"]
    },
    {
        "question": "What is GPTQ quantization?",
        "expected_source": "quantization_methods.md",
        "expected_keywords": ["GPTQ", "quantization", "4-bit", "weight"]
    },
    {
        "question": "What are the advantages of RAG over fine-tuning?",
        "expected_source": "rag_architecture_patterns.md",
        "expected_keywords": ["RAG", "retrieval", "dynamic", "grounded"]
    },
    {
        "question": "How does FAISS compare to ChromaDB?",
        "expected_source": "vector_database_comparison.md",
        "expected_keywords": ["FAISS", "ChromaDB", "GPU", "performance"]
    },
    {
        "question": "What are Tensor Cores used for?",
        "expected_source": "dgx_spark_technical_guide.md",
        "expected_keywords": ["tensor core", "AI", "compute"]
    },
    {
        "question": "How do positional encodings work?",
        "expected_source": "transformer_architecture_explained.md",
        "expected_keywords": ["positional", "encoding", "sinusoidal", "position"]
    },
    {
        "question": "What is QLoRA and how does it differ from LoRA?",
        "expected_source": "lora_finetuning_guide.md",
        "expected_keywords": ["QLoRA", "4-bit", "quantized", "NF4"]
    },
    {
        "question": "What is hybrid search in RAG?",
        "expected_source": "rag_architecture_patterns.md",
        "expected_keywords": ["hybrid", "dense", "sparse", "BM25"]
    }
]

print(f"üìã Created {len(eval_dataset)} evaluation questions")

### Load Embedding Model

In [None]:
print("üîÑ Loading embedding model...")

embedding_model = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
    encode_kwargs={"normalize_embeddings": True, "batch_size": 32}
)

print("‚úÖ Embedding model loaded!")

### Evaluate Each Strategy

In [None]:
@dataclass
class EvalResult:
    """Results from evaluating a chunking strategy."""
    strategy_name: str
    source_recall_at_1: float  # Correct source in top 1
    source_recall_at_3: float  # Correct source in top 3
    source_recall_at_5: float  # Correct source in top 5
    keyword_coverage: float    # % of expected keywords found
    avg_retrieval_time_ms: float
    chunk_count: int


def evaluate_chunking_strategy(
    strategy_name: str,
    chunks: List[Document],
    embedding_model: HuggingFaceEmbeddings,
    eval_dataset: List[Dict],
    k: int = 5
) -> EvalResult:
    """
    Evaluate a chunking strategy on the evaluation dataset.
    """
    import shutil
    
    # Create a temporary vector store
    db_path = f"./temp_chroma_{strategy_name.replace('-', '_')}"
    if Path(db_path).exists():
        shutil.rmtree(db_path)
    
    # Build vector store
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embedding_model,
        persist_directory=db_path
    )
    
    # Evaluate
    source_correct_at_1 = 0
    source_correct_at_3 = 0
    source_correct_at_5 = 0
    keyword_scores = []
    retrieval_times = []
    
    for item in eval_dataset:
        question = item["question"]
        expected_source = item["expected_source"]
        expected_keywords = item["expected_keywords"]
        
        # Time the retrieval
        start_time = time.time()
        results = vectorstore.similarity_search(question, k=k)
        retrieval_times.append((time.time() - start_time) * 1000)
        
        # Check source recall
        retrieved_sources = [r.metadata.get('source') for r in results]
        
        if retrieved_sources and retrieved_sources[0] == expected_source:
            source_correct_at_1 += 1
        if expected_source in retrieved_sources[:3]:
            source_correct_at_3 += 1
        if expected_source in retrieved_sources:
            source_correct_at_5 += 1
        
        # Check keyword coverage
        all_content = " ".join([r.page_content.lower() for r in results])
        keywords_found = sum(1 for kw in expected_keywords if kw.lower() in all_content)
        keyword_scores.append(keywords_found / len(expected_keywords))
    
    # Cleanup
    shutil.rmtree(db_path)
    
    n = len(eval_dataset)
    return EvalResult(
        strategy_name=strategy_name,
        source_recall_at_1=source_correct_at_1 / n,
        source_recall_at_3=source_correct_at_3 / n,
        source_recall_at_5=source_correct_at_5 / n,
        keyword_coverage=np.mean(keyword_scores),
        avg_retrieval_time_ms=np.mean(retrieval_times),
        chunk_count=len(chunks)
    )

In [None]:
# Evaluate all strategies
print("üî¨ Evaluating chunking strategies...")
print("   This may take a few minutes...\n")

results = []
for name, chunks in strategies.items():
    print(f"   Evaluating {name}...")
    result = evaluate_chunking_strategy(
        strategy_name=name,
        chunks=chunks,
        embedding_model=embedding_model,
        eval_dataset=eval_dataset
    )
    results.append(result)
    print(f"   ‚úÖ {name}: Recall@5 = {result.source_recall_at_5:.0%}")

print("\n‚úÖ Evaluation complete!")

### Results Summary

In [None]:
print("\nüìä EVALUATION RESULTS")
print("=" * 90)
print(f"{'Strategy':<15} {'Chunks':<8} {'R@1':<8} {'R@3':<8} {'R@5':<8} {'Keywords':<10} {'Time(ms)':<10}")
print("-" * 90)

for r in sorted(results, key=lambda x: x.source_recall_at_5, reverse=True):
    print(f"{r.strategy_name:<15} {r.chunk_count:<8} {r.source_recall_at_1:<8.0%} "
          f"{r.source_recall_at_3:<8.0%} {r.source_recall_at_5:<8.0%} "
          f"{r.keyword_coverage:<10.0%} {r.avg_retrieval_time_ms:<10.1f}")

print("=" * 90)
print("\nüìù Legend:")
print("   R@K = Source Recall at K (correct document in top K results)")
print("   Keywords = Percentage of expected keywords found in retrieved chunks")

In [None]:
# Find the winner
best_result = max(results, key=lambda r: r.source_recall_at_5 + r.keyword_coverage)

print(f"\nüèÜ Best Strategy: {best_result.strategy_name}")
print(f"   - Recall@5: {best_result.source_recall_at_5:.0%}")
print(f"   - Keyword Coverage: {best_result.keyword_coverage:.0%}")
print(f"   - Chunk Count: {best_result.chunk_count}")

---

## Part 5: Analysis and Recommendations

### When to Use Each Strategy

| Strategy | Best For | Avoid When |
|----------|----------|------------|
| **Fixed-256** | Very specific queries, code snippets | Long-form explanations needed |
| **Fixed-512** | General purpose, balanced | Highly structured documents |
| **Fixed-1024** | Complex topics, rich context | Precise retrieval required |
| **Semantic** | Structured docs (manuals, docs) | Unstructured text |
| **Sentence** | Natural language, Q&A | Technical/code-heavy docs |

In [None]:
# Let's examine where each strategy succeeds and fails
print("üîç Detailed Analysis: Looking at specific examples\n")

# Test a specific question with all strategies
test_q = eval_dataset[0]  # "How much unified memory does DGX Spark have?"
print(f"Question: {test_q['question']}")
print(f"Expected Source: {test_q['expected_source']}")
print("-" * 70)

for name, chunks in strategies.items():
    # Quick search (build temp vectorstore)
    import shutil
    db_path = f"./temp_analysis_{name.replace('-', '_')}"
    if Path(db_path).exists():
        shutil.rmtree(db_path)
    
    vs = Chroma.from_documents(chunks, embedding_model, persist_directory=db_path)
    top_result = vs.similarity_search(test_q['question'], k=1)[0]
    
    is_correct = top_result.metadata.get('source') == test_q['expected_source']
    icon = "‚úÖ" if is_correct else "‚ùå"
    
    print(f"\n{icon} {name}:")
    print(f"   Retrieved: {top_result.metadata.get('source')}")
    print(f"   Content: {top_result.page_content[:100]}...")
    
    shutil.rmtree(db_path)

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: No Overlap Between Chunks
```python
# ‚ùå Wrong: Key information at boundaries is lost
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=0)

# ‚úÖ Right: 10-20% overlap preserves boundary information
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
```

### Mistake 2: Using the Same Strategy for All Document Types
```python
# ‚ùå Wrong: One size fits all
all_chunks = splitter.split_documents(all_documents)

# ‚úÖ Right: Different strategies for different content
code_chunks = fixed_size_splitter.split_documents(code_docs)
doc_chunks = semantic_splitter.split_documents(documentation)
```

### Mistake 3: Not Preserving Metadata
```python
# ‚ùå Wrong: Source information is lost
chunks = [chunk.page_content for chunk in split_result]

# ‚úÖ Right: Keep metadata for citation
chunks = splitter.split_documents(documents)  # Preserves metadata
```

---

## ‚úã Try It Yourself

### Exercise 1: Custom Chunk Size
Try `chunk_size=384` with `overlap=64`. How does it compare?

### Exercise 2: Paragraph-Based Chunking
Implement a chunking strategy that splits only on `\n\n` (paragraph breaks).

### Exercise 3: Add Your Own Documents
Add 2-3 new documents and see if the best strategy changes.

<details>
<summary>üí° Hint for Exercise 2</summary>

```python
paragraph_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=2000,  # Max size
    chunk_overlap=0
)
```
</details>

---

## üéâ Checkpoint

You've learned:
- ‚úÖ Why chunking strategy matters for RAG quality
- ‚úÖ How to implement fixed-size, semantic, and sentence-based chunking
- ‚úÖ How to evaluate chunking strategies with retrieval metrics
- ‚úÖ When to use different chunking strategies

**Key Insight:** There's no universally "best" chunking strategy. The optimal choice depends on your documents and queries. Always evaluate on your specific use case!

---

## üßπ Cleanup

In [None]:
# Clean up
del embedding_model
gc.collect()
torch.cuda.empty_cache()

# Remove any temp databases
import shutil
for p in Path(".").glob("temp_*"):
    if p.is_dir():
        shutil.rmtree(p)

print("‚úÖ Cleanup complete!")

---

## Next Steps

In the next lab, we'll compare different **vector databases** (ChromaDB, FAISS, Qdrant) to find the best one for your use case!

‚û°Ô∏è Continue to [Lab 3.5.3: Vector Database Comparison](./lab-3.5.3-vector-dbs.ipynb)