# Module 05 - Notebook 05: Chunking Strategies

## Learning Objectives
- Understand why chunking is necessary
- Implement different chunking strategies
- Optimize chunk size for retrieval
- Preserve context with overlap
- Handle different document types

---

## 1. Why Chunking Matters

### The Problem:
- Embeddings models have **token limits** (e.g., 8192 tokens)
- Long documents lose **granularity** in single embedding
- **Retrieval precision** suffers with large chunks

### The Solution: Chunking
- Break documents into **smaller pieces**
- Each chunk gets its own embedding
- Retrieve **most relevant** chunks, not entire documents

### Chunking Strategies:
1. **Fixed-size**: Split by character/token count
2. **Sentence-based**: Split at sentence boundaries
3. **Paragraph-based**: Split at paragraph boundaries
4. **Semantic**: Split at topic changes
5. **Recursive**: Try multiple strategies in order

## 2. Setup

In [None]:
!pip install -q langchain tiktoken sentence-transformers chromadb

In [None]:
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter
)
import tiktoken

# Sample long document
sample_doc = """
Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. It has revolutionized many industries.

There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning.

Supervised learning uses labeled data to train models. For example, you might train a model to recognize cats by showing it many images labeled "cat" or "not cat".

Unsupervised learning finds patterns in unlabeled data. Clustering algorithms are a common example of unsupervised learning.

Reinforcement learning trains agents through rewards and punishments. It's commonly used in game AI and robotics.

Deep learning is a subset of machine learning that uses neural networks with many layers. It has achieved remarkable success in image recognition, natural language processing, and many other domains.
"""

print(f"Document length: {len(sample_doc)} characters")
print(f"Word count: {len(sample_doc.split())} words")

## 3. Fixed-Size Chunking

In [None]:
# Character-based splitting
char_splitter = CharacterTextSplitter(
    separator="\n\n",  # Split on double newlines (paragraphs)
    chunk_size=200,    # Max characters per chunk
    chunk_overlap=20,  # Overlap between chunks
    length_function=len
)

char_chunks = char_splitter.split_text(sample_doc)

print(f"Fixed-size chunking: {len(char_chunks)} chunks\n")
for i, chunk in enumerate(char_chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk[:100] + "..." if len(chunk) > 100 else chunk)
    print()

## 4. Token-Based Chunking

In [None]:
# Token-based splitting (respects model's tokenization)
token_splitter = TokenTextSplitter(
    chunk_size=50,     # Max tokens per chunk
    chunk_overlap=10   # Overlap in tokens
)

token_chunks = token_splitter.split_text(sample_doc)

print(f"Token-based chunking: {len(token_chunks)} chunks\n")

# Count tokens in each chunk
enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 encoding
for i, chunk in enumerate(token_chunks[:3], 1):  # Show first 3
    tokens = enc.encode(chunk)
    print(f"Chunk {i} ({len(tokens)} tokens):")
    print(chunk[:100] + "...")
    print()

## 5. Recursive Chunking (Smart)

In [None]:
# Recursive splitting tries multiple separators in order
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20,
    separators=["\n\n", "\n", ".", " ", ""],  # Try in order
    length_function=len
)

recursive_chunks = recursive_splitter.split_text(sample_doc)

print(f"Recursive chunking: {len(recursive_chunks)} chunks\n")
for i, chunk in enumerate(recursive_chunks, 1):
    print(f"Chunk {i}:")
    print(chunk)
    print(f"  [{len(chunk)} chars]\n")

## 6. Chunk Overlap

Overlap preserves context across chunk boundaries.

In [None]:
# Compare with and without overlap
no_overlap = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=0
).split_text(sample_doc)

with_overlap = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20
).split_text(sample_doc)

print(f"Without overlap: {len(no_overlap)} chunks")
print(f"With overlap: {len(with_overlap)} chunks")
print("\nOverlap example:")
print("Chunk 1 end:", with_overlap[0][-40:])
print("Chunk 2 start:", with_overlap[1][:40])
print("\nâ†‘ Notice the shared text between chunks")

## 7. Preserving Metadata

In [None]:
from langchain.schema import Document

# Create document with metadata
doc = Document(
    page_content=sample_doc,
    metadata={
        "source": "ml_tutorial.txt",
        "author": "AI Teacher",
        "date": "2024-01-01"
    }
)

# Split preserving metadata
splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20
)

chunks = splitter.split_documents([doc])

print(f"Created {len(chunks)} chunks\n")
for i, chunk in enumerate(chunks[:2], 1):  # Show first 2
    print(f"Chunk {i}:")
    print(f"  Content: {chunk.page_content[:80]}...")
    print(f"  Metadata: {chunk.metadata}")
    # Note: Each chunk gets the source metadata
    chunk.metadata["chunk_index"] = i - 1  # Add chunk position
    print(f"  Updated: {chunk.metadata}")
    print()

## 8. Optimal Chunk Size

In [None]:
# Test different chunk sizes
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Query to test
query = "What is supervised learning?"
query_emb = model.encode(query)

# Test different chunk sizes
chunk_sizes = [50, 100, 200, 400]
results = []

for size in chunk_sizes:
    splitter = CharacterTextSplitter(
        chunk_size=size,
        chunk_overlap=int(size * 0.1)  # 10% overlap
    )
    chunks = splitter.split_text(sample_doc)
    embeddings = model.encode(chunks)
    
    # Find best match
    similarities = cosine_similarity([query_emb], embeddings)[0]
    best_score = max(similarities)
    best_chunk = chunks[np.argmax(similarities)]
    
    results.append({
        "chunk_size": size,
        "num_chunks": len(chunks),
        "best_score": best_score,
        "best_chunk": best_chunk[:100]
    })

print(f"Query: '{query}'\n")
print("Results by chunk size:\n")
for r in results:
    print(f"Size {r['chunk_size']}: {r['num_chunks']} chunks, score {r['best_score']:.3f}")
    print(f"  Best match: {r['best_chunk']}...\n")

## 9. Document-Specific Chunking

In [None]:
# Different strategies for different document types

# Code documentation
code_doc = '''
def calculate_sum(a, b):
    """Add two numbers."""
    return a + b

def calculate_product(a, b):
    """Multiply two numbers."""
    return a * b
'''

# Split on function boundaries
code_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    separators=["\ndef ", "\nclass ", "\n\n", "\n"],
    chunk_overlap=0
)
code_chunks = code_splitter.split_text(code_doc)
print("Code chunks:")
for i, chunk in enumerate(code_chunks, 1):
    print(f"{i}. {chunk.strip()}\n")

# Markdown document
markdown_doc = '''
# Header 1
Content under header 1.

## Header 2
Content under header 2.

### Header 3
Content under header 3.
'''

# Split on headers
markdown_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    separators=["\n# ", "\n## ", "\n### ", "\n\n"],
    chunk_overlap=0
)
md_chunks = markdown_splitter.split_text(markdown_doc)
print("\nMarkdown chunks:")
for i, chunk in enumerate(md_chunks, 1):
    print(f"{i}. {chunk.strip()}\n")

## 10. Full Pipeline: Load â†’ Chunk â†’ Embed â†’ Store

In [None]:
import chromadb
from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
chroma_client = chromadb.Client()

# Full pipeline
def process_document(text: str, chunk_size: int = 200):
    """Complete pipeline: chunk â†’ embed â†’ store."""
    
    # 1. Chunk
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size * 0.1)
    )
    chunks = splitter.split_text(text)
    print(f"Step 1: Created {len(chunks)} chunks")
    
    # 2. Embed
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=chunks
    )
    embeddings = [item.embedding for item in response.data]
    print(f"Step 2: Generated {len(embeddings)} embeddings")
    
    # 3. Store
    collection = chroma_client.create_collection(
        name=f"chunked_doc_{chunk_size}"
    )
    collection.add(
        documents=chunks,
        embeddings=embeddings,
        ids=[f"chunk_{i}" for i in range(len(chunks))]
    )
    print(f"Step 3: Stored in ChromaDB")
    
    return collection

# Process the sample document
collection = process_document(sample_doc, chunk_size=150)

# Query the collection
query = "Tell me about reinforcement learning"
query_emb = openai_client.embeddings.create(
    model="text-embedding-3-small",
    input=query
).data[0].embedding

results = collection.query(
    query_embeddings=[query_emb],
    n_results=2
)

print(f"\nQuery: '{query}'\n")
print("Results:")
for doc in results['documents'][0]:
    print(f"  â€¢ {doc}\n")

## Exercise: Build a Smart Chunker

Create an adaptive chunker that chooses the best strategy based on document type.

In [None]:
# TODO: Complete this exercise
class AdaptiveChunker:
    """
    Automatically choose the best chunking strategy.
    """
    
    def detect_document_type(self, text: str) -> str:
        """
        Detect document type (code, markdown, plain text, etc.).
        """
        # TODO: Implement detection logic
        pass
    
    def chunk(self, text: str, chunk_size: int = 200) -> list:
        """
        Chunk text using the best strategy for its type.
        """
        # TODO: Implement
        # 1. Detect document type
        # 2. Choose appropriate splitter
        # 3. Return chunks
        pass

# Test your implementation
# chunker = AdaptiveChunker()
# chunks = chunker.chunk(your_document)
# print(f"Created {len(chunks)} chunks")

## Summary

You learned:
- âœ… Why chunking is essential for embeddings
- âœ… Different chunking strategies (fixed, recursive, semantic)
- âœ… Importance of chunk overlap
- âœ… Optimal chunk size selection
- âœ… Document-specific chunking

## Best Practices
1. **Start with 200-500 characters** per chunk
2. **Use 10-20% overlap** to preserve context
3. **Recursive splitting** works well for most cases
4. **Test different sizes** for your specific use case
5. **Preserve metadata** through chunking process
6. **Adapt strategy** to document type

## Next Steps
- ðŸ“˜ Notebook 06: Real-World Applications