# Module 3: Chunking Strategies & Implementation

## 🎯 Learning Objectives
By the end of this module, you will:
- Understand why chunking is essential for RAG systems
- Implement different chunking strategies using LangChain
- Compare fixed-size, semantic, and recursive chunking approaches
- Optimize chunk sizes for different use cases
- Preserve context and metadata across chunks
- Apply 2025's latest semantic chunking techniques

## 📚 Key Concepts

### Why Chunking is Critical 🔪

**Think of chunking like creating an index for a book:**
- You don't read the entire book to find one fact
- You look up the relevant chapter/section
- RAG does the same with your documents

### The Chunking Challenge
| Too Small 📏 | Just Right ✅ | Too Large 📐 |
|--------------|---------------|---------------|
| Loses context | Preserves meaning | Hard to match |
| Poor retrieval | Good relevance | Noise |
| Fast processing | Balanced | Slow processing |

### 2025 Chunking Evolution 🚀
- **Semantic Chunking**: AI determines natural boundaries
- **Adaptive Chunking**: Dynamic sizing based on content type
- **Contextual Preservation**: Better metadata and overlap strategies
- **Embedding-Based Splitting**: Use embeddings to find semantic breaks

### Common Chunking Strategies
1. **Fixed-Size**: Split by character/token count
2. **Semantic**: Split by meaning and context
3. **Recursive**: Try multiple separators hierarchically
4. **Adaptive**: Adjust parameters by content type


## 🛠️ Setup
Let's install and import the required packages for chunking.

In [None]:
# Install required packages
!pip install -q langchain langchain-community tiktoken
!pip install -q sentence-transformers numpy matplotlib
!pip install -q scikit-learn  # For clustering in semantic chunking

In [None]:
import os
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter

# LangChain imports
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,
    MarkdownHeaderTextSplitter
)
from langchain.schema import Document
from langchain.document_loaders import TextLoader

# For token counting
import tiktoken

# For semantic chunking (2025 approach)
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

print("✅ Setup complete!")
print("🔪 Ready to chunk documents!")

## 📄 Exercise 1: Creating Sample Documents for Chunking

Let's create different types of documents to see how chunking strategies work.

In [None]:
# Create sample documents directory
docs_dir = Path("chunking_samples")
docs_dir.mkdir(exist_ok=True)

# Sample 1: Technical documentation
technical_doc = """
# API Documentation

## Authentication

Our API uses Bearer token authentication. Include your API key in the Authorization header:

```
Authorization: Bearer YOUR_API_KEY
```

All API requests must be made over HTTPS. Requests made over plain HTTP will fail.

## Rate Limiting

API calls are rate limited to prevent abuse. The current limits are:

- Free tier: 100 requests per hour
- Pro tier: 1,000 requests per hour  
- Enterprise: 10,000 requests per hour

When you exceed your rate limit, you'll receive a 429 status code with details about when you can make your next request.

## User Management

### Creating Users

To create a new user, send a POST request to `/api/v1/users` with the following payload:

```json
{
  "username": "john_doe",
  "email": "john@example.com",
  "password": "secure_password",
  "role": "user"
}
```

### Updating Users

User information can be updated by sending a PUT request to `/api/v1/users/{user_id}`. Only the user themselves or an admin can update user information.

### Deleting Users

Users can be deleted by sending a DELETE request to `/api/v1/users/{user_id}`. This action is irreversible and will permanently remove all user data.

## Data Analytics

The analytics endpoint provides insights into your application usage. Available metrics include:

- User engagement statistics
- API usage patterns
- Performance metrics
- Error rates and debugging information

Analytics data is updated every 15 minutes and retained for 90 days.

## Error Handling

Our API uses conventional HTTP response codes to indicate success or failure:

- 200: Success
- 400: Bad Request - often missing required parameters
- 401: Unauthorized - invalid or missing API key
- 403: Forbidden - valid API key but insufficient permissions
- 404: Not Found - resource doesn't exist
- 429: Too Many Requests - rate limit exceeded
- 500: Internal Server Error - something went wrong on our end

Error responses include a detailed message explaining what went wrong and how to fix it.
"""

with open(docs_dir / "api_documentation.txt", "w") as f:
    f.write(technical_doc.strip())

print("✅ Created: api_documentation.txt")
print(f"📏 Length: {len(technical_doc)} characters")

In [None]:
# Sample 2: Narrative content (different structure)
narrative_doc = """
The Evolution of Artificial Intelligence in Healthcare

Healthcare has always been at the forefront of technological innovation, but the integration of artificial intelligence represents perhaps the most significant transformation in modern medical practice. From early expert systems in the 1970s to today's sophisticated machine learning algorithms, AI has steadily evolved to address complex medical challenges.

The journey began with simple rule-based systems that could assist doctors in diagnosis. These early systems, while limited, laid the groundwork for more sophisticated approaches. They demonstrated that machines could process medical knowledge in structured ways, even if they couldn't match human intuition and experience.

In the 1990s, the advent of neural networks brought new possibilities. Researchers began experimenting with pattern recognition in medical imaging, leading to breakthrough applications in radiology. The ability to detect subtle patterns in X-rays, CT scans, and MRIs that might escape the human eye became a game-changer for early disease detection.

The real revolution came with deep learning in the 2010s. Convolutional neural networks achieved superhuman performance in specific imaging tasks. Google's DeepMind developed AI systems that could diagnose diabetic retinopathy from retinal photographs with greater accuracy than specialist doctors. Similarly, IBM's Watson for Oncology promised to revolutionize cancer treatment by analyzing vast amounts of medical literature and patient data.

However, the path hasn't been without challenges. Early enthusiasm sometimes outpaced practical implementation. Watson for Oncology, despite initial promise, faced criticism for providing treatment recommendations that didn't always align with standard care practices. This highlighted the importance of rigorous testing and validation in medical AI systems.

Today, AI in healthcare has matured significantly. Electronic health records powered by natural language processing can extract insights from unstructured clinical notes. Predictive analytics help identify patients at risk of sepsis or readmission. Drug discovery processes that once took decades are being accelerated through AI-driven molecular modeling.

The COVID-19 pandemic accelerated AI adoption in unprecedented ways. Contact tracing apps, vaccine distribution optimization, and real-time monitoring of virus variants all relied on artificial intelligence. Telemedicine platforms integrated AI-powered triage systems to manage the surge in virtual consultations.

Looking forward, the future of AI in healthcare appears incredibly promising. Personalized medicine, where treatments are tailored to individual genetic profiles, is becoming reality through AI analysis of genomic data. Robotic surgery assisted by AI provides unprecedented precision. Mental health applications use natural language processing to provide support and early intervention.

Yet challenges remain. Data privacy and security are paramount concerns when dealing with sensitive medical information. Algorithmic bias can perpetuate healthcare disparities if not carefully addressed. The need for transparency and explainability in AI medical decisions is crucial for maintaining physician and patient trust.

Regulatory frameworks are evolving to keep pace with technological advancement. The FDA has approved numerous AI-based medical devices, establishing precedents for safety and efficacy standards. International collaboration on AI healthcare standards ensures global compatibility and safety.

As we move into the next decade, the integration of AI in healthcare will likely become even more seamless and pervasive. The key lies in maintaining the balance between technological innovation and human-centered care, ensuring that artificial intelligence enhances rather than replaces the fundamental human elements of medical practice.
"""

with open(docs_dir / "ai_healthcare.txt", "w") as f:
    f.write(narrative_doc.strip())

print("✅ Created: ai_healthcare.txt")
print(f"📏 Length: {len(narrative_doc)} characters")

In [None]:
# Load our documents
docs = []

# Load technical documentation
tech_loader = TextLoader(str(docs_dir / "api_documentation.txt"))
tech_docs = tech_loader.load()
docs.extend(tech_docs)

# Load narrative document
narrative_loader = TextLoader(str(docs_dir / "ai_healthcare.txt"))
narrative_docs = narrative_loader.load()
docs.extend(narrative_docs)

print(f"📚 Loaded {len(docs)} documents for chunking experiments")
for i, doc in enumerate(docs):
    print(f"   Doc {i+1}: {len(doc.page_content):,} chars")

## 🔪 Exercise 2: Fixed-Size Chunking

Let's start with the simplest approach: splitting documents into fixed-size chunks.

In [None]:
def analyze_chunks(chunks, method_name):
    """
    Analyze and visualize chunk characteristics
    """
    print(f"\n📊 {method_name} Analysis:")
    print(f"   Total chunks: {len(chunks)}")
    
    if chunks:
        # Calculate statistics
        chunk_sizes = [len(chunk.page_content) for chunk in chunks]
        
        print(f"   Avg chunk size: {np.mean(chunk_sizes):.0f} chars")
        print(f"   Min chunk size: {min(chunk_sizes)} chars")
        print(f"   Max chunk size: {max(chunk_sizes)} chars")
        print(f"   Std deviation: {np.std(chunk_sizes):.0f} chars")
        
        # Show first chunk preview
        print(f"\n📝 First chunk preview:")
        print(f"   {chunks[0].page_content[:150]}...")
        
        return chunk_sizes
    return []

# Test different fixed-size approaches
print("🔪 FIXED-SIZE CHUNKING EXPERIMENTS")
print("=" * 50)

# Use the technical document for comparison
test_doc = docs[0]  # API documentation

# 1. Character-based chunking
char_splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separator="\n\n"  # Split on paragraphs when possible
)

char_chunks = char_splitter.split_documents([test_doc])
char_sizes = analyze_chunks(char_chunks, "Character-based (500 chars)")

In [None]:
# 2. Token-based chunking (more precise for LLMs)
token_splitter = TokenTextSplitter(
    chunk_size=100,  # 100 tokens
    chunk_overlap=20
)

token_chunks = token_splitter.split_documents([test_doc])
token_sizes = analyze_chunks(token_chunks, "Token-based (100 tokens)")

# Let's also show token vs character relationship
if token_chunks:
    encoding = tiktoken.get_encoding("cl100k_base")  # GPT-4 encoding
    first_chunk_text = token_chunks[0].page_content
    token_count = len(encoding.encode(first_chunk_text))
    char_count = len(first_chunk_text)
    
    print(f"\n🔍 Token vs Character relationship:")
    print(f"   First chunk: {char_count} chars = {token_count} tokens")
    print(f"   Ratio: {char_count/token_count:.1f} chars per token")

In [None]:
# 3. Different chunk sizes comparison
chunk_size_experiments = [200, 500, 1000, 2000]
size_results = {}

print(f"\n🔬 Chunk Size Comparison:")
for size in chunk_size_experiments:
    splitter = CharacterTextSplitter(
        chunk_size=size,
        chunk_overlap=size // 10,  # 10% overlap
        separator="\n\n"
    )
    
    chunks = splitter.split_documents([test_doc])
    chunk_sizes = [len(chunk.page_content) for chunk in chunks]
    
    size_results[size] = {
        'count': len(chunks),
        'avg_size': np.mean(chunk_sizes) if chunk_sizes else 0,
        'std_dev': np.std(chunk_sizes) if chunk_sizes else 0
    }
    
    print(f"   Size {size}: {len(chunks)} chunks, avg {np.mean(chunk_sizes):.0f} chars")

# Visualize the results
sizes = list(size_results.keys())
counts = [size_results[s]['count'] for s in sizes]
avg_sizes = [size_results[s]['avg_size'] for s in sizes]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Plot chunk count vs target size
ax1.bar(sizes, counts, color='skyblue', alpha=0.7)
ax1.set_xlabel('Target Chunk Size (chars)')
ax1.set_ylabel('Number of Chunks')
ax1.set_title('Chunk Count vs Target Size')

# Plot actual vs target size
ax2.plot(sizes, sizes, 'r--', label='Target Size', alpha=0.7)
ax2.plot(sizes, avg_sizes, 'bo-', label='Actual Avg Size')
ax2.set_xlabel('Target Chunk Size (chars)')
ax2.set_ylabel('Actual Chunk Size (chars)')
ax2.set_title('Target vs Actual Chunk Sizes')
ax2.legend()

plt.tight_layout()
plt.show()

print("\n📈 Chart shows: Larger target sizes = fewer chunks, but actual sizes may vary due to natural boundaries")

## 🧠 Exercise 3: Recursive Chunking (LangChain's Smart Approach)

Recursive chunking tries multiple separators to find the best split points.

In [None]:
# Recursive Character Text Splitter - LangChain's recommended approach
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=100,
    separators=[
        "\n\n",  # Paragraphs (best)
        "\n",    # Lines  
        ". ",    # Sentences
        " ",     # Words
        ""       # Characters (last resort)
    ]
)

print("🔄 RECURSIVE CHUNKING EXPERIMENT")
print("=" * 40)

# Test on both documents
for i, doc in enumerate(docs):
    doc_name = ["Technical (API Docs)", "Narrative (AI Healthcare)"][i]
    
    recursive_chunks = recursive_splitter.split_documents([doc])
    sizes = analyze_chunks(recursive_chunks, f"Recursive - {doc_name}")
    
    # Let's see where it chose to split
    if len(recursive_chunks) >= 2:
        print(f"\n🔍 Split analysis:")
        chunk1_end = recursive_chunks[0].page_content[-50:]
        chunk2_start = recursive_chunks[1].page_content[:50]
        
        print(f"   Chunk 1 ends: ...{chunk1_end}")
        print(f"   Chunk 2 starts: {chunk2_start}...")
        
        # Check if it found a good boundary
        if chunk1_end.endswith('\n') or chunk2_start.startswith('\n'):
            print(f"   ✅ Found natural paragraph boundary")
        elif '. ' in chunk1_end[-10:]:
            print(f"   ✅ Found sentence boundary")
        else:
            print(f"   ⚠️  Had to split mid-sentence")
    
    print("\n" + "="*40 + "\n")

## 🎯 Exercise 4: Semantic Chunking (2025 Advanced Technique)

Let's implement semantic chunking using embeddings to find natural topic boundaries.

In [None]:
# Load a lightweight embedding model for semantic analysis
print("🤖 Loading embedding model for semantic chunking...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # Fast, lightweight model
print("✅ Model loaded!")

def semantic_chunking(text, max_chunk_size=800, similarity_threshold=0.7):
    """
    Advanced semantic chunking using embeddings
    """
    # First, split into sentences
    sentences = [s.strip() for s in text.split('.') if s.strip()]
    
    if len(sentences) < 2:
        return [text]  # Too short to chunk meaningfully
    
    # Get embeddings for each sentence
    print(f"   🧮 Computing embeddings for {len(sentences)} sentences...")
    embeddings = embedding_model.encode(sentences)
    
    # Find semantic boundaries
    chunks = []
    current_chunk = [sentences[0]]
    current_length = len(sentences[0])
    
    for i in range(1, len(sentences)):
        # Calculate similarity with previous sentence
        similarity = cosine_similarity(
            embeddings[i-1].reshape(1, -1),
            embeddings[i].reshape(1, -1)
        )[0][0]
        
        sentence_length = len(sentences[i])
        
        # Decide whether to continue current chunk or start new one
        should_split = (
            current_length + sentence_length > max_chunk_size or  # Size limit
            similarity < similarity_threshold  # Semantic boundary
        )
        
        if should_split and current_chunk:
            # Finish current chunk
            chunks.append('. '.join(current_chunk) + '.')
            current_chunk = [sentences[i]]
            current_length = sentence_length
        else:
            # Continue current chunk
            current_chunk.append(sentences[i])
            current_length += sentence_length
    
    # Add final chunk
    if current_chunk:
        chunks.append('. '.join(current_chunk) + '.')
    
    return chunks

print("\n🧠 SEMANTIC CHUNKING EXPERIMENT")
print("=" * 40)

# Test semantic chunking on narrative document (better for this approach)
narrative_text = docs[1].page_content

semantic_chunks_text = semantic_chunking(
    narrative_text, 
    max_chunk_size=800, 
    similarity_threshold=0.75
)

# Convert to LangChain documents
semantic_chunks = [
    Document(
        page_content=chunk,
        metadata={"chunk_type": "semantic", "chunk_id": i}
    ) 
    for i, chunk in enumerate(semantic_chunks_text)
]

semantic_sizes = analyze_chunks(semantic_chunks, "Semantic Chunking")

# Show topic analysis
print(f"\n🎯 Semantic Chunk Topics:")
for i, chunk in enumerate(semantic_chunks[:5]):  # Show first 5
    first_sentence = chunk.page_content.split('.')[0][:80]
    print(f"   Chunk {i+1}: {first_sentence}...")

## 📏 Exercise 5: Comparing Chunking Strategies

Let's compare all our chunking approaches side by side.

In [None]:
# Compare all methods on the same document
test_document = docs[1]  # Use narrative document

# 1. Fixed-size chunking
fixed_splitter = CharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=100,
    separator="\n\n"
)
fixed_chunks = fixed_splitter.split_documents([test_document])

# 2. Recursive chunking
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=100
)
recursive_chunks = recursive_splitter.split_documents([test_document])

# 3. Semantic chunking (already done above)
# semantic_chunks is already available

# Compare results
methods = {
    "Fixed-Size": fixed_chunks,
    "Recursive": recursive_chunks,
    "Semantic": semantic_chunks
}

print("📊 CHUNKING STRATEGY COMPARISON")
print("=" * 50)

comparison_data = {}

for method, chunks in methods.items():
    if chunks:
        sizes = [len(chunk.page_content) for chunk in chunks]
        
        comparison_data[method] = {
            'count': len(chunks),
            'avg_size': np.mean(sizes),
            'std_dev': np.std(sizes),
            'min_size': min(sizes),
            'max_size': max(sizes)
        }
        
        print(f"\n🔪 {method}:")
        print(f"   Chunks: {len(chunks)}")
        print(f"   Avg size: {np.mean(sizes):.0f} ± {np.std(sizes):.0f} chars")
        print(f"   Range: {min(sizes)} - {max(sizes)} chars")
        
        # Calculate consistency score (lower std dev = more consistent)
        consistency = 1 - (np.std(sizes) / np.mean(sizes))
        print(f"   Consistency: {consistency:.2f} (higher = more uniform)")

# Visualize comparison
methods_list = list(comparison_data.keys())
counts = [comparison_data[m]['count'] for m in methods_list]
avg_sizes = [comparison_data[m]['avg_size'] for m in methods_list]
std_devs = [comparison_data[m]['std_dev'] for m in methods_list]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Chunk count comparison
bars1 = ax1.bar(methods_list, counts, color=['lightcoral', 'lightblue', 'lightgreen'])
ax1.set_ylabel('Number of Chunks')
ax1.set_title('Chunk Count by Method')
ax1.tick_params(axis='x', rotation=45)

# Average size with error bars
bars2 = ax2.bar(methods_list, avg_sizes, yerr=std_devs, 
                capsize=10, color=['lightcoral', 'lightblue', 'lightgreen'], alpha=0.7)
ax2.set_ylabel('Average Chunk Size (chars)')
ax2.set_title('Average Chunk Size ± Std Dev')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\n📈 Key Insights:")
print("   • Fixed-size: Most predictable, may break context")
print("   • Recursive: Good balance of size and natural boundaries")
print("   • Semantic: Most context-aware, variable sizes")

## 🔧 Exercise 6: Advanced Chunking with Metadata Preservation

Let's see how to preserve important metadata and context across chunks.

In [None]:
def advanced_chunking_with_metadata(document, chunk_size=600):
    """
    Advanced chunking that preserves metadata and adds context
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=100,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    
    chunks = splitter.split_documents([document])
    
    # Enhance each chunk with additional metadata
    enhanced_chunks = []
    
    for i, chunk in enumerate(chunks):
        # Calculate position information
        start_char = document.page_content.find(chunk.page_content[:50])
        position_ratio = start_char / len(document.page_content) if document.page_content else 0
        
        # Extract key terms (simple approach)
        words = chunk.page_content.lower().split()
        word_freq = Counter(words)
        key_terms = [word for word, freq in word_freq.most_common(5) 
                    if len(word) > 4 and word.isalpha()]
        
        # Create enhanced metadata
        enhanced_metadata = chunk.metadata.copy()
        enhanced_metadata.update({
            'chunk_id': i,
            'total_chunks': len(chunks),
            'chunk_size': len(chunk.page_content),
            'position_in_doc': position_ratio,
            'position_label': 'beginning' if position_ratio < 0.33 else 
                            'middle' if position_ratio < 0.67 else 'end',
            'key_terms': key_terms,
            'word_count': len(words),
            'has_code': '```' in chunk.page_content or 'def ' in chunk.page_content,
            'has_headers': any(line.startswith('#') for line in chunk.page_content.split('\n')),
            'next_chunk_preview': chunks[i+1].page_content[:50] + '...' if i < len(chunks)-1 else None,
            'prev_chunk_preview': chunks[i-1].page_content[-50:] + '...' if i > 0 else None
        })
        
        enhanced_chunks.append(Document(
            page_content=chunk.page_content,
            metadata=enhanced_metadata
        ))
    
    return enhanced_chunks

# Test advanced chunking
print("🔧 ADVANCED CHUNKING WITH METADATA")
print("=" * 45)

enhanced_chunks = advanced_chunking_with_metadata(docs[0])  # Technical doc

print(f"📚 Created {len(enhanced_chunks)} enhanced chunks")

# Show detailed metadata for first few chunks
for i, chunk in enumerate(enhanced_chunks[:3]):
    meta = chunk.metadata
    print(f"\n📄 Chunk {i+1}:")
    print(f"   Position: {meta['position_label']} ({meta['position_in_doc']:.2f})")
    print(f"   Size: {meta['chunk_size']} chars, {meta['word_count']} words")
    print(f"   Key terms: {', '.join(meta['key_terms'][:3])}")
    print(f"   Has code: {'Yes' if meta['has_code'] else 'No'}")
    print(f"   Has headers: {'Yes' if meta['has_headers'] else 'No'}")
    
    if meta['prev_chunk_preview']:
        print(f"   Previous: ...{meta['prev_chunk_preview']}")
    if meta['next_chunk_preview']:
        print(f"   Next: {meta['next_chunk_preview']}")
    
    print(f"   Content preview: {chunk.page_content[:100]}...")

## 🎯 Exercise 7: Adaptive Chunking (2025 Technique)

Let's implement adaptive chunking that adjusts parameters based on content type.

In [None]:
def adaptive_chunking_strategy(document):
    """
    Adaptive chunking that adjusts parameters based on content analysis
    """
    content = document.page_content
    
    # Analyze content characteristics
    lines = content.split('\n')
    words = content.split()
    
    # Content type detection
    code_indicators = content.count('```') + content.count('def ') + content.count('class ')
    header_count = sum(1 for line in lines if line.strip().startswith('#'))
    list_items = sum(1 for line in lines if line.strip().startswith(('-', '*', '•')))
    avg_line_length = np.mean([len(line) for line in lines if line.strip()])
    
    # Determine content type and optimal parameters
    if code_indicators > 5:
        content_type = "code_heavy"
        chunk_size = 800  # Larger chunks to preserve code context
        overlap = 50      # Minimal overlap
        separators = ["\n\n", "\ndef ", "\nclass ", "\n", " "]
    elif header_count > 3:
        content_type = "structured_docs"
        chunk_size = 600  # Medium chunks
        overlap = 80      # Good overlap
        separators = ["\n# ", "\n## ", "\n\n", "\n", ". ", " "]
    elif list_items > 5:
        content_type = "list_heavy"
        chunk_size = 400  # Smaller chunks
        overlap = 60      # Medium overlap
        separators = ["\n\n", "\n- ", "\n* ", "\n", ". ", " "]
    elif avg_line_length > 100:
        content_type = "narrative"
        chunk_size = 1000  # Larger chunks for narrative flow
        overlap = 150      # High overlap for context
        separators = ["\n\n", "\n", ". ", " "]
    else:
        content_type = "general"
        chunk_size = 600   # Default
        overlap = 100      # Default
        separators = ["\n\n", "\n", ". ", " "]
    
    # Create splitter with adaptive parameters
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=separators
    )
    
    chunks = splitter.split_documents([document])
    
    # Add adaptive metadata
    for chunk in chunks:
        chunk.metadata.update({
            'adaptive_strategy': content_type,
            'chunk_size_used': chunk_size,
            'overlap_used': overlap,
            'content_analysis': {
                'code_indicators': code_indicators,
                'header_count': header_count,
                'list_items': list_items,
                'avg_line_length': round(avg_line_length, 1)
            }
        })
    
    return chunks, content_type

print("🎯 ADAPTIVE CHUNKING EXPERIMENT")
print("=" * 40)

# Test adaptive chunking on both documents
for i, doc in enumerate(docs):
    doc_name = ["Technical (API Docs)", "Narrative (AI Healthcare)"][i]
    
    adaptive_chunks, detected_type = adaptive_chunking_strategy(doc)
    
    print(f"\n📄 {doc_name}:")
    print(f"   Detected type: {detected_type}")
    print(f"   Chunks created: {len(adaptive_chunks)}")
    
    if adaptive_chunks:
        first_chunk_meta = adaptive_chunks[0].metadata
        print(f"   Strategy used:")
        print(f"     Chunk size: {first_chunk_meta['chunk_size_used']}")
        print(f"     Overlap: {first_chunk_meta['overlap_used']}")
        
        analysis = first_chunk_meta['content_analysis']
        print(f"   Content analysis:")
        print(f"     Code indicators: {analysis['code_indicators']}")
        print(f"     Headers: {analysis['header_count']}")
        print(f"     List items: {analysis['list_items']}")
        print(f"     Avg line length: {analysis['avg_line_length']}")
        
        # Show size distribution
        sizes = [len(chunk.page_content) for chunk in adaptive_chunks]
        print(f"   Size distribution: {min(sizes)} - {max(sizes)} chars (avg: {np.mean(sizes):.0f})")

## 📊 Exercise 8: Chunking Quality Assessment

Let's create metrics to evaluate chunking quality.

In [None]:
def evaluate_chunking_quality(chunks, method_name):
    """
    Comprehensive quality assessment for chunking
    """
    if not chunks:
        return {"error": "No chunks to evaluate"}
    
    sizes = [len(chunk.page_content) for chunk in chunks]
    
    # 1. Size consistency (lower coefficient of variation = better)
    size_consistency = 1 - (np.std(sizes) / np.mean(sizes)) if np.mean(sizes) > 0 else 0
    
    # 2. Boundary quality (check for broken sentences/words)
    broken_boundaries = 0
    for chunk in chunks:
        content = chunk.page_content.strip()
        if content and not content[0].isupper():  # Doesn't start with capital
            broken_boundaries += 1
        if content and not content.endswith(('.', '!', '?', '\n')):  # Doesn't end properly
            broken_boundaries += 1
    
    boundary_quality = 1 - (broken_boundaries / (len(chunks) * 2))  # Max 2 issues per chunk
    
    # 3. Content completeness (no very short chunks)
    very_short_chunks = sum(1 for size in sizes if size < 50)
    completeness = 1 - (very_short_chunks / len(chunks))
    
    # 4. Context preservation (simple heuristic)
    context_breaks = 0
    for i in range(len(chunks) - 1):
        current_end = chunks[i].page_content.strip()[-20:]
        next_start = chunks[i+1].page_content.strip()[:20:]
        
        # Check if there's a semantic connection
        if not any(word in next_start.lower() for word in current_end.lower().split() if len(word) > 3):
            context_breaks += 1
    
    context_preservation = 1 - (context_breaks / max(1, len(chunks) - 1))
    
    # 5. Overall quality score
    overall_quality = (size_consistency + boundary_quality + completeness + context_preservation) / 4
    
    return {
        'method': method_name,
        'chunk_count': len(chunks),
        'avg_size': np.mean(sizes),
        'size_consistency': size_consistency,
        'boundary_quality': boundary_quality,
        'completeness': completeness,
        'context_preservation': context_preservation,
        'overall_quality': overall_quality
    }

# Evaluate all our chunking methods
print("📊 CHUNKING QUALITY ASSESSMENT")
print("=" * 50)

# Prepare test chunks (using narrative document)
test_doc = docs[1]

# Get chunks from different methods
fixed_chunks = CharacterTextSplitter(chunk_size=600, chunk_overlap=100).split_documents([test_doc])
recursive_chunks = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=100).split_documents([test_doc])
adaptive_chunks, _ = adaptive_chunking_strategy(test_doc)

methods_to_evaluate = [
    (fixed_chunks, "Fixed-Size"),
    (recursive_chunks, "Recursive"),
    (semantic_chunks, "Semantic"),
    (adaptive_chunks, "Adaptive")
]

quality_results = []

for chunks, method_name in methods_to_evaluate:
    quality = evaluate_chunking_quality(chunks, method_name)
    quality_results.append(quality)
    
    print(f"\n🔪 {method_name}:")
    print(f"   Overall Quality: {quality['overall_quality']:.3f}")
    print(f"   Size Consistency: {quality['size_consistency']:.3f}")
    print(f"   Boundary Quality: {quality['boundary_quality']:.3f}")
    print(f"   Completeness: {quality['completeness']:.3f}")
    print(f"   Context Preservation: {quality['context_preservation']:.3f}")

# Visualize quality comparison
methods = [r['method'] for r in quality_results]
overall_scores = [r['overall_quality'] for r in quality_results]
metrics = ['size_consistency', 'boundary_quality', 'completeness', 'context_preservation']

# Create radar chart
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Overall quality bar chart
bars = ax1.bar(methods, overall_scores, color=['lightcoral', 'lightblue', 'lightgreen', 'gold'])
ax1.set_ylabel('Overall Quality Score')
ax1.set_title('Chunking Methods - Overall Quality')
ax1.set_ylim(0, 1)
ax1.tick_params(axis='x', rotation=45)

# Add value labels on bars
for bar, score in zip(bars, overall_scores):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.01,
             f'{score:.3f}', ha='center', va='bottom')

# Detailed metrics comparison
x = np.arange(len(methods))
width = 0.2

for i, metric in enumerate(metrics):
    values = [r[metric] for r in quality_results]
    ax2.bar(x + i*width, values, width, label=metric.replace('_', ' ').title())

ax2.set_xlabel('Chunking Methods')
ax2.set_ylabel('Quality Score')
ax2.set_title('Detailed Quality Metrics')
ax2.set_xticks(x + width * 1.5)
ax2.set_xticklabels(methods, rotation=45)
ax2.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
ax2.set_ylim(0, 1)

plt.tight_layout()
plt.show()

# Determine best method
best_method = max(quality_results, key=lambda x: x['overall_quality'])
print(f"\n🏆 Best performing method: {best_method['method']} (Quality: {best_method['overall_quality']:.3f})")

## 🧠 Key Takeaways

From this module, you should now understand:

### 🔪 Chunking Fundamentals:
1. **Why chunking matters**: Enables precise retrieval and fits LLM context windows
2. **Size trade-offs**: Small chunks = precise but lack context; Large chunks = context but noise
3. **Boundary importance**: Natural breaks preserve meaning and readability
4. **Overlap strategy**: Helps maintain context across chunk boundaries

### 📊 Strategy Comparison:
- **Fixed-Size**: Simple, predictable, but may break context
- **Recursive**: Good balance, respects natural boundaries
- **Semantic**: Context-aware, but variable sizes and more complex
- **Adaptive**: Optimizes for content type, best overall approach

### 🚀 2025 Best Practices:
1. **Semantic chunking** with embeddings for topic-aware splitting
2. **Adaptive parameters** based on content type analysis
3. **Rich metadata** preservation for better retrieval
4. **Quality assessment** metrics to optimize chunking strategies

### 🎯 Practical Guidelines:
- **Technical docs**: Use recursive chunking with code-aware separators
- **Narrative text**: Use semantic or adaptive chunking for flow preservation
- **Mixed content**: Use adaptive chunking for automatic optimization
- **Production systems**: Always include quality assessment and monitoring

## 🎯 Next Steps

In **Module 4**, we'll explore embedding models that convert our chunks into vectors:
- Understanding different embedding models and their characteristics
- Model selection criteria for different use cases
- Cost vs performance trade-offs
- Latest 2025 embedding models and benchmarks

Good chunking is the foundation of effective RAG - it directly impacts retrieval quality!

## 🤔 Discussion Questions

1. For your use case, which chunking strategy would work best and why?
2. How would you handle documents with mixed content types (code + text + tables)?
3. What additional quality metrics would be useful for your domain?
4. How would you handle multilingual documents or documents with special formatting?

## 📝 Optional Exercise

**Advanced Challenge**: Implement a custom chunking strategy that:
1. Detects document sections automatically
2. Preserves hierarchical structure (headers, subheaders)
3. Handles code blocks specially
4. Includes cross-references between related chunks

Test it on documents from your domain and compare with the methods in this module!