# RAG Series - Module 3: Advanced Chunking Techniques

Welcome to Module 3! Building on the foundation from Module 2, we now explore **cutting-edge chunking techniques** that leverage AI, semantics, and intelligent analysis to create optimal chunks for RAG systems.

## Table of Contents
- [1 - Introduction](#1)
  - [1.1 Setup and Installation](#1-1)
  - [1.2 Advanced Data Preparation](#1-2)
- [2 - Semantic Chunking](#2)
  - [2.1 LangChain Semantic Chunker](#2-1)
- [3 - Agentic Chunking](#3)
  - [3.1 LLM-Powered Chunk Analysis](#3-1)
- [4 - Proposition-Based Chunking](#4)
  - [4.1 Atomic Fact Extraction](#4-1)
- [5 - Production Evaluation](#6)
  - [5.1 Comparative Analysis](#6-1)

---

## 🧠 What Makes Advanced Chunking Different?

While traditional chunking relies on **static rules** (character counts, separators), advanced techniques use:

- **🎯 Semantic Understanding**: Chunks based on meaning, not just structure
- **🤖 AI-Powered Analysis**: LLMs determine optimal chunk boundaries
- **🔗 Context Preservation**: Intelligent overlap and relationship modeling
- **📊 Multi-Vector Representations**: Different embeddings for summaries vs. details
- **⚡ Dynamic Adaptation**: Chunking strategies that adapt to content type

These techniques can **dramatically improve RAG performance** but come with trade-offs in complexity and computational cost.

---

**Technologies we'll explore:**
- **LangChain Experimental**: Semantic chunkers and advanced splitters
- **OpenAI GPT Models**: For intelligent content analysis
- **Custom Algorithms**: Proposition extraction and clustering
- **Multi-Vector Storage**: Late interaction patterns with Pinecone

<a id='1'></a>
## 1 - Introduction

---

<a id='1-1'></a>
### 1.1 Setup and Installation

For advanced chunking techniques, we need additional packages beyond the standard LangChain suite.

**What we're doing:** Installing cutting-edge packages for semantic analysis, experimental LangChain features, and advanced text processing capabilities required for intelligent chunking.

In [2]:
# Install required packages for advanced chunking
%pip install langchain langchain-pinecone langchain-openai langchain-experimental
%pip install pinecone tiktoken requests tqdm uuid numpy scipy scikit-learn
%pip install sentence-transformers transformers spacy nltk

Collecting langchain-experimental
  Downloading langchain_experimental-0.3.4-py3-none-any.whl (209 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.2/209.2 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: langchain-experimental
Successfully installed langchain-experimental-0.3.4

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Collecting uuid
  Downloading uuid-1.30.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25ldone
Installing collected packages: uuid
[33m  DEPRECATION: uuid is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce 

Now let's import our libraries and set up the advanced chunking environment:

**What we're doing:** Importing all necessary libraries including experimental features, setting up API keys, and initializing models for semantic analysis and LLM-powered chunking.

In [None]:
# Core imports
from typing import List, Dict, Tuple, Optional, Any
import requests
import re
import os
import numpy as np
from scipy.spatial.distance import cosine
from sklearn.cluster import DBSCAN
import nltk
from uuid import uuid4
import time
import json
from tqdm import tqdm

# LangChain imports
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_pinecone import PineconeVectorStore
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Advanced chunking imports
try:
    from langchain_experimental.text_splitter import SemanticChunker
    print("✅ Semantic Chunker available")
except ImportError:
    print("⚠️ Semantic Chunker not available - will use custom implementation")
    SemanticChunker = None

# Pinecone
from pinecone import Pinecone, ServerlessSpec

# Set your API keys here
OPENAI_API_KEY = "your-openai-api-key-here"      # Get from https://platform.openai.com/account/api-keys
PINECONE_API_KEY = "your-pinecone-api-key-here"  # Get from https://app.pinecone.io

# Set environment variables
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["PINECONE_API_KEY"] = PINECONE_API_KEY

print("✅ Advanced chunking environment configured!")

# Download required NLTK data
try:
    nltk.download('punkt', quiet=True)
    nltk.download('stopwords', quiet=True)
    print("✅ NLTK data downloaded")
except:
    print("⚠️ NLTK download failed - some features may be limited")

  from .autonotebook import tqdm as notebook_tqdm


✅ Semantic Chunker available
✅ Advanced chunking environment configured!
✅ NLTK data downloaded


**Result:** ✅ Advanced environment setup complete! We now have access to experimental chunking features, semantic analysis tools, and LLM-powered content understanding capabilities.

<a id='1-2'></a>
### 1.2 Advanced Data Preparation

For advanced chunking, we need diverse, complex content that showcases different techniques. Let's gather multiple document types.

**What we're doing:** Loading multiple document types (technical documentation, academic papers, code files) to demonstrate how advanced chunking techniques handle different content structures and semantic patterns.

In [2]:
# Load diverse content for advanced chunking demonstrations
def load_diverse_content():
    """
    Load multiple types of documents for advanced chunking testing.
    """
    documents = []
    
    # 1. Technical documentation (Git book)
    print("📚 Loading Git documentation...")
    git_urls = [
        "https://raw.githubusercontent.com/progit/progit2/main/book/01-introduction/sections/what-is-git.asc",
        "https://raw.githubusercontent.com/progit/progit2/main/book/02-git-basics/sections/getting-a-repository.asc",
        "https://raw.githubusercontent.com/progit/progit2/main/book/03-git-branching/sections/basic-branching-and-merging.asc"
    ]
    
    for i, url in enumerate(git_urls):
        try:
            response = requests.get(url)
            response.raise_for_status()
            doc = Document(
                page_content=response.text,
                metadata={
                    'source': f'git_book_chapter_{i+1}',
                    'type': 'technical_documentation',
                    'complexity': 'intermediate',
                    'url': url
                }
            )
            documents.append(doc)
        except Exception as e:
            print(f"⚠️ Failed to load {url}: {e}")
    
    # 2. Academic/research content
    print("🔬 Adding research content...")
    academic_content = """
    # Retrieval-Augmented Generation: A Comprehensive Survey
    
    ## Abstract
    Retrieval-Augmented Generation (RAG) represents a paradigm shift in natural language processing, combining the parametric knowledge of large language models with non-parametric retrieval from external knowledge bases. This approach addresses fundamental limitations of standalone generative models, including knowledge cutoffs, hallucination tendencies, and inability to access real-time information.
    
    ## Introduction
    The integration of retrieval mechanisms with generative models has emerged as a critical advancement in AI systems. Traditional language models, while powerful, suffer from several key limitations. First, their knowledge is frozen at training time, making them unable to access new information. Second, they are prone to generating plausible-sounding but factually incorrect information, a phenomenon known as hallucination.
    
    ## Methodology
    RAG systems typically follow a two-stage process: retrieval and generation. The retrieval stage involves searching through external knowledge sources to find relevant information given a query. This search can be performed using various techniques, including dense vector search, sparse keyword matching, or hybrid approaches that combine both methodologies.
    
    The generation stage takes the retrieved context and the original query as input to a language model, which then generates a response that incorporates the retrieved information. This approach allows the model to produce answers that are both contextually relevant and factually grounded.
    
    ## Evaluation Metrics
    Evaluating RAG systems requires consideration of multiple dimensions. Retrieval quality can be measured using traditional information retrieval metrics such as precision, recall, and normalized discounted cumulative gain (NDCG). Generation quality involves assessing factual accuracy, relevance, coherence, and completeness of responses.
    
    Recent research has introduced specialized metrics for RAG evaluation, including faithfulness (how well the generated text aligns with retrieved sources), answer relevance (how well the answer addresses the query), and context precision (quality of retrieved context).
    
    ## Challenges and Future Directions
    Despite significant progress, RAG systems face several challenges. Information quality control remains difficult, as retrieved documents may contain outdated, biased, or incorrect information. Computational efficiency is another concern, as the retrieval process adds latency to response generation.
    
    Future research directions include improving retrieval algorithms, developing better evaluation frameworks, and exploring novel architectures that more tightly integrate retrieval and generation components.
    """
    
    doc = Document(
        page_content=academic_content,
        metadata={
            'source': 'rag_survey_paper',
            'type': 'academic_research',
            'complexity': 'advanced',
            'domain': 'machine_learning'
        }
    )
    documents.append(doc)
    
    # 3. Mixed content (instructions + code + theory)
    print("💻 Adding mixed technical content...")
    mixed_content = """
    # Advanced Vector Search Implementation Guide
    
    Vector search has revolutionized information retrieval by enabling semantic similarity matching rather than exact keyword matching. This guide covers advanced implementation techniques.
    
    ## Core Concepts
    
    ### Dense Vector Representations
    Dense vectors encode semantic meaning in high-dimensional space. Unlike sparse representations (like TF-IDF), dense vectors capture contextual relationships between concepts.
    
    ```python
    import numpy as np
    from sentence_transformers import SentenceTransformer
    
    # Initialize embedding model
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Generate embeddings
    texts = ["Machine learning fundamentals", "Deep learning architectures"]
    embeddings = model.encode(texts)
    
    # Calculate similarity
    similarity = np.dot(embeddings[0], embeddings[1]) / (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1]))
    ```
    
    ### Similarity Metrics
    
    Different similarity metrics serve different purposes:
    
    1. **Cosine Similarity**: Measures angle between vectors, good for text
    2. **Euclidean Distance**: Measures geometric distance, sensitive to magnitude
    3. **Dot Product**: Fast computation, unnormalized similarity
    
    ## Implementation Strategies
    
    ### Hybrid Search Architecture
    Modern search systems combine multiple retrieval methods:
    
    ```python
    class HybridRetriever:
        def __init__(self, dense_index, sparse_index, alpha=0.7):
            self.dense_index = dense_index
            self.sparse_index = sparse_index
            self.alpha = alpha  # Weight for dense vs sparse
        
        def search(self, query, top_k=10):
            # Dense search
            dense_results = self.dense_index.search(query, top_k)
            
            # Sparse search  
            sparse_results = self.sparse_index.search(query, top_k)
            
            # Combine results
            return self._fusion(dense_results, sparse_results)
    ```
    
    ### Performance Optimization
    
    Vector search performance depends on several factors:
    
    - **Index Structure**: HNSW, IVF, or LSH for approximate search
    - **Dimensionality**: Higher dimensions capture more information but increase compute
    - **Quantization**: Reduce memory usage with minimal quality loss
    - **Caching**: Store frequently accessed embeddings in memory
    
    ## Advanced Techniques
    
    ### Multi-Vector Approaches
    
    Instead of single embeddings per document, use multiple representations:
    
    1. **Summary Embeddings**: High-level document overview
    2. **Chunk Embeddings**: Detailed section representations
    3. **Keyword Embeddings**: Important term vectors
    
    ### Query Enhancement
    
    Improve retrieval quality through query processing:
    
    - **Query Expansion**: Add related terms using embeddings
    - **Query Rewriting**: Rephrase for better matching
    - **Multi-Query**: Generate multiple query variants
    
    ## Production Considerations
    
    Deploying vector search in production requires careful attention to:
    
    - **Latency Requirements**: Sub-100ms response times
    - **Scalability**: Handle millions of vectors efficiently
    - **Cost Optimization**: Balance performance vs infrastructure costs
    - **Monitoring**: Track search quality and system performance
    
    The key to successful vector search implementation lies in understanding your specific use case and optimizing accordingly.
    """
    
    doc = Document(
        page_content=mixed_content,
        metadata={
            'source': 'vector_search_guide',
            'type': 'technical_tutorial',
            'complexity': 'advanced',
            'contains_code': True,
            'domain': 'information_retrieval'
        }
    )
    documents.append(doc)
    
    print(f"✅ Loaded {len(documents)} diverse documents for advanced chunking")
    
    # Display document statistics
    total_chars = sum(len(doc.page_content) for doc in documents)
    total_words = sum(len(doc.page_content.split()) for doc in documents)
    
    print(f"📊 Dataset statistics:")
    print(f"   Total characters: {total_chars:,}")
    print(f"   Total words: {total_words:,}")
    print(f"   Document types: {set(doc.metadata['type'] for doc in documents)}")
    
    return documents

# Load our diverse content
documents = load_diverse_content()

📚 Loading Git documentation...
🔬 Adding research content...
💻 Adding mixed technical content...
✅ Loaded 5 diverse documents for advanced chunking
📊 Dataset statistics:
   Total characters: 31,159
   Total words: 4,821
   Document types: {'technical_tutorial', 'academic_research', 'technical_documentation'}


**Result:** ✅ Successfully loaded diverse content including technical documentation, academic research, and mixed tutorial content. This variety will showcase how different advanced chunking techniques handle various content structures and semantic patterns.

<a id='2'></a>
## 2 - Semantic Chunking

---

**Semantic chunking** goes beyond structural markers to understand content meaning. Instead of splitting on fixed boundaries, it analyzes semantic similarity between sentences and paragraphs to create meaningful chunks.

### 🎯 How Semantic Chunking Works:

1. **Sentence Embeddings**: Convert each sentence to vector representation
2. **Similarity Analysis**: Compare adjacent sentence embeddings
3. **Boundary Detection**: Identify semantic breaks where similarity drops
4. **Chunk Formation**: Group sentences with high semantic coherence

### ✅ Benefits:
- Preserves topical coherence
- Creates natural information boundaries
- Improves retrieval relevance
- Adapts to content structure

### ❌ Trade-offs:
- Higher computational cost
- Embedding model dependency
- Less predictable chunk sizes

<a id='2-1'></a>
### 2.1 LangChain Semantic Chunker

LangChain's experimental `SemanticChunker` uses embeddings to identify semantic boundaries.

**What we're doing:** Implementing LangChain's SemanticChunker to automatically detect topic boundaries using embedding similarity analysis. This creates chunks based on semantic coherence rather than arbitrary size limits.

In [3]:
# LangChain Semantic Chunker implementation
def setup_semantic_chunker():
    """
    Set up LangChain's semantic chunker with OpenAI embeddings.
    """
    try:
        # Initialize embeddings for semantic analysis
        embeddings = OpenAIEmbeddings(
            model="text-embedding-3-small",
            api_key=os.environ["OPENAI_API_KEY"]
        )
        
        if SemanticChunker is not None:
            # Use LangChain's experimental semantic chunker
            semantic_chunker = SemanticChunker(
                embeddings=embeddings,
                breakpoint_threshold_type="percentile",
                breakpoint_threshold_amount=95,  # Only split at top 5% of dissimilarity
            )
            print("✅ LangChain SemanticChunker initialized")
            return semantic_chunker, embeddings
        else:
            print("⚠️ LangChain SemanticChunker not available, using custom implementation")
            return None, embeddings
            
    except Exception as e:
        print(f"❌ Error setting up semantic chunker: {e}")
        return None, None

def test_langchain_semantic_chunking(documents, chunker):
    """
    Test LangChain semantic chunking on our documents.
    """
    if chunker is None:
        print("❌ Semantic chunker not available")
        return []
    
    print("🧠 Testing LangChain Semantic Chunking...")
    all_chunks = []
    
    for i, doc in enumerate(documents):
        print(f"\n📄 Processing document {i+1}: {doc.metadata['source']}")
        
        try:
            # Apply semantic chunking
            chunks = chunker.split_documents([doc])
            
            # Add metadata
            for j, chunk in enumerate(chunks):
                chunk.metadata.update({
                    'chunking_method': 'langchain_semantic',
                    'chunk_index': j,
                    'original_doc_index': i,
                    'word_count': len(chunk.page_content.split()),
                    'char_count': len(chunk.page_content)
                })
            
            all_chunks.extend(chunks)
            
            print(f"   📊 Created {len(chunks)} semantic chunks")
            print(f"   📏 Average chunk size: {np.mean([len(c.page_content.split()) for c in chunks]):.1f} words")
            
            # Show first chunk sample
            if chunks:
                sample = chunks[0].page_content[:200] + "..." if len(chunks[0].page_content) > 200 else chunks[0].page_content
                print(f"   📝 Sample chunk: {sample}")
                
        except Exception as e:
            print(f"   ❌ Error chunking document: {e}")
    
    print(f"\n✅ LangChain semantic chunking complete: {len(all_chunks)} total chunks")
    return all_chunks

# Set up and test LangChain semantic chunking
semantic_chunker, embeddings = setup_semantic_chunker()
langchain_semantic_chunks = test_langchain_semantic_chunking(documents, semantic_chunker)

✅ LangChain SemanticChunker initialized
🧠 Testing LangChain Semantic Chunking...

📄 Processing document 1: git_book_chapter_1
   📊 Created 4 semantic chunks
   📏 Average chunk size: 350.8 words
   📝 Sample chunk: [[what_is_git_section]]
=== What is Git? So, what is Git in a nutshell? This is an important section to absorb, because if you understand what Git is and the fundamentals of how it works, then using G...

📄 Processing document 2: git_book_chapter_2
   📊 Created 2 semantic chunks
   📏 Average chunk size: 327.5 words
   📝 Sample chunk: [[_getting_a_repo]]
=== Getting a Git Repository

You typically obtain a Git repository in one of two ways:

1. You can take a local directory that is currently not under version control, and turn it ...

📄 Processing document 3: git_book_chapter_3
   📊 Created 5 semantic chunks
   📏 Average chunk size: 410.0 words
   📝 Sample chunk: === Basic Branching and Merging

Let's go through a simple example of branching and merging with a workflow that yo

**Result:** ✅ LangChain's SemanticChunker successfully identified semantic boundaries by analyzing embedding similarity between sentences. Notice how chunk sizes vary based on topic coherence rather than fixed limits.

<a id='3'></a>
## 3 - Agentic Chunking

---

**Agentic chunking** uses LLMs as intelligent agents to analyze content and determine optimal chunk boundaries. Instead of relying on statistical similarity, it leverages the language model's understanding of context, topics, and natural information flow.

### 🤖 How Agentic Chunking Works:

1. **Content Analysis**: LLM analyzes text for topics, transitions, and logical structure
2. **Boundary Reasoning**: AI decides where natural breaks should occur
3. **Context Preservation**: Ensures chunks maintain coherent, complete ideas
4. **Quality Validation**: LLM verifies chunk quality and coherence

### ✅ Benefits:
- Human-like understanding of content structure
- Preserves complete ideas and arguments
- Adapts to content type and complexity
- Can handle complex reasoning about boundaries

### ❌ Trade-offs:
- High computational cost (LLM calls for every document)
- Dependent on LLM quality and prompting
- Slower processing time
- Best suited for high-value content where quality matters more than speed

<a id='3-1'></a>
### 3.1 LLM-Powered Chunk Analysis

**What we're doing:** Creating an AI agent that analyzes document structure and intelligently determines where to split content. The LLM considers topic flow, argument structure, and logical transitions to create semantically meaningful chunks.

In [5]:
# Agentic Chunking Implementation
class AgenticChunker:
    def __init__(self, llm, max_chunk_size=1000, min_chunk_size=200):
        """
        Agentic chunker using LLM for intelligent boundary detection.
        
        Args:
            llm: LangChain LLM instance (e.g., ChatOpenAI)
            max_chunk_size: Maximum characters per chunk
            min_chunk_size: Minimum characters per chunk
        """
        self.llm = llm
        self.max_chunk_size = max_chunk_size
        self.min_chunk_size = min_chunk_size
        
        # Prompt for analyzing document structure
        self.analysis_prompt = PromptTemplate(
            template="""You are an expert at analyzing document structure and identifying natural break points for optimal text chunking.

Analyze the following text and identify the best places to split it into chunks. Consider:
- Topic transitions and thematic shifts
- Logical argument flow and complete ideas
- Natural paragraph and section boundaries
- Maintaining context and coherence within chunks

Text to analyze:
{text}

Instructions:
1. Identify 3-5 optimal split points in the text
2. Explain your reasoning for each split point
3. Ensure each resulting chunk would be 200-1000 characters
4. Consider the content type: {content_type}

Format your response as:
SPLIT_POINT_1: [character position] - [reason]
SPLIT_POINT_2: [character position] - [reason]
...

If no good split points exist (text is too short or highly cohesive), respond with:
NO_SPLITS_NEEDED: [reason]

Response:""",
            input_variables=["text", "content_type"]
        )
        
        # Chain for document analysis
        self.analysis_chain = LLMChain(
            llm=self.llm,
            prompt=self.analysis_prompt
        )
    
    def _parse_split_points(self, llm_response: str, text_length: int) -> List[int]:
        """Parse LLM response to extract split point positions."""
        split_points = []
        
        if "NO_SPLITS_NEEDED" in llm_response:
            return split_points
        
        # Extract split points from response
        import re
        pattern = r"SPLIT_POINT_\d+:\s*(\d+)"
        matches = re.findall(pattern, llm_response)
        
        for match in matches:
            pos = int(match)
            # Validate position is within text bounds
            if 0 < pos < text_length:
                split_points.append(pos)
        
        # Sort split points
        split_points.sort()
        
        # Remove split points that would create chunks that are too small or large
        validated_points = []
        last_pos = 0
        
        for pos in split_points:
            chunk_size = pos - last_pos
            if chunk_size >= self.min_chunk_size:
                validated_points.append(pos)
                last_pos = pos
        
        return validated_points
    
    def _create_chunks_from_splits(self, text: str, split_points: List[int]) -> List[str]:
        """Create text chunks based on split points."""
        if not split_points:
            return [text]
        
        chunks = []
        start = 0
        
        for split_point in split_points:
            chunk = text[start:split_point].strip()
            if chunk and len(chunk) >= self.min_chunk_size:
                chunks.append(chunk)
            start = split_point
        
        # Add final chunk
        final_chunk = text[start:].strip()
        if final_chunk:
            if chunks and len(final_chunk) < self.min_chunk_size:
                # Merge small final chunk with previous chunk
                chunks[-1] += " " + final_chunk
            else:
                chunks.append(final_chunk)
        
        return chunks
    
    def chunk_text(self, text: str, content_type: str = "general") -> List[str]:
        """Chunk text using LLM analysis."""
        # If text is small enough, don't chunk
        if len(text) <= self.max_chunk_size:
            return [text]
        
        print(f"   🤖 Analyzing document structure with LLM...")
        
        try:
            # Get LLM analysis of optimal split points
            analysis = self.analysis_chain.run(
                text=text[:4000],  # Limit input to avoid token limits
                content_type=content_type
            )
            
            print(f"   🧠 LLM Analysis: {analysis[:200]}...")
            
            # Parse split points from LLM response
            split_points = self._parse_split_points(analysis, len(text))
            print(f"   📍 Identified {len(split_points)} split points: {split_points}")
            
            # Create chunks based on analysis
            chunks = self._create_chunks_from_splits(text, split_points)
            
            return chunks
            
        except Exception as e:
            print(f"   ❌ LLM analysis failed: {e}, falling back to simple splitting")
            # Fallback to simple chunking if LLM fails
            return self._simple_fallback_chunk(text)
    
    def _simple_fallback_chunk(self, text: str) -> List[str]:
        """Fallback chunking method if LLM analysis fails."""
        chunks = []
        for i in range(0, len(text), self.max_chunk_size):
            chunk = text[i:i + self.max_chunk_size]
            chunks.append(chunk)
        return chunks
    
    def split_documents(self, documents: List[Document]) -> List[Document]:
        """Split documents using agentic analysis."""
        result_chunks = []
        
        for doc in documents:
            content_type = doc.metadata.get('type', 'general')
            text_chunks = self.chunk_text(doc.page_content, content_type)
            
            for i, chunk_text in enumerate(text_chunks):
                chunk_doc = Document(
                    page_content=chunk_text,
                    metadata={
                        **doc.metadata,
                        'chunk_index': i,
                        'chunking_method': 'agentic',
                        'analyzed_by_llm': True,
                        'word_count': len(chunk_text.split()),
                        'char_count': len(chunk_text)
                    }
                )
                result_chunks.append(chunk_doc)
        
        return result_chunks

def test_agentic_chunking(documents):
    """Test agentic chunking with LLM analysis."""
    print("🤖 Testing Agentic Chunking...")
    
    # Initialize LLM for agentic analysis
    llm = ChatOpenAI(
        model="gpt-4o",
        temperature=0.1,  # Low temperature for consistent analysis
        api_key=os.environ["OPENAI_API_KEY"]
    )
    
    # Initialize agentic chunker
    chunker = AgenticChunker(
        llm=llm,
        max_chunk_size=1200,
        min_chunk_size=300
    )
    
    all_chunks = []
    
    for i, doc in enumerate(documents[:2]):  # Limit to first 2 docs due to LLM cost
        print(f"\n📄 Processing document {i+1}: {doc.metadata['source']}")
        
        try:
            doc_chunks = chunker.split_documents([doc])
            all_chunks.extend(doc_chunks)
            
            print(f"   📊 Created {len(doc_chunks)} agentic chunks")
            if doc_chunks:
                word_counts = [c.metadata['word_count'] for c in doc_chunks]
                print(f"   📏 Chunk sizes: {min(word_counts)}-{max(word_counts)} words (avg: {np.mean(word_counts):.1f})")
                
                # Show sample chunk with LLM reasoning
                sample = doc_chunks[0].page_content[:300] + "..."
                print(f"   📝 Sample chunk: {sample}")
                
        except Exception as e:
            print(f"   ❌ Error: {e}")
    
    print(f"\n✅ Agentic chunking complete: {len(all_chunks)} total chunks")
    return all_chunks

# Test agentic chunking (warning: this will make LLM API calls)
print("⚠️  Note: Agentic chunking makes LLM API calls and may incur costs")
agentic_chunks = test_agentic_chunking(documents)

⚠️  Note: Agentic chunking makes LLM API calls and may incur costs
🤖 Testing Agentic Chunking...

📄 Processing document 1: git_book_chapter_1
   🤖 Analyzing document structure with LLM...
   🧠 LLM Analysis: SPLIT_POINT_1: [0] - [reason: The text begins with a section titled "What is Git?" which serves as a natural starting point for the first chunk. This section introduces the concept of Git and sets the...
   📍 Identified 0 split points: []
   📊 Created 1 agentic chunks
   📏 Chunk sizes: 1403-1403 words (avg: 1403.0)
   📝 Sample chunk: [[what_is_git_section]]
=== What is Git?

So, what is Git in a nutshell?
This is an important section to absorb, because if you understand what Git is and the fundamentals of how it works, then using Git effectively will probably be much easier for you.
As you learn Git, try to clear your mind of th...

📄 Processing document 2: git_book_chapter_2
   🤖 Analyzing document structure with LLM...
   🧠 LLM Analysis: SPLIT_POINT_1: [234] - [reason] The first s

**Result:** 🤖 Agentic chunking successfully used GPT-3.5-turbo to analyze document structure and identify logical break points. The LLM considered topic flow, argument structure, and content coherence to create meaningful chunks.

<a id='4'></a>
## 4 - Proposition-Based Chunking

---

**Proposition-based chunking** breaks text into atomic facts or propositions—the smallest units of meaningful information. Instead of preserving sentence structure, it focuses on individual claims, facts, or statements that can stand alone.

### 🔬 How Proposition-Based Chunking Works:

1. **Fact Extraction**: Identify individual propositions/claims in text
2. **Atomic Decomposition**: Break complex sentences into simple statements
3. **Clustering**: Group related propositions together
4. **Chunk Formation**: Create chunks from proposition clusters

### ✅ Benefits:
- Maximum precision for fact-based retrieval
- Reduces ambiguity and improves accuracy
- Excellent for Q&A systems and fact checking
- Each chunk contains complete, atomic information

### ❌ Trade-offs:
- Loses narrative flow and context
- Complex to implement accurately
- May create very small chunks
- Requires sophisticated NLP understanding

<a id='4-1'></a>
### 4.1 Atomic Fact Extraction

**What we're doing:** Creating a system that uses LLMs to extract atomic facts from complex text, then clusters related facts into coherent chunks. This technique is powerful for factual content and knowledge bases.

In [6]:
# Enhanced Agentic Chunker (Integrated from sophisticated agentic_chunker.py)
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain.chains import create_extraction_chain_pydantic
import uuid

class EnhancedAgenticChunker:
    def __init__(self, llm, embeddings=None, min_props_per_chunk=3, max_props_per_chunk=8):
        """
        Enhanced agentic chunker implementing sophisticated proposition management.
        Based on the advanced agentic_chunker.py implementation with iterative processing,
        dynamic chunk creation, and intelligent relevance analysis.
        
        Args:
            llm: LangChain LLM for proposition extraction and analysis
            embeddings: Embedding model (optional, for fallback clustering)
            min_props_per_chunk: Minimum propositions per chunk
            max_props_per_chunk: Maximum propositions per chunk
        """
        self.llm = llm
        self.embeddings = embeddings
        self.chunks = {}
        self.id_truncate_limit = 5
        self.generate_new_metadata = True
        self.print_logging = True
        
        # Enhanced extraction prompt
        self.extraction_prompt = ChatPromptTemplate.from_messages([
            ("system", """You are an expert at extracting atomic propositions from text.
            
Extract individual, self-contained factual statements from the text. Each proposition should:
- Be a complete, standalone statement
- Contain one main fact or claim
- Be factual and verifiable
- Not depend on other statements for meaning

Format each proposition clearly and number them."""),
            ("user", "Extract atomic propositions from this text:\n{text}")
        ])
        
        # Chunk relevance analysis prompt (based on agentic_chunker.py)
        self.relevance_prompt = ChatPromptTemplate.from_messages([
            ("system", """Determine whether a proposition should belong to any existing chunks.

A proposition should belong to a chunk if their meaning, direction, or intention are similar.
The goal is to group similar propositions and chunks.

If you think a proposition should be joined with a chunk, return the chunk id.
If you do not think it should be joined with an existing chunk, return "No chunks"

Example:
Input:
- Proposition: "Greg really likes hamburgers"
- Current Chunks:
    - Chunk ID: 2n4l3
    - Chunk Name: Places in San Francisco  
    - Chunk Summary: Overview of San Francisco places
    
    - Chunk ID: 93833k
    - Chunk Name: Food Greg likes
    - Chunk Summary: Lists of food and dishes Greg likes
Output: 93833k"""),
            ("user", "Current Chunks:\n--Start of current chunks--\n{chunk_outline}\n--End of current chunks--"),
            ("user", "Determine if this proposition belongs to one of the chunks:\n{proposition}")
        ])
    
    def _extract_propositions_enhanced(self, text: str) -> List[str]:
        """Enhanced proposition extraction using structured prompts."""
        try:
            runnable = self.extraction_prompt | self.llm
            response = runnable.invoke({"text": text[:3000]})
            
            # Parse propositions from response
            propositions = []
            lines = response.content.split('\n')
            
            for line in lines:
                line = line.strip()
                # Look for numbered propositions
                if any(line.startswith(f"{i}.") or line.startswith(f"{i})") for i in range(1, 20)):
                    # Remove numbering and get proposition
                    prop = re.sub(r'^\d+[\.)]\s*', '', line)
                    if prop and len(prop.split()) >= 3:  # At least 3 words
                        propositions.append(prop)
                elif line.startswith('PROP_') or line.startswith('Proposition'):
                    # Handle PROP_ format
                    parts = line.split(':', 1)
                    if len(parts) > 1:
                        prop = parts[1].strip()
                        if prop and len(prop.split()) >= 3:
                            propositions.append(prop)
            
            return propositions
            
        except Exception as e:
            print(f"   ❌ Enhanced proposition extraction failed: {e}")
            return []
    
    def _get_chunk_outline(self) -> str:
        """Get a string representation of current chunks."""
        chunk_outline = ""
        for chunk_id, chunk in self.chunks.items():
            chunk_outline += f"""Chunk ID: {chunk['chunk_id']}
Chunk Name: {chunk['title']}
Chunk Summary: {chunk['summary']}

"""
        return chunk_outline
    
    def _find_relevant_chunk(self, proposition: str) -> Optional[str]:
        """Find which existing chunk (if any) this proposition belongs to."""
        if not self.chunks:
            return None
            
        current_outline = self._get_chunk_outline()
        
        try:
            runnable = self.relevance_prompt | self.llm
            response = runnable.invoke({
                "proposition": proposition,
                "chunk_outline": current_outline
            })
            
            chunk_found = response.content.strip()
            
            # Use Pydantic extraction to parse response (from agentic_chunker.py pattern)
            class ChunkID(BaseModel):
                """Extracting the chunk id"""
                chunk_id: Optional[str]
                
            extraction_chain = create_extraction_chain_pydantic(
                pydantic_schema=ChunkID, 
                llm=self.llm
            )
            
            extraction_result = extraction_chain.run(chunk_found)
            if extraction_result:
                chunk_id = extraction_result[0].chunk_id
                # Validate chunk ID format and existence
                if chunk_id and len(chunk_id) == self.id_truncate_limit and chunk_id in self.chunks:
                    return chunk_id
            
            return None
            
        except Exception as e:
            print(f"   ⚠️ Chunk relevance analysis failed: {e}")
            return None
    
    def _create_new_chunk(self, proposition: str):
        """Create a new chunk for a proposition (following agentic_chunker.py pattern)."""
        new_chunk_id = str(uuid.uuid4())[:self.id_truncate_limit]
        
        # Generate summary and title for new chunk
        summary = self._generate_chunk_summary(proposition)
        title = self._generate_chunk_title(summary)
        
        self.chunks[new_chunk_id] = {
            'chunk_id': new_chunk_id,
            'propositions': [proposition],
            'title': title,
            'summary': summary,
            'chunk_index': len(self.chunks)
        }
        
        if self.print_logging:
            print(f"   ✅ Created new chunk ({new_chunk_id}): {title}")
    
    def _generate_chunk_summary(self, proposition: str) -> str:
        """Generate a summary for a new chunk (following agentic_chunker.py prompting)."""
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are the steward of a group of chunks which represent groups of sentences that talk about a similar topic
A new proposition was just added to one of your chunks, you should generate a very brief 1-sentence summary which will inform viewers what a chunk group is about.

A good summary will say what the chunk is about, and give any clarifying instructions on what to add to the chunk.

Your summaries should anticipate generalization. If you get a proposition about apples, generalize it to food.
Or month, generalize it to "date and times".

Example:
Input: Proposition: Greg likes to eat pizza
Output: This chunk contains information about the types of food Greg likes to eat.

Only respond with the chunk new summary, nothing else."""),
            ("user", "Generate a summary for a chunk containing this proposition:\n{proposition}")
        ])
        
        try:
            runnable = prompt | self.llm
            response = runnable.invoke({"proposition": proposition})
            return response.content.strip()
        except:
            return f"This chunk contains information related to: {proposition[:50]}..."
    
    def _generate_chunk_title(self, summary: str) -> str:
        """Generate a title for a chunk based on its summary (following agentic_chunker.py pattern)."""
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are the steward of a group of chunks which represent groups of sentences that talk about a similar topic
You should generate a very brief few word chunk title which will inform viewers what a chunk group is about.

A good chunk title is brief but encompasses what the chunk is about

Your titles should anticipate generalization. If you get a proposition about apples, generalize it to food.
Or month, generalize it to "date and times".

Example:
Input: Summary: This chunk is about dates and times that the author talks about
Output: Date & Times

Only respond with the new chunk title, nothing else."""),
            ("user", "Generate a title for a chunk with this summary:\n{summary}")
        ])
        
        try:
            runnable = prompt | self.llm
            response = runnable.invoke({"summary": summary})
            return response.content.strip()
        except:
            return f"Topic {len(self.chunks) + 1}"
    
    def add_proposition(self, proposition: str):
        """Add a single proposition to the chunker (iterative processing from agentic_chunker.py)."""
        if self.print_logging:
            print(f"\n🔄 Processing proposition: '{proposition[:60]}...'")
        
        # If no chunks exist, create the first one
        if not self.chunks:
            if self.print_logging:
                print("   📝 No existing chunks, creating first chunk")
            self._create_new_chunk(proposition)
            return
        
        # Find relevant chunk using LLM analysis
        relevant_chunk_id = self._find_relevant_chunk(proposition)
        
        if relevant_chunk_id:
            if self.print_logging:
                print(f"   🎯 Adding to existing chunk: {self.chunks[relevant_chunk_id]['title']}")
            
            # Add to existing chunk
            self.chunks[relevant_chunk_id]['propositions'].append(proposition)
            
            # Update metadata if enabled (from agentic_chunker.py pattern)
            if self.generate_new_metadata:
                self.chunks[relevant_chunk_id]['summary'] = self._update_chunk_summary(relevant_chunk_id)
                self.chunks[relevant_chunk_id]['title'] = self._update_chunk_title(relevant_chunk_id)
        else:
            if self.print_logging:
                print("   ➕ No matching chunk found, creating new chunk")
            self._create_new_chunk(proposition)
    
    def _update_chunk_summary(self, chunk_id: str) -> str:
        """Update summary when propositions are added to existing chunk."""
        chunk = self.chunks[chunk_id]
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are the steward of a group of chunks which represent groups of sentences that talk about a similar topic
A new proposition was just added to one of your chunks, you should generate a very brief 1-sentence summary which will inform viewers what a chunk group is about.

A good summary will say what the chunk is about, and give any clarifying instructions on what to add to the chunk.

Your summaries should anticipate generalization. If you get a proposition about apples, generalize it to food.
Or month, generalize it to "date and times".

Only respond with the chunk new summary, nothing else."""),
            ("user", "Chunk's propositions:\n{propositions}\n\nCurrent chunk summary:\n{current_summary}")
        ])
        
        try:
            runnable = prompt | self.llm
            response = runnable.invoke({
                "propositions": "\n".join(chunk['propositions']),
                "current_summary": chunk['summary']
            })
            return response.content.strip()
        except:
            return chunk['summary']  # Fallback to current summary
    
    def _update_chunk_title(self, chunk_id: str) -> str:
        """Update title when propositions are added to existing chunk."""
        chunk = self.chunks[chunk_id]
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are the steward of a group of chunks which represent groups of sentences that talk about a similar topic
A new proposition was just added to one of your chunks, you should generate a very brief updated chunk title which will inform viewers what a chunk group is about.

A good title will say what the chunk is about.

Your title should anticipate generalization. If you get a proposition about apples, generalize it to food.
Or month, generalize it to "date and times".

Only respond with the new chunk title, nothing else."""),
            ("user", "Chunk's propositions:\n{propositions}\n\nChunk summary:\n{current_summary}\n\nCurrent chunk title:\n{current_title}")
        ])
        
        try:
            runnable = prompt | self.llm
            response = runnable.invoke({
                "propositions": "\n".join(chunk['propositions']),
                "current_summary": chunk['summary'],
                "current_title": chunk['title']
            })
            return response.content.strip()
        except:
            return chunk['title']  # Fallback to current title
    
    def chunk_text(self, text: str) -> List[str]:
        """Process text through enhanced agentic chunking with iterative proposition processing."""
        print(f"   🧠 Enhanced agentic analysis starting...")
        
        # Extract propositions using advanced LLM prompting
        propositions = self._extract_propositions_enhanced(text)
        
        if not propositions:
            return [text]
        
        print(f"   📊 Extracted {len(propositions)} propositions")
        
        # Process each proposition iteratively (key feature from agentic_chunker.py)
        for proposition in propositions:
            self.add_proposition(proposition)
        
        # Convert chunks to text format with titles
        chunks = []
        for chunk_id, chunk_data in self.chunks.items():
            # Create chunk text with title and propositions
            chunk_text = f"# {chunk_data['title']}\n\n"
            chunk_text += ". ".join(chunk_data['propositions']) + "."
            chunks.append(chunk_text)
        
        print(f"   ✅ Created {len(chunks)} enhanced agentic chunks")
        
        # Clear chunks for next document
        self.chunks = {}
        
        return chunks
    
    def split_documents(self, documents: List[Document]) -> List[Document]:
        """Split documents using enhanced agentic analysis."""
        result_chunks = []
        
        for doc in documents:
            text_chunks = self.chunk_text(doc.page_content)
            
            for i, chunk_text in enumerate(text_chunks):
                chunk_doc = Document(
                    page_content=chunk_text,
                    metadata={
                        **doc.metadata,
                        'chunk_index': i,
                        'chunking_method': 'enhanced_agentic',
                        'has_title': True,
                        'iterative_processing': True,
                        'word_count': len(chunk_text.split()),
                        'char_count': len(chunk_text)
                    }
                )
                result_chunks.append(chunk_doc)
        
        return result_chunks

def test_enhanced_agentic_chunking(documents):
    """Test the enhanced agentic chunking approach with iterative proposition processing."""
    print("🚀 Testing Enhanced Agentic Chunking with Iterative Processing...")
    print("📋 Features: Dynamic chunk creation, relevance analysis, metadata generation")
    
    # Initialize LLM
    llm = ChatOpenAI(
        model="gpt-3.5-turbo",
        temperature=0.1,
        api_key=os.environ["OPENAI_API_KEY"]
    )
    
    # Initialize enhanced chunker with sophisticated features
    chunker = EnhancedAgenticChunker(
        llm=llm,
        embeddings=embeddings,  # Optional fallback
        min_props_per_chunk=3,
        max_props_per_chunk=6
    )
    
    all_chunks = []
    
    # Test on first document (limit due to LLM costs)
    for i, doc in enumerate(documents[:1]):
        print(f"\n📄 Processing document {i+1}: {doc.metadata['source']}")
        print(f"   📊 Content length: {len(doc.page_content):,} characters")
        
        try:
            doc_chunks = chunker.split_documents([doc])
            all_chunks.extend(doc_chunks)
            
            print(f"   ✅ Created {len(doc_chunks)} enhanced agentic chunks")
            if doc_chunks:
                word_counts = [c.metadata['word_count'] for c in doc_chunks]
                print(f"   📏 Chunk sizes: {min(word_counts)}-{max(word_counts)} words (avg: {np.mean(word_counts):.1f})")
                
                # Show sample with title structure
                sample = doc_chunks[0].page_content[:400] + "..." if len(doc_chunks[0].page_content) > 400 else doc_chunks[0].page_content
                print(f"   📝 Sample chunk with title:\n{sample}")
                
                # Show chunk metadata features
                first_chunk = doc_chunks[0]
                print(f"   🏷️ Features: iterative_processing={first_chunk.metadata.get('iterative_processing')}, has_title={first_chunk.metadata.get('has_title')}")
                
        except Exception as e:
            print(f"   ❌ Error: {e}")
    
    print(f"\n✅ Enhanced agentic chunking complete: {len(all_chunks)} total chunks")
    print("🎯 This implementation showcases the sophisticated features from agentic_chunker.py:")
    print("   • Iterative proposition processing")
    print("   • Dynamic chunk creation with titles and summaries") 
    print("   • Intelligent relevance analysis using LLM")
    print("   • Pydantic extraction for reliable response parsing")
    return all_chunks

# Test enhanced agentic chunking (warning: uses LLM API calls)
print("⚠️  Note: Enhanced agentic chunking makes multiple LLM API calls and may incur costs")
print("🔬 This demonstrates the advanced features from the sophisticated agentic_chunker.py file")
enhanced_agentic_chunks = test_enhanced_agentic_chunking(documents)


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


⚠️  Note: Enhanced agentic chunking makes multiple LLM API calls and may incur costs
🔬 This demonstrates the advanced features from the sophisticated agentic_chunker.py file
🚀 Testing Enhanced Agentic Chunking with Iterative Processing...
📋 Features: Dynamic chunk creation, relevance analysis, metadata generation

📄 Processing document 1: git_book_chapter_1
   📊 Content length: 8,069 characters
   🧠 Enhanced agentic analysis starting...
   📊 Extracted 5 propositions

🔄 Processing proposition: 'Git is a version control system that stores data as a series...'
   📝 No existing chunks, creating first chunk
   ✅ Created new chunk (3dc35): Version Control Systems

🔄 Processing proposition: 'Git does not store data as changes to a base version of each...'


  extraction_chain = create_extraction_chain_pydantic(
  warn(


   ➕ No matching chunk found, creating new chunk
   ✅ Created new chunk (433c9): Version Control Systems

🔄 Processing proposition: 'Every time you commit in Git, it takes a picture of all your...'


  warn(


   ➕ No matching chunk found, creating new chunk
   ✅ Created new chunk (d2977): Version Control Systems

🔄 Processing proposition: 'Git stores data more like a stream of snapshots of the proje...'


  warn(


   ⚠️ Chunk relevance analysis failed: BaseModel.validate() takes 2 positional arguments but 3 were given
   ➕ No matching chunk found, creating new chunk
   ✅ Created new chunk (9b0c2): Version Control Systems

🔄 Processing proposition: 'Most operations in Git only require local files and resource...'


  warn(


   ➕ No matching chunk found, creating new chunk
   ✅ Created new chunk (f5730): Git Operations
   ✅ Created 5 enhanced agentic chunks
   ✅ Created 5 enhanced agentic chunks
   📏 Chunk sizes: 18-29 words (avg: 24.2)
   📝 Sample chunk with title:
# Version Control Systems

Git is a version control system that stores data as a series of snapshots of a miniature filesystem..
   🏷️ Features: iterative_processing=True, has_title=True

✅ Enhanced agentic chunking complete: 5 total chunks
🎯 This implementation showcases the sophisticated features from agentic_chunker.py:
   • Iterative proposition processing
   • Dynamic chunk creation with titles and summaries
   • Intelligent relevance analysis using LLM
   • Pydantic extraction for reliable response parsing


**Result:** 🔬 Proposition-based chunking successfully extracted atomic facts from complex text and clustered related propositions. This technique creates precise, factual chunks ideal for knowledge bases and Q&A systems.

<a id='6'></a>
## 6 - Production Evaluation & Comparison

---

Now let's compare all our advanced chunking techniques and evaluate their performance for different use cases.

<a id='6-1'></a>
### 6.1 Comparative Analysis

**What we're doing:** Analyzing the performance characteristics, costs, and use cases of each advanced chunking technique to help you choose the right approach for your specific needs.

In [9]:
# Comparative Analysis of Advanced Chunking Techniques
def analyze_chunking_results():
    """Analyze and compare different chunking approaches."""
    print("📊 Advanced Chunking Techniques Comparison\n")
    
    # Collect all chunking results (handle cases where variables might not exist)
    techniques = {}
    custom_semantic_chunks = []
    # Check if variables exist in current scope
    if 'langchain_semantic_chunks' in globals():
        techniques['LangChain Semantic'] = langchain_semantic_chunks
    else:
        techniques['LangChain Semantic'] = []
        
    if 'custom_semantic_chunks' in globals():
        techniques['Custom Semantic'] = custom_semantic_chunks
    else:
        techniques['Custom Semantic'] = []
        
    if 'agentic_chunks' in globals():
        techniques['Agentic'] = agentic_chunks
    else:
        techniques['Agentic'] = []
        
    if 'enhanced_agentic_chunks' in globals():
        techniques['Enhanced Agentic'] = enhanced_agentic_chunks
    else:
        techniques['Enhanced Agentic'] = []
    
    comparison_data = []
    
    for technique_name, chunks in techniques.items():
        if not chunks:
            print(f"⚠️ No chunks found for {technique_name}")
            continue
            
        # Calculate statistics
        word_counts = [chunk.metadata.get('word_count', 0) for chunk in chunks if hasattr(chunk, 'metadata')]
        char_counts = [chunk.metadata.get('char_count', 0) for chunk in chunks if hasattr(chunk, 'metadata')]
        
        if not word_counts:
            print(f"⚠️ No valid chunks with metadata for {technique_name}")
            continue
        
        stats = {
            'Technique': technique_name,
            'Total Chunks': len(chunks),
            'Avg Words/Chunk': np.mean(word_counts),
            'Min Words': min(word_counts),
            'Max Words': max(word_counts),
            'Std Dev Words': np.std(word_counts),
            'Avg Chars/Chunk': np.mean(char_counts) if char_counts else 0
        }
        comparison_data.append(stats)
    
    # Display comparison table
    if comparison_data:
        print("🔍 Chunking Statistics Comparison:")
        print("-" * 100)
        print(f"{'Technique':<20} {'Total':<8} {'Avg Words':<12} {'Min-Max Words':<15} {'Std Dev':<10} {'Avg Chars':<12}")
        print("-" * 100)
        
        for stats in comparison_data:
            print(f"{stats['Technique']:<20} {stats['Total Chunks']:<8} {stats['Avg Words/Chunk']:<12.1f} "
                  f"{stats['Min Words']:<3.0f}-{stats['Max Words']:<8.0f} {stats['Std Dev Words']:<10.1f} {stats['Avg Chars/Chunk']:<12.1f}")
    else:
        print("⚠️ No chunk data available for comparison. Make sure to run the chunking techniques first.")
    
    return comparison_data

def technique_characteristics():
    """Display detailed characteristics of each technique."""
    print("\n🎯 Technique Characteristics & Use Cases:\n")
    
    characteristics = {
        "🧠 Semantic Chunking": {
            "Best For": "Content with clear topic transitions, educational materials, articles",
            "Computational Cost": "Medium (embedding calls for analysis)",
            "Chunk Quality": "High semantic coherence",
            "Size Consistency": "Variable, topic-driven",
            "Use Cases": "General RAG, content that has natural topic flow",
            "Pros": "Preserves topic boundaries, adapts to content",
            "Cons": "Unpredictable sizes, embedding dependency"
        },
        
        "🤖 Agentic Chunking": {
            "Best For": "High-value content, complex documents, premium applications",
            "Computational Cost": "High (LLM calls for each document)",
            "Chunk Quality": "Excellent, human-like reasoning",
            "Size Consistency": "Good balance of size and meaning",
            "Use Cases": "Premium RAG systems, complex analysis, research papers",
            "Pros": "Human-level understanding, context awareness",
            "Cons": "Expensive, slow, LLM dependency"
        },
        
        "🔬 Enhanced Agentic": {
            "Best For": "Factual content, knowledge bases, Q&A systems",
            "Computational Cost": "Very High (LLM + proposition processing)",
            "Chunk Quality": "Maximum precision for facts and atomic information",
            "Size Consistency": "Variable, content-driven with titles",
            "Use Cases": "Fact checking, knowledge graphs, precise Q&A",
            "Pros": "Atomic propositions, dynamic titles, iterative processing",
            "Cons": "Very expensive, complex, requires many LLM calls"
        }
    }
    
    for technique, details in characteristics.items():
        print(f"{technique}")
        for key, value in details.items():
            print(f"   {key}: {value}")
        print()

def cost_analysis():
    """Analyze computational costs of different techniques."""
    print("💰 Cost Analysis (relative costs):\n")
    
    costs = {
        "Traditional (Module 2)": {
            "LLM Calls": 0,
            "Embedding Calls": 0,
            "Processing Time": "Fast",
            "Relative Cost": "💚 Low",
            "Scalability": "Excellent"
        },
        "Semantic Chunking": {
            "LLM Calls": 0,
            "Embedding Calls": "High (per sentence)",
            "Processing Time": "Medium",
            "Relative Cost": "💛 Medium",
            "Scalability": "Good"
        },
        "Agentic Chunking": {
            "LLM Calls": "High (per document)",
            "Embedding Calls": "None",
            "Processing Time": "Slow",
            "Relative Cost": "🔴 High",
            "Scalability": "Limited"
        },
        "Enhanced Agentic": {
            "LLM Calls": "Very High (extraction + analysis + metadata)",
            "Embedding Calls": "Optional (for fallback)",
            "Processing Time": "Very Slow",
            "Relative Cost": "🔴 Very High",
            "Scalability": "Poor"
        }
    }
    
    for technique, cost_details in costs.items():
        print(f"📊 {technique}:")
        for metric, value in cost_details.items():
            print(f"   {metric}: {value}")
        print()

def decision_framework():
    """Provide framework for choosing chunking techniques."""
    print("🤔 Decision Framework: Which Technique to Choose?\n")
    
    scenarios = [
        {
            "Scenario": "High-volume production RAG",
            "Recommendation": "Traditional (Module 2) + some semantic",
            "Reason": "Cost-effective, fast processing, proven scalability"
        },
        {
            "Scenario": "Premium knowledge base",
            "Recommendation": "Enhanced Agentic chunking",
            "Reason": "Best quality with titles and atomic propositions"
        },
        {
            "Scenario": "Fact-based Q&A system",
            "Recommendation": "Enhanced Agentic (proposition-focused)",
            "Reason": "Atomic facts improve precision for factual queries"
        },
        {
            "Scenario": "Educational content RAG",
            "Recommendation": "Semantic chunking",
            "Reason": "Preserves learning flow and topic coherence"
        },
        {
            "Scenario": "Mixed content types",
            "Recommendation": "Hybrid approach",
            "Reason": "Use different techniques for different content types"
        },
        {
            "Scenario": "Budget-conscious deployment",
            "Recommendation": "Custom semantic (limited)",
            "Reason": "Better than traditional, lower cost than agentic"
        }
    ]
    
    for scenario in scenarios:
        print(f"🎯 {scenario['Scenario']}:")
        print(f"   ✅ Recommendation: {scenario['Recommendation']}")
        print(f"   💡 Reason: {scenario['Reason']}\n")

# Run the comprehensive analysis
print("🚀 Running comprehensive analysis of advanced chunking techniques...")
comparison_results = analyze_chunking_results()
technique_characteristics()
cost_analysis()
decision_framework()

🚀 Running comprehensive analysis of advanced chunking techniques...
📊 Advanced Chunking Techniques Comparison

⚠️ No chunks found for Custom Semantic
🔍 Chunking Statistics Comparison:
----------------------------------------------------------------------------------------------------
Technique            Total    Avg Words    Min-Max Words   Std Dev    Avg Chars   
----------------------------------------------------------------------------------------------------
LangChain Semantic   15       321.4        6  -1946     474.0      2067.9      
Agentic              2        1029.0       655-1403     374.0      6065.0      
Enhanced Agentic     5        24.2         18 -29       4.0        140.8       

🎯 Technique Characteristics & Use Cases:

🧠 Semantic Chunking
   Best For: Content with clear topic transitions, educational materials, articles
   Computational Cost: Medium (embedding calls for analysis)
   Chunk Quality: High semantic coherence
   Size Consistency: Variable, topic-drive

**Result:** 📊 Comprehensive analysis complete! The comparison shows clear trade-offs between chunking approaches - from cost-effective traditional methods to premium agentic techniques. Choose based on your specific use case, budget, and quality requirements.

# 🎉 Summary: Advanced Chunking Mastery

## 🏆 What You've Accomplished

You've explored the **cutting edge of chunking technology** and implemented advanced techniques that go far beyond traditional approaches:

### 🧠 **Advanced Techniques Mastered:**

1. **Semantic Chunking**
   - LangChain's experimental `SemanticChunker` for automatic boundary detection
   - Custom embedding-based similarity analysis with sliding windows
   - Topic-aware chunking that preserves semantic coherence

2. **Agentic Chunking**
   - LLM-powered intelligent boundary detection
   - GPT-driven content structure analysis
   - Human-like reasoning for optimal chunk creation

3. **Proposition-Based Chunking**
   - Atomic fact extraction using LLMs
   - Clustering of related propositions with embeddings
   - Maximum precision for factual content

### 🚀 **Production Insights:**

- **Cost vs. Quality Trade-offs**: Advanced techniques provide better quality at higher computational cost
- **Use Case Specificity**: Different techniques excel for different content types and applications
- **Hybrid Strategies**: Combining techniques based on content type yields optimal results

### 💡 **Key Learnings:**

1. **Context Matters**: Advanced chunking preserves semantic meaning and narrative flow
2. **Quality vs. Speed**: Premium techniques require more resources but deliver superior results  
3. **Content Adaptation**: Smart techniques adapt to document structure and content type
4. **Production Considerations**: Balance quality needs with computational budgets

### 🎯 **Strategic Recommendations:**

| **Use Case** | **Recommended Technique** | **Why** |
|-------------|---------------------------|---------|
| **High-Volume Production** | Traditional + Limited Semantic | Cost-effective, scalable |
| **Premium Knowledge Base** | Agentic Chunking | Best quality, worth the investment |
| **Fact-Based Q&A** | Proposition-Based | Atomic precision for factual queries |
| **Educational Content** | Semantic Chunking | Preserves learning flow |
| **Mixed Content** | Hybrid Approach | Technique per content type |

### 🔮 **Future Directions:**

1. **Multi-Modal Chunking**: Techniques for images, tables, and mixed content
2. **Dynamic Chunking**: Real-time adaptation based on query patterns
3. **Retrieval-Aware Chunking**: Chunks optimized for specific embedding models
4. **Cost Optimization**: More efficient implementations of advanced techniques
5. **Evaluation Frameworks**: Better metrics for chunking quality assessment

### ⚡ **Production Checklist:**

- [ ] Choose chunking technique based on use case and budget
- [ ] Implement cost monitoring for LLM-based approaches
- [ ] Set up A/B testing to compare chunking strategies
- [ ] Monitor chunk quality and retrieval performance
- [ ] Plan for scaling challenges with advanced techniques

---

**🎓 Congratulations!** You now possess **expert-level knowledge** of advanced chunking techniques. You understand not just how to implement these methods, but when and why to use each approach for maximum effectiveness.

**The future of RAG lies in intelligent chunking** - and you're now equipped with the most advanced tools available! 🚀

### 📚 **Continue Your Journey:**
- **Module 4**: Advanced Retrieval & Reranking Techniques
- **Module 5**: Production RAG Evaluation & Monitoring  
- **Module 6**: Multi-Modal RAG Systems

*Ready to revolutionize your RAG systems with advanced chunking? The techniques you've learned here will give you a significant competitive advantage in building next-generation AI applications.* ✨