# üìÑ Document Ingestion & Processing in RAG Systems

> **Educational Notebook 22**: Deep dive into document ingestion, parsing, chunking, and indexing workflows

---

## üìã Learning Objectives

By the end of this notebook, you will:

1. Understand the complete document ingestion pipeline in RAG systems
2. Learn about different document parsers and their capabilities
3. Explore various chunking strategies and their trade-offs
4. Implement a custom chunking algorithm from scratch
5. Understand multi-modal processing (text, images, tables)
6. Learn about deduplication techniques
7. Appreciate the production considerations for document processing

## üéØ Why Document Processing Matters

Document processing is the foundation of any RAG system. Poor document processing leads to:
- Low-quality chunks that don't answer user questions effectively
- Information loss due to inappropriate chunking
- High latency during retrieval
- Inconsistent retrieval results

Let's dive into the complete workflow!

## üõ†Ô∏è Setting Up Our Environment

First, let's set up our environment and import the necessary libraries:

In [1]:
import os
import sys
from pathlib import Path
import asyncio
import hashlib
import time
from typing import List, Dict, Any, Optional
import json

# Add project root to path
current = Path().resolve()
repo_root = None
for parent in [current, *current.parents]:
    if (parent / "src").exists() and (parent / "notebooks").exists():
        repo_root = parent
        break

if repo_root is None:
    raise RuntimeError("Could not locate rag-engine-mini root for imports")

sys.path.insert(0, str(repo_root))

# Import required modules
import fitz  # PyMuPDF
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

print(f"Repository root: {repo_root}")
print("Environment setup complete!")

## üìö Understanding Document Types & Parsing

Different document formats require different parsing strategies. Let's explore the key formats and their characteristics:

In [2]:
# Define supported document types
document_types = {
    "PDF": {
        "extensions": [".pdf"],
        "characteristics": [
            "Can contain text, images, tables",
            "May have complex layouts",
            "Requires specialized parsing (PyMuPDF, pdfminer)",
            "OCR may be needed for scanned documents"
        ],
        "parser": "PyMuPDF (fitz)"
    },
    "Word": {
        "extensions": [".docx", ".doc"],
        "characteristics": [
            "Structured text with formatting",
            "Can contain embedded images",
            "Requires python-docx for parsing"
        ],
        "parser": "python-docx"
    },
    "PowerPoint": {
        "extensions": [".pptx", ".ppt"],
        "characteristics": [
            "Slide-based content",
            "Often contains images and charts",
            "Requires python-pptx for parsing"
        ],
        "parser": "python-pptx"
    },
    "Excel": {
        "extensions": [".xlsx", ".xls", ".csv"],
        "characteristics": [
            "Tabular data",
            "Requires pandas or openpyxl for parsing",
            "Need special handling for table structures"
        ],
        "parser": "pandas/openpyxl"
    },
    "Plain Text": {
        "extensions": [".txt", ".md", ".rst"],
        "characteristics": [
            "Simple text content",
            "No complex layout",
            "Fast parsing"
        ],
        "parser": "Built-in file reading"
    }
}

# Display the document types information
print("Supported Document Types in RAG Systems:")
print("=" * 50)
for doc_type, info in document_types.items():
    print(f"\n{doc_type} ({', '.join(info['extensions'])}):")
    print(f"  Parser: {info['parser']}")
    print("  Characteristics:")
    for char in info['characteristics']:
        print(f"    - {char}")

## üß© Understanding Chunking Strategies

Chunking is crucial for RAG systems. Let's explore different strategies and their implications:

In [3]:
# Define chunking strategies
chunking_strategies = {
    "Fixed Size": {
        "description": "Split documents into fixed-size chunks based on token count or character length",
        "pros": [
            "Simple to implement",
            "Consistent chunk sizes",
            "Predictable performance"
        ],
        "cons": [
            "May split related content",
            "Context boundaries",
            "Information fragmentation"
        ],
        "use_case": "General-purpose RAG with diverse document types"
    },
    "Semantic": {
        "description": "Split documents based on semantic boundaries (sentences, paragraphs, topics)",
        "pros": [
            "Preserves semantic coherence",
            "Better context preservation",
            "More meaningful chunks"
        ],
        "cons": [
            "Variable chunk sizes",
            "Requires NLP processing",
            "Higher computational cost"
        ],
        "use_case": "Documents with clear semantic structure (articles, books)"
    },
    "Hierarchical": {
        "description": "Create multiple levels of chunks (sections, subsections, paragraphs)",
        "pros": [
            "Multiple resolution levels",
            "Flexible retrieval",
            "Context switching"
        ],
        "cons": [
            "Complex implementation",
            "Storage overhead",
            "Indexing complexity"
        ],
        "use_case": "Long documents requiring both detailed and summary retrieval"
    },
    "Sliding Window": {
        "description": "Create overlapping chunks to preserve context across boundaries",
        "pros": [
            "Preserves boundary context",
            "Reduces information loss",
            "Better for QA tasks"
        ],
        "cons": [
            "Increased storage",
            "Potential redundancy",
            "Higher retrieval costs"
        ],
        "use_case": "Question answering systems requiring precise context"
    }
}

# Display chunking strategies
print("Chunking Strategies in RAG Systems:")
print("=" * 60)
for strategy, info in chunking_strategies.items():
    print(f"\n{strategy}: {info['description']}")
    print("  Pros:")
    for pro in info['pros']:
        print(f"    ‚úì {pro}")
    print("  Cons:")
    for con in info['cons']:
        print(f"    ‚úó {con}")
    print(f"  Use Case: {info['use_case']}")

## üîß Implementing a Custom Chunking Algorithm

Let's implement a token-aware chunking algorithm from scratch to understand the mechanics:

In [4]:
class TokenAwareChunker:
    """
    A token-aware chunker that splits text into chunks considering token limits.
    This is a simplified implementation similar to what's used in the RAG engine.
    """
    
    def __init__(self, max_tokens: int = 512, overlap: int = 50):
        """
        Initialize the chunker.
        
        Args:
            max_tokens: Maximum tokens per chunk
            overlap: Number of overlapping tokens between chunks
        """
        self.max_tokens = max_tokens
        self.overlap = overlap
        # For simplicity, we'll approximate tokens as words
        # In production, use proper tokenizers like tiktoken or transformers
        
    def estimate_tokens(self, text: str) -> int:
        """
        Estimate the number of tokens in a text.
        This is a simple approximation - in practice, use proper tokenizers.
        """
        # Remove extra whitespace and count words
        words = text.strip().split()
        return len(words)
    
    def chunk_text(self, text: str) -> List[Dict[str, Any]]:
        """
        Split text into chunks respecting token limits.
        
        Args:
            text: Input text to chunk
            
        Returns:
            List of chunks with metadata
        """
        # First, split by paragraphs to respect semantic boundaries
        paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
        
        chunks = []
        current_chunk = ""
        current_tokens = 0
        chunk_id = 0
        
        for para in paragraphs:
            para_tokens = self.estimate_tokens(para)
            
            # If paragraph alone exceeds max tokens, split it further
            if para_tokens > self.max_tokens:
                # Add current chunk if it's not empty
                if current_chunk:
                    chunks.append({
                        "id": f"chunk_{chunk_id}",
                        "text": current_chunk,
                        "tokens": current_tokens,
                        "type": "continuation"
                    })
                    chunk_id += 1
                    current_chunk = ""
                    current_tokens = 0
                
                # Now split the long paragraph into sentences
                sentences = self._split_into_sentences(para)
                temp_chunk = ""
                temp_tokens = 0
                
                for sent in sentences:
                    sent_tokens = self.estimate_tokens(sent)
                    
                    if temp_tokens + sent_tokens <= self.max_tokens:
                        temp_chunk += sent + " "
                        temp_tokens += sent_tokens
                    else:
                        if temp_chunk:
                            chunks.append({
                                "id": f"chunk_{chunk_id}",
                                "text": temp_chunk.strip(),
                                "tokens": temp_tokens,
                                "type": "sentence_split"
                            })
                            chunk_id += 1
                        
                        # Start new chunk with current sentence
                        temp_chunk = sent + " "
                        temp_tokens = sent_tokens
                
                # Add remaining text if any
                if temp_chunk:
                    chunks.append({
                        "id": f"chunk_{chunk_id}",
                        "text": temp_chunk.strip(),
                        "tokens": temp_tokens,
                        "type": "sentence_split_continuation"
                    })
                    chunk_id += 1
            
            # If adding this paragraph would exceed the limit, start a new chunk
            elif current_tokens + para_tokens > self.max_tokens:
                if current_chunk:
                    chunks.append({
                        "id": f"chunk_{chunk_id}",
                        "text": current_chunk.strip(),
                        "tokens": current_tokens,
                        "type": "paragraph_based"
                    })
                    chunk_id += 1
                
                # Start new chunk with this paragraph
                current_chunk = para + "\n\n"
                current_tokens = para_tokens
            
            # Otherwise, add to current chunk
            else:
                current_chunk += para + "\n\n"
                current_tokens += para_tokens
        
        # Add the last chunk if it has content
        if current_chunk:
            chunks.append({
                "id": f"chunk_{chunk_id}",
                "text": current_chunk.strip(),
                "tokens": current_tokens,
                "type": "final_chunk"
            })
        
        return chunks
    
    def _split_into_sentences(self, text: str) -> List[str]:
        """
        Simple sentence splitting (in practice, use NLTK or spaCy)
        """
        import re
        # Split on sentence endings followed by whitespace and capital letter
        sentences = re.split(r'(?<=[.!?])\s+', text)
        return [s.strip() for s in sentences if s.strip()]

# Example usage
sample_text = """
Introduction to Machine Learning.

Machine learning is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed. The concept isn't new - it dates back to the 1950s when Arthur Samuel coined the term. However, recent advances in computing power and data availability have made machine learning more practical and widespread than ever before.

There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves training a model on labeled data, where the desired output is known. This is commonly used for classification and regression tasks. Unsupervised learning works with unlabeled data to discover hidden patterns or intrinsic structures in the data. Clustering and association are common unsupervised learning tasks.

Reinforcement learning is different from the other two. It involves an agent that learns to make decisions by performing actions in an environment to maximize cumulative reward. This type of learning is used in robotics, gaming, and navigation. The agent receives feedback in the form of rewards or penalties, learning through trial and error.

Deep learning, a subset of machine learning, uses neural networks with many layers to model complex patterns in data. It has revolutionized fields like computer vision, natural language processing, and speech recognition. The success of deep learning has led to breakthrough applications including autonomous vehicles, real-time translation, and sophisticated virtual assistants.
"""

print("Sample text for chunking:")
print(sample_text[:200] + "... [truncated]")
print(f"\nTotal estimated tokens: {TokenAwareChunker().estimate_tokens(sample_text)}")

# Create chunker and process text
chunker = TokenAwareChunker(max_tokens=75, overlap=10)
chunks = chunker.chunk_text(sample_text)

print(f"\nNumber of chunks created: {len(chunks)}")
for i, chunk in enumerate(chunks):
    print(f"\nChunk {i+1} (Type: {chunk['type']}, Tokens: {chunk['tokens']}):")
    print(f"  Preview: {chunk['text'][:80]}...")

## üß† Multi-Modal Processing

Modern RAG systems need to handle more than just text. Let's explore multi-modal processing (text + images + tables):

In [5]:
# Simulate multi-modal processing concepts
print("Multi-Modal Processing in RAG Systems:")
print("=" * 50)

multi_modal_approaches = {
    "Text Extraction": {
        "method": "OCR and text extraction from documents",
        "tools": ["PyMuPDF", "pdfminer", "Tesseract", "docx"],
        "challenge": "Handling different layouts and formats"
    },
    "Image Processing": {
        "method": "Extract and describe images using vision models",
        "tools": ["OpenAI Vision API", "CLIP", "BLIP"],
        "challenge": "Generating meaningful descriptions"
    },
    "Table Processing": {
        "method": "Convert tables to structured text or vectors",
        "tools": ["pandas", "camelot", "tabula"],
        "challenge": "Preserving structure and relationships"
    },
    "Layout Understanding": {
        "method": "Understanding document structure (headers, footers, columns)",
        "tools": ["LayoutParser", "DocTR", "Donut"],
        "challenge": "Complex layouts and formatting"
    }
}

for modality, details in multi_modal_approaches.items():
    print(f"\n{modality}:")
    print(f"  Method: {details['method']}")
    print(f"  Tools: {', '.join(details['tools'])}")
    print(f"  Challenge: {details['challenge']}")

# Show how this is implemented in the RAG engine
print("\n" + "="*50)
print("Real Implementation from RAG Engine Workers:")
print("="*50)

print("The RAG engine handles multi-modal processing in the index_document task:")
print("1. Extract text & tables (structural extraction)")
print("2. Extract & describe images (LLM-vision)")
print("3. Hierarchical & contextual linking")
print("4. Graph triplet extraction")
print("5. Batch embedding with caching")
print("6. Detailed storage & graph extraction")

## üîÅ Deduplication Techniques

Deduplication is essential to avoid storing redundant information:

In [6]:
# Implement a basic deduplication technique
class ChunkDeduplicator:
    """
    Implements chunk deduplication using hash-based comparison
    """
    
    def __init__(self):
        self.chunk_hashes = set()  # Store hashes of seen chunks
        self.duplicate_count = 0
    
    def normalize_text(self, text: str) -> str:
        """
        Normalize text to reduce false duplicates
        """
        # Remove extra whitespace, convert to lowercase
        normalized = ' '.join(text.lower().split())
        return normalized
    
    def calculate_hash(self, text: str) -> str:
        """
        Calculate SHA256 hash of normalized text
        """
        normalized = self.normalize_text(text)
        return hashlib.sha256(normalized.encode('utf-8')).hexdigest()
    
    def is_duplicate(self, text: str) -> bool:
        """
        Check if a text chunk is a duplicate
        """
        chunk_hash = self.calculate_hash(text)
        if chunk_hash in self.chunk_hashes:
            self.duplicate_count += 1
            return True
        
        self.chunk_hashes.add(chunk_hash)
        return False
    
    def get_stats(self) -> Dict[str, int]:
        """
        Get duplication statistics
        """
        return {
            "unique_chunks": len(self.chunk_hashes),
            "duplicate_count": self.duplicate_count
        }

# Example usage
deduplicator = ChunkDeduplicator()

# Sample chunks with some duplicates
sample_chunks = [
    "Artificial Intelligence is a wonderful field of computer science.",
    "Machine learning enables computers to learn from data.",
    "Artificial Intelligence is a wonderful field of computer science.",  # Duplicate
    "Deep learning uses neural networks with multiple layers.",
    "Machine learning enables computers to learn from data.",  # Duplicate
    "Natural Language Processing helps machines understand human language.",
    "artificial intelligence is a wonderful field of computer science.",  # Near duplicate (different case)
]

print("Processing chunks for deduplication:")
unique_chunks = []
for i, chunk in enumerate(sample_chunks):
    is_dup = deduplicator.is_duplicate(chunk)
    status = "DUPLICATE" if is_dup else "NEW"
    print(f"Chunk {i+1}: {status}")
    
    if not is_dup:
        unique_chunks.append(chunk)

stats = deduplicator.get_stats()
print(f"\nDeduplication Results:")
print(f"  Original chunks: {len(sample_chunks)}")
print(f"  Unique chunks: {stats['unique_chunks']}")
print(f"  Duplicates removed: {stats['duplicate_count']}")
print(f"  Storage efficiency: {(stats['duplicate_count']/len(sample_chunks)*100):.1f}% reduction")

## ‚ö° Performance Considerations

Document processing can be computationally expensive. Here are key performance considerations:

In [7]:
# Performance considerations for document processing
performance_considerations = {
    "Parallel Processing": {
        "technique": "Process multiple documents simultaneously",
        "implementation": "Use multiprocessing or threading",
        "benefit": "Significant speedup for I/O bound tasks"
    },
    "Caching": {
        "technique": "Cache embeddings and processed chunks",
        "implementation": "Redis or in-memory cache",
        "benefit": "Avoid recomputing embeddings for identical content"
    },
    "Batch Processing": {
        "technique": "Process multiple items together",
        "implementation": "Batch API calls for embeddings",
        "benefit": "Reduced API overhead and cost"
    },
    "Asynchronous Processing": {
        "technique": "Use async/await for I/O operations",
        "implementation": "Async document parsing and API calls",
        "benefit": "Better resource utilization"
    },
    "Memory Management": {
        "technique": "Process large documents in streams",
        "implementation": "Generator functions and streaming",
        "benefit": "Handle large documents without memory issues"
    }
}

print("Performance Considerations for Document Processing:")
print("=" * 60)
for name, details in performance_considerations.items():
    print(f"\n{name}:")
    print(f"  Technique: {details['technique']}")
    print(f"  Implementation: {details['implementation']}")
    print(f"  Benefit: {details['benefit']}")

# Show how the RAG engine implements these
print("\n" + "="*60)
print("How RAG Engine Implements Performance:")
print("="*60)

print("1. Background Processing: Uses Celery for async document indexing")
print("2. Embedding Caching: Reuses embeddings for identical content")
print("3. Batch Operations: Processes multiple chunks together")
print("4. Memory Efficiency: Streams large files instead of loading entirely")
print("5. Connection Pooling: Efficient DB and vector store connections")

## üß™ Hands-On Exercise: Building a Document Processing Pipeline

Let's put everything together and create a simple document processing pipeline:

In [8]:
class DocumentProcessorPipeline:
    """
    A complete document processing pipeline combining all concepts
    """
    
    def __init__(self, max_tokens=512, overlap=50):
        self.chunker = TokenAwareChunker(max_tokens=max_tokens, overlap=overlap)
        self.deduplicator = ChunkDeduplicator()
        
    def process_document(self, text: str, doc_id: str) -> Dict[str, Any]:
        """
        Process a document through the complete pipeline
        """
        start_time = time.time()
        
        # Step 1: Initial text cleaning
        cleaned_text = self._clean_text(text)
        
        # Step 2: Chunk the text
        chunks = self.chunker.chunk_text(cleaned_text)
        
        # Step 3: Deduplicate chunks
        unique_chunks = []
        for chunk in chunks:
            if not self.deduplicator.is_duplicate(chunk['text']):
                # Add document ID and other metadata
                chunk['doc_id'] = doc_id
                chunk['hash'] = self.deduplicator.calculate_hash(chunk['text'])
                unique_chunks.append(chunk)
        
        processing_time = time.time() - start_time
        
        # Prepare results
        result = {
            "document_id": doc_id,
            "original_length": len(text),
            "chunks_created": len(chunks),
            "unique_chunks": len(unique_chunks),
            "duplication_rate": ((len(chunks) - len(unique_chunks)) / len(chunks)) * 100 if chunks else 0,
            "processing_time": processing_time,
            "chunks": unique_chunks
        }
        
        return result
    
    def _clean_text(self, text: str) -> str:
        """
        Basic text cleaning
        """
        # Remove extra whitespace but preserve paragraph breaks
        lines = [line.strip() for line in text.split('\n')]
        cleaned = '\n'.join(line for line in lines if line)
        return cleaned

# Test the pipeline
pipeline = DocumentProcessorPipeline(max_tokens=60, overlap=10)

# Process our sample text
result = pipeline.process_document(sample_text, "doc_ml_intro_001")

print("Document Processing Pipeline Results:")
print("=" * 50)
print(f"Document ID: {result['document_id']}")
print(f"Original Length: {result['original_length']} chars")
print(f"Chunks Created: {result['chunks_created']}")
print(f"Unique Chunks: {result['unique_chunks']}")
print(f"Duplication Rate: {result['duplication_rate']:.1f}%")
print(f"Processing Time: {result['processing_time']:.3f}s")

print(f"\nChunk Details:")
for i, chunk in enumerate(result['chunks']):
    print(f"  Chunk {i+1}: {chunk['tokens']} tokens, type '{chunk['type']}'")
    print(f"    Preview: {chunk['text'][:70]}...")

# Show deduplication stats
stats = pipeline.deduplicator.get_stats()
print(f"\nOverall Pipeline Stats:")
print(f"  Total unique chunks processed: {stats['unique_chunks']}")
print(f"  Total duplicates found: {stats['duplicate_count']}")

## üìä Key Takeaways

Document ingestion and processing is a critical component of RAG systems with several important considerations:

1. **Format Diversity**: Different document types require specialized parsers
2. **Chunking Strategy**: The right strategy depends on your use case and content type
3. **Semantic Coherence**: Preserve meaning when splitting documents
4. **Performance**: Optimize for speed and resource usage
5. **Quality Control**: Deduplicate and validate content
6. **Multi-Modal**: Handle text, images, and structured data appropriately
7. **Metadata**: Preserve context and source information

The RAG Engine Mini implements these concepts with production-grade features including asynchronous processing, caching, and comprehensive error handling.

## üèÅ Conclusion

In this notebook, we've explored the critical aspects of document ingestion and processing in RAG systems:

- We learned about different document types and their parsing requirements
- We implemented a token-aware chunking algorithm from scratch
- We explored multi-modal processing concepts
- We implemented a deduplication technique
- We considered performance optimizations
- We built a complete document processing pipeline

These concepts form the foundation of any successful RAG system. The quality of your document processing directly impacts the quality of your retrievals and ultimately the effectiveness of your RAG system.

Continue to the next notebooks to learn about retrieval techniques, ranking algorithms, and evaluation methodologies!