# RAG Pipeline Deep Dive (Comprehensive Analysis)

This notebook provides a comprehensive exploration of the RAG (Retrieval-Augmented Generation) pipeline in the RAG Engine Mini project. It covers the complete flow from document ingestion to response generation, including all intermediate processing steps.

## Learning Objectives

By the end of this notebook, you will understand:
- How documents are ingested and processed in the RAG pipeline
- The multi-modal processing capabilities of the system
- Hierarchical chunking strategies and their benefits
- Knowledge graph extraction and storage
- Embedding generation and caching mechanisms
- Background task processing with Celery

## Prerequisites

This notebook assumes familiarity with:
- Basic RAG concepts
- Python async programming
- Document processing workflows

## Step 0: Setup and Imports

We add the project root to `sys.path` so imports work when running from the notebook directory.

In [None]:
import os
import sys
import asyncio
import hashlib
from pathlib import Path

# Find rag-engine-mini root by walking upwards
current = Path.cwd().resolve()
repo_root = None
for parent in [current, *current.parents]:
    if (parent / "src").exists() and (parent / "notebooks").exists():
        repo_root = parent
        break

if repo_root is None:
    raise RuntimeError("Could not locate rag-engine-mini root for imports")

sys.path.insert(0, str(repo_root))

print("Repo root:", repo_root)

## Step 1: Understanding the Core Components

Let's examine the core components of the RAG pipeline by importing and inspecting them.

In [None]:
from src.core.config import settings
from src.core.bootstrap import get_container
from src.domain.entities import TenantId, DocumentId, ChunkSpec
from src.application.services.chunking import chunk_hierarchical

# Get the dependency container
container = get_container()

# List some of the key services available
services = list(container._services.keys())
print("Available services in container:", services[:10], "...")

# Focus on the key services for our RAG pipeline
print("\nKey RAG services:")
print("- Document Repository:", type(container.get("document_repo")))
print("- Document Reader:", type(container.get("document_reader")))
print("- Text Extractor:", type(container.get("text_extractor")))
print("- Cached Embeddings:", type(container.get("cached_embeddings")))
print("- Chunk Dedup Repository:", type(container.get("chunk_dedup_repo")))
print("- Vector Store:", type(container.get("vector_store")))
print("- LLM:", type(container.get("llm")))
print("- Graph Extractor:", type(container.get("graph_extractor")))
print("- Graph Repository:", type(container.get("graph_repo")))
print("- Vision Service:", type(container.get("vision_service")))

## Step 2: Simulating Document Ingestion Process

Let's simulate the document ingestion process step-by-step, following the same pattern as the actual pipeline.

In [None]:
# Simulate a simplified version of the document indexing process

def simulate_document_ingestion_pipeline():
    """
    Simulates the document ingestion pipeline with explanations for each step.
    This mirrors the actual index_document Celery task but simplified for demonstration.
    """
    print("Starting simulated document ingestion pipeline...")
    
    # Get services from container
    document_repo = container["document_repo"]
    document_reader = container["document_reader"]
    text_extractor = container["text_extractor"]
    cached_embeddings = container["cached_embeddings"]
    chunk_dedup_repo = container["chunk_dedup_repo"]
    vector_store = container["vector_store"]
    llm = container["llm"]
    graph_extractor = container["graph_extractor"]
    graph_repo = container["graph_repo"]
    
    print("✓ Services initialized")
    
    # Step 1: Create a mock document entity
    tenant = TenantId("demo-tenant")
    doc_id = DocumentId("demo-doc-id")
    
    print(f"\n1. Processing document: {doc_id.value} for tenant: {tenant.value}")
    
    # Step 2: Mock document content (in real scenario, this would come from document_reader)
    mock_document_content = """
    Artificial Intelligence and Machine Learning Overview
    
    Introduction to AI
    Artificial Intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals. Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.
    
    Machine Learning Concepts
    Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence.
    
    Deep Learning Applications
    Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.
    
    Future of AI
    The future of AI holds many possibilities, from autonomous vehicles to personalized medicine. However, ethical considerations must be taken into account as AI systems become more powerful.
    """
    
    print("2. Document content loaded (simulated)")
    
    # Step 3: Document Summary Generation
    print("3. Generating document summary with LLM...")
    summary_prompt = (
        "Summarize the following document content in 1-2 sentences to provide context for RAG chunks. "
        "Output ONLY the summary sentences:\n\n"
        f"{mock_document_content[:12000]}"
    )
    doc_summary = "This document provides an overview of AI, ML, and deep learning concepts along with their applications and ethical considerations."
    print(f"   Generated summary: {doc_summary}")
    
    # Step 4: Hierarchical Chunking
    print("4. Performing hierarchical chunking...")
    spec = ChunkSpec(strategy="hierarchical", parent_size=2048, child_size=512)
    hierarchy = chunk_hierarchical(mock_document_content, spec)
    print(f"   Created {len(hierarchy)} chunk pairs (parent-child)")
    
    # Display first few chunks as examples
    print("   Example chunks:")
    for i, item in enumerate(hierarchy[:2]):
        print(f"     Parent {i+1}: '{item['parent_text'][:50]}...'")
        print(f"     Child {i+1}:  '{item['child_text'][:50]}...'\n")
    
    # Step 5: Batch Embedding
    print("5. Generating embeddings for chunks...")
    child_texts = [h["child_text"] for h in hierarchy]
    unique_child_texts = list(set(child_texts))
    
    # In a real system, we'd call: unique_vectors = cached_embeddings.embed_many(unique_child_texts)
    # For simulation, we'll create mock embeddings
    mock_vectors = [[0.1] * 384 for _ in unique_child_texts]  # 384-dim mock vectors
    vec_map = dict(zip(unique_child_texts, mock_vectors))
    
    print(f"   Generated embeddings for {len(unique_child_texts)} unique chunks")
    
    # Step 6: Storage Operations
    print("6. Storing chunks and vectors...")
    chunk_ids_in_order = []
    handled_parents = set()
    
    for idx, item in enumerate(hierarchy):
        c_text = item["child_text"]
        p_text = item["parent_text"]
        
        # Generate hashes for parent and child
        p_hash = hashlib.sha256(p_text.encode()).hexdigest()[:16]
        c_hash = hashlib.sha256(c_text.encode()).hexdigest()[:16]
        
        # Simulate parent chunk creation
        p_id = f"parent_{p_hash}"
        if p_id not in handled_parents:
            # In real system: p_id = chunk_dedup_repo.upsert_chunk_store(...)
            print(f"   Created parent chunk: {p_id}")
            
            # Extract and store graph triplets for parent
            # In real system: triplets = graph_extractor.extract_triplets(p_text)
            print(f"   Extracted and stored graph triplets for parent")
            handled_parents.add(p_id)
        
        # Simulate child chunk creation
        c_id = f"child_{c_hash}_{idx}"
        # In real system: c_id = chunk_dedup_repo.upsert_chunk_store(...)
        print(f"   Created child chunk: {c_id}")
        
        chunk_ids_in_order.append(c_id)
        
        # Simulate vector storage
        # In real system: vector_store.upsert_points(ids=[c_id], vectors=[vec_map[c_text]], ...)
        print(f"   Stored vector for chunk: {c_id}")
    
    print(f"\n✓ Pipeline completed successfully!")
    print(f"  Total chunks processed: {len(chunk_ids_in_order)}")
    print(f"  Document summary: {doc_summary[:60]}...")
    
    return {
        "chunks_created": len(chunk_ids_in_order),
        "document_summary": doc_summary,
        "chunk_ids": chunk_ids_in_order
    }

# Run the simulation
result = simulate_document_ingestion_pipeline()

## Step 3: Exploring the Chunk Deduplication Mechanism

The RAG engine implements a sophisticated chunk deduplication mechanism to prevent storing identical content multiple times.

In [None]:
def demonstrate_chunk_deduplication():
    """
    Demonstrates the chunk deduplication mechanism.
    """
    print("Demonstrating chunk deduplication mechanism:")
    
    # Example texts, including some duplicates
    texts = [
        "Artificial Intelligence is transforming industries worldwide.",
        "Machine learning enables computers to learn from experience.",
        "Artificial Intelligence is transforming industries worldwide.",  # Duplicate
        "Deep learning uses neural networks with multiple layers.",
        "Machine learning enables computers to learn from experience."   # Duplicate
    ]
    
    print("\nOriginal texts:")
    for i, text in enumerate(texts):
        print(f"  {i+1}. {text}")
    
    # Calculate hashes (same as _chunk_hash function in tasks.py)
    def chunk_hash(text: str) -> str:
        """Generate SHA256 hash of normalized text."""
        normalized = " ".join(text.split())
        return hashlib.sha256(normalized.encode("utf-8")).hexdigest()
    
    hashes = [chunk_hash(text) for text in texts]
    
    print("\nHashes calculated:")
    for i, (text, h) in enumerate(zip(texts, hashes)):
        print(f"  {i+1}. {h[:16]}... <- '{text[:30]}...'")
    
    # Identify unique texts
    unique_texts = list(set(texts))
    unique_hashes = [chunk_hash(text) for text in unique_texts]
    
    print(f"\nAfter deduplication:")
    print(f"  Original count: {len(texts)}")
    print(f"  Unique count: {len(unique_texts)}")
    print(f"  Saved: {len(texts) - len(unique_texts)} duplicate chunks")
    
    print("\nUnique texts:")
    for i, text in enumerate(unique_texts):
        print(f"  {i+1}. {text}")

demonstrate_chunk_deduplication()

## Step 4: Examining the Hierarchical Chunking Strategy

The RAG engine uses hierarchical chunking to maintain context while enabling granular retrieval.

In [None]:
def demonstrate_hierarchical_chunking():
    """
    Demonstrates the hierarchical chunking strategy.
    """
    print("Demonstrating hierarchical chunking strategy:")
    
    # Example content to chunk
    sample_content = """
    Chapter 1: Introduction to Neural Networks
    
    Neural networks are computing systems inspired by the human brain. They consist of interconnected nodes or neurons in various layers. Each connection has a weight that adjusts as learning proceeds.
    
    Section 1.1: Perceptron Model
    The perceptron is the simplest form of a neural network, consisting of a single neuron. It takes multiple inputs, applies weights to them, sums them up, and passes the result through an activation function.
    
    Section 1.2: Multilayer Perceptrons
    Multilayer perceptrons extend the simple perceptron by adding hidden layers between the input and output layers. This allows the network to learn more complex patterns and relationships in the data.
    
    Activation Functions
    Activation functions determine the output of a neural network. Common activation functions include sigmoid, tanh, and ReLU. Each has its own advantages and disadvantages depending on the use case.
    """
    
    print("Sample content to chunk:")
    print(sample_content.strip())
    
    # Define chunk specifications
    spec = ChunkSpec(strategy="hierarchical", parent_size=200, child_size=100)
    
    # Simulate hierarchical chunking (in real system, this would call chunk_hierarchical)
    # For demonstration, we'll manually create a hierarchy
    
    # Parents (larger context chunks)
    parents = [
        sample_content[0:300],  # Introduction to Neural Networks + Section 1.1
        sample_content[200:500],  # Section 1.1 + Section 1.2
        sample_content[400:]  # Section 1.2 + Activation Functions
    ]
    
    # Children (focused chunks)
    children = [
        sample_content[0:100],   # Intro sentence
        sample_content[100:200], # Neural network definition
        sample_content[200:300], # Perceptron details
        sample_content[300:400], # Multilayer details
        sample_content[400:500], # Activation functions intro
        sample_content[500:]     # Activation function details
    ]
    
    # Create hierarchy pairs
    hierarchy = []
    for child in children:
        # Find the most appropriate parent for each child
        if "Introduction" in child or "Neural" in child:
            hierarchy.append({"parent_text": parents[0], "child_text": child})
        elif "Perceptron" in child:
            # Determine which parent contains this child
            if "Multilayer" in child:
                hierarchy.append({"parent_text": parents[2], "child_text": child})
            else:
                hierarchy.append({"parent_text": parents[0], "child_text": child})
        else:
            hierarchy.append({"parent_text": parents[2], "child_text": child})
    
    print(f"\nCreated {len(hierarchy)} parent-child pairs:")
    
    for i, pair in enumerate(hierarchy):
        print(f"\nPair {i+1}:")
        print(f"  Parent ({len(pair['parent_text'])} chars): {pair['parent_text'][:60]}...")
        print(f"  Child ({len(pair['child_text'])} chars):  {pair['child_text'][:60]}...")
    
    print(f"\nBenefits of hierarchical chunking:")
    print("• Granular retrieval: Search for specific details in child chunks")
    print("• Context preservation: Access broader context via parent chunks")
    print("• Efficient storage: Related content grouped together")
    print("• Flexible querying: Choose appropriate granularity for different questions")

demonstrate_hierarchical_chunking()

## Step 5: Understanding the Knowledge Graph Extraction

The RAG engine extracts structured knowledge from text and stores it as graph triplets.

In [None]:
def demonstrate_knowledge_graph_extraction():
    """
    Demonstrates the knowledge graph extraction concept.
    """
    print("Demonstrating knowledge graph extraction:")
    
    # Sample text with extractable relationships
    sample_text = (
        "Apple Inc. is headquartered in Cupertino, California. Tim Cook is the CEO of Apple Inc. "
        "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne. The iPhone was developed by Apple Inc."
    )
    
    print(f"\nInput text: {sample_text}")
    
    # Simulate triplet extraction
    # In a real system, graph_extractor.extract_triplets(text) would return actual triplets
    extracted_triplets = [
        ("Apple Inc.", "headquartered_in", "Cupertino, California"),
        ("Tim Cook", "CEO_of", "Apple Inc."),
        ("Apple Inc.", "founded_by", "Steve Jobs"),
        ("Apple Inc.", "founded_by", "Steve Wozniak"),
        ("Apple Inc.", "founded_by", "Ronald Wayne"),
        ("iPhone", "developed_by", "Apple Inc.")
    ]
    
    print(f"\nExtracted {len(extracted_triplets)} triplets:")
    for i, triplet in enumerate(extracted_triplets, 1):
        print(f"  {i}. ({triplet[0]}, {triplet[1]}, {triplet[2]})")
    
    print(f"\nBenefits of knowledge graph extraction:")
    print("• Structured knowledge representation")
    print("• Ability to answer relationship-based questions")
    print("• Enhanced reasoning capabilities")
    print("• Improved retrieval for fact-based queries")
    
    # Show how triplets could be used for enhanced retrieval
    print(f"\nExample query processing:")
    print("Q: Who founded Apple Inc.?")
    print("A: Based on triplets: Steve Jobs, Steve Wozniak, Ronald Wayne")
    
    print("\nQ: What did Apple Inc. develop?")
    print("A: Based on triplets: iPhone")

demonstrate_knowledge_graph_extraction()

## Step 6: Understanding the Caching Strategy

The RAG engine implements a multi-layer caching strategy for performance optimization.

In [None]:
def demonstrate_caching_strategy():
    """
    Demonstrates the caching strategy used in the RAG pipeline.
    """
    print("Demonstrating caching strategy:")
    
    print(f"\nThe RAG engine uses multiple caching layers:")
    
    print(f"\n1. Embedding Cache:")
    print("   • Stores computed embeddings to avoid recomputation")
    print("   • Uses document/content hashing for quick lookup")
    print("   • Reduces API costs and improves response time")
    
    # Example of embedding caching
    sample_texts = [
        "Machine learning is a subset of artificial intelligence",
        "Deep learning uses neural networks with multiple layers",
        "Machine learning is a subset of artificial intelligence"  # Duplicate
    ]
    
    print(f"\n   Example: Processing {len(sample_texts)} texts with caching")
    unique_texts = list(set(sample_texts))
    print(f"   Unique texts after deduplication: {len(unique_texts)}")
    print(f"   Embeddings computed: {len(unique_texts)} (not {len(sample_texts)})")
    print(f"   Cost saving: {(len(sample_texts) - len(unique_texts))/len(sample_texts)*100:.0f}% fewer embeddings")
    
    print(f"\n2. Document Chunk Cache:")
    print("   • Stores processed document chunks to avoid re-processing")
    print("   • Maintains parent-child relationships")
    print("   • Enables faster document retrieval")
    
    print(f"\n3. Query Result Cache:")
    print("   • Caches frequent query results")
    print("   • Improves response time for repeated questions")
    print("   • Reduces computational overhead")
    
    print(f"\n4. LLM Response Cache:")
    print("   • Caches responses for similar prompts")
    print("   • Reduces API usage and latency")
    print("   • Maintains quality while improving performance")

demonstrate_caching_strategy()

## Step 7: Reviewing the Complete Pipeline Architecture

Let's review the complete RAG pipeline architecture with all components working together.

In [None]:
def summarize_rag_pipeline():
    """
    Summarizes the complete RAG pipeline architecture.
    """
    print("SUMMARY: Complete RAG Pipeline Architecture")
    print("=" * 50)
    
    print("\nINGESTION PHASE:")
    print("1. Document Upload → Stored in file system")
    print("2. Document Processing → Extract text, tables, images")
    print("3. Multi-modal Processing → Describe images with vision models")
    print("4. Document Summarization → Generate context summary with LLM")
    print("5. Hierarchical Chunking → Create parent-child chunk relationships")
    print("6. Knowledge Extraction → Extract graph triplets from content")
    print("7. Embedding Generation → Convert chunks to vector representations")
    print("8. Storage → Save chunks to DB, vectors to vector store, graphs to graph DB")
    
    print("\nRETRIEVAL PHASE:")
    print("1. Query Processing → Parse and potentially expand user query")
    print("2. Hybrid Search → Combine semantic and keyword-based retrieval")
    print("3. Re-ranking → Improve relevance with cross-encoder models")
    print("4. Context Assembly → Gather relevant chunks with parent context")
    
    print("\nGENERATION PHASE:")
    print("1. Prompt Construction → Build context-aware prompt with retrieved content")
    print("2. LLM Generation → Generate response using LLM")
    print("3. Verification → Validate response accuracy if enabled")
    print("4. Response Delivery → Return answer with sources and metadata")
    
    print("\nKEY FEATURES:")
    print("✓ Multi-tenant isolation")
    print("✓ Multi-modal content support")
    print("✓ Hierarchical chunking for context preservation")
    print("✓ Knowledge graph integration")
    print("✓ Embedding caching for cost optimization")
    print("✓ Asynchronous processing with Celery")
    print("✓ Comprehensive observability")
    print("✓ Enterprise-grade security")
    
    print("\nPERFORMANCE OPTIMIZATIONS:")
    print("• Chunk deduplication to reduce storage/compute")
    print("• Batch embedding operations")
    print("• Multi-level caching strategy")
    print("• Asynchronous processing for scalability")
    print("• Connection pooling for databases")
    
summarize_rag_pipeline()

## Conclusion

This notebook provided a comprehensive exploration of the RAG pipeline in the RAG Engine Mini project. We examined:

1. The complete document ingestion pipeline
2. Multi-modal processing capabilities
3. Hierarchical chunking strategies
4. Knowledge graph extraction and storage
5. Caching mechanisms for performance
6. The complete architecture with all components integrated

The RAG Engine demonstrates enterprise-grade practices for building scalable, reliable, and performant RAG systems. The combination of hierarchical chunking, knowledge graph integration, multi-modal processing, and comprehensive caching creates a robust foundation for production RAG applications.