# Context-Enriched Processor Testing

This notebook tests the new context-enriched chunking approach that adds hierarchical section context to each chunk for better RAG retrieval.

## Setup

In [12]:
import sys
import os
from pathlib import Path
import torch

# Fix OpenMP conflict (common with ML libraries on macOS)
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

# Add the project root to Python path
project_root = Path().absolute().parent
sys.path.append(str(project_root))

print(f"Project root: {project_root}")
print("✅ Setup complete")

Project root: /Users/paulschmitt/DataspellProjects/verbatim-rag
✅ Setup complete


In [13]:
from verbatim_rag.ingestion.context_enriched_processor import ContextEnrichedProcessor, ContextEnrichedChunk
from pprint import pprint

print("✅ Imports successful")

✅ Imports successful


## Test 1: Basic Context-Enriched Processing

In [16]:
# Test with the academic paper
pdf_path = project_root / "data" / "acl_papers" / "VERBATIM_RAG_ACL.pdf"

# Create context-enriched processor
processor = ContextEnrichedProcessor.for_rag(chunk_size=512)

# Process document
document = processor.process_file(pdf_path, title="Verbatim RAG ACL Paper")

print(f"✅ Document processed successfully!")
print(f"Title: {document.title}")
print(f"Chunks: {len(document.chunks)}")
print(f"Content type: {document.content_type}")



✅ Document processed successfully!
Title: Verbatim RAG ACL Paper
Chunks: 57
Content type: DocumentType.PDF


## Test 2: Examine Context-Enriched Chunks

In [17]:
print("🔍 Context-Enriched Chunk Analysis:")
print(f"Total chunks: {len(document.chunks)}")

🔍 Context-Enriched Chunk Analysis:
Total chunks: 57


In [26]:
# Show first 5 chunks with their context
n = 10
print(f"\n📝 First {n} Chunks with Context:")
for i, chunk in enumerate(document.chunks[:n]):
    print(f"\n--- Chunk {i+1} ---")
    print(f"Type: {type(chunk).__name__}")
    print(f"Section Path: {chunk.section_path}")
    print(f"Context: {chunk.context_string}")
    print(f"Content: {chunk.content[:100]}...")

    # Show enhanced content (what gets embedded)
    if hasattr(chunk, 'get_enhanced_content'):
        enhanced = chunk.get_enhanced_content()
        print(f"Enhanced: {enhanced[:150]}...")

    # Show citation context
    if hasattr(chunk, 'get_citation_context'):
        citation = chunk.get_citation_context()
        print(f"Citation: {citation}")

if len(document.chunks) > n:
    print(f"\n... and {len(document.chunks) - n} more chunks")


📝 First 10 Chunks with Context:

--- Chunk 1 ---
Type: ContextEnrichedChunk
Section Path: ['1 Introduction']
Context: Section: 1 Introduction
Content: Modern question-answering (QA) and retrievalaugmented generation (RAG) systems play a vital role in ...
Enhanced: Section: 1 Introduction | Modern question-answering (QA) and retrievalaugmented generation (RAG) systems play a vital role in many high-stakes domains...
Citation: 1 Introduction

--- Chunk 2 ---
Type: ContextEnrichedChunk
Section Path: ['1 Introduction']
Context: Section: 1 Introduction
Content: incorrect information, commonly referred to as hallucinations (Ji et al., 2023; Madsen et al., 2024)...
Enhanced: Section: 1 Introduction | incorrect information, commonly referred to as hallucinations (Ji et al., 2023; Madsen et al., 2024). We argue that a reliab...
Citation: 1 Introduction

--- Chunk 3 ---
Type: ContextEnrichedChunk
Section Path: ['1 Introduction']
Context: Section: 1 Introduction
Content: trained generation , dyn

## Test 3: Context Distribution Analysis

In [28]:
print("📊 Context Distribution Analysis:")

# Analyze section distribution
section_counts = {}
context_lengths = []

for chunk in document.chunks:
    if hasattr(chunk, 'section_path') and chunk.section_path:
        # Count chunks per section
        main_section = chunk.section_path[0] if chunk.section_path else "No Section"
        section_counts[main_section] = section_counts.get(main_section, 0) + 1

        # Track context length
        context_lengths.append(len(chunk.context_string))

print(f"\n🏷️  Chunks per Main Section:")
for section, count in sorted(section_counts.items()):
    print(f"  {section}: {count} chunks")

if context_lengths:
    print(f"\n📏 Context String Statistics:")
    print(f"  Average length: {sum(context_lengths)/len(context_lengths):.1f} chars")
    print(f"  Min length: {min(context_lengths)} chars")
    print(f"  Max length: {max(context_lengths)} chars")

# Show unique section paths
unique_paths = set()
for chunk in document.chunks:
    if hasattr(chunk, 'section_path') and chunk.section_path:
        path_str = " → ".join(chunk.section_path)
        unique_paths.add(path_str)

print(f"\n🌳 Unique Section Paths ({len(unique_paths)} total):")
for path in sorted(unique_paths):
    print(f"  {path}")

📊 Context Distribution Analysis:

🏷️  Chunks per Main Section:
  1 Introduction: 5 chunks
  2 Background: 5 chunks
  3 Method: 16 chunks
  4 Evaluation: 9 chunks
  5 Ethical Considerations: 2 chunks
  6 Limitations: 2 chunks
  7 Conclusion: 18 chunks

📏 Context String Statistics:
  Average length: 35.0 chars
  Min length: 21 chars
  Max length: 67 chars

🌳 Unique Section Paths (12 total):
  1 Introduction
  2 Background → 2.1 Dataset
  2 Background → 2.2 Limitations of Standard RAG
  2 Background → 2.3 Synthetic Training Data
  3 Method → 3.1 System Overview
  3 Method → 3.2 Evidence Extraction
  3 Method → 3.3 Synthetic Data Generation
  3 Method → 3.4 Answer Generation
  4 Evaluation
  5 Ethical Considerations
  6 Limitations
  7 Conclusion


## Test 4: Embedding-Ready Content Examples

In [31]:
print("🎯 Embedding-Ready Content Examples:")
print("This shows what will actually be embedded for RAG retrieval.")

# Show 3 examples of enhanced content
for i, chunk in enumerate(document.chunks[:12]):
    if hasattr(chunk, 'get_enhanced_content'):
        print(f"\n--- Example {i+1} ---")
        print(f"Original: {chunk.content[:100]}...")
        print(f"Enhanced: {chunk.get_enhanced_content()[:200]}...")
        print(f"Context adds: {len(chunk.get_enhanced_content()) - len(chunk.content)} chars")

# Show processed chunks (what goes to the index)
print(f"\n💾 ProcessedChunk Integration:")
total_processed = sum(len(chunk.processed_chunks) for chunk in document.chunks)
print(f"Total processed chunks: {total_processed}")

if document.chunks and document.chunks[0].processed_chunks:
    pc = document.chunks[0].processed_chunks[0]
    print(f"\nExample ProcessedChunk:")
    print(f"  Section title: {pc.section_title}")
    print(f"  Enhanced content: {pc.enhanced_content[:150]}...")
    print(f"  Processing metadata: {pc.processing_metadata}")

🎯 Embedding-Ready Content Examples:
This shows what will actually be embedded for RAG retrieval.

--- Example 1 ---
Original: Modern question-answering (QA) and retrievalaugmented generation (RAG) systems play a vital role in ...
Enhanced: Section: 1 Introduction | Modern question-answering (QA) and retrievalaugmented generation (RAG) systems play a vital role in many high-stakes domains for information extraction and generation tasks. ...
Context adds: 26 chars

--- Example 2 ---
Original: incorrect information, commonly referred to as hallucinations (Ji et al., 2023; Madsen et al., 2024)...
Enhanced: Section: 1 Introduction | incorrect information, commonly referred to as hallucinations (Ji et al., 2023; Madsen et al., 2024). We argue that a reliable QA system should guarantee complete traceabilit...
Context adds: 26 chars

--- Example 3 ---
Original: trained generation , dynamically creating answer templates filled exclu-

We participated in the Arc...
Enhanced: Section: 1 Introduct

## Test 5: RAG Benefits Demonstration

In [33]:
print("🚀 RAG Benefits Demonstration:")
print("Examples of how context enrichment improves retrieval...")

# Simulate search scenarios
search_terms = [
    "dataset",
    "background",
    "method",
    "evaluation",
    "limitations"
]

for term in search_terms:
    print(f"\n🔍 Query: '{term}'")
    matches = []

    for chunk in document.chunks:
        if hasattr(chunk, 'get_enhanced_content'):
            enhanced = chunk.get_enhanced_content().lower()
            if term.lower() in enhanced:
                # Calculate relevance score (simple approach)
                content_match = term.lower() in chunk.content.lower()
                context_match = term.lower() in chunk.context_string.lower()

                matches.append({
                    'chunk': chunk,
                    'content_match': content_match,
                    'context_match': context_match,
                    'both': content_match and context_match
                })

    if matches:
        print(f"  Found {len(matches)} potential matches")

        # Show best matches
        best_matches = sorted(matches, key=lambda x: (x['both'], x['context_match'], x['content_match']), reverse=True)[:2]

        for i, match in enumerate(best_matches):
            chunk = match['chunk']
            match_type = "Content+Context" if match['both'] else ("Context" if match['context_match'] else "Content")
            print(f"    Match {i+1} ({match_type}): {chunk.get_citation_context()}")
            print(f"      Content: {chunk.content[:80]}...")
    else:
        print(f"  No matches found")

🚀 RAG Benefits Demonstration:
Examples of how context enrichment improves retrieval...

🔍 Query: 'dataset'
  Found 9 potential matches
    Match 1 (Content+Context): 2 Background → 2.1 Dataset
      Content: Early clinical QA datasets such as emrQA (Pampari et al., 2018) and CliCR (Šuste...
    Match 2 (Context): 2 Background → 2.1 Dataset
      Content: sentence-level as essential , supplementary , or irrelevant . Answers must be co...

🔍 Query: 'background'
  Found 6 potential matches
    Match 1 (Context): 2 Background → 2.1 Dataset
      Content: Early clinical QA datasets such as emrQA (Pampari et al., 2018) and CliCR (Šuste...
    Match 2 (Context): 2 Background → 2.1 Dataset
      Content: sentence-level as essential , supplementary , or irrelevant . Answers must be co...

🔍 Query: 'method'
  Found 28 potential matches
    Match 1 (Content+Context): 3 Method → 3.4 Answer Generation
      Content: repair of his ruptured thoracoabdominal aortic aneurysm. |1| -He was immediately...

## Summary and Next Steps

In [34]:
print("📋 Context-Enriched Processing Summary:")
print("=" * 50)

print(f"\n✅ Successfully processed document with context enrichment")
print(f"  📄 Document: {document.title}")
print(f"  🧩 Total chunks: {len(document.chunks)}")

# Count context-enriched chunks
enriched_count = sum(1 for chunk in document.chunks if hasattr(chunk, 'section_path') and chunk.section_path)
print(f"  🏷️  Context-enriched chunks: {enriched_count}")

# Show unique sections
sections = set()
for chunk in document.chunks:
    if hasattr(chunk, 'section_path') and chunk.section_path:
        sections.add(chunk.section_path[0])
print(f"  📚 Unique sections: {len(sections)}")

print(f"\n🎯 Key Benefits for RAG:")
print(f"  • Each chunk contains full hierarchical context")
print(f"  • Section information embedded with content")
print(f"  • Better retrieval through context matching")
print(f"  • Rich citation context for answers")
print(f"  • Backward compatible with existing VerbatimRAG")

print(f"\n🚀 Ready for:")
print(f"  • Integration with VerbatimIndex")
print(f"  • Embedding generation with context")
print(f"  • Enhanced RAG retrieval testing")

print(f"\n🧹 Test complete - ready for production integration!")

📋 Context-Enriched Processing Summary:

✅ Successfully processed document with context enrichment
  📄 Document: Verbatim RAG ACL Paper
  🧩 Total chunks: 57
  🏷️  Context-enriched chunks: 57
  📚 Unique sections: 7

🎯 Key Benefits for RAG:
  • Each chunk contains full hierarchical context
  • Section information embedded with content
  • Better retrieval through context matching
  • Rich citation context for answers
  • Backward compatible with existing VerbatimRAG

🚀 Ready for:
  • Integration with VerbatimIndex
  • Embedding generation with context
  • Enhanced RAG retrieval testing

🧹 Test complete - ready for production integration!
