# DocumentProcessor Testing Notebook

This notebook tests the DocumentProcessor functionality with different chunking strategies and document types.

## Setup and Imports

This section sets up the Python environment, handles common ML library conflicts (OpenMP), and imports the necessary modules for testing the DocumentProcessor.

In [70]:
import sys
import os
from pathlib import Path

# Fix OpenMP conflict (common with ML libraries on macOS)
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

# Add the project root to Python path
project_root = Path().absolute().parent
sys.path.append(str(project_root))

print(f"Project root: {project_root}")
print(f"Current working directory: {Path.cwd()}")
print("✅ OpenMP conflict workaround applied")

Project root: /Users/paulschmitt/DataspellProjects/verbatim-rag
Current working directory: /Users/paulschmitt/DataspellProjects/verbatim-rag/notebooks
✅ OpenMP conflict workaround applied


In [2]:
from verbatim_rag.ingestion import DocumentProcessor

In [3]:
from verbatim_rag.document import DocumentType, ChunkType
import json
from pprint import pprint

## Check Available Test Files

Before running tests, we need to verify which document files are available for processing. This helps us understand what test data we have to work with.

In [4]:
# Check available example documents
example_docs_path = project_root / "data" / "acl_papers"
print(f"Example docs path: {example_docs_path}")
print(f"Exists: {example_docs_path.exists()}")

if example_docs_path.exists():
    print("\nAvailable files:")
    for file in example_docs_path.iterdir():
        print(f"  - {file.name} ({file.stat().st_size} bytes)")
else:
    print("Example docs directory not found!")

Example docs path: /Users/paulschmitt/DataspellProjects/verbatim-rag/data/acl_papers
Exists: True

Available files:
  - VERBATIM_RAG_ACL.pdf (362982 bytes)


## Test 1: Basic DocumentProcessor Creation

This test verifies that the DocumentProcessor can be instantiated with default settings. It checks if all required dependencies (docling, chonkie) are properly installed and accessible.

In [5]:
try:
    # Test basic creation with default settings
    processor = DocumentProcessor()
    print("✅ DocumentProcessor created successfully")
    print(f"Chunker type: {processor.chunker_type}")
    print(f"Chunk size: {processor.chunk_size}")
    print(f"Chunk overlap: {processor.chunk_overlap}")
except Exception as e:
    print(f"❌ Error creating DocumentProcessor: {e}")
    print("Make sure you have installed the document-processing dependencies:")
    print("pip install -e .[document-processing]")

✅ DocumentProcessor created successfully
Chunker type: recursive
Chunk size: 512
Chunk overlap: 50


## Test 2: Process a Simple Text File

This test processes a sample markdown document to understand how the DocumentProcessor converts content into chunks. We create a test document with headers and sections to analyze the chunking behavior.

In [10]:
test_file_name = "VERBATIM_RAG_ACL.pdf"
test_file_path = example_docs_path / test_file_name

In [11]:
# Test processing the file
try:
    processor = DocumentProcessor()
    document = processor.process_file(test_file_path, title=test_file_name)
    
    print("✅ Document processed successfully!")
    print(f"Document ID: {document.id}")
    print(f"Title: {document.title}")
    print(f"Source: {document.source}")
    print(f"Content Type: {document.content_type}")
    print(f"Number of chunks: {len(document.chunks)}")
    print(f"Raw content length: {len(document.raw_content)} characters")
    
except Exception as e:
    print(f"❌ Error processing document: {e}")
    import traceback
    traceback.print_exc()



✅ Document processed successfully!
Document ID: 7c2a73ad-6491-45ab-a371-8717b4d72a18
Title: VERBATIM_RAG_ACL.pdf
Source: /Users/paulschmitt/DataspellProjects/verbatim-rag/data/acl_papers/VERBATIM_RAG_ACL.pdf
Content Type: DocumentType.PDF
Number of chunks: 88
Raw content length: 25746 characters


## Test 3: Examine Chunks in Detail

This test provides a detailed analysis of the generated chunks, including their structure, content, and metadata. It helps us understand how content is split and what information is preserved in each chunk.

In [12]:
if 'document' in locals():
    print(f"\n📄 Document Analysis:")
    print(f"Total chunks: {len(document.chunks)}")
    
    print("\n📝 Chunk Details:")
    for i, chunk in enumerate(document.chunks[:5]):  # Show first 5 chunks
        print(f"\n--- Chunk {i+1} ---")
        print(f"ID: {chunk.id}")
        print(f"Type: {chunk.chunk_type}")
        print(f"Number: {chunk.chunk_number}")
        print(f"Content length: {len(chunk.content)} chars")
        print(f"Processed chunks: {len(chunk.processed_chunks)}")
        print(f"Content preview: {chunk.content[:200]}...")
        
        # Show processed chunk details
        if chunk.processed_chunks:
            pc = chunk.processed_chunks[0]
            print(f"Enhanced content length: {len(pc.enhanced_content)} chars")
            print(f"Enhanced preview: {pc.enhanced_content[:200]}...")
    
    if len(document.chunks) > 5:
        print(f"\n... and {len(document.chunks) - 5} more chunks")
else:
    print("❌ No document available to analyze")


📄 Document Analysis:
Total chunks: 88

📝 Chunk Details:

--- Chunk 1 ---
ID: 9ca6dcc7-f0e1-438b-96d7-a0e227425290
Type: ChunkType.PARAGRAPH
Number: 0
Content length: 227 chars
Processed chunks: 1
Content preview: ## KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering

Ádám Kovács KR Labs kovacs@krlabs.eu

## Paul Schmitt

TU Wien paul.schmitt@tuwien.ac.at

Gábor Recski KR Labs...
Enhanced content length: 227 chars
Enhanced preview: ## KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering

Ádám Kovács KR Labs kovacs@krlabs.eu

## Paul Schmitt

TU Wien paul.schmitt@tuwien.ac.at

Gábor Recski KR Labs...

--- Chunk 2 ---
ID: 408e5810-c9ac-4d0b-9faa-9504d8008d27
Type: ChunkType.PARAGRAPH
Number: 1
Content length: 458 chars
Processed chunks: 1
Content preview: ## Abstract

We present a lightweight, domain-agnostic verbatim pipeline for evidence-grounded question answering. Our pipeline operates in two steps: first, a sentence

## Test 4: Different Chunking Strategies

This test compares various chunking approaches (recursive, token, sentence, word) to understand how each strategy affects the resulting chunks. This comparison helps identify the best approach for different use cases.

In [13]:
# Test different chunking strategies
chunking_strategies = [
    ("recursive", {"chunker_recipe": "markdown", "chunk_size": 512}),
    ("token", {"chunk_size": 256, "chunk_overlap": 50}),
    ("sentence", {"chunk_size": 3, "chunk_overlap": 1}),
    ("word", {"chunk_size": 100, "chunk_overlap": 20}),
]

results = {}

for strategy_name, kwargs in chunking_strategies:
    try:
        print(f"\n🔄 Testing {strategy_name} chunking...")
        processor = DocumentProcessor(chunker_type=strategy_name, **kwargs)
        doc = processor.process_file(test_file_path, title=f"Test Doc - {strategy_name}")
        
        results[strategy_name] = {
            "chunks": len(doc.chunks),
            "avg_chunk_size": sum(len(chunk.content) for chunk in doc.chunks) / len(doc.chunks),
            "total_content": sum(len(chunk.content) for chunk in doc.chunks)
        }
        
        print(f"  ✅ {strategy_name}: {len(doc.chunks)} chunks")
        
    except Exception as e:
        print(f"  ❌ {strategy_name} failed: {e}")
        results[strategy_name] = {"error": str(e)}

# Summary
print("\n📊 Chunking Strategy Comparison:")
print("-" * 60)
for strategy, result in results.items():
    if "error" in result:
        print(f"{strategy:12} | ERROR: {result['error']}")
    else:
        print(f"{strategy:12} | {result['chunks']:3d} chunks | Avg size: {result['avg_chunk_size']:6.1f} chars")


🔄 Testing recursive chunking...




  ✅ recursive: 88 chunks

🔄 Testing token chunking...




  ✅ token: 125 chunks

🔄 Testing sentence chunking...




  ✅ sentence: 273 chunks

🔄 Testing word chunking...
  ❌ word failed: module 'chonkie' has no attribute 'WordChunker'

📊 Chunking Strategy Comparison:
------------------------------------------------------------
recursive    |  88 chunks | Avg size:  292.6 chars
token        | 125 chunks | Avg size:  255.6 chars
sentence     | 273 chunks | Avg size:   94.3 chars
word         | ERROR: module 'chonkie' has no attribute 'WordChunker'


## Test 5: Factory Methods

This test evaluates the convenience factory methods provided by DocumentProcessor. These methods create pre-configured processors optimized for specific tasks like embeddings, Q&A, and semantic processing.

In [14]:
# Test factory methods
factory_methods = [
    ("for_embeddings", DocumentProcessor.for_embeddings),
    ("for_qa", DocumentProcessor.for_qa),
    ("semantic", DocumentProcessor.semantic),
    ("markdown_recursive", DocumentProcessor.markdown_recursive),
]

print("🏭 Testing Factory Methods:")
for method_name, method in factory_methods:
    try:
        processor = method()
        doc = processor.process_file(test_file_path, title=f"Factory Test - {method_name}")
        print(f"  ✅ {method_name:18} | {len(doc.chunks):3d} chunks | Type: {processor.chunker_type}")
    except Exception as e:
        print(f"  ❌ {method_name:18} | Error: {str(e)[:50]}...")

🏭 Testing Factory Methods:




  ✅ for_embeddings     |  56 chunks | Type: token




  ✅ for_qa             | 273 chunks | Type: sentence


Falling back to loading default provider model.
Falling back to SentenceTransformerEmbeddings.


  ❌ semantic           | Error: Failed to load embeddings via SentenceTransformerE...




  ✅ markdown_recursive |  88 chunks | Type: recursive


## Test 6: Directory Processing

This test examines the DocumentProcessor's ability to process multiple files from a directory. It's useful for understanding batch processing capabilities and handling various file formats.

In [15]:
# Test directory processing if example docs exist
if example_docs_path.exists():
    print("📁 Testing directory processing...")
    try:
        processor = DocumentProcessor()
        documents = processor.process_directory(example_docs_path)
        
        print(f"✅ Processed {len(documents)} documents from directory")
        
        for doc in documents:
            print(f"  - {doc.title}: {len(doc.chunks)} chunks ({doc.content_type.value})")
            
    except Exception as e:
        print(f"❌ Directory processing failed: {e}")
        import traceback
        traceback.print_exc()
else:
    print("📁 Skipping directory test - no example docs found")

📁 Testing directory processing...




✅ Processed 1 documents from directory
  - VERBATIM_RAG_ACL: 88 chunks (pdf)


## Test 7: Document Structure Analysis

This critical test analyzes the document's structural patterns, including headers, sections, and hierarchical elements. The insights from this test inform our hierarchical chunking strategy and help identify documents suitable for hierarchical processing.

In [18]:
# Analyze document structure for hierarchical chunking insights
if 'document' in locals():
    print("🔍 Document Structure Analysis (for hierarchical chunking):")
    print(f"\nDocument: {document.title}")
    # print(f"Raw content sample:")
    # print(document.raw_content[:500] + "...")
    
    print(f"\n📊 Chunk Analysis:")
    chunk_types = {}
    chunk_sizes = []
    
    for chunk in document.chunks:
        chunk_types[chunk.chunk_type] = chunk_types.get(chunk.chunk_type, 0) + 1
        chunk_sizes.append(len(chunk.content))
    
    print(f"Chunk types distribution:")
    for chunk_type, count in chunk_types.items():
        print(f"  {chunk_type.value}: {count}")
    
    print(f"\nChunk size statistics:")
    print(f"  Min: {min(chunk_sizes)} chars")
    print(f"  Max: {max(chunk_sizes)} chars")
    print(f"  Avg: {sum(chunk_sizes)/len(chunk_sizes):.1f} chars")
    
    # Look for hierarchical patterns in content
    print(f"\n🌳 Hierarchical Pattern Analysis:")
    headers = []
    for chunk in document.chunks:
        lines = chunk.content.split('\n')
        for line in lines:
            line = line.strip()
            if line.startswith('#'):
                level = len(line) - len(line.lstrip('#'))
                headers.append((level, line))
    
    if headers:
        print(f"Found {len(headers)} headers:")
        for level, header in headers:
            indent = "  " * (level - 1)
            print(f"{indent}Level {level}: {header}")
    else:
        print("No markdown headers found")

🔍 Document Structure Analysis (for hierarchical chunking):

Document: VERBATIM_RAG_ACL.pdf

📊 Chunk Analysis:
Chunk types distribution:
  paragraph: 88

Chunk size statistics:
  Min: 1 chars
  Max: 510 chars
  Avg: 292.6 chars

🌳 Hierarchical Pattern Analysis:
Found 18 headers:
  Level 2: ## KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering
  Level 2: ## Paul Schmitt
  Level 2: ## Abstract
  Level 2: ## 1 Introduction
  Level 2: ## 2 Background
  Level 2: ## 2.1 Dataset
  Level 2: ## 2.2 Limitations of Standard RAG
  Level 2: ## 2.3 Synthetic Training Data
  Level 2: ## 3 Method
  Level 2: ## 3.1 System Overview
  Level 2: ## 3.2 Evidence Extraction
  Level 2: ## 3.3 Synthetic Data Generation
  Level 2: ## 3.4 Answer Generation
  Level 2: ## 4 Evaluation
  Level 2: ## 5 Ethical Considerations
  Level 2: ## 6 Limitations
  Level 2: ## 7 Conclusion
  Level 2: ## References


## Test 8: Document Serialization

This test verifies that documents can be properly serialized to and deserialized from dictionary format. This capability is essential for storing, transferring, and reconstructing document objects while maintaining data integrity.

In [None]:
# Test document serialization/deserialization
if 'document' in locals():
    print("💾 Testing document serialization...")
    
    try:
        # Convert to dict
        doc_dict = document.to_dict()
        print(f"✅ Document serialized to dict ({len(str(doc_dict))} chars)")
        
        # Convert back to document
        restored_doc = document.__class__.from_dict(doc_dict)
        print(f"✅ Document restored from dict")
        
        # Verify integrity
        print(f"\n🔍 Integrity check:")
        print(f"  Title match: {document.title == restored_doc.title}")
        print(f"  Chunks count match: {len(document.chunks) == len(restored_doc.chunks)}")
        print(f"  Content match: {document.raw_content == restored_doc.raw_content}")
        
        if len(document.chunks) > 0 and len(restored_doc.chunks) > 0:
            print(f"  First chunk content match: {document.chunks[0].content == restored_doc.chunks[0].content}")
        
    except Exception as e:
        print(f"❌ Serialization test failed: {e}")
        import traceback
        traceback.print_exc()

## Test 9: Docling HierarchicalChunker

This test explores Docling's built-in HierarchicalChunker to determine if it preserves document hierarchy better than standard chunking. We investigate whether the issue lies in PDF conversion or the chunking process itself.

## Test 11: Hierarchical Chunking Prototype

This is our breakthrough test that implements hierarchical chunking using section numbering patterns. It creates a complete hierarchical document structure with parent-child relationships, demonstrating how to overcome Docling's hierarchy flattening limitations through intelligent post-processing.

## Test 10: View Raw Converted Content

This test examines the raw markdown content produced by Docling's PDF conversion to understand exactly what happens during the document conversion process. It helps identify where hierarchy information is lost and explores alternative export methods.

## Test 11: Hierarchical Chunking Prototype

In [116]:
# Prototype hierarchical chunking using section numbering

import re
from dataclasses import dataclass, field
from typing import List, Optional
from verbatim_rag.document import Chunk, ChunkType

HEADER_RE = re.compile(r'^##\s+(\d+(?:\.\d+)*)\s+(.+)$')
NUM_RE    = re.compile(r'^(\d+(?:\.\d+)*)\s+([A-Z][A-Za-z\s]+.*)$')

@dataclass
class HierarchicalChunk(Chunk):
    """Extended Chunk class with hierarchy support."""
    parent_chunk_id: Optional[str] = None
    child_chunk_ids: List[str] = field(default_factory=list)
    hierarchy_level: int = 0  # 0=document, 1=section, 2=subsection, 3=content
    section_number: Optional[str] = None  # "1", "2.1", "3.2.1"

    def add_child(self, child_chunk: 'HierarchicalChunk'):
        """Add a child chunk and set up parent-child relationship."""
        child_chunk.parent_chunk_id = self.id
        if child_chunk.id not in self.child_chunk_ids:
            self.child_chunk_ids.append(child_chunk.id)

    def __str__(self):
        indent = "  " * self.hierarchy_level
        section = f"[{self.section_number}] " if self.section_number else ""
        return f"{indent}{section}{self.content[:100]}..."

def detect_section_numbering(content: str) -> List[tuple]:
    """
    Detect section numbering patterns in content.
    Returns list of (line_number, section_number, title, level, full_line)
    """
    lines = content.split('\n')
    sections = []

    for i, line in enumerate(lines):
        line_stripped = line.strip()

        # Pattern 1: "## 1 Introduction" or "## 2.1 Dataset"
        match = HEADER_RE.match(line_stripped)
        if match:
            section_num = match.group(1)
            title = match.group(2)
            level = len(section_num.split('.'))
            sections.append((i+1, section_num, title, level, line_stripped))
            continue

        # Pattern 2: Just numbers "1 Introduction" (without ##)
        match = NUM_RE.match(line_stripped)
        if match:
            section_num = match.group(1)
            title = match.group(2)
            level = len(section_num.split('.'))
            sections.append((i+1, section_num, title, level, line_stripped))

    return sections

def create_hierarchical_chunks(content: str, document_id: str) -> List[HierarchicalChunk]:
    """
    Create hierarchical chunks from content using section numbering.
    """
    # Step 1: Detect sections
    sections = detect_section_numbering(content)

    if not sections:
        print("❌ No section numbering found - fallback to flat chunking")
        return []

    print(f"✅ Found {len(sections)} sections with numbering")

    # Step 2: Split content by sections
    lines = content.split('\n')
    hierarchical_chunks = []
    chunk_map = {}  # section_number -> chunk

    for i, (line_num, section_num, title, level, full_line) in enumerate(sections):
        # Find content for this section (until next section)
        start_line = line_num - 1  # Convert to 0-based
        if i + 1 < len(sections):
            end_line = sections[i + 1][0] - 1  # Next section's line
        else:
            end_line = len(lines)  # End of document

        # Extract section content
        section_lines = lines[start_line:end_line]
        header, *body = section_lines
        section_content = '\n'.join(body).strip()

        # Create hierarchical chunk
        chunk = HierarchicalChunk(
            document_id=document_id,
            content=section_content,
            chunk_number=i,
            chunk_type=ChunkType.SECTION if level <= 2 else ChunkType.PARAGRAPH,
            hierarchy_level=level,
            section_number=section_num,
            metadata={
                'section_title': title,
                'section_number': section_num,
                'hierarchy_level': level
            }
        )

        hierarchical_chunks.append(chunk)
        chunk_map[section_num] = chunk

    # Step 3: Build parent-child relationships
    for chunk in hierarchical_chunks:
        section_parts = chunk.section_number.split('.')

        # Find parent (e.g., "2.1" parent is "2")
        if len(section_parts) > 1:
            parent_section = '.'.join(section_parts[:-1])
            parent_chunk = chunk_map.get(parent_section)
            if parent_chunk:
                parent_chunk.add_child(chunk)

    return hierarchical_chunks

## Test Summary and Recommendations

This final section summarizes all test results and provides actionable recommendations for implementing production-ready hierarchical chunking. It consolidates insights from all previous tests and outlines the next steps for development.

**Analyze the detected Sections**

In [117]:
# Test section detection
content = document.raw_content
sections = detect_section_numbering(content)
print(f"🔍 Section Detection Results:")
print(f"Found {len(sections)} sections:")

for line_num, section_num, title, level, full_line in sections[:10]:
    indent = "  " * (level - 1)
    print(f"  Line {line_num:3d}: {indent}Level {level} - {section_num} {title}")

if len(sections) > 10:
    print(f"  ... and {len(sections) - 10} more sections")

🔍 Section Detection Results:
Found 14 sections:
  Line  17: Level 1 - 1 Introduction
  Line  30: Level 1 - 2 Background
  Line  32:   Level 2 - 2.1 Dataset
  Line  42:   Level 2 - 2.2 Limitations of Standard RAG
  Line  46:   Level 2 - 2.3 Synthetic Training Data
  Line  50: Level 1 - 3 Method
  Line  52:   Level 2 - 3.1 System Overview
  Line  56:   Level 2 - 3.2 Evidence Extraction
  Line  66:   Level 2 - 3.3 Synthetic Data Generation
  Line  79:   Level 2 - 3.4 Answer Generation
  ... and 4 more sections


**Hierarchical Chunks**

In [118]:
hierarchical_chunks = create_hierarchical_chunks(content, document.id)
print(f"Created {len(hierarchical_chunks)} hierarchical chunks")

✅ Found 14 sections with numbering
Created 14 hierarchical chunks


**Analyze the Hierarchy**

In [119]:
level_counts = {}
parent_child_pairs = 0

for chunk in hierarchical_chunks:
    level_counts[chunk.hierarchy_level] = level_counts.get(chunk.hierarchy_level, 0) + 1
    if chunk.child_chunk_ids:
        parent_child_pairs += len(chunk.child_chunk_ids)

**Level Distribution**

In [120]:
print(f"Level distribution:")
for level in sorted(level_counts.keys()):
    print(f"  Level {level}: {level_counts[level]} chunks")
print(f"Parent-child relationships: {parent_child_pairs}")

Level distribution:
  Level 1: 7 chunks
  Level 2: 7 chunks
Parent-child relationships: 7


**Hierarchy Structure**

In [121]:
print(len(hierarchical_chunks))

14


In [126]:
print(f"📋 Hierarchy Structure (first 10 chunks):")
for i, chunk in enumerate(hierarchical_chunks[:10]):
    print()
    #print(f"{i+1:2d}. {chunk}")
    print(chunk.metadata["section_title"])
    print(chunk.content)
    if chunk.child_chunk_ids:
        print(f"     └── Has {len(chunk.child_chunk_ids)} children")

📋 Hierarchy Structure (first 10 chunks):

Introduction
Modern question-answering (QA) and retrievalaugmented generation (RAG) systems play a vital role in many high-stakes domains for information extraction and generation tasks. In medicine, a typical use case involves clinicians asking questions based on a patient's electronic health record (EHR) notes, rather than manually sifting through lengthy notes, which can be time-consuming. However, in practice, RAG and QA pipelines often misalign evidence and produce incorrect information, commonly referred to as hallucinations (Ji et al., 2023; Madsen et al., 2024). We argue that a reliable QA system should guarantee complete traceability of answers. To tackle this problem, we propose a verbatim pipeline that clearly separates extraction and generation to mitigate hallucinations (other errors may still occur):

- Sentence-level extraction , using either zeroshot LLMs or supervised ModernBERT classifiers.
- Template-constrained generation , 

In [114]:
for i, chunk in enumerate(hierarchical_chunks[:10]):
    print(chunk.metadata)
    print(chunk.content)
    break

{'section_title': 'Introduction', 'section_number': '1', 'hierarchy_level': 1}
Modern question-answering (QA) and retrievalaugmented generation (RAG) systems play a vital role in many high-stakes domains for information extraction and generation tasks. In medicine, a typical use case involves clinicians asking questions based on a patient's electronic health record (EHR) notes, rather than manually sifting through lengthy notes, which can be time-consuming. However, in practice, RAG and QA pipelines often misalign evidence and produce incorrect information, commonly referred to as hallucinations (Ji et al., 2023; Madsen et al., 2024). We argue that a reliable QA system should guarantee complete traceability of answers. To tackle this problem, we propose a verbatim pipeline that clearly separates extraction and generation to mitigate hallucinations (other errors may still occur):

- Sentence-level extraction , using either zeroshot LLMs or supervised ModernBERT classifiers.
- Template-c

**Test: Find children of a parent**

In [115]:
print(f"👨‍👧‍👦 Parent-Child Relationship Test:")
for chunk in hierarchical_chunks[:5]:
    if chunk.child_chunk_ids:
        print(f"Parent: {chunk.section_number} {chunk.metadata.get('section_title', '')}")
        for child_id in chunk.child_chunk_ids:
            child_chunk = next((c for c in hierarchical_chunks if c.id == child_id), None)
            if child_chunk:
                print(f"  └── Child: {child_chunk.section_number} {child_chunk.metadata.get('section_title', '')}")

👨‍👧‍👦 Parent-Child Relationship Test:
Parent: 2 Background
  └── Child: 2.1 Dataset
  └── Child: 2.2 Limitations of Standard RAG
  └── Child: 2.3 Synthetic Training Data


## Test 9: Docling HierarchicalChunker

In [21]:
# Follow-up: Examine DocChunk structure and content
if 'chunks' in locals() and chunks:
    print("🔍 Deep Analysis of Docling DocChunks:")
    
    for i, chunk in enumerate(chunks[:5]):
        print(f"\n--- DocChunk {i+1} Deep Dive ---")
        
        # Access the text content directly
        chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
        print(f"Raw text ({len(chunk_text)} chars):")
        print(f"'{chunk_text[:300]}...'")
        
        # Check chunk metadata
        if hasattr(chunk, 'meta'):
            print(f"Meta type: {type(chunk.meta)}")
            if hasattr(chunk.meta, 'doc_items'):
                print(f"Doc items count: {len(chunk.meta.doc_items)}")
                
                # Look at document items for structure
                for j, item in enumerate(chunk.meta.doc_items[:3]):
                    print(f"  Item {j}: {type(item).__name__}")
                    if hasattr(item, 'text'):
                        print(f"    Text: '{item.text[:100]}...'")
                    if hasattr(item, 'parent'):
                        print(f"    Parent: {item.parent}")
                    if hasattr(item, 'level') or hasattr(item, 'hierarchy_level'):
                        level = getattr(item, 'level', getattr(item, 'hierarchy_level', None))
                        print(f"    Level: {level}")
        
        # Look for section patterns in the text
        lines = chunk_text.split('\n')
        for line_num, line in enumerate(lines[:10]):
            line = line.strip()
            # Look for section patterns like "1 Introduction", "2.1 Dataset"
            import re
            if re.match(r'^\d+(\.\d+)*\s+[A-Z]', line):
                print(f"  📍 Section pattern found: '{line}'")
            elif re.match(r'^[A-Z][A-Za-z\s]+$', line) and len(line) < 50:
                print(f"  📝 Possible header: '{line}'")

    # Test: Can we detect hierarchy from section numbering?
    print(f"\n🔢 Section Number Analysis:")
    section_patterns = []
    
    for chunk in chunks:
        chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
        lines = chunk_text.split('\n')
        
        for line in lines:
            line = line.strip()
            # Match patterns like "1 Introduction", "2.1 Dataset", "2.1.1 Details"
            match = re.match(r'^(\d+(?:\.\d+)*)\s+([A-Z][A-Za-z\s]+)', line)
            if match:
                section_num = match.group(1)
                section_title = match.group(2)
                level = len(section_num.split('.'))
                section_patterns.append((level, section_num, section_title))
    
    if section_patterns:
        print("✅ Found section number hierarchy:")
        for level, num, title in section_patterns[:10]:
            indent = "  " * (level - 1)
            print(f"{indent}Level {level}: {num} {title}")
        
        print(f"\n🚀 SOLUTION: Use section numbering for hierarchy!")
        print("We can create hierarchical chunks based on section numbers:")
        print("  Level 1: 1, 2, 3, 4...")
        print("  Level 2: 2.1, 2.2, 3.1...")
        print("  Level 3: 2.1.1, 2.1.2...")
    else:
        print("❌ No clear section numbering patterns found")
        print("May need to use content-based or semantic hierarchy")
else:
    print("❌ No chunks available for deep analysis")

🔍 Deep Analysis of Docling DocChunks:

--- DocChunk 1 Deep Dive ---
Raw text (36 chars):
'Ádám Kovács KR Labs kovacs@krlabs.eu...'
Meta type: <class 'docling_core.transforms.chunker.hierarchical_chunker.DocMeta'>
Doc items count: 1
  Item 0: TextItem
    Text: 'Ádám Kovács KR Labs kovacs@krlabs.eu...'
    Parent: cref='#/body'

--- DocChunk 2 Deep Dive ---
Raw text (33 chars):
'TU Wien paul.schmitt@tuwien.ac.at...'
Meta type: <class 'docling_core.transforms.chunker.hierarchical_chunker.DocMeta'>
Doc items count: 1
  Item 0: TextItem
    Text: 'TU Wien paul.schmitt@tuwien.ac.at...'
    Parent: cref='#/body'

--- DocChunk 3 Deep Dive ---
Raw text (45 chars):
'Gábor Recski KR Labs TU Wien recski@krlabs.eu...'
Meta type: <class 'docling_core.transforms.chunker.hierarchical_chunker.DocMeta'>
Doc items count: 1
  Item 0: TextItem
    Text: 'Gábor Recski KR Labs TU Wien recski@krlabs.eu...'
    Parent: cref='#/body'

--- DocChunk 4 Deep Dive ---
Raw text (678 chars):
'We present a lightweight

## Test 9.5: Docling Conversion Analysis

In [25]:
# View the raw converted markdown content to see what Docling actually produces
from docling.document_converter import DocumentConverter

print("📄 Examining Raw Docling Conversion:")

pdf_path = project_root / "data" / "acl_papers" / "VERBATIM_RAG_ACL.pdf"

if pdf_path.exists():
    converter = DocumentConverter()
    result = converter.convert(str(pdf_path))
    
    # Method 1: Standard markdown export (what DocumentProcessor uses)
    print("\n🔍 Method 1: Standard export_to_markdown():")
    markdown_content = result.document.export_to_markdown()
    print(f"Length: {len(markdown_content)} characters")
    print("First 2000 characters:")
    print("-" * 60)
    print(repr(markdown_content[:2000]))  # Use repr to see actual characters
    print("-" * 60)
    
    # Method 2: Check if there are other export options
    print(f"\n🔍 Method 2: Available export methods:")
    export_methods = [method for method in dir(result.document) if 'export' in method.lower()]
    print(f"Available methods: {export_methods}")
    
    # Method 3: Look at document structure
    print(f"\n🔍 Method 3: Document structure:")
    print(f"Document type: {type(result.document)}")
    doc_attrs = [attr for attr in dir(result.document) if not attr.startswith('_')]
    print(f"Document attributes: {doc_attrs[:10]}...")
    
    # Method 4: Try to access raw document elements
    if hasattr(result.document, 'body'):
        print(f"\n🔍 Method 4: Document body structure:")
        print(f"Body type: {type(result.document.body)}")
        body_attrs = [attr for attr in dir(result.document.body) if not attr.startswith('_')]
        print(f"Body attributes: {body_attrs[:10]}...")
    
    # Method 5: Save to file for inspection
    output_file = project_root / "debug_converted_content.md"
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(markdown_content)
    print(f"\n💾 Saved full content to: {output_file}")
    print("You can open this file to see the complete converted content!")
    
    # Method 6: Look for header patterns in the raw content
    print(f"\n🔍 Method 6: Header pattern analysis in raw content:")
    lines = markdown_content.split('\n')
    header_lines = []
    
    for i, line in enumerate(lines):
        line_stripped = line.strip()
        if line_stripped.startswith('#'):
            header_lines.append((i+1, line))
        elif re.match(r'^\d+(\.\d+)*\s+[A-Z]', line_stripped):
            header_lines.append((i+1, f"[NUMBER] {line}"))
    
    print(f"Found {len(header_lines)} potential headers:")
    for line_num, header in header_lines[:15]:  # Show first 15
        print(f"  Line {line_num}: {header}")
    
    if len(header_lines) > 15:
        print(f"  ... and {len(header_lines) - 15} more")

else:
    print(f"❌ PDF not found: {pdf_path}")

📄 Examining Raw Docling Conversion:





🔍 Method 1: Standard export_to_markdown():
Length: 25746 characters
First 2000 characters:
------------------------------------------------------------
"## KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering\n\nÁdám Kovács KR Labs kovacs@krlabs.eu\n\n## Paul Schmitt\n\nTU Wien paul.schmitt@tuwien.ac.at\n\nGábor Recski KR Labs TU Wien recski@krlabs.eu\n\n## Abstract\n\nWe present a lightweight, domain-agnostic verbatim pipeline for evidence-grounded question answering. Our pipeline operates in two steps: first, a sentence-level extractor flags relevant note sentences using either zero-shot LLM prompts or supervised ModernBERT classifiers. Next, an LLM drafts a questionspecific template, which is filled verbatim with sentences from the extraction step. This prevents hallucinations and ensures traceability. In the ArchEHR-QA 2025 shared task, our system scored 42.01%, ranking top-10 in core metrics and outperforming the organiser's 70B-parameter Llama-3.

## Test 10: View Raw Converted Content

In [24]:
print("📋 Test Summary:")
print("=" * 50)
print("\n✅ Successfully tested:")
print("  - Basic DocumentProcessor creation")
print("  - File processing with different chunking strategies")
print("  - Factory methods for specialized processors")
print("  - Document structure analysis")
print("  - Document serialization/deserialization")

print("\n🚀 Ready for hierarchical chunking implementation!")
print("\nNext steps for hierarchical chunking:")
print("  1. Extend Chunk class with parent_chunk_id field")
print("  2. Modify DocumentProcessor to create hierarchical relationships")
print("  3. Update VerbatimIndex to handle hierarchical chunks")
print("  4. Add hierarchical search capabilities")

print("\n💡 Insights for hierarchical chunking:")
if 'headers' in locals() and headers:
    print(f"  - Document has {len(headers)} headers for natural hierarchy")
    max_level = max(level for level, _ in headers)
    print(f"  - Maximum header depth: {max_level} levels")
    print("  - Can use markdown structure for parent-child relationships")
else:
    print("  - Document lacks clear hierarchical structure")
    print("  - Consider semantic-based or size-based hierarchical chunking")

print("\n🧹 Cleanup...")
if test_file_path.exists():
    test_file_path.unlink()
    print(f"Removed test file: {test_file_path}")

📋 Test Summary:

✅ Successfully tested:
  - Basic DocumentProcessor creation
  - File processing with different chunking strategies
  - Factory methods for specialized processors
  - Document structure analysis
  - Document serialization/deserialization

🚀 Ready for hierarchical chunking implementation!

Next steps for hierarchical chunking:
  1. Extend Chunk class with parent_chunk_id field
  2. Modify DocumentProcessor to create hierarchical relationships
  3. Update VerbatimIndex to handle hierarchical chunks
  4. Add hierarchical search capabilities

💡 Insights for hierarchical chunking:
  - Document has 18 headers for natural hierarchy
  - Maximum header depth: 2 levels
  - Can use markdown structure for parent-child relationships

🧹 Cleanup...
Removed test file: /Users/paulschmitt/DataspellProjects/verbatim-rag/data/acl_papers/VERBATIM_RAG_ACL.pdf
