# Document Processing Demo

This notebook demonstrates the document processing pipeline used in the Content Verification Tool:

1. **Docling PDF Conversion** - Converting PDF to DoclingDocument
2. **Docling DOCX Conversion** - Converting DOCX to PDF then DoclingDocument (with LibreOffice)
3. **Hierarchical Pre-chunking** - Using HybridChunker from Docling
4. **Paragraph-level Splitting** - Using LangChain RecursiveCharacterTextSplitter
5. **Sentence-level Splitting** - Using SpaCy sentence detection
6. **Verification Shell Creation** - Creating DocumentChunk objects with page # and item # assignments

## Prerequisites

```bash
# Install dependencies (already in pyproject.toml)
pip install docling docling-core langchain-text-splitters spacy python-docx termcolor

# Download SpaCy model
python -m spacy download en_core_web_sm
```

## Sample Documents

This demo requires sample PDF and DOCX files. You can:
- Use your own legal documents (max 10MB)
- Create sample files named `sample.pdf` and `sample.docx` in the project root
- Update the file paths in the code cells below

## Setup: Import Libraries and Initialize Processors

In [None]:
import os
import sys
from pathlib import Path
from pprint import pprint

# Add backend to path for imports
sys.path.insert(0, str(Path.cwd() / 'backend'))

from app.processing.document_processor import DocumentProcessor
from app.processing.chunker import DocumentChunker
from app.models import ChunkingMode, DocumentChunk

print("=" * 80)
print("INITIALIZING DOCUMENT PROCESSING PIPELINE")
print("=" * 80)

# Initialize processor and chunker
processor = DocumentProcessor()
chunker = DocumentChunker()

print("\n‚úì All components initialized successfully")

## Part 1: PDF Processing with Docling

We'll convert a PDF document to a DoclingDocument object, which preserves:
- Page structure and numbers
- Paragraphs and sections
- Tables
- Footnotes
- Provenance data (for tracking text spans across pages)

In [None]:
# Define path to sample PDF
PDF_PATH = "sample.pdf"  # User should replace with their file

print("=" * 80)
print("PART 1: PDF ‚Üí DoclingDocument")
print("=" * 80)

# Check if file exists
if not Path(PDF_PATH).exists():
    print(f"\n‚ùå ERROR: File not found: {PDF_PATH}")
    print("Please create a sample PDF file or update PDF_PATH variable")
else:
    # Read file content
    with open(PDF_PATH, "rb") as f:
        pdf_content = f.read()
    
    print(f"\nüìÑ Processing PDF: {PDF_PATH}")
    print(f"   File size: {len(pdf_content) / 1024:.2f} KB")
    
    # Convert with Docling
    result = processor.convert_document(pdf_content, PDF_PATH, use_cache=True)
    
    docling_doc = result['docling_document']
    
    print(f"\n‚úì Conversion successful!")
    print(f"   Filename: {result['filename']}")
    print(f"   Pages: {result['page_count']}")
    print(f"   File size: {result['file_size']} bytes")
    
    # Inspect DoclingDocument structure
    print(f"\nüìä DoclingDocument Structure:")
    print(f"   Type: {type(docling_doc)}")
    print(f"   Has pages: {hasattr(docling_doc, 'pages')}")
    if hasattr(docling_doc, 'pages'):
        print(f"   Total pages: {len(docling_doc.pages)}")
        print(f"   First page: {docling_doc.pages[0] if docling_doc.pages else 'N/A'}")

## Part 2: DOCX Processing with LibreOffice + Docling

DOCX files are converted to PDF using LibreOffice first, then processed with Docling.

**Why?**
- Accurate page numbers (DOCX doesn't have fixed pages)
- Consistent processing pipeline (everything goes through PDF)
- Better metadata extraction

In [None]:
# Define path to sample DOCX
DOCX_PATH = "sample.docx"  # User should replace with their file

print("=" * 80)
print("PART 2: DOCX ‚Üí PDF ‚Üí DoclingDocument")
print("=" * 80)

if not Path(DOCX_PATH).exists():
    print(f"\n‚ùå ERROR: File not found: {DOCX_PATH}")
    print("Please create a sample DOCX file or update DOCX_PATH variable")
else:
    # Read file content
    with open(DOCX_PATH, "rb") as f:
        docx_content = f.read()
    
    print(f"\nüìÑ Processing DOCX: {DOCX_PATH}")
    print(f"   File size: {len(docx_content) / 1024:.2f} KB")
    print(f"\n‚ö†Ô∏è  Note: This will use LibreOffice for DOCX‚ÜíPDF conversion")
    
    # Convert with Docling (includes LibreOffice conversion)
    result = processor.convert_document(docx_content, DOCX_PATH, use_cache=True)
    
    docling_doc_docx = result['docling_document']
    
    print(f"\n‚úì Conversion successful!")
    print(f"   Original: {result['filename']}")
    print(f"   Pages: {result['page_count']}")
    print(f"   Note: Intermediate PDF was created and cleaned up automatically")

## Part 3: Hierarchical Pre-chunking with HybridChunker

The first processing step uses Docling's **HybridChunker** to:
- Preserve document structure (headings, paragraphs, sections)
- Maintain page numbers and provenance data
- Extract footnotes as separate items
- Handle tables

This creates the foundation for further splitting.

In [None]:
print("=" * 80)
print("PART 3: Hierarchical Pre-chunking")
print("=" * 80)

# Use the PDF document from Part 1
if 'docling_doc' not in locals():
    print("\n‚ùå ERROR: No DoclingDocument available")
    print("Please run Part 1 first to load a PDF document")
else:
    # Apply hierarchical chunking (internal method)
    base_chunks = chunker._apply_hierarchical_chunking(docling_doc)
    
    print(f"\n‚úì Hierarchical chunking complete")
    print(f"   Total base chunks: {len(base_chunks)}")
    
    # Show sample chunks
    print(f"\nüìã Sample Base Chunks (first 5):")
    print("-" * 80)
    
    for i, chunk in enumerate(base_chunks[:5], 1):
        page = chunk['page_number']
        overlap = "‚ö†Ô∏è OVERLAP" if chunk['is_overlap'] else ""
        text_preview = chunk['text'][:100] + "..." if len(chunk['text']) > 100 else chunk['text']
        
        print(f"\n{i}. Page {page} {overlap}")
        print(f"   Text: \"{text_preview}\"")
    
    # Statistics
    pages_with_content = set(chunk['page_number'] for chunk in base_chunks)
    overlap_count = sum(1 for chunk in base_chunks if chunk['is_overlap'])
    
    print(f"\nüìä Statistics:")
    print(f"   Pages with content: {len(pages_with_content)}")
    print(f"   Chunks with overlap: {overlap_count}")
    print(f"   Average chunk length: {sum(len(c['text']) for c in base_chunks) / len(base_chunks):.0f} chars")

## Part 4: Paragraph-level Splitting

After hierarchical chunking, we apply **LangChain's RecursiveCharacterTextSplitter** to break content into paragraphs.

**Configuration:**
- chunk_size: 100 characters (configurable)
- chunk_overlap: 10 characters
- Separators: `\n\n`, `\n`, `. `, etc.
- keep_separator: "end" (preserves punctuation)

In [None]:
print("=" * 80)
print("PART 4: Paragraph-level Splitting")
print("=" * 80)

if 'base_chunks' not in locals():
    print("\n‚ùå ERROR: No base chunks available")
    print("Please run Part 3 first")
else:
    # Apply paragraph splitting
    paragraph_chunks = chunker._apply_paragraph_splitting(base_chunks)
    
    print(f"\n‚úì Paragraph splitting complete")
    print(f"   Base chunks: {len(base_chunks)}")
    print(f"   Paragraph chunks: {len(paragraph_chunks)}")
    print(f"   Expansion factor: {len(paragraph_chunks) / len(base_chunks):.2f}x")
    
    # Show sample paragraphs
    print(f"\nüìã Sample Paragraph Chunks (first 5):")
    print("-" * 80)
    
    for i, chunk in enumerate(paragraph_chunks[:5], 1):
        page = chunk['page_number']
        text_preview = chunk['text'][:150] + "..." if len(chunk['text']) > 150 else chunk['text']
        
        print(f"\n{i}. Page {page}")
        print(f"   \"{text_preview}\"")
    
    # Length distribution
    lengths = [len(chunk['text']) for chunk in paragraph_chunks]
    print(f"\nüìä Paragraph Length Statistics:")
    print(f"   Min: {min(lengths)} chars")
    print(f"   Max: {max(lengths)} chars")
    print(f"   Avg: {sum(lengths) / len(lengths):.0f} chars")

## Part 5: Sentence-level Splitting with SpaCy

For fine-grained verification, we use **SpaCy's sentence boundary detection** to split text into individual sentences.

**Key Features:**
- One sentence per chunk
- Intelligent boundary detection (handles abbreviations, titles, etc.)
- Tracks which base chunk each sentence came from

In [None]:
print("=" * 80)
print("PART 5: Sentence-level Splitting")
print("=" * 80)

if 'base_chunks' not in locals():
    print("\n‚ùå ERROR: No base chunks available")
    print("Please run Part 3 first")
else:
    # Apply sentence splitting
    sentence_chunks = chunker._apply_sentence_splitting(base_chunks)
    
    print(f"\n‚úì Sentence splitting complete")
    print(f"   Base chunks: {len(base_chunks)}")
    print(f"   Sentence chunks: {len(sentence_chunks)}")
    print(f"   Expansion factor: {len(sentence_chunks) / len(base_chunks):.2f}x")
    
    # Show sample sentences
    print(f"\nüìã Sample Sentence Chunks (first 10):")
    print("-" * 80)
    
    for i, chunk in enumerate(sentence_chunks[:10], 1):
        page = chunk['page_number']
        base_idx = chunk.get('base_chunk_index', 'N/A')
        text = chunk['text']
        
        print(f"\n{i}. Page {page} (Base Chunk {base_idx})")
        print(f"   \"{text}\"")
    
    # Sentences per base chunk
    from collections import Counter
    base_chunk_counts = Counter(chunk.get('base_chunk_index', -1) for chunk in sentence_chunks)
    
    print(f"\nüìä Sentence Distribution:")
    print(f"   Total sentences: {len(sentence_chunks)}")
    print(f"   Avg sentences per base chunk: {len(sentence_chunks) / len(base_chunks):.1f}")
    print(f"   Base chunk with most sentences: {max(base_chunk_counts.values())} sentences")

## Part 6: Creating Verification Shells (DocumentChunk Objects)

The final step assigns **item numbers** to each chunk and creates **DocumentChunk** objects ready for verification.

**Item Numbering:**
- **Paragraph mode:** Simple sequential (1, 2, 3...) - resets per page
- **Sentence mode:** Hierarchical (1.1, 1.2, 2.1, 2.2...) - shows base chunk relationship

In [None]:
print("=" * 80)
print("PART 6: Creating DocumentChunk Objects (Paragraph Mode)")
print("=" * 80)

if 'docling_doc' not in locals():
    print("\n‚ùå ERROR: No DoclingDocument available")
else:
    # Chunk in paragraph mode
    paragraph_doc_chunks = chunker.chunk_document(
        docling_doc, 
        mode=ChunkingMode.PARAGRAPH
    )
    
    print(f"\n‚úì Document chunks created")
    print(f"   Total chunks: {len(paragraph_doc_chunks)}")
    print(f"   Mode: {ChunkingMode.PARAGRAPH.value}")
    
    # Show first 10 chunks with full metadata
    print(f"\nüìã Document Chunks (first 10):")
    print("-" * 80)
    
    for chunk in paragraph_doc_chunks[:10]:
        overlap_flag = " [OVERLAP]" if chunk.is_overlap else ""
        text_preview = chunk.text[:80] + "..." if len(chunk.text) > 80 else chunk.text
        
        print(f"\nPage {chunk.page_number}, Item {chunk.item_number}{overlap_flag}")
        print(f"  \"{text_preview}\"")
    
    # Page distribution
    from collections import defaultdict
    chunks_per_page = defaultdict(int)
    for chunk in paragraph_doc_chunks:
        chunks_per_page[chunk.page_number] += 1
    
    print(f"\nüìä Distribution by Page:")
    for page in sorted(chunks_per_page.keys())[:5]:  # First 5 pages
        print(f"   Page {page}: {chunks_per_page[page]} chunks")

In [None]:
print("=" * 80)
print("PART 6b: Creating DocumentChunk Objects (Sentence Mode)")
print("=" * 80)

if 'docling_doc' not in locals():
    print("\n‚ùå ERROR: No DoclingDocument available")
else:
    # Chunk in sentence mode
    sentence_doc_chunks = chunker.chunk_document(
        docling_doc, 
        mode=ChunkingMode.SENTENCE
    )
    
    print(f"\n‚úì Document chunks created")
    print(f"   Total chunks: {len(sentence_doc_chunks)}")
    print(f"   Mode: {ChunkingMode.SENTENCE.value}")
    
    # Show first 15 chunks with hierarchical numbering
    print(f"\nüìã Document Chunks with Hierarchical Numbering (first 15):")
    print("-" * 80)
    
    for chunk in sentence_doc_chunks[:15]:
        overlap_flag = " [OVERLAP]" if chunk.is_overlap else ""
        
        print(f"\nPage {chunk.page_number}, Item {chunk.item_number}{overlap_flag}")
        print(f"  \"{chunk.text}\"")
    
    # Analyze hierarchical structure
    print(f"\nüìä Hierarchical Structure Analysis:")
    
    # Count base chunks (items like 1.x, 2.x, 3.x)
    base_items = set()
    for chunk in sentence_doc_chunks:
        if '.' in chunk.item_number:
            base_items.add(chunk.item_number.split('.')[0])
    
    print(f"   Total sentences: {len(sentence_doc_chunks)}")
    print(f"   Total base chunks (paragraph-level): {len(base_items)}")
    print(f"   Avg sentences per base chunk: {len(sentence_doc_chunks) / len(base_items):.1f}")

## Part 7: Comparing Chunking Modes

Let's compare the two chunking modes side-by-side to understand their differences.

In [None]:
print("=" * 80)
print("PART 7: Chunking Mode Comparison")
print("=" * 80)

if 'paragraph_doc_chunks' in locals() and 'sentence_doc_chunks' in locals():
    print(f"\n{'Metric':<40} {'Paragraph':<15} {'Sentence':<15}")
    print("-" * 70)
    
    print(f"{'Total chunks':<40} {len(paragraph_doc_chunks):<15} {len(sentence_doc_chunks):<15}")
    
    # Average text length
    para_avg_len = sum(len(c.text) for c in paragraph_doc_chunks) / len(paragraph_doc_chunks)
    sent_avg_len = sum(len(c.text) for c in sentence_doc_chunks) / len(sentence_doc_chunks)
    print(f"{'Avg chunk length (chars)':<40} {para_avg_len:<15.0f} {sent_avg_len:<15.0f}")
    
    # Chunks per page (average)
    para_pages = set(c.page_number for c in paragraph_doc_chunks)
    sent_pages = set(c.page_number for c in sentence_doc_chunks)
    para_per_page = len(paragraph_doc_chunks) / len(para_pages)
    sent_per_page = len(sentence_doc_chunks) / len(sent_pages)
    print(f"{'Avg chunks per page':<40} {para_per_page:<15.1f} {sent_per_page:<15.1f}")
    
    # Overlap counts
    para_overlap = sum(1 for c in paragraph_doc_chunks if c.is_overlap)
    sent_overlap = sum(1 for c in sentence_doc_chunks if c.is_overlap)
    print(f"{'Chunks with overlap flag':<40} {para_overlap:<15} {sent_overlap:<15}")
    
    print(f"\nüí° Recommendations:")
    print(f"   ‚Ä¢ Use PARAGRAPH mode for: General document verification, faster processing")
    print(f"   ‚Ä¢ Use SENTENCE mode for: Fine-grained verification, detailed fact-checking")
    
else:
    print("\n‚ùå Please run Part 6 and 6b first")

## Part 8: Exporting Verification Shell

Now that we have DocumentChunk objects, let's see how they would be exported for verification.

In [None]:
import json

print("=" * 80)
print("PART 8: Exporting DocumentChunks to JSON")
print("=" * 80)

if 'paragraph_doc_chunks' in locals():
    # Convert to JSON-serializable format
    chunks_data = [
        {
            "page_number": chunk.page_number,
            "item_number": chunk.item_number,
            "text": chunk.text,
            "is_overlap": chunk.is_overlap,
            "verified": None,  # To be filled by AI
            "verification_score": None,
            "verification_source": "",
            "verification_note": ""
        }
        for chunk in paragraph_doc_chunks[:5]  # First 5 for demo
    ]
    
    # Pretty print JSON
    print("\nüìÑ Sample Export (first 5 chunks):")
    print(json.dumps(chunks_data, indent=2, ensure_ascii=False))
    
    # Save to file
    output_file = "verification_shell.json"
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(
            [chunk.model_dump() for chunk in paragraph_doc_chunks], 
            f, 
            indent=2, 
            ensure_ascii=False
        )
    
    print(f"\n‚úì Exported {len(paragraph_doc_chunks)} chunks to: {output_file}")
    
else:
    print("\n‚ùå No chunks available for export")

## Summary

This notebook demonstrated the complete document processing pipeline:

1. ‚úÖ **PDF Conversion** - Docling converts PDF to structured DoclingDocument
2. ‚úÖ **DOCX Conversion** - LibreOffice ‚Üí PDF ‚Üí DoclingDocument for accurate pagination
3. ‚úÖ **Hierarchical Chunking** - HybridChunker preserves document structure
4. ‚úÖ **Paragraph Splitting** - RecursiveCharacterTextSplitter for paragraph-level chunks
5. ‚úÖ **Sentence Splitting** - SpaCy sentencizer for fine-grained chunks
6. ‚úÖ **DocumentChunk Creation** - Structured objects with page/item metadata

### Key Takeaways

- **Docling** provides robust PDF/DOCX parsing with metadata preservation
- **HybridChunker** maintains document hierarchy while chunking
- **Paragraph mode** creates ~100-char chunks with simple numbering (1, 2, 3...)
- **Sentence mode** creates true sentence-level chunks with hierarchical numbering (1.1, 1.2...)
- **DocumentChunk objects** are ready for AI verification with Gemini

### Next Steps

- Explore `gemini_features.ipynb` to see AI verification in action
- Run the full application: `./start_all.sh`
- Check `backend/app/processing/` for implementation details
- Try different documents and chunking modes to understand behavior