# Multi-Format Document Processing Demo

This notebook demonstrates the DocumentConverter's ability to process different document formats:
- Plain Text (.txt)
- CSV files (.csv)
- Excel files (.xlsx)
- Word documents (.docx)

All formats are converted to page images and text for processing in the IDP pipeline.

## Setup and Imports

In [None]:
import sys
import os
import io
import tempfile
import time
from PIL import Image
import pandas as pd
from docx import Document as WordDocument

# Add the idp_common package to the path
sys.path.insert(0, '../../lib/idp_common_pkg')
from idp_common.ocr.document_converter import DocumentConverter

print("✅ All imports successful")

## Initialize Document Converter

In [None]:
# Initialize DocumentConverter with good quality DPI
converter = DocumentConverter(dpi=150)

print(f"✅ DocumentConverter initialized")
print(f"   DPI: {converter.dpi}")
print(f"   Page size: {converter.page_width}x{converter.page_height} pixels")

## Helper Functions

In [None]:
def process_and_display(content, format_name, convert_method):
    """
    Process content and display results.
    """
    print(f"\n{'='*60}")
    print(f"Processing: {format_name}")
    print(f"{'='*60}")
    
    # Convert to pages
    start_time = time.time()
    pages = convert_method(content)
    processing_time = time.time() - start_time
    
    print(f"⏱️  Processing time: {processing_time:.3f} seconds")
    print(f"📄 Number of pages: {len(pages)}")
    
    # Display each page
    for i, (image_bytes, text_content) in enumerate(pages):
        print(f"\n--- Page {i+1} ---")
        
        # Display image info and image
        if len(image_bytes) > 0:
            img = Image.open(io.BytesIO(image_bytes))
            print(f"📸 Image: {len(image_bytes):,} bytes, {img.size[0]}x{img.size[1]} pixels")
            
            # Resize for display if needed
            display_img = img
            if img.size[0] > 800:
                ratio = 800 / img.size[0]
                new_size = (int(img.size[0] * ratio), int(img.size[1] * ratio))
                display_img = img.resize(new_size)
            
            display(display_img)
        else:
            print("⚠️ No image generated")
        
        # Display text preview
        print(f"\n📝 Extracted Text ({len(text_content)} characters):")
        print(f"{text_content[:400]}{'...' if len(text_content) > 400 else ''}")
    
    return pages

## 1. Plain Text Document

In [None]:
# Create sample invoice text
invoice_text = """INVOICE

Invoice Number: INV-2024-001
Date: January 15, 2024
Due Date: February 15, 2024

Bill To:
ACME Corporation
123 Business Street
Suite 100
Business City, BC 12345
Phone: (555) 123-4567

From:
Professional Services LLC
456 Service Avenue
Service Town, ST 67890

DESCRIPTION                    QTY    RATE      AMOUNT
-----------------------------------------------
Consulting Services - Jan      40    $150.00   $6,000.00
Project Management            20    $125.00   $2,500.00
Technical Documentation       10    $100.00   $1,000.00
Travel Expenses                1    $350.00     $350.00
-----------------------------------------------
                            SUBTOTAL:   $9,850.00
                                 TAX:     $985.00
                               TOTAL:  $10,835.00

Payment Terms: Net 30 days
Thank you for your business!"""

# Process text document
text_pages = process_and_display(
    invoice_text, 
    "Plain Text Invoice", 
    converter.convert_text_to_pages
)

## 2. CSV Document

In [None]:
# Create sample employee CSV
csv_content = """Employee ID,Name,Department,Position,Salary,Start Date
E001,John Smith,Engineering,Senior Developer,95000,2020-03-15
E002,Sarah Johnson,Marketing,Marketing Manager,78000,2019-07-22
E003,Michael Brown,Sales,Sales Representative,65000,2021-01-10
E004,Emily Davis,HR,HR Specialist,58000,2020-11-30
E005,David Wilson,Engineering,Frontend Developer,82000,2022-02-14
E006,Lisa Anderson,Finance,Financial Analyst,70000,2019-09-05
E007,Robert Taylor,Operations,Operations Manager,85000,2018-12-01
E008,Jennifer Martinez,Marketing,Content Creator,55000,2021-06-18
E009,Christopher Lee,Sales,Sales Manager,90000,2017-05-20
E010,Amanda White,Engineering,DevOps Engineer,88000,2020-08-12"""

# Process CSV document
csv_pages = process_and_display(
    csv_content, 
    "Employee Data CSV", 
    converter.convert_csv_to_pages
)

## 3. Excel Document

In [None]:
# Create sample Excel file
def create_excel_file():
    # Sales data
    sales_data = {
        'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
        'Category': ['Computer', 'Accessory', 'Accessory', 'Computer', 'Accessory'],
        'Price': [999.99, 29.99, 79.99, 299.99, 149.99],
        'Units Sold': [45, 120, 85, 30, 67],
        'Revenue': [44999.55, 3598.80, 6799.15, 8999.70, 10049.33]
    }
    
    # Quarterly expenses
    expense_data = {
        'Category': ['Office Supplies', 'Travel', 'Marketing', 'Utilities'],
        'Q1 2024': [1200, 3500, 5000, 800],
        'Q2 2024': [1350, 2800, 6200, 850],
        'Q3 2024': [1100, 4100, 4800, 900],
        'Q4 2024': [1450, 3200, 7500, 950]
    }
    
    with tempfile.NamedTemporaryFile(suffix='.xlsx', delete=False) as tmp:
        with pd.ExcelWriter(tmp.name, engine='openpyxl') as writer:
            pd.DataFrame(sales_data).to_excel(writer, sheet_name='Sales', index=False)
            pd.DataFrame(expense_data).to_excel(writer, sheet_name='Expenses', index=False)
        
        with open(tmp.name, 'rb') as f:
            excel_bytes = f.read()
        
        os.unlink(tmp.name)
        return excel_bytes

# Process Excel document
excel_bytes = create_excel_file()
excel_pages = process_and_display(
    excel_bytes, 
    "Financial Report Excel", 
    converter.convert_excel_to_pages
)

## 4. Word Document

In [None]:
# Create sample Word document
def create_word_file():
    doc = WordDocument()
    
    # Add title and content
    doc.add_heading('Quarterly Business Report', 0)
    doc.add_heading('Q4 2024 Executive Summary', level=1)
    
    doc.add_paragraph(
        'This document provides a comprehensive overview of our business '
        'performance during Q4 2024. We achieved significant growth across '
        'all business segments with strong performance in technology and services.'
    )
    
    doc.add_heading('Key Metrics', level=2)
    
    # Add table
    table = doc.add_table(rows=1, cols=3)
    table.style = 'Table Grid'
    
    # Header row
    hdr_cells = table.rows[0].cells
    hdr_cells[0].text = 'Metric'
    hdr_cells[1].text = 'Q4 2024'
    hdr_cells[2].text = 'Change'
    
    # Data rows
    metrics = [
        ('Revenue', '$2.4M', '+12%'),
        ('Net Profit', '$480K', '+18%'),
        ('Customers', '1,250', '+8%')
    ]
    
    for metric, value, change in metrics:
        row = table.add_row().cells
        row[0].text = metric
        row[1].text = value
        row[2].text = change
    
    doc.add_paragraph(
        'We expect continued growth in 2025 driven by strategic investments '
        'and market expansion initiatives.'
    )
    
    with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as tmp:
        doc.save(tmp.name)
        
        with open(tmp.name, 'rb') as f:
            word_bytes = f.read()
        
        os.unlink(tmp.name)
        return word_bytes

# Process Word document
word_bytes = create_word_file()
word_pages = process_and_display(
    word_bytes, 
    "Business Report Word Doc", 
    converter.convert_word_to_pages
)

## Summary

In [None]:
# Show processing summary
print("📊 Processing Summary")
print("=" * 50)

formats = {
    'Text': text_pages,
    'CSV': csv_pages, 
    'Excel': excel_pages,
    'Word': word_pages
}

for format_name, pages in formats.items():
    print(f"\n{format_name} Document:")
    print(f"  📄 Pages generated: {len(pages)}")
    
    if pages:
        image_bytes, text_content = pages[0]
        print(f"  📸 Image size: {len(image_bytes):,} bytes")
        print(f"  📝 Text length: {len(text_content):,} characters")

print("\n✅ All formats processed successfully!")
print("\n🔧 Key Benefits:")
print("   • Unified DocumentConverter for all document types")
print("   • Consistent output: (image_bytes, text) tuples")
print("   • Automatic text formatting and image generation")
print("   • Ready for OCR service integration")

## Conclusion

This notebook demonstrated the DocumentConverter's ability to process multiple document formats:

### 🎯 **What We Showed:**
- **Multi-format Support**: Text, CSV, Excel, and Word documents
- **Consistent Output**: All formats produce (image_bytes, text) tuples
- **Visual Results**: Generated page images with proper formatting
- **Text Extraction**: Clean, structured text from all document types

### 🏗️ **Integration Ready:**
- **OCR Service**: The DocumentConverter integrates seamlessly with the OCR service
- **Format Detection**: Automatic file type detection based on content and extensions
- **Pipeline Ready**: Consistent output enables downstream classification and extraction
- **Extensible**: Easy to add support for additional document formats

The DocumentConverter makes the IDP accelerator truly format-agnostic, enabling organizations to process diverse document portfolios through a unified pipeline.