# PaddleOCR Pipeline with Intelligent Column Detection

This notebook creates an advanced OCR pipeline using **PaddleOCR** that:
- ‚úÖ Automatically detects number of columns (1, 2, 3, or more)
- ‚úÖ Uses layout_data.json for proper reading order
- ‚úÖ Maintains left-to-right, top-to-bottom reading order
- ‚úÖ Generates clean markdown with intelligent column formatting
- ‚úÖ Higher accuracy than Tesseract
- ‚úÖ Handles multi-page documents
- ‚úÖ Preserves document structure automatically

**Column Detection Strategy:**
- Analyzes horizontal positions of elements
- Uses clustering to identify column boundaries
- Detects full-width elements (headers, tables)
- Groups related content by columns

**Prerequisites:**
- PaddleOCR installed
- Cropped sections from layout detection
- layout_data.json with bounding boxes and reading order

## 1. Import Required Libraries

In [1]:
import json
import logging
from pathlib import Path
from typing import Dict, Any, List, Tuple, Optional
import re
import warnings
warnings.filterwarnings('ignore')

# OCR and Image processing
try:
    from paddleocr import PaddleOCR
    from PIL import Image
    import numpy as np
    print("‚úì PaddleOCR and PIL imported successfully")
except ImportError as e:
    print(f"‚ùå Error: {e}")
    print("Please install: pip install paddleocr paddlepaddle")
    raise

# Data processing
import pandas as pd
from datetime import datetime
from tqdm.auto import tqdm

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print("‚úì All libraries imported")

‚úì PaddleOCR and PIL imported successfully
‚úì All libraries imported


## 2. Configuration

In [2]:
# Configuration
CONFIG = {
    'input_dir': 'output_results',
    'output_dir': 'paddle_markdown_output',
    
    # PaddleOCR settings
    'use_gpu': False,
    'lang': 'en',
    'use_angle_cls': True,
    'show_log': False,
    
    # Column detection settings
    'column_gap_threshold': 0.15,  # 15% of page width for column gaps
    'full_width_threshold': 0.7,   # Elements wider than 70% are full-width
    'vertical_pairing_threshold': 100,  # Pixels for vertical alignment
    
    # Reading order strategy
    'use_layout_reading_order': True,
    'sort_by_position': 'top_left',
    
    # Output settings
    'confidence_threshold': 0.0,
    'add_spacing_between_elements': True,
    'format_tables': True,
    'include_confidence_comments': False,
    'show_column_info': True,  # Show column detection info in markdown
    
    # Document structure
    'separate_pages': True,
    'page_separator': '\n---\n\n',
}

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

Configuration:
  input_dir: output_results
  output_dir: paddle_markdown_output
  use_gpu: False
  lang: en
  use_angle_cls: True
  show_log: False
  column_gap_threshold: 0.15
  full_width_threshold: 0.7
  vertical_pairing_threshold: 100
  use_layout_reading_order: True
  sort_by_position: top_left
  confidence_threshold: 0.0
  add_spacing_between_elements: True
  format_tables: True
  include_confidence_comments: False
  show_column_info: True
  separate_pages: True
  page_separator: 
---




## 3. Initialize PaddleOCR

In [3]:
# Initialize PaddleOCR
print("Initializing PaddleOCR...\n")

try:
    ocr = PaddleOCR(
        use_angle_cls=CONFIG['use_angle_cls'],
        lang=CONFIG['lang'],
    )
    print("‚úì PaddleOCR initialized successfully")
except Exception as e:
    print(f"‚ùå Error initializing PaddleOCR: {e}")
    raise

Initializing PaddleOCR...



[32mCreating model: ('PP-LCNet_x1_0_doc_ori', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/home/abhishek-mishra/.paddlex/official_models/PP-LCNet_x1_0_doc_ori`.[0m
[32mCreating model: ('UVDoc', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/home/abhishek-mishra/.paddlex/official_models/UVDoc`.[0m
[32mCreating model: ('PP-LCNet_x1_0_textline_ori', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/home/abhishek-mishra/.paddlex/official_models/PP-LCNet_x1_0_textline_ori`.[0m
[32mCreating model: ('PP-OCRv5_server_det', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/home/abhishek-mishra/.paddlex/official_models/PP-OCRv5_server_det`.[0m
[32mCreating model: ('en_PP-OCRv5_mobile_rec', None)[0m
[32mModel files

‚úì PaddleOCR initialized successfully


## 4. Helper Functions - File Parsing

In [4]:
def parse_crop_filename(filename: str) -> Dict[str, Any]:
    """
    Parse crop filename to extract metadata.
    Format: page_001_order_001_type_id_27.png
    """
    parts = filename.replace('.png', '').split('_')
    
    metadata = {
        'filename': filename,
        'page': None,
        'order': None,
        'element_type': None,
        'element_id': None
    }
    
    try:
        if 'page' in parts:
            page_idx = parts.index('page')
            if page_idx + 1 < len(parts):
                metadata['page'] = int(parts[page_idx + 1])
        
        if 'order' in parts:
            order_idx = parts.index('order')
            if order_idx + 1 < len(parts):
                metadata['order'] = int(parts[order_idx + 1])
        
        if 'id' in parts:
            id_idx = parts.index('id')
            if id_idx + 1 < len(parts):
                metadata['element_id'] = int(parts[id_idx + 1])
        
        if 'order' in parts and 'id' in parts:
            order_idx = parts.index('order')
            id_idx = parts.index('id')
            if order_idx + 2 < id_idx:
                metadata['element_type'] = '_'.join(parts[order_idx + 2:id_idx])
    
    except (ValueError, IndexError) as e:
        logger.warning(f"Error parsing filename {filename}: {e}")
    
    return metadata


def load_layout_data(document_dir: Path) -> Optional[Dict]:
    """Load layout_data.json for a document."""
    layout_file = document_dir / "layout_data.json"
    
    if not layout_file.exists():
        logger.warning(f"No layout_data.json found in {document_dir}")
        return None
    
    try:
        with open(layout_file, 'r', encoding='utf-8') as f:
            return json.load(f)
    except Exception as e:
        logger.error(f"Error loading layout data: {e}")
        return None


print("‚úì File parsing functions defined")

‚úì File parsing functions defined


## 5. OCR Functions

In [5]:
def perform_paddle_ocr(image_path: Path, ocr_engine, config: Dict) -> Dict[str, Any]:
    """
    Perform OCR using PaddleOCR on a single image.
    Returns text and confidence score.
    """
    try:
        result = ocr_engine.ocr(str(image_path))
        
        if not result or not result[0]:
            return {
                'success': True,
                'text': '',
                'confidence': 0.0,
                'lines': []
            }
        
        ocr_result = result[0]
        rec_texts = ocr_result.get('rec_texts', [])
        rec_scores = ocr_result.get('rec_scores', [])
        
        if not rec_texts:
            return {
                'success': True,
                'text': '',
                'confidence': 0.0,
                'lines': []
            }
        
        lines = []
        confidences = []
        
        for text, score in zip(rec_texts, rec_scores):
            if score >= config['confidence_threshold']:
                lines.append(text)
                confidences.append(score * 100)  # Convert to percentage
        
        full_text = ' '.join(lines)
        avg_confidence = sum(confidences) / len(confidences) if confidences else 0.0
        
        return {
            'success': True,
            'text': full_text,
            'confidence': avg_confidence,
            'lines': lines
        }
    
    except Exception as e:
        logger.error(f"OCR error for {image_path.name}: {e}")
        return {
            'success': False,
            'text': '',
            'confidence': 0.0,
            'error': str(e)
        }


print("‚úì PaddleOCR functions defined")

‚úì PaddleOCR functions defined


## 6. Reading Order & Layout Functions

In [6]:
def get_element_layout_info(element_id: int, layout_data: Dict, page_num: int) -> Optional[Dict]:
    """
    Get layout information for a specific element from layout_data.json.
    """
    if not layout_data or 'pages' not in layout_data:
        return None
    
    page_idx = page_num - 1
    if page_idx >= len(layout_data['pages']):
        return None
    
    page = layout_data['pages'][page_idx]
    
    for elem in page.get('elements', []):
        if elem.get('id') == element_id:
            bbox = elem.get('bounding_box', {})
            return {
                'reading_order': elem.get('reading_order'),
                'bbox': bbox,
                'left': bbox.get('left', 0),
                'top': bbox.get('top', 0),
                'right': bbox.get('right', 0),
                'bottom': bbox.get('bottom', 0),
                'type': elem.get('type', 'unknown')
            }
    
    return None


def sort_elements_by_reading_order(elements: List[Dict], config: Dict) -> List[Dict]:
    """
    Sort elements by reading order.
    Priority: Page -> Reading order -> Top -> Left
    """
    def sort_key(elem: Dict) -> Tuple:
        page = elem.get('page', 0)
        
        if config['use_layout_reading_order'] and elem.get('layout_info'):
            layout = elem['layout_info']
            reading_order = layout.get('reading_order', 999)
            return (page, reading_order, layout.get('top', 0), layout.get('left', 0))
        else:
            order = elem.get('order', 999)
            return (page, order, 0, 0)
    
    return sorted(elements, key=sort_key)


print("‚úì Reading order functions defined")

‚úì Reading order functions defined


## 7. Advanced Column Detection

In [7]:
def detect_document_columns(elements: List[Dict], page_num: int, config: Dict) -> Dict[str, Any]:
    """
    Automatically detect the number of columns in a document (1, 2, 3, or more).
    Uses clustering on horizontal positions to identify columns.
    """
    page_elements = [e for e in elements if e.get('page') == page_num]
    
    if not page_elements:
        return {
            'num_columns': 0,
            'columns': [],
            'top_elements': [],
            'bottom_elements': [],
            'column_boundaries': [],
            'layout_type': 'empty'
        }
    
    # Get page width
    page_width = 1654.0
    elements_with_layout = [e for e in page_elements if e.get('layout_info')]
    
    if elements_with_layout:
        max_right = max([e['layout_info'].get('right', 0) for e in elements_with_layout])
        if max_right > 0:
            page_width = max_right
    
    # Collect center positions of all elements
    element_centers = []
    for elem in elements_with_layout:
        layout = elem['layout_info']
        left = layout.get('left', 0)
        right = layout.get('right', 0)
        width = right - left
        center = left + (width / 2)
        element_centers.append({
            'element': elem,
            'center': center,
            'left': left,
            'right': right,
            'width': width,
            'top': layout.get('top', 0),
            'bottom': layout.get('bottom', 0)
        })
    
    if not element_centers:
        return {
            'num_columns': 1,
            'columns': [page_elements],
            'top_elements': [],
            'bottom_elements': [],
            'column_boundaries': [0, page_width],
            'layout_type': 'single_column'
        }
    
    # Identify full-width elements
    full_width_threshold = page_width * config['full_width_threshold']
    full_width_elements = [ec for ec in element_centers if ec['width'] > full_width_threshold]
    narrow_elements = [ec for ec in element_centers if ec['width'] <= full_width_threshold]
    
    if len(narrow_elements) < 2:
        return {
            'num_columns': 1,
            'columns': [page_elements],
            'top_elements': [],
            'bottom_elements': [],
            'column_boundaries': [0, page_width],
            'layout_type': 'single_column'
        }
    
    # Find gaps in horizontal distribution
    sorted_centers = sorted([ec['center'] for ec in narrow_elements])
    
    gaps = []
    for i in range(len(sorted_centers) - 1):
        gap = sorted_centers[i + 1] - sorted_centers[i]
        if gap > page_width * config['column_gap_threshold']:
            gaps.append({
                'position': (sorted_centers[i] + sorted_centers[i + 1]) / 2,
                'size': gap
            })
    
    # Determine number of columns
    num_columns = len(gaps) + 1
    
    # Calculate column boundaries
    if num_columns == 1:
        column_boundaries = [0, page_width]
    else:
        column_boundaries = [0] + [gap['position'] for gap in gaps] + [page_width]
    
    # Assign elements to columns
    columns = [[] for _ in range(num_columns)]
    
    for ec in narrow_elements:
        center = ec['center']
        for i in range(num_columns):
            left_boundary = column_boundaries[i]
            right_boundary = column_boundaries[i + 1]
            
            if left_boundary <= center < right_boundary:
                columns[i].append(ec['element'])
                break
    
    # Sort elements within each column by vertical position
    for i in range(num_columns):
        columns[i] = sorted(columns[i], key=lambda x: (
            x.get('layout_info', {}).get('top', 0),
            x.get('order', 0)
        ))
    
    # Detect vertical range of multi-column section
    column_start = None
    column_end = None
    
    if num_columns > 1 and narrow_elements:
        # Find paired elements (elements at similar heights in different columns)
        paired_tops = []
        paired_bottoms = []
        vertical_threshold = config['vertical_pairing_threshold']
        
        for i in range(num_columns - 1):
            for elem1 in columns[i]:
                top1 = elem1.get('layout_info', {}).get('top', 999)
                bottom1 = elem1.get('layout_info', {}).get('bottom', 0)
                
                for elem2 in columns[i + 1]:
                    top2 = elem2.get('layout_info', {}).get('top', 999)
                    bottom2 = elem2.get('layout_info', {}).get('bottom', 0)
                    
                    if abs(top1 - top2) < vertical_threshold:
                        paired_tops.append(min(top1, top2))
                        paired_bottoms.append(max(bottom1, bottom2))
        
        if paired_tops and paired_bottoms:
            column_start = min(paired_tops)
            column_end = max(paired_bottoms)
    
    # Categorize full-width elements
    top_elements = []
    bottom_elements = []
    
    for ec in full_width_elements:
        if column_start and ec['top'] < column_start - 50:
            top_elements.append(ec['element'])
        elif column_end and ec['top'] > column_end + 50:
            bottom_elements.append(ec['element'])
        else:
            bottom_elements.append(ec['element'])
    
    # Sort top and bottom elements
    top_elements.sort(key=lambda x: (
        x.get('layout_info', {}).get('top', 0),
        x.get('order', 0)
    ))
    bottom_elements.sort(key=lambda x: (
        x.get('layout_info', {}).get('top', 0),
        x.get('order', 0)
    ))
    
    # Determine layout type
    if num_columns == 1:
        layout_type = 'single_column'
    elif num_columns == 2:
        layout_type = 'two_column'
    elif num_columns == 3:
        layout_type = 'three_column'
    else:
        layout_type = f'{num_columns}_column'
    
    return {
        'num_columns': num_columns,
        'columns': columns,
        'top_elements': top_elements,
        'bottom_elements': bottom_elements,
        'column_boundaries': column_boundaries,
        'layout_type': layout_type,
        'page_width': page_width
    }


print("‚úì Advanced column detection functions defined")

‚úì Advanced column detection functions defined


## 8. Markdown Generation Functions

In [8]:
def format_element_for_markdown(element: Dict, config: Dict) -> str:
    """
    Format an element for markdown output based on its type.
    """
    text = element.get('text', '').strip()
    if not text:
        return ''
    
    element_type = element.get('element_type', 'text')
    markdown_lines = []
    
    if config['include_confidence_comments'] and 'confidence' in element:
        confidence = element['confidence']
        markdown_lines.append(f"<!-- OCR Confidence: {confidence:.1f}% -->")
    
    if element_type == 'title':
        markdown_lines.append(f"# {text}")
    elif element_type == 'section_header':
        markdown_lines.append(f"## {text}")
    elif element_type == 'page_header':
        markdown_lines.append(f"*{text}*")
    elif element_type == 'page_footer':
        markdown_lines.append(f"*{text}*")
    elif element_type == 'table':
        if config['format_tables']:
            markdown_lines.append("```")
            markdown_lines.append(text)
            markdown_lines.append("```")
        else:
            markdown_lines.append(text)
    elif element_type == 'key_value_region':
        markdown_lines.append(f"**{text}**")
    else:
        markdown_lines.append(text)
    
    return '\n'.join(markdown_lines)


def generate_markdown_with_columns(elements: List[Dict], config: Dict, doc_name: str) -> str:
    """
    Generate markdown document with automatic column detection and formatting.
    """
    markdown_parts = []
    
    # Add document title
   
    
    # Group elements by page
    pages = {}
    for elem in elements:
        page_num = elem.get('page', 1)
        if page_num not in pages:
            pages[page_num] = []
        pages[page_num].append(elem)
    
    # Process each page
    for page_num in sorted(pages.keys()):
        if config['separate_pages']:
            markdown_parts.append(f"## Page {page_num}\n")
        
        # Detect columns
        column_info = detect_document_columns(elements, page_num, config)
        
        # Add layout info comment
        if config['show_column_info']:
            markdown_parts.append(f"<!-- Layout: {column_info['layout_type']} ({column_info['num_columns']} column(s)) -->")
            markdown_parts.append('')
        
        # Process top elements
        for elem in column_info['top_elements']:
            formatted = format_element_for_markdown(elem, config)
            if formatted:
                markdown_parts.append(formatted)
                if config['add_spacing_between_elements']:
                    markdown_parts.append('')
        
        # Process columns
        if column_info['num_columns'] > 1:
            markdown_parts.append(f'<div style="display: flex; gap: 20px;">')
            markdown_parts.append('')
            
            for col_idx, column in enumerate(column_info['columns']):
                markdown_parts.append(f'<div style="flex: 1;">  <!-- Column {col_idx + 1} -->')
                markdown_parts.append('')
                
                for elem in column:
                    formatted = format_element_for_markdown(elem, config)
                    if formatted:
                        markdown_parts.append(formatted)
                        if config['add_spacing_between_elements']:
                            markdown_parts.append('')
                
                markdown_parts.append('</div>')
                markdown_parts.append('')
            
            markdown_parts.append('</div>')
            markdown_parts.append('')
        else:
            for column in column_info['columns']:
                for elem in column:
                    formatted = format_element_for_markdown(elem, config)
                    if formatted:
                        markdown_parts.append(formatted)
                        if config['add_spacing_between_elements']:
                            markdown_parts.append('')
        
        # Process bottom elements
        for elem in column_info['bottom_elements']:
            formatted = format_element_for_markdown(elem, config)
            if formatted:
                markdown_parts.append(formatted)
                if config['add_spacing_between_elements']:
                    markdown_parts.append('')
        
        # Add page separator
        if config['separate_pages'] and page_num < max(pages.keys()):
            markdown_parts.append(config['page_separator'])
    
    return '\n'.join(markdown_parts)


print("‚úì Markdown generation functions defined")

‚úì Markdown generation functions defined


## 9. Main Processing Function

In [9]:
def process_document_with_column_detection(document_dir: Path, ocr_engine, config: Dict) -> Dict[str, Any]:
    """
    Process a single document with PaddleOCR and automatic column detection.
    """
    doc_name = document_dir.name
    logger.info(f"Processing document: {doc_name}")
    
    # Load layout data
    layout_data = load_layout_data(document_dir)
    
    # Find cropped sections
    crops_dir = document_dir / "cropped_sections"
    if not crops_dir.exists():
        return {
            'success': False,
            'error': f"No cropped_sections directory found in {document_dir}"
        }
    
    crop_files = sorted([f for f in crops_dir.glob('*.png') if f.name.startswith('page_')])
    
    if not crop_files:
        return {
            'success': False,
            'error': f"No crop files found in {crops_dir}"
        }
    
    # Process each crop
    elements = []
    
    for crop_file in tqdm(crop_files, desc=f"OCR {doc_name}", leave=False):
        metadata = parse_crop_filename(crop_file.name)
        ocr_result = perform_paddle_ocr(crop_file, ocr_engine, config)
        
        if not ocr_result['success'] or not ocr_result['text'].strip():
            continue
        
        # Get layout information
        layout_info = None
        if layout_data and metadata['page'] and metadata['element_id'] is not None:
            layout_info = get_element_layout_info(
                metadata['element_id'],
                layout_data,
                metadata['page']
            )
        
        element = {
            'page': metadata['page'],
            'order': metadata['order'],
            'element_type': metadata['element_type'],
            'element_id': metadata['element_id'],
            'text': ocr_result['text'],
            'confidence': ocr_result['confidence'],
            'layout_info': layout_info
        }
        
        elements.append(element)
    
    # Sort elements by reading order
    sorted_elements = sort_elements_by_reading_order(elements, config)
    
    # Detect columns for each page
    pages = list(set([e.get('page', 1) for e in sorted_elements]))
    column_info = {}
    for page_num in pages:
        col_info = detect_document_columns(sorted_elements, page_num, config)
        column_info[page_num] = col_info
        logger.info(f"  Page {page_num}: {col_info['layout_type']} ({col_info['num_columns']} columns)")
    
    # Generate markdown
    markdown_content = generate_markdown_with_columns(sorted_elements, config, doc_name)
    
    # Calculate average confidence
    confidences = [e['confidence'] for e in sorted_elements if e.get('confidence', 0) > 0]
    avg_confidence = sum(confidences) / len(confidences) if confidences else 0.0
    
    return {
        'success': True,
        'document_name': doc_name,
        'total_elements': len(sorted_elements),
        'markdown': markdown_content,
        'elements': sorted_elements,
        'column_info': column_info,
        'avg_confidence': avg_confidence
    }


print("‚úì Main processing function defined")

‚úì Main processing function defined


## 10. Test Single Document with Column Detection

In [10]:
# Test with automatic column detection
input_dir = Path(CONFIG['input_dir'])
output_dir = Path(CONFIG['output_dir'])

document_dirs = [d for d in input_dir.iterdir() if d.is_dir()]
if document_dirs:
    test_doc = document_dirs[0]
    print(f"Testing PaddleOCR with Column Detection: {test_doc.name}\n")
    print("="*70)
    
    # Process document
    result = process_document_with_column_detection(test_doc, ocr, CONFIG)
    
    if result['success']:
        print(f"\n‚úÖ Successfully processed: {result['document_name']}")
        print(f"   Total elements: {result['total_elements']}")
        print(f"   Average OCR confidence: {result['avg_confidence']:.1f}%")
        
        # Display column information
        print(f"\nüìä Column Detection Results:")
        for page_num, col_info in result['column_info'].items():
            print(f"\n   Page {page_num}:")
            print(f"     - Layout Type: {col_info['layout_type']}")
            print(f"     - Number of Columns: {col_info['num_columns']}")
            print(f"     - Page Width: {col_info.get('page_width', 0):.1f}px")
            if col_info['num_columns'] > 1:
                print(f"     - Column Boundaries: {[f'{b:.1f}' for b in col_info['column_boundaries']]}'")
                for i, col in enumerate(col_info['columns']):
                    print(f"     - Column {i+1}: {len(col)} elements")
            print(f"     - Top Elements: {len(col_info['top_elements'])}")
            print(f"     - Bottom Elements: {len(col_info['bottom_elements'])}")
        
        # Save markdown
        output_dir.mkdir(parents=True, exist_ok=True)
        output_file = output_dir / f"{result['document_name']}.md"
        with open(output_file, 'w', encoding='utf-8') as f:
            f.write(result['markdown'])
        
        print(f"\n‚úì Markdown saved to: {output_file}")
        print(f"\nüìÑ Markdown preview (first 600 chars):")
        print("="*70)
        print(result['markdown'][:600])
        if len(result['markdown']) > 600:
            print("...")
    else:
        print(f"‚úó Error: {result.get('error')}")
else:
    print("No documents found for testing")

Testing PaddleOCR with Column Detection: batch1-0287



                                                                


‚úÖ Successfully processed: batch1-0287
   Total elements: 13
   Average OCR confidence: 99.1%

üìä Column Detection Results:

   Page 1:
     - Layout Type: two_column
     - Number of Columns: 2
     - Page Width: 1520.1px
     - Column Boundaries: ['0.0', '722.5', '1520.1']'
     - Column 1: 7 elements
     - Column 2: 4 elements
     - Top Elements: 0
     - Bottom Elements: 2

‚úì Markdown saved to: paddle_markdown_output/batch1-0287.md

üìÑ Markdown preview (first 600 chars):
## Page 1

<!-- Layout: two_column (2 column(s)) -->

<div style="display: flex; gap: 20px;">

<div style="flex: 1;">  <!-- Column 1 -->

**Invoice no: 51335214 Date of issue: 03/27/201**

## Seller:

Rivera Group 90443 lan Inlet Suite 58e Lake Abigail, WV 40743

**Tax Id: 988-71-1654**

BAN: GB10XNTM3891843789666

## ITEMS

## SUMMARY

</div>

<div style="flex: 1;">  <!-- Column 2 -->

## Client:

Malone, Wilson and Carson 01909 Kyle Port South Joyce, VA 45070

Tax Id: 928-86-3224

**Tax Id: 928-86-322**



## 11. Batch Processing Function

In [11]:
def batch_process_documents(input_dir: Path, output_dir: Path, ocr_engine, config: Dict) -> Dict[str, Any]:
    """
    Process all documents in the input directory.
    """
    output_dir.mkdir(parents=True, exist_ok=True)
    
    document_dirs = [d for d in input_dir.iterdir() if d.is_dir()]
    
    if not document_dirs:
        logger.warning(f"No document directories found in {input_dir}")
        return {'success': False, 'error': 'No documents found'}
    
    logger.info(f"Found {len(document_dirs)} documents to process")
    
    results = {
        'total_documents': len(document_dirs),
        'successful': 0,
        'failed': 0,
        'details': [],
        'total_columns_detected': {}
    }
    
    for doc_dir in tqdm(document_dirs, desc="Processing documents"):
        try:
            result = process_document_with_column_detection(doc_dir, ocr_engine, config)
            
            if result['success']:
                # Save markdown
                output_file = output_dir / f"{result['document_name']}.md"
                with open(output_file, 'w', encoding='utf-8') as f:
                    f.write(result['markdown'])
                
                results['successful'] += 1
                
                # Track column statistics
                for page_num, col_info in result['column_info'].items():
                    layout_type = col_info['layout_type']
                    results['total_columns_detected'][layout_type] = \
                        results['total_columns_detected'].get(layout_type, 0) + 1
                
                results['details'].append({
                    'document': result['document_name'],
                    'status': 'success',
                    'elements': result['total_elements'],
                    'confidence': result['avg_confidence'],
                    'output_file': str(output_file)
                })
                
                logger.info(f"‚úì {result['document_name']}: {result['total_elements']} elements, "
                          f"{result['avg_confidence']:.1f}% confidence")
            else:
                results['failed'] += 1
                results['details'].append({
                    'document': doc_dir.name,
                    'status': 'failed',
                    'error': result.get('error', 'Unknown error')
                })
                logger.error(f"‚úó {doc_dir.name}: {result.get('error')}")
        
        except Exception as e:
            results['failed'] += 1
            results['details'].append({
                'document': doc_dir.name,
                'status': 'failed',
                'error': str(e)
            })
            logger.error(f"‚úó {doc_dir.name}: {e}")
    
    return results


print("‚úì Batch processing function defined")

‚úì Batch processing function defined


## 12. Batch Process All Documents

In [12]:
# Process all documents
print("Starting batch processing with PaddleOCR...\n")

input_dir = Path(CONFIG['input_dir'])
output_dir = Path(CONFIG['output_dir'])

results = batch_process_documents(input_dir, output_dir, ocr, CONFIG)

print("\n" + "="*70)
print("BATCH PROCESSING SUMMARY")
print("="*70)
print(f"Total documents: {results['total_documents']}")
print(f"Successful: {results['successful']}")
print(f"Failed: {results['failed']}")
print(f"\nOutput directory: {output_dir}")

if results.get('total_columns_detected'):
    print("\nüìä Column Layout Statistics:")
    for layout_type, count in results['total_columns_detected'].items():
        print(f"  {layout_type}: {count} pages")

if results['failed'] > 0:
    print("\n‚ùå Failed documents:")
    for detail in results['details']:
        if detail['status'] == 'failed':
            print(f"  - {detail['document']}: {detail['error']}")

Starting batch processing with PaddleOCR...



Processing documents: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 61/61 [23:05<00:00, 22.71s/it]


BATCH PROCESSING SUMMARY
Total documents: 61
Successful: 61
Failed: 0

Output directory: paddle_markdown_output

üìä Column Layout Statistics:
  two_column: 49 pages
  single_column: 7 pages
  three_column: 7 pages





## 13. View Generated Markdown

In [15]:
# View a generated markdown file
output_dir = Path(CONFIG['output_dir'])
markdown_files = list(output_dir.glob('*.md'))

if markdown_files:
    sample_file = markdown_files[0]
    print(f"Viewing: {sample_file.name}\n")
    print("="*70)
    
    with open(sample_file, 'r', encoding='utf-8') as f:
        content = f.read()
        print(content[:1500])
        if len(content) > 1500:
            print("\n... (truncated)")
            print(f"\nTotal length: {len(content)} characters")
else:
    print("No markdown files found")

Viewing: 00921466.md

# 00921466

*Generated with PaddleOCR - 2025-10-23 01:59:35*

## Page 1

<!-- Layout: three_column (3 column(s)) -->

<div style="display: flex; gap: 20px;">

<div style="flex: 1;">  <!-- Column 1 -->

BORRISTON RESEARCH LABORATORIES, INC.

August 20, 1981

Greensboro, N.C. 27420 420 English St. LORILLARD, INC.

**Attention: Dr. Harry Minnemeyer Reference: Purchase Order # 312-A BRL Ref.: 2-22-222-J Invoice No.: 5-J**

## DESCRIPTION

For submission of Final Report "Cardiovascular Testing of Compound A-11 in the Beagle Dog" at $2,700.00 per compound.

</div>

<div style="flex: 1;">  <!-- Column 2 -->

## * * * *INVOICE* * * *

**Remittance Address: ENVIRO CONTROL INC. 11140 Rockville Pike Rockville, Md. 20852 Attn: B. Belford, Accountin**

* * * *INVOICE* * *

*5050 Beech Place ‚Ä¢ Temple Hills, Maryland 20031 ‚Ä¢ 301899-353*

</div>

<div style="flex: 1;">  <!-- Column 3 -->

**AMOUNT $2,700.00**

*00921466*

</div>

</div>



## Summary

### ‚úÖ Features Implemented:

1. **PaddleOCR Integration**
   - Higher accuracy than Tesseract (typically 98-100% confidence)
   - Better multi-language support
   - Robust text detection and recognition

2. **Intelligent Column Detection**
   - Automatically identifies 1, 2, 3, or more columns
   - Uses clustering algorithm on horizontal positions
   - Detects full-width elements (headers, tables, footers)
   - Groups paired elements across columns

3. **Layout-Aware Processing**
   - Reads layout_data.json for structure
   - Maintains proper reading order
   - Preserves spatial relationships

4. **Smart Markdown Generation**
   - Multi-column formatting with HTML flexbox
   - Proper grouping of related content
   - Clean, readable output
   - Column information in comments

5. **Batch Processing**
   - Process multiple documents
   - Progress tracking
   - Error handling
   - Statistics reporting

### üìä Processing Flow:
```
Input: Cropped images + layout_data.json
  ‚Üì
PaddleOCR ‚Üí Extract text with high confidence
  ‚Üì
Layout Analysis ‚Üí Get reading order & bounding boxes
  ‚Üì
Column Detection ‚Üí Identify layout structure
  ‚Üì
Markdown Generation ‚Üí Format with proper columns
  ‚Üì
Output: Structured markdown files
```

### üéØ Next Steps:
1. Run cell 10 to test single document
2. Run cell 12 to batch process all documents
3. Run cell 13 to view generated markdown
4. Check output in `paddle_markdown_output/` directory