# NanoNet OCR Pipeline with Intelligent Column Detection

This notebook mirrors the PaddleOCR column-aware workflow but swaps in the **NanoNet OCR** model from Hugging Face. It provides:
- ‚úÖ Automatic column detection (1, 2, 3, or more)
- ‚úÖ Reading-order aware text reconstruction via `layout_data.json`
- ‚úÖ Full-page and cropped-region processing with NanoNet OCR
- ‚úÖ Markdown output that preserves multi-column structure
- ‚úÖ Batch processing across many documents

**Why NanoNet?**
- Vision-language model that understands complex layouts
- Produces richer text (tables, checkboxes, watermarks) than classic OCR
- Works on CPU or GPU (CUDA recommended for speed)

**Prerequisites:**
- Layout detection notebook already executed (`layout_data.json`, `cropped_sections/` available)
- Python environment with `torch`, `transformers>=4.41`, `Pillow`, `tqdm` installed

## 1. Import Required Libraries

In [26]:
import json
import logging
from pathlib import Path
from typing import Dict, Any, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# OCR / ML
torch = None
try:
    import torch
    from PIL import Image
    from transformers import AutoProcessor, AutoModelForImageTextToText
except ImportError as e:
    print(f"‚ùå Missing dependency: {e}")
    print("Please install required packages, e.g. ‚Üí pip install torch transformers pillow")
    raise

# Utilities
from datetime import datetime
from tqdm.auto import tqdm

# Logging setup
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print("‚úì Libraries imported successfully")

‚úì Libraries imported successfully


## 2. Configuration

In [27]:
CONFIG: Dict[str, Any] = {
    # Data directories
    'input_dir': 'output_results',
    'output_dir': 'nanonet_markdown_output',
    
    # NanoNet model options
    'model_name': 'nanonets/Nanonets-OCR-s',
    'device': 'auto',  # 'auto', 'cuda', or 'cpu'
    'dtype_priority': ['bfloat16', 'float16', 'float32'],
    'page_max_tokens': 2048,
    'crop_max_tokens': 1024,
    'use_full_page_ocr': True,
    'use_crop_ocr': True,
    'ocr_prompt': (
        """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ‚òê and ‚òë for check boxes."""
    ),
    
    # Column detection thresholds
    'column_gap_threshold': 0.15,
    'full_width_threshold': 0.7,
    'vertical_pairing_threshold': 100,
    
    # Reading order / sorting
    'use_layout_reading_order': True,
    'sort_by_position': 'top_left',
    
    # Output formatting
    'confidence_threshold': 0.0,
    'add_spacing_between_elements': True,
    'format_tables': True,
    'include_confidence_comments': False,
    'show_column_info': True,
    'separate_pages': True,
    'include_document_header': False,
    'include_generation_stamp': False,
    'page_separator': '\n---\n\n',
}

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

Configuration:
  input_dir: output_results
  output_dir: nanonet_markdown_output
  model_name: nanonets/Nanonets-OCR-s
  device: auto
  dtype_priority: ['bfloat16', 'float16', 'float32']
  page_max_tokens: 2048
  crop_max_tokens: 1024
  use_full_page_ocr: True
  use_crop_ocr: True
  ocr_prompt: Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ‚òê and ‚òë for check boxes.
  column_gap_threshold: 0.15
  full_width_threshold: 0.7
  vertical_pairing_threshold: 100
  use_layout_reading_order

## 3. Initialize NanoNet OCR

In [28]:
class NanoNetOCREngine:
    """Lightweight wrapper that loads and executes the NanoNet OCR model."""

    def __init__(self, model_name: str, device_pref: str = 'auto', dtype_priority: Optional[List[str]] = None):
        self.model_name = model_name
        self.device = self._resolve_device(device_pref)
        self.dtype = self._resolve_dtype(dtype_priority or ['bfloat16', 'float16', 'float32'])

        print(f"üöÄ Loading NanoNet OCR model: {model_name}")
        print(f"   ‚Ä¢ Device: {self.device}")
        print(f"   ‚Ä¢ Torch dtype: {self.dtype}")

        self.processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
        self.model = AutoModelForImageTextToText.from_pretrained(
            model_name,
            torch_dtype=self.dtype,
            trust_remote_code=True,
            low_cpu_mem_usage=True
        ).eval()

        if self.device.startswith('cuda'):
            self.model.to(self.device)

        pad_id = getattr(self.model.generation_config, 'pad_token_id', None)
        if pad_id is None:
            fallback_pad = getattr(self.processor.tokenizer, 'pad_token_id', None)
            if fallback_pad is not None:
                self.model.generation_config.pad_token_id = fallback_pad

    def _resolve_device(self, device_pref: str) -> str:
        if device_pref == 'cuda' or (device_pref == 'auto' and torch.cuda.is_available()):
            return 'cuda'
        return 'cpu'

    def _resolve_dtype(self, candidates: List[str]):
        for candidate in candidates:
            if candidate == 'bfloat16' and torch.cuda.is_available() and torch.cuda.is_bf16_supported():
                return torch.bfloat16
            if candidate == 'float16' and torch.cuda.is_available():
                return torch.float16
            if candidate == 'float32':
                return torch.float32
        return torch.float32

    def generate(self, image: Image.Image, prompt: str, max_tokens: int) -> str:
        messages = [
            {"role": "system", "content": "You are a helpful OCR assistant."},
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": image},
                    {"type": "text", "text": prompt},
                ],
            },

        ]

        chat_template = self.processor.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.processor(
            text=[chat_template],
            images=[image],
            padding=True,
            return_tensors='pt'
        )

        if self.device.startswith('cuda'):
            inputs = {k: v.to(self.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

        with torch.inference_mode():
            output = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                do_sample=False,
                num_beams=1,
                use_cache=True
            )

        generated_ids = [out[len(inp):] for inp, out in zip(inputs['input_ids'], output)]
        text = self.processor.batch_decode(
            generated_ids,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=True
        )[0]
        return text.strip()


nanonet_engine = NanoNetOCREngine(
    CONFIG['model_name'],
    device_pref=CONFIG['device'],
    dtype_priority=CONFIG['dtype_priority']
)
print("‚úì NanoNet OCR engine ready")

üöÄ Loading NanoNet OCR model: nanonets/Nanonets-OCR-s
   ‚Ä¢ Device: cuda
   ‚Ä¢ Torch dtype: torch.bfloat16


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

‚úì NanoNet OCR engine ready


## 4. Helper Functions - File Parsing & Layout

In [29]:
def parse_crop_filename(filename: str) -> Dict[str, Any]:
    """Extract metadata from crop filename (page/order/type/id)."""
    parts = filename.replace('.png', '').split('_')
    metadata = {'filename': filename, 'page': None, 'order': None, 'element_type': None, 'element_id': None}

    try:
        if 'page' in parts:
            page_idx = parts.index('page')
            metadata['page'] = int(parts[page_idx + 1])
        if 'order' in parts:
            order_idx = parts.index('order')
            metadata['order'] = int(parts[order_idx + 1])
        if 'id' in parts:
            id_idx = parts.index('id')
            metadata['element_id'] = int(parts[id_idx + 1])
        if 'order' in parts and 'id' in parts:
            order_idx = parts.index('order')
            id_idx = parts.index('id')
            if order_idx + 2 < id_idx:
                metadata['element_type'] = '_'.join(parts[order_idx + 2:id_idx])
    except (ValueError, IndexError) as exc:
        logger.warning(f"Failed to parse filename {filename}: {exc}")

    return metadata


def load_layout_data(document_dir: Path) -> Optional[Dict[str, Any]]:
    layout_file = document_dir / 'layout_data.json'
    if not layout_file.exists():
        logger.warning(f"layout_data.json not found in {document_dir}")
        return None
    try:
        with open(layout_file, 'r', encoding='utf-8') as fh:
            return json.load(fh)
    except Exception as exc:
        logger.error(f"Error reading {layout_file}: {exc}")
        return None


def get_element_layout_info(element_id: int, layout_data: Dict[str, Any], page_num: int) -> Optional[Dict[str, Any]]:
    if not layout_data or 'pages' not in layout_data:
        return None
    page_idx = page_num - 1
    if page_idx >= len(layout_data['pages']):
        return None
    page = layout_data['pages'][page_idx]
    for elem in page.get('elements', []):
        if elem.get('id') == element_id:
            bbox = elem.get('bounding_box', {})
            return {
                'reading_order': elem.get('reading_order'),
                'bbox': bbox,
                'left': bbox.get('left', 0),
                'top': bbox.get('top', 0),
                'right': bbox.get('right', 0),
                'bottom': bbox.get('bottom', 0),
                'type': elem.get('type', 'unknown')
            }
    return None

## 5. NanoNet OCR Function

In [30]:
def perform_nanonet_ocr(
    image_path: Path,
    engine: NanoNetOCREngine,
    config: Dict[str, Any],
    max_tokens: int
) -> Dict[str, Any]:
    """Run NanoNet OCR on an image path, returning text and status."""
    try:
        image = Image.open(image_path).convert('RGB')
    except Exception as exc:
        logger.error(f"Failed to open image {image_path}: {exc}")
        return {'success': False, 'text': '', 'confidence': 0.0, 'error': str(exc)}

    try:
        text = engine.generate(image, config['ocr_prompt'], max_tokens=max_tokens)
        if not text.strip():
            text = ''
        return {
            'success': True,
            'text': text,
            'confidence': 0.0,  # NanoNet does not return per-line confidence
            'lines': [line.strip() for line in text.splitlines() if line.strip()]
        }
    except RuntimeError as exc:
        if 'CUDA out of memory' in str(exc) and max_tokens > 128:
            logger.warning(f"OOM on {image_path.name} with {max_tokens} tokens; retrying with {max_tokens // 2}")
            torch.cuda.empty_cache()
            return perform_nanonet_ocr(image_path, engine, config, max_tokens=max_tokens // 2)
        logger.error(f"NanoNet inference failed for {image_path.name}: {exc}")
        return {'success': False, 'text': '', 'confidence': 0.0, 'error': str(exc)}
    except Exception as exc:
        logger.error(f"NanoNet inference failed for {image_path.name}: {exc}")
        return {'success': False, 'text': '', 'confidence': 0.0, 'error': str(exc)}

## 6. Reading Order & Sorting

In [31]:
def sort_elements_by_reading_order(elements: List[Dict[str, Any]], config: Dict[str, Any]) -> List[Dict[str, Any]]:
    def sort_key(elem: Dict[str, Any]) -> Tuple:
        page = elem.get('page', 0)
        layout_info = elem.get('layout_info')
        if config['use_layout_reading_order'] and layout_info:
            return (
                page,
                layout_info.get('reading_order', 9999),
                layout_info.get('top', 0),
                layout_info.get('left', 0)
            )
        return (page, elem.get('order', 9999), 0, 0)

    return sorted(elements, key=sort_key)

## 7. Advanced Column Detection

In [32]:
def detect_document_columns(elements: List[Dict[str, Any]], page_num: int, config: Dict[str, Any]) -> Dict[str, Any]:
    page_elements = [e for e in elements if e.get('page') == page_num]
    if not page_elements:
        return {
            'num_columns': 0,
            'columns': [],
            'top_elements': [],
            'bottom_elements': [],
            'column_boundaries': [],
            'layout_type': 'empty'
        }

    page_width = 1654.0
    elements_with_layout = [e for e in page_elements if e.get('layout_info')]
    if elements_with_layout:
        max_right = max(e['layout_info'].get('right', 0) for e in elements_with_layout)
        if max_right > 0:
            page_width = max_right

    element_centers = []
    for elem in elements_with_layout:
        layout = elem['layout_info']
        left = layout.get('left', 0)
        right = layout.get('right', 0)
        width = max(right - left, 1)
        center = left + (width / 2)
        element_centers.append({
            'element': elem,
            'center': center,
            'left': left,
            'right': right,
            'width': width,
            'top': layout.get('top', 0),
            'bottom': layout.get('bottom', 0)
        })

    if not element_centers:
        return {
            'num_columns': 1,
            'columns': [page_elements],
            'top_elements': [],
            'bottom_elements': [],
            'column_boundaries': [0, page_width],
            'layout_type': 'single_column'
        }

    full_width_threshold = page_width * config['full_width_threshold']
    full_width_elements = [ec for ec in element_centers if ec['width'] > full_width_threshold]
    narrow_elements = [ec for ec in element_centers if ec['width'] <= full_width_threshold]

    if len(narrow_elements) < 2:
        return {
            'num_columns': 1,
            'columns': [page_elements],
            'top_elements': [],
            'bottom_elements': [],
            'column_boundaries': [0, page_width],
            'layout_type': 'single_column'
        }

    sorted_centers = sorted(ec['center'] for ec in narrow_elements)
    gaps = []
    for i in range(len(sorted_centers) - 1):
        gap = sorted_centers[i + 1] - sorted_centers[i]
        if gap > page_width * config['column_gap_threshold']:
            gaps.append({
                'position': (sorted_centers[i] + sorted_centers[i + 1]) / 2,
                'size': gap
            })

    num_columns = len(gaps) + 1
    column_boundaries = [0] + [gap['position'] for gap in gaps] + [page_width]

    columns: List[List[Dict[str, Any]]] = [[] for _ in range(num_columns)]
    for ec in narrow_elements:
        center = ec['center']
        for i in range(num_columns):
            if column_boundaries[i] <= center < column_boundaries[i + 1]:
                columns[i].append(ec['element'])
                break

    for i in range(num_columns):
        columns[i] = sorted(
            columns[i],
            key=lambda x: (
                x.get('layout_info', {}).get('top', 0),
                x.get('order', 0)
            )
        )

    column_start = None
    column_end = None
    vertical_threshold = config['vertical_pairing_threshold']
    if num_columns > 1:
        paired_tops = []
        paired_bottoms = []
        for i in range(num_columns - 1):
            for elem1 in columns[i]:
                for elem2 in columns[i + 1]:
                    top1 = elem1.get('layout_info', {}).get('top', 9999)
                    top2 = elem2.get('layout_info', {}).get('top', 9999)
                    bottom1 = elem1.get('layout_info', {}).get('bottom', 0)
                    bottom2 = elem2.get('layout_info', {}).get('bottom', 0)
                    if abs(top1 - top2) < vertical_threshold:
                        paired_tops.append(min(top1, top2))
                        paired_bottoms.append(max(bottom1, bottom2))
        if paired_tops and paired_bottoms:
            column_start = min(paired_tops)
            column_end = max(paired_bottoms)

    top_elements = []
    bottom_elements = []
    for ec in full_width_elements:
        top_pos = ec['top']
        if column_start and top_pos < column_start - 50:
            top_elements.append(ec['element'])
        elif column_end and top_pos > column_end + 50:
            bottom_elements.append(ec['element'])
        else:
            bottom_elements.append(ec['element'])

    top_elements.sort(key=lambda x: (x.get('layout_info', {}).get('top', 0), x.get('order', 0)))
    bottom_elements.sort(key=lambda x: (x.get('layout_info', {}).get('top', 0), x.get('order', 0)))

    if num_columns == 1:
        layout_type = 'single_column'
    elif num_columns == 2:
        layout_type = 'two_column'
    elif num_columns == 3:
        layout_type = 'three_column'
    else:
        layout_type = f'{num_columns}_column'

    return {
        'num_columns': num_columns,
        'columns': columns,
        'top_elements': top_elements,
        'bottom_elements': bottom_elements,
        'column_boundaries': column_boundaries,
        'layout_type': layout_type,
        'page_width': page_width
    }

## 8. Markdown Generation

In [33]:
def format_element_for_markdown(element: Dict[str, Any], config: Dict[str, Any]) -> str:
    text = element.get('text', '').strip()
    if not text:
        return ''

    element_type = element.get('element_type', 'text')
    markdown_lines = []

    if config['include_confidence_comments'] and 'confidence' in element:
        markdown_lines.append(f"<!-- OCR Confidence: {element['confidence']:.1f}% -->")

    if element_type == 'title':
        markdown_lines.append(f"# {text}")
    elif element_type in {'section_header', 'header'}:
        markdown_lines.append(f"## {text}")
    elif element_type == 'page_header':
        markdown_lines.append(f"*{text}*")
    elif element_type == 'page_footer':
        markdown_lines.append(f"*{text}*")
    elif element_type == 'table':
        if config['format_tables']:
            markdown_lines.append('```html')
            markdown_lines.append(text)
            markdown_lines.append('```')
        else:
            markdown_lines.append(text)
    elif element_type == 'key_value_region':
        markdown_lines.append(f"**{text}**")
    else:
        markdown_lines.append(text)

    return '\n'.join(markdown_lines)


def generate_markdown_with_columns(elements: List[Dict[str, Any]], config: Dict[str, Any], doc_name: str) -> str:
    markdown_parts = [

    ]

    pages: Dict[int, List[Dict[str, Any]]] = {}
    for elem in elements:
        page_num = elem.get('page', 1)
        pages.setdefault(page_num, []).append(elem)

    for page_num in sorted(pages.keys()):
        if config['separate_pages']:
            markdown_parts.append(f"## Page {page_num}\n")

        column_info = detect_document_columns(elements, page_num, config)

        if config['show_column_info']:
            markdown_parts.append(
                f"<!-- Layout: {column_info['layout_type']} ({column_info['num_columns']} column(s)) -->"
            )
            markdown_parts.append('')

        for elem in column_info['top_elements']:
            formatted = format_element_for_markdown(elem, config)
            if formatted:
                markdown_parts.append(formatted)
                if config['add_spacing_between_elements']:
                    markdown_parts.append('')

        if column_info['num_columns'] > 1:
            markdown_parts.append('<div style="display: flex; gap: 20px;">')
            markdown_parts.append('')
            for idx, column in enumerate(column_info['columns']):
                markdown_parts.append(f'<div style="flex: 1;">  <!-- Column {idx + 1} -->')
                markdown_parts.append('')
                for elem in column:
                    formatted = format_element_for_markdown(elem, config)
                    if formatted:
                        markdown_parts.append(formatted)
                        if config['add_spacing_between_elements']:
                            markdown_parts.append('')
                markdown_parts.append('</div>')
                markdown_parts.append('')
            markdown_parts.append('</div>')
            markdown_parts.append('')
        else:
            for column in column_info['columns']:
                for elem in column:
                    formatted = format_element_for_markdown(elem, config)
                    if formatted:
                        markdown_parts.append(formatted)
                        if config['add_spacing_between_elements']:
                            markdown_parts.append('')

        for elem in column_info['bottom_elements']:
            formatted = format_element_for_markdown(elem, config)
            if formatted:
                markdown_parts.append(formatted)
                if config['add_spacing_between_elements']:
                    markdown_parts.append('')

        if config['separate_pages'] and page_num < max(pages.keys()):
            markdown_parts.append(config['page_separator'])

    return '\n'.join(markdown_parts)

## 9. Main Processing Function

In [34]:
def process_document_with_column_detection(
    document_dir: Path,
    engine: NanoNetOCREngine,
    config: Dict[str, Any]
) -> Dict[str, Any]:
    doc_name = document_dir.name
    logger.info(f"Processing document with NanoNet OCR: {doc_name}")

    layout_data = load_layout_data(document_dir)
    crops_dir = document_dir / 'cropped_sections'

    if not crops_dir.exists():
        return {'success': False, 'error': f'No cropped_sections directory in {document_dir}'}

    crop_files = sorted([f for f in crops_dir.glob('*.png') if f.name.startswith('page_')])
    if not crop_files:
        return {'success': False, 'error': f'No crop PNG files found in {crops_dir}'}

    elements: List[Dict[str, Any]] = []

    for crop_file in tqdm(crop_files, desc=f"NanoNet OCR {doc_name}", leave=False):
        metadata = parse_crop_filename(crop_file.name)
        max_tokens = config['crop_max_tokens']
        ocr_result = perform_nanonet_ocr(crop_file, engine, config, max_tokens=max_tokens)

        if not ocr_result['success'] or not ocr_result['text'].strip():
            continue

        layout_info = None
        if layout_data and metadata['page'] and metadata['element_id'] is not None:
            layout_info = get_element_layout_info(metadata['element_id'], layout_data, metadata['page'])

        elements.append({
            'page': metadata['page'],
            'order': metadata['order'],
            'element_type': metadata['element_type'],
            'element_id': metadata['element_id'],
            'text': ocr_result['text'],
            'confidence': ocr_result['confidence'],
            'layout_info': layout_info
        })

    # Optional full-page OCR pass
    if config['use_full_page_ocr']:
        page_images_dir = document_dir / 'page_images'
        if page_images_dir.exists():
            for page_image in sorted(page_images_dir.glob('page_*.png')):
                page_num = int(page_image.stem.split('_')[1]) if '_' in page_image.stem else 1
                ocr_result = perform_nanonet_ocr(
                    page_image,
                    engine,
                    config,
                    max_tokens=config['page_max_tokens']
                )
                if ocr_result['success'] and ocr_result['text'].strip():
                    elements.append({
                        'page': page_num,
                        'order': 0,
                        'element_type': 'full_page',
                        'element_id': None,
                        'text': ocr_result['text'],
                        'confidence': ocr_result['confidence'],
                        'layout_info': None
                    })

    sorted_elements = sort_elements_by_reading_order(elements, config)
    pages = sorted({e.get('page', 1) for e in sorted_elements if e.get('page') is not None})

    column_info = {}
    for page_num in pages:
        column_info[page_num] = detect_document_columns(sorted_elements, page_num, config)
        info = column_info[page_num]
        logger.info(
            f"  Page {page_num}: {info['layout_type']} ({info['num_columns']} column(s))"
        )

    markdown_content = generate_markdown_with_columns(sorted_elements, config, doc_name)

    confidences = [e['confidence'] for e in sorted_elements if e.get('confidence', 0) > 0]
    avg_confidence = sum(confidences) / len(confidences) if confidences else 0.0

    return {
        'success': True,
        'document_name': doc_name,
        'total_elements': len(sorted_elements),
        'markdown': markdown_content,
        'elements': sorted_elements,
        'column_info': column_info,
        'avg_confidence': avg_confidence
    }

## 10. Test Single Document

In [35]:
input_dir = Path(CONFIG['input_dir'])
output_dir = Path(CONFIG['output_dir'])
output_dir.mkdir(parents=True, exist_ok=True)

document_dirs = [d for d in input_dir.iterdir() if d.is_dir()]
if document_dirs:
    test_doc = document_dirs[0]
    print(f"Testing NanoNet OCR with Column Detection: {test_doc.name}\n")
    print("=" * 70)

    result = process_document_with_column_detection(test_doc, nanonet_engine, CONFIG)

    if result['success']:
        print(f"\n‚úÖ Processed: {result['document_name']}")
        print(f"   Elements captured: {result['total_elements']}")
        print(f"   Average confidence placeholder: {result['avg_confidence']:.1f}%")

        print("\nüìä Column Layout Summary:")
        for page_num, info in result['column_info'].items():
            print(f"  Page {page_num} ‚Üí {info['layout_type']} ({info['num_columns']} cols)")
            if info['num_columns'] > 1:
                bounds = ', '.join(f"{b:.1f}" for b in info['column_boundaries'])
                print(f"    Boundaries: {bounds}")

        output_file = output_dir / f"{result['document_name']}.md"
        with open(output_file, 'w', encoding='utf-8') as fh:
            fh.write(result['markdown'])

        print(f"\n‚úì Markdown saved to {output_file}")
        print("\nPreview (first 600 chars):")
        print("=" * 70)
        preview = result['markdown'][:600]
        print(preview)
        if len(result['markdown']) > 600:
            print("... (truncated)")
    else:
        print(f"‚úó Error: {result.get('error')}")
else:
    print("No documents available in input directory.")

2025-10-27 18:18:09,131 - INFO - Processing document with NanoNet OCR: 00922240


Testing NanoNet OCR with Column Detection: 00922240



NanoNet OCR 00922240:   0%|          | 0/17 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore


‚úÖ Processed: 00922240
   Elements captured: 17
   Average confidence placeholder: 0.0%

üìä Column Layout Summary:
  Page 1 ‚Üí single_column (1 cols)

‚úì Markdown saved to nanonet_markdown_output/00922240.md

Preview (first 600 chars):
## Page 1

<!-- Layout: single_column (1 column(s)) -->

<img>R-T-I triangle pointing to the right</img>

## INVOICE

**DATE: March 8, 1984
TERMS: Net
INV. NO.: 31T 2552-14B**

**TO:
Lorillard Research Center
Post Office Box 21688
Greensboro, N. C. 27420

ATTN: Dr. Harry Minnemeyer
Director, Research

REFERENCE:
P.O. No. 327B and 336A
Agreement Dated 11/18/82

FOR:
Carbon-14 Syntheses
Other Direct Costs

PERIOD:
February 1, 1984 to February 29, 1984**

*022240*

<...>

-54-0000

DURHAM

FROM

RALEIGH.

AND

CHAPEL

HILL

*RESEARCH TRIANGLE INSTITUTE
POST OFFICE BOX 12194
RESEARCH TRIANGLE PARK
... (truncated)


## 11. Batch Processing Helper

In [36]:
def batch_process_documents(
    input_dir: Path,
    output_dir: Path,
    engine: NanoNetOCREngine,
    config: Dict[str, Any]
) -> Dict[str, Any]:
    output_dir.mkdir(parents=True, exist_ok=True)
    document_dirs = [d for d in input_dir.iterdir() if d.is_dir()]
    if not document_dirs:
        logger.warning(f"No document directories found in {input_dir}")
        return {'success': False, 'error': 'No documents'}

    logger.info(f"Batch processing {len(document_dirs)} documents with NanoNet OCR")
    summary = {
        'total_documents': len(document_dirs),
        'successful': 0,
        'failed': 0,
        'details': [],
        'column_layouts': {}
    }

    for doc_dir in tqdm(document_dirs, desc='NanoNet batch'):
        try:
            result = process_document_with_column_detection(doc_dir, engine, config)
            if result['success']:
                output_file = output_dir / f"{result['document_name']}.md"
                with open(output_file, 'w', encoding='utf-8') as fh:
                    fh.write(result['markdown'])

                summary['successful'] += 1
                for page_num, info in result['column_info'].items():
                    layout = info['layout_type']
                    summary['column_layouts'][layout] = summary['column_layouts'].get(layout, 0) + 1

                summary['details'].append({
                    'document': result['document_name'],
                    'status': 'success',
                    'elements': result['total_elements'],
                    'output_file': str(output_file)
                })
            else:
                summary['failed'] += 1
                summary['details'].append({
                    'document': doc_dir.name,
                    'status': 'failed',
                    'error': result.get('error')
                })
        except Exception as exc:
            summary['failed'] += 1
            summary['details'].append({
                'document': doc_dir.name,
                'status': 'failed',
                'error': str(exc)
            })
            logger.error(f"Batch failure for {doc_dir.name}: {exc}")

    return summary

## 12. Batch Process All Documents

In [37]:
print("Starting NanoNet batch processing...\n")

batch_results = batch_process_documents(input_dir, output_dir, nanonet_engine, CONFIG)

print("\n" + "=" * 70)
print("NANONET BATCH SUMMARY")
print("=" * 70)
print(f"Total documents: {batch_results.get('total_documents', 0)}")
print(f"Successful: {batch_results.get('successful', 0)}")
print(f"Failed: {batch_results.get('failed', 0)}")
print(f"Output directory: {output_dir}")

if batch_results.get('column_layouts'):
    print("\nColumn layout breakdown:")
    for layout, count in batch_results['column_layouts'].items():
        print(f"  {layout}: {count} pages")

if batch_results.get('failed'):
    print("\nFailures:")
    for detail in batch_results['details']:
        if detail['status'] == 'failed':
            print(f"  - {detail['document']}: {detail.get('error')}")

2025-10-27 18:18:39,251 - INFO - Batch processing 61 documents with NanoNet OCR


Starting NanoNet batch processing...



NanoNet batch:   0%|          | 0/61 [00:00<?, ?it/s]

2025-10-27 18:18:39,259 - INFO - Processing document with NanoNet OCR: 00922240


NanoNet OCR 00922240:   0%|          | 0/17 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR batch1-0281:   0%|          | 0/11 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR 00920576:   0%|          | 0/8 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Angele Hood_12988:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:20:46,462 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:20:46,464 - INFO - Processing document with NanoNet OCR: batch1-0280


NanoNet OCR batch1-0280:   0%|          | 0/13 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Andy Yotov_37312:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:21:25,295 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:21:25,297 - INFO - Processing document with NanoNet OCR: 00921466


NanoNet OCR 00921466:   0%|          | 0/12 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR 0013046347:   0%|          | 0/13 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Anna Andreadi_39301:   0%|          | 0/2 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:22:49,809 - INFO -   Page 1: single_column (1 column(s))
2025-10-27 18:22:49,810 - INFO - Processing document with NanoNet OCR: invoice_Angele Hood_35601


NanoNet OCR invoice_Angele Hood_35601:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:23:01,974 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:23:01,976 - INFO - Processing document with NanoNet OCR: 2085538660


NanoNet OCR 2085538660:   0%|          | 0/3 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:23:49,263 - INFO -   Page 1: single_column (1 column(s))
2025-10-27 18:23:49,265 - INFO - Processing document with NanoNet OCR: batch1-0286


NanoNet OCR batch1-0286:   0%|          | 0/13 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Anna Andreadi_39300:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:24:42,500 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:24:42,502 - INFO - Processing document with NanoNet OCR: 50120516-0516


NanoNet OCR 50120516-0516:   0%|          | 0/5 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:24:59,724 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:24:59,725 - INFO - Processing document with NanoNet OCR: batch1-0005


NanoNet OCR batch1-0005:   0%|          | 0/13 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Anna Andreadi_35319:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:25:50,249 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:25:50,251 - INFO - Processing document with NanoNet OCR: batch1-0274


NanoNet OCR batch1-0274:   0%|          | 0/12 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Anna Andreadi_39302:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:26:42,545 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:26:42,547 - INFO - Processing document with NanoNet OCR: batch1-0027


NanoNet OCR batch1-0027:   0%|          | 0/11 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR ti31149327_9330:   0%|          | 0/12 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR batch1-0288:   0%|          | 0/11 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Anna Chung_36195:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:28:40,819 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:28:40,821 - INFO - Processing document with NanoNet OCR: batch1-0285


NanoNet OCR batch1-0285:   0%|          | 0/14 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR batch1-0021:   0%|          | 0/12 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Ann Blume_35427:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:30:01,406 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:30:01,408 - INFO - Processing document with NanoNet OCR: ti31689113


NanoNet OCR ti31689113:   0%|          | 0/10 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR batch1-0007:   0%|          | 0/13 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR motelone_20240203:   0%|          | 0/18 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Anna H–îberlin_40216:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:32:34,758 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:32:34,760 - INFO - Processing document with NanoNet OCR: spaceneedle_20240528_005


NanoNet OCR spaceneedle_20240528_005:   0%|          | 0/15 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Anna Andreadi_35317:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:32:56,846 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:32:56,848 - INFO - Processing document with NanoNet OCR: batch1-0020


NanoNet OCR batch1-0020:   0%|          | 0/11 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Anna Andreadi_35318:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:33:46,774 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:33:46,776 - INFO - Processing document with NanoNet OCR: batch1-0282


NanoNet OCR batch1-0282:   0%|          | 0/12 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR sliders-454353423425:   0%|          | 0/2 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:34:39,434 - INFO -   Page 1: single_column (1 column(s))
2025-10-27 18:34:39,436 - INFO - Processing document with NanoNet OCR: invoice_Andy Yotov_37313


NanoNet OCR invoice_Andy Yotov_37313:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:34:52,836 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:34:52,838 - INFO - Processing document with NanoNet OCR: batch1-0278


NanoNet OCR batch1-0278:   0%|          | 0/12 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Andy Yotov_37314:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:35:47,166 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:35:47,168 - INFO - Processing document with NanoNet OCR: batch1-0276


NanoNet OCR batch1-0276:   0%|          | 0/13 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Andy Yotov_37315:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:36:23,374 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:36:23,377 - INFO - Processing document with NanoNet OCR: 0001139626


NanoNet OCR 0001139626:   0%|          | 0/8 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Anna Chung_36194:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:38:04,432 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:38:04,434 - INFO - Processing document with NanoNet OCR: batch1-0287


NanoNet OCR batch1-0287:   0%|          | 0/13 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR batch1-0275:   0%|          | 0/11 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Angele Hood_35602:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:39:27,617 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:39:27,619 - INFO - Processing document with NanoNet OCR: batch1-0001


NanoNet OCR batch1-0001:   0%|          | 0/13 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR batch1-0283:   0%|          | 0/11 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR batch1-0029:   0%|          | 0/12 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Andy Yotov_39986:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:41:49,340 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:41:49,342 - INFO - Processing document with NanoNet OCR: 0001139716


NanoNet OCR 0001139716:   0%|          | 0/9 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR batch1-0003:   0%|          | 0/12 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR batch1-0279:   0%|          | 0/12 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR batch1-0277:   0%|          | 0/11 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Anemone Ratner_8876:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:44:33,403 - INFO -   Page 1: two_column (2 column(s))
2025-10-27 18:44:33,405 - INFO - Processing document with NanoNet OCR: ti31689150


NanoNet OCR ti31689150:   0%|          | 0/10 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR shakeshack_20181208_004:   0%|          | 0/12 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR sliders-454353423425_with_crops:   0%|          | 0/2 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:45:22,987 - INFO -   Page 1: single_column (1 column(s))
2025-10-27 18:45:22,990 - INFO - Processing document with NanoNet OCR: happyitaly_20240306_001


NanoNet OCR happyitaly_20240306_001:   0%|          | 0/4 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:45:39,836 - INFO -   Page 1: single_column (1 column(s))
2025-10-27 18:45:39,838 - INFO - Processing document with NanoNet OCR: ti16311032


NanoNet OCR ti16311032:   0%|          | 0/9 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR batch1-0284:   0%|          | 0/12 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

NanoNet OCR invoice_Anna Gayman_42837:   0%|          | 0/6 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2025-10-27 18:46:26,787 - INFO -   Page 1: two_column (2 column(s))



NANONET BATCH SUMMARY
Total documents: 61
Successful: 61
Failed: 0
Output directory: nanonet_markdown_output

Column layout breakdown:
  single_column: 8 pages
  two_column: 47 pages
  three_column: 8 pages


## 13. Preview Generated Markdown

In [38]:
markdown_files = list(output_dir.glob('*.md'))
if markdown_files:
    sample_file = markdown_files[0]
    print(f"Previewing: {sample_file.name}\n")
    print("=" * 70)
    with open(sample_file, 'r', encoding='utf-8') as fh:
        preview = fh.read()
        print(preview[:1500])
        if len(preview) > 1500:
            print("\n... (truncated)")
            print(f"\nTotal length: {len(preview)} characters")
else:
    print("No markdown files found in output directory.")

Previewing: 0001139626.md

## Page 1

<!-- Layout: three_column (3 column(s)) -->

```html
Phone 212-800-3131 Cable Bateboard, New York

RUN ON AUG09/79 AT 19:59 PRODUCTION BILL
BILL NUMBER P-08-0933 PAGE 1
DATE DUE AUG23/79

CLIENT BH GROWN & WILLIAMSON TOBACCO CORP
PRODUCT KM KOOL MILDS
J98 PM581Z KOOL MILDS PACK PRESENTATION
MEDIA P KOOL PRODUCT
ESTIMATE NUMBER KM-MISC-70-IRI BEW CODE N/A

1000 HILL ST
LOUISVILLE KY 40201
```

MEDIA P
ESTIMATE NUMBER= KM-MISC-78-1R1 BEW CODE= N/A

<table>
  <thead>
    <tr>
      <th>DESCRIPTION</th>
      <th>VENDOR NAME</th>
      <th>NET AMOUNT</th>
      <th>COMMISSION</th>
      <th>TOTAL</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>PHOTOGRAPHS FEE</td>
      <td></td>
      <td></td>
      <td></td>
      <td></td>
    </tr>
    <tr>
      <td>122176 PO NU 3935</td>
      <td>THOM DE SANTO</td>
      <td>1500.00</td>
      <td></td>
      <td></td>
    </tr>
    <tr>
      <td>761041 PO NO 3760</td>
      <td>SHIG IKEDA</td>
      <t

## Summary & Next Steps

### ‚úÖ What this notebook provides
1. **NanoNet OCR Integration** ‚Äì swap-in replacement for PaddleOCR with richer text understanding.
2. **Column-Aware Formatting** ‚Äì same clustering strategy used in the Paddle notebook ensures layout fidelity.
3. **Batch & Single Document Modes** ‚Äì easy testing + scale-out processing.
4. **Markdown Output** ‚Äì structured, human-readable results suitable for downstream use.

### üöÄ Suggested workflow
1. Run the layout detection notebook to generate crops & `layout_data.json`.
2. Execute sections 1‚Äì3 here to load the NanoNet model.
3. Test a single document (Section 10) to validate output.
4. Process entire folders with Section 12.
5. Inspect results in `nanonet_markdown_output/`.

### üõ†Ô∏è Troubleshooting tips
- **Model downloads slowly?** Pre-cache the Hugging Face model or mount a shared cache.
- **GPU memory errors?** Lower `page_max_tokens` / `crop_max_tokens` or switch to CPU mode.
- **No crops found?** Ensure layout detection produced `cropped_sections/` folder for each document.

Happy extracting!