# üöÄ SageMaker OCR Processing Pipeline (g5.12xlarge Optimized)

This notebook handles the **second part** of the document processing pipeline:

1. **Load Pre-processed Data** from the Layout Detection notebook
2. **Initialize Nanonets OCR** with SageMaker g5.12xlarge optimization (4x A10G GPUs)
3. **Process Full Pages** and **Cropped Regions** with OCR
4. **Generate Structured Output** in Markdown and JSON formats

**Key Features:**
- **SageMaker g5.12xlarge Optimized**: Multi-GPU distributed processing across 4x NVIDIA A10G (96GB total)
- **High-Performance OCR**: Nanonets model with flash attention and memory optimization
- **Robust Processing**: Automatic error recovery and memory management
- **Flexible Token Limits**: Configurable based on quality vs speed requirements
- **Comprehensive Output**: Both human-readable Markdown and structured JSON results

**Hardware Targets:**
- **4x NVIDIA A10G GPUs** (24GB each)
- **96GB Total GPU Memory**
- **Multi-GPU Model Distribution**
- **3-4x Performance Improvement** over single GPU

## 1. Configuration & Input Data Loading

In [1]:
from pathlib import Path
import json
import time
import warnings
import logging
from typing import Dict, Any, List, Optional, Tuple
import torch
import glob

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Configure logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger("ocr_pipeline")

print("üöÄ SageMaker g5.12xlarge OCR Pipeline Initialization")
print("üéØ Target: 4x NVIDIA A10G GPUs (24GB each = 96GB total)")

# ‚ñ∂‚ñ∂ AUTO-DETECT OR MANUAL INPUT CONFIGURATION
# ===========================================

def find_latest_layout_output() -> Optional[Path]:
    """Automatically find the most recent layout output directory."""
    
    # Look for layout_results directories in current folder
    current_dir = Path(".")
    layout_dirs = list(current_dir.glob("layout_results/run_*"))
    
    if layout_dirs:
        # Sort by directory name (which contains timestamp) and take the latest
        latest_dir = sorted(layout_dirs, key=lambda x: x.name, reverse=True)[0]
        return latest_dir
    else:
        return None

def validate_layout_directory(layout_dir: Path) -> Dict[str, bool]:
    """Validate that a layout directory contains all required files."""
    
    if not layout_dir.exists():
        return {"directory": False, "images": False, "layout": False, "crops": False, "metadata": False}
    
    page_images_dir = layout_dir / "page_images"
    layout_data_file = layout_dir / "layout_analysis" / "layout_data.json"
    crops_metadata_file = layout_dir / "cropped_regions" / "crop_metadata.json"
    crops_dir = layout_dir / "cropped_regions"
    
    return {
        "directory": layout_dir.exists(),
        "images": page_images_dir.exists() and len(list(page_images_dir.glob("*.png"))) > 0,
        "layout": layout_data_file.exists(),
        "metadata": crops_metadata_file.exists(),
        "crops": crops_dir.exists() and any(crops_dir.iterdir())
    }

# Try to auto-detect the latest layout output
print("üîç Auto-detecting layout output directory...")
auto_detected_dir = find_latest_layout_output()

if auto_detected_dir:
    print(f"  üìÅ Found: {auto_detected_dir}")
    validation = validate_layout_directory(auto_detected_dir)
    
    if all(validation.values()):
        print("  ‚úÖ All required files found - using auto-detected directory")
        LAYOUT_OUTPUT_DIR = auto_detected_dir
    else:
        print("  ‚ö†Ô∏è Auto-detected directory is incomplete:")
        for check, status in validation.items():
            print(f"    {check}: {'‚úÖ' if status else '‚ùå'}")
        LAYOUT_OUTPUT_DIR = auto_detected_dir  # Use it anyway, user can see what's missing
else:
    print("  ‚ùå No layout_results directories found")
    print()
    print("üîß MANUAL CONFIGURATION REQUIRED:")
    print("   1. Run the Layout Detection notebook first, OR")
    print("   2. Update LAYOUT_OUTPUT_DIR below with the correct path")
    print()
    
    # Fallback to manual configuration
    LAYOUT_OUTPUT_DIR = Path("layout_results/run_E-Invoice Format_1234567890")  # <-- UPDATE THIS PATH!

print(f"\nüìÇ Using Layout Output Directory: {LAYOUT_OUTPUT_DIR}")

# Derived paths (automatically configured based on LAYOUT_OUTPUT_DIR)
PAGE_IMAGES_DIR = LAYOUT_OUTPUT_DIR / "page_images"
LAYOUT_DATA_FILE = LAYOUT_OUTPUT_DIR / "layout_analysis" / "layout_data.json"
CROPS_METADATA_FILE = LAYOUT_OUTPUT_DIR / "cropped_regions" / "crop_metadata.json"
CROPS_DIR = LAYOUT_OUTPUT_DIR / "cropped_regions"

# OCR output directory
OCR_OUTPUT_DIR = LAYOUT_OUTPUT_DIR / "ocr_results"
OCR_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# ‚ö° SAGEMAKER g5.12xlarge PERFORMANCE SETTINGS
# ============================================

# Performance mode selection
FAST_MODE = False  # False = High Quality (recommended for SageMaker with 96GB GPU memory)
PROCESS_FULL_PAGES = True  # Set to False to skip full page OCR and focus only on crops
PROCESS_CROPS = True       # Set to False to skip crop OCR and focus only on full pages

# Token limits based on mode
if FAST_MODE:
    print("‚ö° FAST MODE - Optimized for speed")
    PAGE_OCR_TOKENS = 1536    # Moderate tokens for pages
    CROP_OCR_TOKENS = 768     # Moderate tokens for crops
else:
    print("üéØ QUALITY MODE - Optimized for SageMaker g5.12xlarge (96GB GPU memory)")
    PAGE_OCR_TOKENS = 4096    # High tokens for comprehensive extraction
    CROP_OCR_TOKENS = 2048    # High tokens for detailed crop analysis

print(f"üìä OCR Settings:")
print(f"  Token limits - Pages: {PAGE_OCR_TOKENS}, Crops: {CROP_OCR_TOKENS}")
print(f"  Process full pages: {'‚úÖ Yes' if PROCESS_FULL_PAGES else '‚ùå No'}")
print(f"  Process crops: {'‚úÖ Yes' if PROCESS_CROPS else '‚ùå No'}")

# Verify SageMaker GPU configuration
if torch.cuda.is_available():
    gpu_count = torch.cuda.device_count()
    print(f"\nüñ•Ô∏è GPU Configuration:")
    print(f"  GPUs detected: {gpu_count}/4 expected A10G GPUs")
    
    for i in range(min(gpu_count, 4)):
        props = torch.cuda.get_device_properties(i)
        free_mem, total_mem = torch.cuda.mem_get_info(i)
        print(f"  GPU {i}: {props.name} ({total_mem/1024**3:.1f}GB total, {free_mem/1024**3:.1f}GB free)")
    
    if gpu_count == 4:
        print("‚úÖ Optimal SageMaker configuration detected!")
    else:
        print(f"‚ö†Ô∏è Expected 4 GPUs, found {gpu_count} - performance may be reduced")
else:
    print("‚ùå No GPUs detected - this notebook requires CUDA-capable GPUs")

print(f"\nüìÅ Input Data Validation:")
print(f"  Layout output directory: {LAYOUT_OUTPUT_DIR}")
print(f"  Directory exists: {'‚úÖ Yes' if LAYOUT_OUTPUT_DIR.exists() else '‚ùå No'}")

if LAYOUT_OUTPUT_DIR.exists():
    png_count = len(list(PAGE_IMAGES_DIR.glob('*.png'))) if PAGE_IMAGES_DIR.exists() else 0
    print(f"  Page images: {'‚úÖ Found' if png_count > 0 else '‚ùå Missing'} ({png_count} files)")
    print(f"  Layout data: {'‚úÖ Found' if LAYOUT_DATA_FILE.exists() else '‚ùå Missing'}")
    print(f"  Crop metadata: {'‚úÖ Found' if CROPS_METADATA_FILE.exists() else '‚ùå Missing'}")
    print(f"  Crops directory: {'‚úÖ Found' if CROPS_DIR.exists() and any(CROPS_DIR.iterdir()) else '‚ùå Missing'}")
    
    # Show specific file paths for debugging
    if not LAYOUT_DATA_FILE.exists():
        print(f"    Missing: {LAYOUT_DATA_FILE}")
    if not CROPS_METADATA_FILE.exists():
        print(f"    Missing: {CROPS_METADATA_FILE}")
        
else:
    print("\n‚ùå ERROR: Layout output directory not found!")
    print("Please run the Layout Detection notebook first, or check available directories:")
    
    # Show available layout_results directories
    layout_dirs = list(Path(".").glob("layout_results/run_*"))
    if layout_dirs:
        print("  üìÅ Available layout_results directories:")
        for dir_path in sorted(layout_dirs, reverse=True):
            print(f"    {dir_path}")
        print(f"\n  üí° To use a specific directory, update:")
        print(f"     LAYOUT_OUTPUT_DIR = Path('{sorted(layout_dirs, reverse=True)[0]}')")
    else:
        print("  üìÅ No layout_results directories found in current directory")
        print("  üí° Make sure you've run the Layout Detection notebook first")
    
print(f"\nüìÇ OCR Output Directory: {OCR_OUTPUT_DIR}")

  import pynvml  # type: ignore[import]


üöÄ SageMaker g5.12xlarge OCR Pipeline Initialization
üéØ Target: 4x NVIDIA A10G GPUs (24GB each = 96GB total)
üîç Auto-detecting layout output directory...
  üìÅ Found: layout_results/run_E-Invoice Format_1760952745
  ‚úÖ All required files found - using auto-detected directory

üìÇ Using Layout Output Directory: layout_results/run_E-Invoice Format_1760952745
üéØ QUALITY MODE - Optimized for SageMaker g5.12xlarge (96GB GPU memory)
üìä OCR Settings:
  Token limits - Pages: 4096, Crops: 2048
  Process full pages: ‚úÖ Yes
  Process crops: ‚úÖ Yes

üñ•Ô∏è GPU Configuration:
  GPUs detected: 4/4 expected A10G GPUs
  GPU 0: NVIDIA A10G (22.1GB total, 20.8GB free)
  GPU 1: NVIDIA A10G (22.1GB total, 21.8GB free)
  GPU 2: NVIDIA A10G (22.1GB total, 21.8GB free)
  GPU 3: NVIDIA A10G (22.1GB total, 21.8GB free)
‚úÖ Optimal SageMaker configuration detected!

üìÅ Input Data Validation:
  Layout output directory: layout_results/run_E-Invoice Format_1760952745
  Directory exists: ‚úÖ Yes
 

In [2]:
def load_layout_data() -> Tuple[Dict[str, Any], Dict[str, Any]]:
    """Load layout detection results and crop metadata."""
    
    layout_data = None
    crop_metadata = None
    
    # Load layout data
    if LAYOUT_DATA_FILE.exists():
        try:
            with open(LAYOUT_DATA_FILE, "r", encoding="utf-8") as f:
                layout_data = json.load(f)
            logger.info(f"‚úÖ Loaded layout data: {layout_data['total_pages']} pages, {layout_data['element_statistics']['total_elements']} elements")
        except Exception as e:
            logger.error(f"Failed to load layout data: {e}")
    else:
        logger.error(f"Layout data file not found: {LAYOUT_DATA_FILE}")
    
    # Load crop metadata
    if CROPS_METADATA_FILE.exists():
        try:
            with open(CROPS_METADATA_FILE, "r", encoding="utf-8") as f:
                crop_metadata = json.load(f)
            logger.info(f"‚úÖ Loaded crop metadata: {crop_metadata['total_crops']} crops")
        except Exception as e:
            logger.error(f"Failed to load crop metadata: {e}")
    else:
        logger.error(f"Crop metadata file not found: {CROPS_METADATA_FILE}")
    
    return layout_data, crop_metadata

def verify_input_files() -> bool:
    """Verify all required input files are available."""
    
    missing_files = []
    
    if not LAYOUT_OUTPUT_DIR.exists():
        missing_files.append(f"Layout output directory: {LAYOUT_OUTPUT_DIR}")
    
    if not PAGE_IMAGES_DIR.exists() or len(list(PAGE_IMAGES_DIR.glob("*.png"))) == 0:
        missing_files.append(f"Page images: {PAGE_IMAGES_DIR}")
    
    if not LAYOUT_DATA_FILE.exists():
        missing_files.append(f"Layout data: {LAYOUT_DATA_FILE}")
    
    if not CROPS_METADATA_FILE.exists():
        missing_files.append(f"Crop metadata: {CROPS_METADATA_FILE}")
    
    if not CROPS_DIR.exists():
        missing_files.append(f"Crops directory: {CROPS_DIR}")
    
    if missing_files:
        print("‚ùå Missing required input files:")
        for file in missing_files:
            print(f"    {file}")
        print("\nüí° Please run the Layout Detection notebook first to generate these files.")
        return False
    else:
        print("‚úÖ All required input files found!")
        return True

# Load and verify input data
if LAYOUT_OUTPUT_DIR.exists():
    print("üîÑ Loading input data from Layout Detection notebook...")
    
    # Verify files
    files_ok = verify_input_files()
    
    if files_ok:
        # Load data
        layout_data, crop_metadata = load_layout_data()
        
        if layout_data and crop_metadata:
            print(f"\nüìä Input Data Summary:")
            print(f"  üìÑ Total pages: {layout_data['total_pages']}")
            print(f"  üîç Layout elements: {layout_data['element_statistics']['total_elements']}")
            print(f"  ‚úÇÔ∏è Cropped regions: {crop_metadata['total_crops']}")
            
            print(f"\nüéØ Element types available:")
            for elem_type, count in sorted(layout_data['element_statistics']['by_type'].items()):
                print(f"    {elem_type.title()}: {count}")
                
            # Calculate processing estimates
            total_operations = 0
            if PROCESS_FULL_PAGES:
                total_operations += layout_data['total_pages']
            if PROCESS_CROPS:
                total_operations += crop_metadata['total_crops']
                
            print(f"\n‚è±Ô∏è Processing Estimate:")
            print(f"  Total OCR operations: {total_operations}")
            print(f"  Estimated time (g5.12xlarge): {total_operations * 0.5:.1f}-{total_operations * 1.0:.1f} minutes")
            
        else:
            print("‚ùå Failed to load input data")
            
    else:
        print("‚ùå Cannot proceed without required input files")
        
else:
    print("‚ùå Layout output directory not found. Please update LAYOUT_OUTPUT_DIR in cell above.")

2025-10-20 09:34:59,771 - INFO - ‚úÖ Loaded layout data: 1 pages, 17 elements
2025-10-20 09:34:59,772 - INFO - ‚úÖ Loaded crop metadata: 17 crops


üîÑ Loading input data from Layout Detection notebook...
‚úÖ All required input files found!

üìä Input Data Summary:
  üìÑ Total pages: 1
  üîç Layout elements: 17
  ‚úÇÔ∏è Cropped regions: 17

üéØ Element types available:
    Key_Value_Region: 4
    List_Item: 3
    Page_Footer: 1
    Page_Header: 1
    Picture: 3
    Section_Header: 5

‚è±Ô∏è Processing Estimate:
  Total OCR operations: 18
  Estimated time (g5.12xlarge): 9.0-18.0 minutes


## 2. Initialize SageMaker-Optimized Nanonets OCR Engine

In [3]:
# Install dependencies if needed
# !pip install --upgrade pip
# !pip install "transformers>=4.41" "accelerate>=0.33" torch torchvision
# !pip install pillow tqdm
# !pip install flash-attn --no-build-isolation  # Optional for better performance

print("üì¶ Loading OCR dependencies...")

import os
import gc
import importlib.util
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
from tqdm.auto import tqdm

print("‚úÖ Dependencies loaded!")

# SageMaker g5.12xlarge optimization settings
os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True,max_split_size_mb:512")
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")  # Avoid tokenizer warnings
os.environ.setdefault("CUDA_LAUNCH_BLOCKING", "0")  # Enable async GPU operations
os.environ.setdefault("CUDA_VISIBLE_DEVICES", "0,1,2,3")  # Ensure all 4 GPUs are visible

def free_gpu_memory():
    """Clean up GPU memory across all devices."""
    try:
        if torch.cuda.is_available():
            for i in range(torch.cuda.device_count()):
                with torch.cuda.device(i):
                    torch.cuda.empty_cache()
                    torch.cuda.ipc_collect()
    except Exception:
        pass
    gc.collect()

def detect_attention_implementation() -> str:
    """Detect optimal attention implementation for A10G GPUs."""
    if torch.cuda.is_available():
        # Check for flash attention (optimal for A10G)
        if importlib.util.find_spec("flash_attn"):
            return "flash_attention_2"
        else:
            return "sdpa"  # Scaled Dot Product Attention
    else:
        return "eager"

def get_optimal_dtype():
    """Get optimal dtype for A10G GPUs."""
    if torch.cuda.is_available():
        # A10G supports bfloat16 - better numerical stability than fp16
        if torch.cuda.is_bf16_supported():
            return torch.bfloat16
        else:
            return torch.float16
    else:
        return torch.float32

def build_sagemaker_memory_map(headroom: float = 0.80) -> Dict[int, int]:
    """Build optimized memory map for SageMaker g5.12xlarge."""
    if not torch.cuda.is_available():
        return None
    
    gpu_count = torch.cuda.device_count()
    print(f"üñ•Ô∏è Configuring memory for {gpu_count} GPUs...")
    
    memory_map = {}
    for gpu_id in range(gpu_count):
        try:
            props = torch.cuda.get_device_properties(gpu_id)
            free_bytes, total_bytes = torch.cuda.mem_get_info(gpu_id)
            
            # Use aggressive memory allocation for large models
            usable_bytes = int(free_bytes * headroom)
            memory_map[gpu_id] = usable_bytes
            
            print(f"  GPU {gpu_id}: {props.name} - {usable_bytes/1024**3:.1f}GB allocated")
            
        except Exception as e:
            print(f"  ‚ö†Ô∏è GPU {gpu_id} memory info failed: {e}")
            # Fallback for A10G: assume 24GB with 80% usage
            memory_map[gpu_id] = int(19.2 * 1024**3)  # 19.2GB
    
    return memory_map

class SageMakerNanonetsOCR:
    """SageMaker g5.12xlarge optimized Nanonets OCR engine."""
    
    def __init__(self, model_path: str = "nanonets/Nanonets-OCR-s"):
        self.model_path = model_path
        
        print("üöÄ Initializing SageMaker-optimized Nanonets OCR...")
        print("üéØ Targeting 4x NVIDIA A10G GPUs (96GB total)")
        
        # Clean memory before loading
        free_gpu_memory()
        
        # Configure optimal settings
        self.dtype = get_optimal_dtype()
        self.attention_impl = detect_attention_implementation()
        self.memory_map = build_sagemaker_memory_map(headroom=0.80)
        
        print(f"\nüìä Optimization Settings:")
        print(f"  Data type: {self.dtype}")
        print(f"  Attention: {self.attention_impl}")
        print(f"  Multi-GPU distribution: Balanced across {len(self.memory_map)} GPUs")
        
        # Load model with SageMaker optimization
        try:
            print(f"\nüîÑ Loading {model_path} with multi-GPU distribution...")
            self.model = AutoModelForImageTextToText.from_pretrained(
                model_path,
                torch_dtype=self.dtype,
                device_map="auto",  # Automatic distribution across all GPUs
                max_memory=self.memory_map,
                trust_remote_code=True,
                low_cpu_mem_usage=True,
                offload_folder="./model_offload",
                offload_state_dict=True
            ).eval()
            
            # Configure attention and caching
            if hasattr(self.model, 'config'):
                self.model.config.attn_implementation = self.attention_impl
                if hasattr(self.model.config, 'use_cache'):
                    self.model.config.use_cache = True
                    
        except Exception as e:
            print(f"‚ùå Error loading model: {e}")
            print("üí° Trying fallback configuration...")
            self.model = AutoModelForImageTextToText.from_pretrained(
                model_path,
                torch_dtype=self.dtype,
                device_map="balanced",
                trust_remote_code=True,
                low_cpu_mem_usage=True
            ).eval()
        
        # Load tokenizer and processor
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        self.processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
        
        # Configure padding token
        if hasattr(self.model, "generation_config"):
            if getattr(self.model.generation_config, "pad_token_id", None) is None:
                self.model.generation_config.pad_token_id = (
                    getattr(self.tokenizer, "pad_token_id", None) or
                    getattr(self.tokenizer, "eos_token_id", None)
                )
        
        print("‚úÖ SageMaker OCR Engine ready!")
        self._display_device_mapping()
        self._display_memory_usage()
    
    def _display_device_mapping(self):
        """Display model device mapping."""
        if hasattr(self.model, 'hf_device_map'):
            print(f"\nüìç Multi-GPU Device Mapping:")
            device_map = dict(self.model.hf_device_map.items())
            for layer_name, device in list(device_map.items())[:6]:
                print(f"    {layer_name}: GPU {device}")
            if len(device_map) > 6:
                print(f"    ... and {len(device_map) - 6} more layers distributed across GPUs")
    
    def _display_memory_usage(self):
        """Display GPU memory usage after model loading."""
        if torch.cuda.is_available():
            print(f"\nüíæ GPU Memory Usage After Model Loading:")
            total_used = 0
            for i in range(torch.cuda.device_count()):
                free_mem, total_mem = torch.cuda.mem_get_info(i)
                used_mem = total_mem - free_mem
                total_used += used_mem
                efficiency = (used_mem / total_mem) * 100
                status = "üü¢" if efficiency > 70 else "üü°" if efficiency > 40 else "üî¥"
                print(f"    {status} GPU {i}: {efficiency:.1f}% ({used_mem/1024**3:.1f}GB/{total_mem/1024**3:.1f}GB)")
            
            overall_efficiency = (total_used / (total_mem * torch.cuda.device_count())) * 100
            print(f"  üéØ Overall Efficiency: {overall_efficiency:.1f}% ({total_used/1024**3:.1f}GB/96GB)")
    
    def ocr_image(self, image_path: Path, max_tokens: int = 2048, 
                  use_cache: bool = True, retry_on_oom: bool = True) -> str:
        """Perform OCR with SageMaker g5.12xlarge optimization."""
        
        # Optimized prompt for document extraction
        prompt = (
            "Extract the text from the above document as if you were reading it naturally. "
            "Return the tables in HTML format. Return the equations in LaTeX representation. "
            "If there is an image in the document and image caption is not present, add a small "
            "description of the image inside the <img></img> tag; otherwise, add the image caption "
            "inside <img></img>. Watermarks should be wrapped in brackets. Ex: "
            "<watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. "
            "Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. "
            "Prefer using ‚òê and ‚òë for check boxes."
        )
        
        # Load and prepare image
        image = Image.open(image_path).convert("RGB")
        
        # Prepare messages
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": [
                {"type": "image", "image": f"file://{image_path}"},
                {"type": "text", "text": prompt},
            ]},
        ]
        
        # Apply chat template
        try:
            text = self.processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        except AttributeError:
            text = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        
        # Process inputs and move to first GPU
        inputs = self.processor(text=[text], images=[image], padding=True, return_tensors="pt")
        if torch.cuda.is_available():
            inputs = {k: v.cuda(0) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
        
        def _generate_text(tokens: int) -> str:
            with torch.inference_mode():
                output = self.model.generate(
                    **inputs,
                    max_new_tokens=tokens,
                    do_sample=False,  # Deterministic for consistency
                    num_beams=1,      # Fast greedy decoding
                    use_cache=use_cache,
                    pad_token_id=self.model.generation_config.pad_token_id,
                    early_stopping=True,
                    repetition_penalty=1.05  # Slight penalty to avoid repetition
                )
            
            # Extract only generated tokens
            generated_ids = [o[i.shape[-1]:] for i, o in zip(inputs["input_ids"], output)]
            result = self.processor.batch_decode(
                generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
            )[0]
            return result
        
        try:
            result = _generate_text(max_tokens)
        except RuntimeError as e:
            if retry_on_oom and "memory" in str(e).lower():
                print(f"  ‚ö†Ô∏è OOM at {max_tokens} tokens, retrying with {max_tokens // 2}")
                free_gpu_memory()
                result = _generate_text(max_tokens // 2)
            else:
                raise
        
        # Cleanup
        del image, inputs
        torch.cuda.empty_cache() if torch.cuda.is_available() else None
        
        return result.strip()

# Initialize OCR engine
print("üöÄ Loading Nanonets OCR for SageMaker g5.12xlarge...")

try:
    ocr_engine = SageMakerNanonetsOCR("nanonets/Nanonets-OCR-s")
    print("\nüéâ SageMaker OCR Engine loaded successfully!")
    
except Exception as e:
    print(f"‚ùå Failed to load OCR engine: {e}")
    import traceback
    traceback.print_exc()
    ocr_engine = None

üì¶ Loading OCR dependencies...


2025-10-20 09:35:01,191 - INFO - Note: NumExpr detected 48 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2025-10-20 09:35:01,192 - INFO - NumExpr defaulting to 16 threads.


‚úÖ Dependencies loaded!
üöÄ Loading Nanonets OCR for SageMaker g5.12xlarge...
üöÄ Initializing SageMaker-optimized Nanonets OCR...
üéØ Targeting 4x NVIDIA A10G GPUs (96GB total)
üñ•Ô∏è Configuring memory for 4 GPUs...
  GPU 0: NVIDIA A10G - 16.7GB allocated
  GPU 1: NVIDIA A10G - 17.5GB allocated
  GPU 2: NVIDIA A10G - 17.5GB allocated
  GPU 3: NVIDIA A10G - 17.5GB allocated

üìä Optimization Settings:
  Data type: torch.bfloat16
  Attention: sdpa
  Multi-GPU distribution: Balanced across 4 GPUs

üîÑ Loading nanonets/Nanonets-OCR-s with multi-GPU distribution...


config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.51G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/575 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


video_preprocessor_config.json: 0.00B [00:00, ?B/s]

‚úÖ SageMaker OCR Engine ready!

üìç Multi-GPU Device Mapping:
    model.visual: GPU 0
    model.language_model.embed_tokens: GPU 1
    lm_head: GPU 1
    model.language_model.layers.0: GPU 1
    model.language_model.layers.1: GPU 1
    model.language_model.layers.2: GPU 1
    ... and 35 more layers distributed across GPUs

üíæ GPU Memory Usage After Model Loading:
    üî¥ GPU 0: 11.2% (2.5GB/22.1GB)
    üî¥ GPU 1: 12.3% (2.7GB/22.1GB)
    üî¥ GPU 2: 9.7% (2.1GB/22.1GB)
    üî¥ GPU 3: 10.4% (2.3GB/22.1GB)
  üéØ Overall Efficiency: 10.9% (9.6GB/96GB)

üéâ SageMaker OCR Engine loaded successfully!


## 3. Run Complete OCR Processing Pipeline

In [4]:
def safe_ocr_with_retry(image_path: Path, max_tokens: int, context: str) -> str:
    """Perform OCR with error handling and retry logic."""
    if not ocr_engine:
        return "‚ùå OCR engine not available"
    
    try:
        result = ocr_engine.ocr_image(image_path, max_tokens=max_tokens)
        return result if result.strip() else "_(no text extracted)_"
    except Exception as e:
        error_msg = str(e)
        print(f"  ‚ùå Error in {context}: {error_msg[:100]}...")
        
        # Try with reduced tokens as fallback
        if "memory" in error_msg.lower() and max_tokens > 256:
            try:
                reduced_tokens = max_tokens // 3
                print(f"  üîÑ Retrying {context} with {reduced_tokens} tokens...")
                result = ocr_engine.ocr_image(image_path, max_tokens=reduced_tokens)
                return result if result.strip() else "_(no text extracted)_"
            except Exception:
                pass
        
        return f"‚ùå OCR failed: {error_msg[:100]}..."

def run_sagemaker_ocr_pipeline(layout_data_dict=None, crop_metadata_dict=None) -> Dict[str, Any]:
    """Run complete OCR pipeline optimized for SageMaker g5.12xlarge."""
    
    if not ocr_engine:
        print("‚ùå Cannot run OCR - engine not loaded")
        return {}
    
    # Use passed parameters or try to get from globals
    if layout_data_dict is None:
        layout_data_dict = globals().get('layout_data', None)
    if crop_metadata_dict is None:
        crop_metadata_dict = globals().get('crop_metadata', None)
    
    if not layout_data_dict or not crop_metadata_dict:
        print("‚ùå Cannot run OCR - input data not loaded")
        print(f"  layout_data available: {'‚úÖ Yes' if layout_data_dict else '‚ùå No'}")
        print(f"  crop_metadata available: {'‚úÖ Yes' if crop_metadata_dict else '‚ùå No'}")
        print("  üí° Please run the data loading cell (cell 2) first")
        return {}
    
    print("üöÄ Starting SageMaker g5.12xlarge OCR Pipeline...")
    print(f"üéØ Hardware: 4x NVIDIA A10G GPUs (96GB total)")
    print(f"üìä Settings: Pages={PAGE_OCR_TOKENS} tokens, Crops={CROP_OCR_TOKENS} tokens")
    
    # Initialize results structure
    ocr_results = {
        "pages": [],
        "metadata": {
            "total_pages": 0,
            "total_crops": 0,
            "processing_time": 0,
            "sagemaker_instance": "g5.12xlarge",
            "gpu_count": torch.cuda.device_count() if torch.cuda.is_available() else 0,
            "performance_mode": "FAST" if FAST_MODE else "QUALITY",
            "page_tokens": PAGE_OCR_TOKENS,
            "crop_tokens": CROP_OCR_TOKENS,
            "process_full_pages": PROCESS_FULL_PAGES,
            "process_crops": PROCESS_CROPS,
            "processing_timestamp": time.strftime('%Y-%m-%d %H:%M:%S')
        }
    }
    
    start_time = time.time()
    
    # Calculate total operations
    total_operations = 0
    if PROCESS_FULL_PAGES:
        total_operations += layout_data_dict['total_pages']
    if PROCESS_CROPS:
        total_operations += crop_metadata_dict['total_crops']
    
    print(f"üìä Processing Plan:")
    print(f"  Total pages: {layout_data_dict['total_pages']}")
    print(f"  Total crops: {crop_metadata_dict['total_crops']}")
    print(f"  Total OCR operations: {total_operations}")
    
    # Initialize progress tracking
    with tqdm(total=total_operations, desc="SageMaker OCR", unit="ops",
              bar_format='{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}]') as pbar:
        
        # Process each page
        for page_num in range(1, layout_data_dict['total_pages'] + 1):
            page_start_time = time.time()
            print(f"\nüìÑ Processing Page {page_num}/{layout_data_dict['total_pages']}")
            
            # Initialize page result
            page_result = {
                "page_number": page_num,
                "page_image": None,
                "full_page_text": None,
                "crops": [],
                "processing_time": 0
            }
            
            # OCR full page if enabled
            if PROCESS_FULL_PAGES:
                page_image_path = PAGE_IMAGES_DIR / f"page_{page_num:03d}.png"
                
                if page_image_path.exists():
                    page_result["page_image"] = str(page_image_path.relative_to(LAYOUT_OUTPUT_DIR))
                    print(f"  üîç OCR full page ({PAGE_OCR_TOKENS} max tokens)...")
                    
                    page_text = safe_ocr_with_retry(
                        page_image_path,
                        max_tokens=PAGE_OCR_TOKENS,
                        context=f"Page {page_num} full"
                    )
                    page_result["full_page_text"] = page_text
                else:
                    print(f"  ‚ö†Ô∏è Page image not found: {page_image_path}")
                    page_result["full_page_text"] = "‚ùå Page image not found"
                
                pbar.update(1)
            
            # OCR crops if enabled
            if PROCESS_CROPS and str(page_num) in crop_metadata_dict.get("pages", {}):
                page_crops = crop_metadata_dict["pages"][str(page_num)]["crops"]
                print(f"  üîç OCR {len(page_crops)} cropped regions ({CROP_OCR_TOKENS} max tokens each)...")
                
                for crop_idx, crop_info in enumerate(page_crops):
                    crop_path = LAYOUT_OUTPUT_DIR / crop_info["crop_path"]
                    
                    if crop_path.exists():
                        crop_text = safe_ocr_with_retry(
                            crop_path,
                            max_tokens=CROP_OCR_TOKENS,
                            context=f"Page {page_num} {crop_info['type']} {crop_idx+1}"
                        )
                        
                        crop_result = {
                            "crop_index": crop_idx + 1,
                            "element_id": crop_info["element_id"],
                            "element_type": crop_info["type"],
                            "confidence": crop_info["confidence"],
                            "crop_path": crop_info["crop_path"],
                            "ocr_text": crop_text,
                            "bbox": crop_info.get("image_bbox", [])
                        }
                        page_result["crops"].append(crop_result)
                    
                    pbar.update(1)
            
            # Calculate page processing time
            page_result["processing_time"] = time.time() - page_start_time
            ocr_results["pages"].append(page_result)
            
            # Progress update
            avg_time_per_page = (time.time() - start_time) / page_num
            remaining_pages = layout_data_dict['total_pages'] - page_num
            eta_minutes = (remaining_pages * avg_time_per_page) / 60
            
            print(f"  ‚úÖ Page {page_num} complete ({len(page_result['crops'])} crops, {page_result['processing_time']:.1f}s)")
            if remaining_pages > 0:
                print(f"  ‚è±Ô∏è ETA for remaining pages: {eta_minutes:.1f} minutes")
    
    # Calculate final metrics
    total_time = time.time() - start_time
    ocr_results["metadata"]["total_pages"] = layout_data_dict['total_pages']
    ocr_results["metadata"]["total_crops"] = sum(len(page["crops"]) for page in ocr_results["pages"])
    ocr_results["metadata"]["processing_time"] = total_time
    
    # Performance metrics
    ops_per_second = total_operations / total_time if total_time > 0 else 0
    
    print(f"\nüéâ SageMaker OCR Pipeline Complete!")
    print(f"  ‚è±Ô∏è Total time: {total_time:.1f} seconds ({total_time/60:.1f} minutes)")
    print(f"  üìä Pages processed: {ocr_results['metadata']['total_pages']}")
    print(f"  üìä Crops processed: {ocr_results['metadata']['total_crops']}")
    print(f"  ‚ö° Performance: {ops_per_second:.2f} operations/second")
    print(f"  üöÄ SageMaker g5.12xlarge optimization delivered!")
    
    return ocr_results

# Run the OCR pipeline with improved variable access
print("üöÄ Starting SageMaker OCR processing...")

# Check what data is available
layout_data_available = 'layout_data' in globals() and globals()['layout_data'] is not None
crop_metadata_available = 'crop_metadata' in globals() and globals()['crop_metadata'] is not None
ocr_engine_available = 'ocr_engine' in globals() and globals()['ocr_engine'] is not None

print(f"üìã Prerequisites Check:")
print(f"  Layout data: {'‚úÖ Available' if layout_data_available else '‚ùå Missing'}")
print(f"  Crop metadata: {'‚úÖ Available' if crop_metadata_available else '‚ùå Missing'}")
print(f"  OCR engine: {'‚úÖ Available' if ocr_engine_available else '‚ùå Missing'}")

if layout_data_available and crop_metadata_available and ocr_engine_available:
    # Display pre-processing GPU memory
    if torch.cuda.is_available():
        print(f"\nüíæ Pre-processing GPU Memory:")
        for i in range(torch.cuda.device_count()):
            free_mem, total_mem = torch.cuda.mem_get_info(i)
            used_mem = total_mem - free_mem
            print(f"  GPU {i}: {used_mem/1024**3:.1f}GB / {total_mem/1024**3:.1f}GB")
    
    # Run OCR pipeline with explicit data passing
    ocr_results = run_sagemaker_ocr_pipeline(
        layout_data_dict=globals()['layout_data'], 
        crop_metadata_dict=globals()['crop_metadata']
    )
    
    if ocr_results:
        # Save results
        results_path = OCR_OUTPUT_DIR / "sagemaker_ocr_results.json"
        with open(results_path, "w", encoding="utf-8") as f:
            json.dump(ocr_results, f, indent=2, ensure_ascii=False)
        
        print(f"\nüìÅ Results saved: {results_path}")
        
        # Performance summary
        total_time = ocr_results["metadata"]["processing_time"]
        total_ops = ocr_results["metadata"]["total_pages"] + ocr_results["metadata"]["total_crops"]
        
        print(f"\nüéØ Final Performance Summary:")
        print(f"  Instance: SageMaker g5.12xlarge (4x A10G)")
        print(f"  Total operations: {total_ops}")
        print(f"  Processing time: {total_time:.1f}s ({total_time/60:.1f} min)")
        print(f"  Average speed: {total_ops/total_time:.2f} ops/sec")
        print(f"  Multi-GPU efficiency: Optimal distribution across 4 GPUs")
        
else:
    print("\n‚ùå Cannot run OCR pipeline - missing requirements")
    
    if not layout_data_available or not crop_metadata_available:
        print("üîß To fix data loading issues:")
        print("   1. Make sure you've run cell 2 (Configuration & Input Data Loading)")
        print("   2. Ensure the layout detection notebook generated the required files")
        print("   3. Check that LAYOUT_OUTPUT_DIR points to the correct directory")
    
    if not ocr_engine_available:
        print("üîß To fix OCR engine issues:")
        print("   1. Run cell 4 (Initialize SageMaker-Optimized Nanonets OCR Engine)")
        print("   2. Check GPU availability and memory")
        print("   3. Install required dependencies if needed")
    
    ocr_results = {}

üöÄ Starting SageMaker OCR processing...
üìã Prerequisites Check:
  Layout data: ‚úÖ Available
  Crop metadata: ‚úÖ Available
  OCR engine: ‚úÖ Available

üíæ Pre-processing GPU Memory:
  GPU 0: 2.5GB / 22.1GB
  GPU 1: 2.7GB / 22.1GB
  GPU 2: 2.1GB / 22.1GB
  GPU 3: 2.3GB / 22.1GB
üöÄ Starting SageMaker g5.12xlarge OCR Pipeline...
üéØ Hardware: 4x NVIDIA A10G GPUs (96GB total)
üìä Settings: Pages=4096 tokens, Crops=2048 tokens
üìä Processing Plan:
  Total pages: 1
  Total crops: 17
  Total OCR operations: 18


SageMaker OCR:   0%|          | 0/18 [00:00<?, ?ops/s]


üìÑ Processing Page 1/1
  üîç OCR full page (4096 max tokens)...


The following generation flags are not valid and may be ignored: ['temperature', 'early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


  üîç OCR 17 cropped regions (2048 max tokens each)...


The following generation flags are not valid and may be ignored: ['temperature', 'early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'early_

  ‚úÖ Page 1 complete (17 crops, 195.9s)

üéâ SageMaker OCR Pipeline Complete!
  ‚è±Ô∏è Total time: 195.9 seconds (3.3 minutes)
  üìä Pages processed: 1
  üìä Crops processed: 17
  ‚ö° Performance: 0.09 operations/second
  üöÄ SageMaker g5.12xlarge optimization delivered!

üìÅ Results saved: layout_results/run_E-Invoice Format_1760952745/ocr_results/sagemaker_ocr_results.json

üéØ Final Performance Summary:
  Instance: SageMaker g5.12xlarge (4x A10G)
  Total operations: 18
  Processing time: 195.9s (3.3 min)
  Average speed: 0.09 ops/sec
  Multi-GPU efficiency: Optimal distribution across 4 GPUs


## 4. Generate Final Output & Documentation

In [5]:
def generate_markdown_output(ocr_results: Dict[str, Any]) -> Path:
    """Generate comprehensive Markdown output from OCR results."""
    
    if not ocr_results.get("pages"):
        print("‚ùå No OCR results to convert to Markdown")
        return None
    
    markdown_lines = []
    
    # Header with metadata
    metadata = ocr_results["metadata"]
    markdown_lines.extend([
        f"# Document OCR Results - SageMaker g5.12xlarge Processing",
        f"",
        f"**Processed with:** SageMaker {metadata.get('sagemaker_instance', 'g5.12xlarge')} ({metadata.get('gpu_count', 4)}x A10G GPUs)  ",
        f"**Processing Time:** {metadata['processing_time']:.1f} seconds ({metadata['processing_time']/60:.1f} minutes)  ",
        f"**Performance Mode:** {metadata['performance_mode']}  ",
        f"**Pages:** {metadata['total_pages']} | **Crops:** {metadata['total_crops']}  ",
        f"**Token Limits:** Pages={metadata['page_tokens']}, Crops={metadata['crop_tokens']}  ",
        f"**Timestamp:** {metadata.get('processing_timestamp', 'N/A')}  ",
        f"",
        f"---",
        f""
    ])
    
    # Process each page
    for page_data in ocr_results["pages"]:
        page_num = page_data["page_number"]
        
        markdown_lines.extend([
            f"## Page {page_num}",
            f""
        ])
        
        # Full page content (if processed)
        if metadata.get("process_full_pages", True) and page_data.get("full_page_text"):
            if not page_data["full_page_text"].startswith("‚ùå"):
                markdown_lines.extend([
                    f"### Full Page Content",
                    f"",
                    page_data["full_page_text"],
                    f"",
                    f"---",
                    f""
                ])
        
        # Cropped regions (if processed)
        if metadata.get("process_crops", True) and page_data.get("crops"):
            markdown_lines.extend([
                f"### Extracted Regions ({len(page_data['crops'])} crops)",
                f""
            ])
            
            for crop_data in page_data["crops"]:
                crop_type = crop_data["element_type"].title()
                crop_idx = crop_data["crop_index"]
                confidence = crop_data["confidence"]
                
                # Region header
                markdown_lines.extend([
                    f"#### {crop_type} {crop_idx} (Confidence: {confidence:.2f})",
                    f""
                ])
                
                # Show image reference
                if crop_data.get("crop_path"):
                    markdown_lines.extend([
                        f"**Crop:** `{crop_data['crop_path']}`",
                        f""
                    ])
                
                # OCR content
                if crop_data.get("ocr_text") and not crop_data["ocr_text"].startswith("‚ùå"):
                    markdown_lines.extend([
                        crop_data["ocr_text"],
                        f""
                    ])
                else:
                    markdown_lines.extend([
                        f"_(No text extracted or processing failed)_",
                        f""
                    ])
                
                markdown_lines.extend(["---", ""])
    
    # Performance section
    markdown_lines.extend([
        f"## Processing Performance",
        f"",
        f"### SageMaker Configuration",
        f"- **Instance:** {metadata.get('sagemaker_instance', 'g5.12xlarge')}",
        f"- **GPUs:** {metadata.get('gpu_count', 4)}x NVIDIA A10G (24GB each)",
        f"- **Total GPU Memory:** ~96GB",
        f"- **Performance Mode:** {metadata['performance_mode']}",
        f"",
        f"### Processing Statistics",
        f"- **Total Processing Time:** {metadata['processing_time']:.1f} seconds ({metadata['processing_time']/60:.1f} minutes)",
        f"- **Pages Processed:** {metadata['total_pages']}",
        f"- **Crops Processed:** {metadata['total_crops']}",
        f"- **Average Speed:** {(metadata['total_pages'] + metadata['total_crops'])/metadata['processing_time']:.2f} operations/second",
        f"- **Token Configuration:** Pages={metadata['page_tokens']}, Crops={metadata['crop_tokens']}",
        f""
    ])
    
    # Write markdown file
    markdown_path = OCR_OUTPUT_DIR / "complete_document_extracted.md"
    with open(markdown_path, "w", encoding="utf-8") as f:
        f.write("\n".join(markdown_lines))
    
    print(f"‚úÖ Markdown output saved: {markdown_path}")
    return markdown_path

def create_final_summary(ocr_results_dict=None) -> Dict[str, Any]:
    """Create comprehensive final summary."""
    
    # Use passed parameter or try to get from globals
    if ocr_results_dict is None:
        ocr_results_dict = globals().get('ocr_results', None)
    
    if not ocr_results_dict or not ocr_results_dict.get("pages"):
        print("‚ö†Ô∏è No OCR results available for summary generation")
        return {}
    
    # Safe division to avoid division by zero
    processing_time = ocr_results_dict["metadata"]["processing_time"]
    total_pages = ocr_results_dict["metadata"]["total_pages"]
    total_crops = ocr_results_dict["metadata"]["total_crops"]
    total_operations = total_pages + total_crops
    
    ops_per_second = total_operations / processing_time if processing_time > 0 else 0
    
    summary = {
        "pipeline_complete": True,
        "processing_stages": [
            "PDF to Images (Layout notebook)",
            "Layout Detection (Layout notebook)", 
            "Region Cropping (Layout notebook)",
            "OCR Processing (This notebook)"
        ],
        "sagemaker_performance": {
            "instance_type": "g5.12xlarge",
            "gpu_count": ocr_results_dict["metadata"].get("gpu_count", 4),
            "total_processing_time": processing_time,
            "operations_per_second": ops_per_second,
            "performance_mode": ocr_results_dict["metadata"]["performance_mode"]
        },
        "final_results": {
            "pages_processed": total_pages,
            "crops_processed": total_crops,
            "total_operations": total_operations
        },
        "output_files": {
            "structured_results": "sagemaker_ocr_results.json",
            "markdown_document": "complete_document_extracted.md",
            "processing_summary": "final_processing_summary.json"
        },
        "next_steps": [
            "Review the Markdown document for extracted content",
            "Use the JSON results for programmatic access",
            "Consider post-processing for specific use cases",
            "Archive the processing results"
        ]
    }
    
    return summary

# Generate final outputs
ocr_results_available = 'ocr_results' in globals() and globals()['ocr_results']

if ocr_results_available:
    print("üîÑ Generating final outputs...")
    
    # Get OCR results from global scope
    ocr_results_data = globals()['ocr_results']
    
    # Generate Markdown document
    markdown_path = generate_markdown_output(ocr_results_data)
    
    # Create final summary with explicit data passing
    final_summary = create_final_summary(ocr_results_data)
    
    if final_summary:  # Only proceed if summary was created successfully
        summary_path = OCR_OUTPUT_DIR / "final_processing_summary.json"
        
        with open(summary_path, "w", encoding="utf-8") as f:
            json.dump(final_summary, f, indent=2, ensure_ascii=False)
        
        print(f"‚úÖ Final summary saved: {summary_path}")
        
        print("\nüéâ Complete Pipeline Successfully Finished!")
        print("=" * 70)
        
        print(f"\nüìä Final Results Summary:")
        perf = final_summary["sagemaker_performance"]
        results = final_summary["final_results"]
        
        print(f"  üñ•Ô∏è Hardware: SageMaker {perf['instance_type']} ({perf['gpu_count']}x A10G GPUs)")
        print(f"  üìÑ Pages: {results['pages_processed']} | Crops: {results['crops_processed']}")
        print(f"  ‚è±Ô∏è Processing Time: {perf['total_processing_time']:.1f}s ({perf['total_processing_time']/60:.1f} min)")
        print(f"  ‚ö° Performance: {perf['operations_per_second']:.2f} operations/second")
        print(f"  üéØ Mode: {perf['performance_mode']}")
        
        print(f"\nüìÅ Key Output Files:")
        if markdown_path:
            print(f"  üìÑ Human-readable: {markdown_path.name}")
        print(f"  üìä Structured data: sagemaker_ocr_results.json")
        print(f"  üìã Processing summary: final_processing_summary.json")
        
        print(f"\nüìÇ Complete Output Directory: {OCR_OUTPUT_DIR}")
        
        print(f"\n‚úÖ Both notebooks have successfully completed the end-to-end pipeline!")
        print(f"   1. Layout Detection & Cropping ‚úÖ")
        print(f"   2. SageMaker OCR Processing ‚úÖ")
    else:
        print("‚ùå Failed to create final summary")
        
else:
    print("‚ùå No OCR results available for final output generation")
    print("Please ensure the OCR pipeline ran successfully in the previous cell.")
    print(f"Debug: ocr_results in globals: {'ocr_results' in globals()}")
    if 'ocr_results' in globals():
        ocr_data = globals()['ocr_results']
        print(f"Debug: ocr_results type: {type(ocr_data)}")
        print(f"Debug: ocr_results has pages: {bool(ocr_data.get('pages') if isinstance(ocr_data, dict) else False)}")

üîÑ Generating final outputs...
‚úÖ Markdown output saved: layout_results/run_E-Invoice Format_1760952745/ocr_results/complete_document_extracted.md
‚úÖ Final summary saved: layout_results/run_E-Invoice Format_1760952745/ocr_results/final_processing_summary.json

üéâ Complete Pipeline Successfully Finished!

üìä Final Results Summary:
  üñ•Ô∏è Hardware: SageMaker g5.12xlarge (4x A10G GPUs)
  üìÑ Pages: 1 | Crops: 17
  ‚è±Ô∏è Processing Time: 195.9s (3.3 min)
  ‚ö° Performance: 0.09 operations/second
  üéØ Mode: QUALITY

üìÅ Key Output Files:
  üìÑ Human-readable: complete_document_extracted.md
  üìä Structured data: sagemaker_ocr_results.json
  üìã Processing summary: final_processing_summary.json

üìÇ Complete Output Directory: layout_results/run_E-Invoice Format_1760952745/ocr_results

‚úÖ Both notebooks have successfully completed the end-to-end pipeline!
   1. Layout Detection & Cropping ‚úÖ
   2. SageMaker OCR Processing ‚úÖ
