# Introduction

In this notebook, we use Amazon Textract and Google Vision to provide a quick way of extracting text/tables from an image of a page.

Intended use: The intended use of this notebook is to quickly prototype. You should expect to modify the code in this notebook to suit your usecase.

Preparation: At a minimum, set a working folder, and make sure to add your API keys for both Textract and Google Vision. To do so, please follow the steps outlined here: https://github.com/MikeJGiordano/OCR_History/blob/main/ReadMe.md

This notebook contains four parts:

    1. Unmodified image OCR. This is intended to quickly detect text from a single image.
        a. There is then an option to run one or both OCR tools on a whole folder.
        
    2. Image preprocessing. This routine helps you to quickly preprocess a single image (adjust contrast, split image, etc). 
        a. If you are satisfied with the preprocessing routine, it will give you the option to preprocess a whole folder.
        
    3. Image preprocessing with text extraction. This runs the image modification from part 2 into the text detection from part 1.
    
    4. Image preprocessing with table extraction from Textract. This uses the image modification from part 2 to extract a table using Textract.

# Program Setup

## There are 5 steps, marked A-E.

### A: Import packages

In [3]:
import io
import json
import os

# if you don't have these packages use any package manager to install
# you can install all packages at once using the provided requirements.txt file
import cv2
import boto3
from google.cloud import vision

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tqdm as tq

from PIL import Image, ImageDraw
from textractor import Textractor
from textractor.visualizers.entitylist import EntityList
from textractor.data.constants import TextractFeatures, Direction, DirectionalFinderType
import math 

# note: the following py file, you'll have to download
import preprocess as pp 
import logging
import sys

# Set up logging
logging.basicConfig(level=logging.INFO,
                   format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Disable PIL max image size limit
Image.MAX_IMAGE_PIXELS = None

### B: Please set your working directories here

In [4]:
# please set the path to the folder containing your images here
input_folder = "/mnt/c/Users/WATLINGS/Documents/OCR Files/Census Processing/Documents/1920/Output"
output_folder = "/mnt/c/Users/WATLINGS/Documents/OCR Files/Census Processing/Documents/1920/Output_OCR"

In [5]:
#Authenticate Google Cloud here

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/mnt/c/Users/WATLINGS/Documents/GitHub/OCR_History/OCR_Python/ServiceAccountToken.json'
client = vision.ImageAnnotatorClient()

### E: Please authenticate Amazon Textract

For help with Amazon Textract, see https://github.com/MikeJGiordano/OCR_History/blob/main/Setup_AWS_Root.md

In [None]:
#Authenticate AWS Textract in the console/terminal

# Part 1: Basic text extraction

In [6]:

def resize_image_if_needed(img, max_size=8000, quality=85):
    """
    Resize image if either dimension exceeds max_size while maintaining aspect ratio.
    Added memory-efficient handling of large images.
    """
    width, height = img.size
    logger.info(f"Processing image of size {width}x{height}")
    
    if width > max_size or height > max_size:
        # Calculate new dimensions
        scale = max_size / max(width, height)
        new_width = math.floor(width * scale)
        new_height = math.floor(height * scale)
        
        try:
            logger.info(f"Resizing image from {width}x{height} to {new_width}x{new_height}")
            
            # Use LANCZOS for better quality, but fall back to NEAREST if memory error
            try:
                img = img.resize((new_width, new_height), Image.LANCZOS)
            except MemoryError:
                logger.warning("Memory error with LANCZOS, falling back to NEAREST")
                img = img.resize((new_width, new_height), Image.NEAREST)
            
            logger.info("Resize successful")
            
            # Convert to RGB if needed
            if img.mode != 'RGB':
                img = img.convert('RGB')
                logger.info("Converted to RGB mode")
            
            # Optimize memory usage
            if max(new_width, new_height) > 4000:
                # For very large images, compress more aggressively
                quality = min(quality, 75)
                logger.info(f"Large image detected, using reduced quality: {quality}")
            
            return img
            
        except Exception as e:
            logger.error(f"Error during resize: {str(e)}")
            raise
    
    return img

# First, let's create a function to process images and save results
def process_and_save_text(input_folder, output_folder, filename):
    print(f"\nProcessing {filename}...")
    
    # Setup paths
    input_path = os.path.join(input_folder, filename)
    base_name = os.path.splitext(filename)[0]
    output_text = os.path.join(output_folder, f"{base_name}_Textract.txt")
    output_json = os.path.join(output_folder, f"{base_name}_Textract.json")
    
    try:
        # Process image
        with Image.open(input_path) as img:
            # Resize if needed
            img = resize_image_if_needed(img)
            # Convert to RGB mode if needed
            if img.mode != 'RGB':
                img = img.convert('RGB')
            # Save as JPEG in memory
            buffer = io.BytesIO()
            img.save(buffer, format='JPEG', quality=95)
            image_content = buffer.getvalue()
        
        # Process with Textract
        textract = boto3.client('textract')
        response = textract.detect_document_text(
            Document={'Bytes': image_content}
        )
        
        # Save JSON response
        with open(output_json, 'w', encoding='utf-8') as f:
            json.dump(response, f, indent=2)
            
        # Save extracted text
        with open(output_text, 'w', encoding='utf-8') as f:
            for block in response['Blocks']:
                if block['BlockType'] == 'LINE':
                    f.write(block.get('Text', '') + '\n')
        
        print(f"Successfully processed {filename}")
        print(f"Text saved to: {output_text}")
        print(f"JSON saved to: {output_json}")
        
        return True
    
    except Exception as e:
        print(f"Error processing {filename}: {e}")
        return False





# Part 4: Textract Table Extraction

## Extract the tables

In [10]:
import os
from PIL import Image
import math
import io
from tqdm import tqdm 

def process_image(input_path, max_size=8000):
    """Process a single image with enhanced error handling and memory management"""
    try:
        logger.info(f"Opening image: {input_path}")
        
        # Open image with lazy loading
        with Image.open(input_path) as img:
            # Get original size
            orig_size = img.size
            logger.info(f"Original image size: {orig_size}, Mode: {img.mode}")
            
            # Resize if needed
            img = resize_image_if_needed(img, max_size)
            
            # Convert to RGB if needed
            if img.mode != 'RGB':
                img = img.convert('RGB')
            
            # Save as JPEG in memory with appropriate quality
            buffer = io.BytesIO()
            quality = 85 if max(img.size) <= 4000 else 75
            img.save(buffer, format='JPEG', quality=quality, optimize=True)
            
            logger.info(f"Successfully processed image. Original size: {orig_size}, Final size: {img.size}")
            return buffer.getvalue()
            
    except MemoryError:
        logger.error(f"Memory error processing {input_path}. Try reducing max_size parameter.")
        raise
    except Exception as e:
        logger.error(f"Error processing image {input_path}: {str(e)}")
        raise

def batch_resize_and_extract(extractor, input_folder, output_folder, max_size=4000):
    """Process all images in a folder with enhanced error handling"""
    os.makedirs(output_folder, exist_ok=True)
    
    # Get list of image files
    valid_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.tiff', '.bmp']
    image_files = [f for f in os.listdir(input_folder) 
                  if any(f.lower().endswith(ext) for ext in valid_extensions)]
    
    if not image_files:
        logger.warning(f"No image files found in {input_folder}")
        return
    
    logger.info(f"\nProcessing {len(image_files)} images...")
    
    successful = []
    failed = []
    tables_found = 0
    
    for filename in tqdm(image_files, desc="Processing images"):
        try:
            # Process image first
            input_path = os.path.join(input_folder, filename)
            
            with Image.open(input_path) as img:
                # Resize if needed
                img = resize_image_if_needed(img, max_size)
                
                # Convert to RGB if needed
                if img.mode != 'RGB':
                    img = img.convert('RGB')
                
                # Save directly to bytes
                buffer = io.BytesIO()
                img.save(buffer, format='JPEG', quality=85)
                image_bytes = buffer.getvalue()
            
            # Extract tables using Textract directly
            textract_client = boto3.client('textract')
            response = textract_client.analyze_document(
                Document={'Bytes': image_bytes},
                FeatureTypes=['TABLES']
            )
            
            # Process tables from response
            if 'Blocks' in response:
                # Find table blocks
                table_blocks = [block for block in response['Blocks'] 
                              if block['BlockType'] == 'TABLE']
                
                tables_found += len(table_blocks)
                
                if table_blocks:
                    # Process each table
                    for i, table_block in enumerate(table_blocks):
                        # Extract cells for this table
                        cells = [block for block in response['Blocks'] 
                               if block['BlockType'] == 'CELL' and 
                               block.get('TableId') == table_block['Id']]
                        
                        # Convert to DataFrame
                        table_data = []
                        max_row = max(cell['RowIndex'] for cell in cells)
                        max_col = max(cell['ColumnIndex'] for cell in cells)
                        
                        # Initialize empty table
                        table_data = [['' for _ in range(max_col)] for _ in range(max_row)]
                        
                        # Fill in cell values
                        for cell in cells:
                            row_idx = cell['RowIndex'] - 1
                            col_idx = cell['ColumnIndex'] - 1
                            if 'Text' in cell:
                                table_data[row_idx][col_idx] = cell['Text']
                        
                        # Convert to DataFrame and save
                        df = pd.DataFrame(table_data)
                        base_name = os.path.splitext(filename)[0]
                        excel_filename = f"{base_name}_table_{i+1}.xlsx"
                        output_path = os.path.join(output_folder, excel_filename)
                        df.to_excel(output_path, index=False)
                    
                    successful.append(filename)
                    logger.info(f"Successfully extracted {len(table_blocks)} tables from {filename}")
                else:
                    failed.append((filename, "No tables found"))
                    logger.warning(f"No tables found in {filename}")
            else:
                failed.append((filename, "No blocks in response"))
                logger.warning(f"No blocks found in response for {filename}")
                
        except Exception as e:
            logger.error(f"Error processing {filename}: {str(e)}")
            failed.append((filename, str(e)))
    
    # Print summary
    logger.info("\nProcessing complete!")
    logger.info(f"Successfully processed: {len(successful)} images")
    logger.info(f"Total tables extracted: {tables_found}")
    if failed:
        logger.error(f"\nFailed to process {len(failed)} images:")
        for filename, error in failed:
            logger.error(f"- {filename}: {error}")

    return successful, failed, tables_found
    

In [11]:
extractor = Textractor(profile_name="default")
successful, failed, tables = batch_resize_and_extract(extractor, input_folder, output_folder)

Processing images:   2%|█▍                                                               | 1/47 [00:04<03:21,  4.38s/it]ERROR:__main__:Error processing 1920 Census_10.png: max() arg is an empty sequence
Processing images:   4%|██▊                                                              | 2/47 [00:11<04:17,  5.73s/it]


KeyboardInterrupt: 

In [None]:
image_files = [f for f in os.listdir(input_folder) if f.lower().endswith(('.png', '.jpg', '.jpeg'))]
test_image = image_files[20]
debug_response = debug_textract_response(test_image, input_folder)

In [16]:
def split_and_process_census(input_folder, output_folder, image_index=19):
    """Split census page into left and right halves, then process each separately"""
    image_files = [f for f in os.listdir(input_folder) 
                  if f.lower().endswith(('.png', '.jpg', '.jpeg', '.tiff', '.bmp'))]
    
    if not image_files:
        print("No image files found in the input folder")
        return
    
    if image_index >= len(image_files):
        print(f"Image index {image_index + 1} is out of range. Only {len(image_files)} images in folder.")
        return
        
    filename = image_files[image_index]
    print(f"\nProcessing image: {filename}")
    
    try:
        # Open and process the image
        input_path = os.path.join(input_folder, filename)
        with Image.open(input_path) as img:
            print(f"Original image size: {img.size}")
            
            # Convert to RGB if needed
            if img.mode != 'RGB':
                img = img.convert('RGB')
            
            # Split image into left and right halves
            width, height = img.size
            mid_point = width // 2
            
            # Create left and right halves
            left_half = img.crop((0, 0, mid_point, height))
            right_half = img.crop((mid_point, 0, width, height))
            
            print("\nProcessing left half...")
            process_half(left_half, filename, "_left", output_folder)
            
            print("\nProcessing right half...")
            process_half(right_half, filename, "_right", output_folder)
            
    except Exception as e:
        print(f"Error processing image: {str(e)}")
        import traceback
        traceback.print_exc()

def process_half(img_half, original_filename, suffix, output_folder):
    """Process one half of the split census page"""
    try:
        # Resize if needed
        max_dimension = 7900
        width, height = img_half.size
        resize_ratio = min(max_dimension / width, max_dimension / height)
        
        if resize_ratio < 1:
            new_width = int(width * resize_ratio)
            new_height = int(height * resize_ratio)
            print(f"Resizing to: {new_width} x {new_height}")
            img_half = img_half.resize((new_width, new_height), Image.Resampling.LANCZOS)
        
        # Convert to bytes
        buffer = io.BytesIO()
        quality = 95
        while True:
            buffer.seek(0)
            buffer.truncate()
            img_half.save(buffer, format='JPEG', quality=quality)
            if buffer.tell() > 9.5 * 1024 * 1024:
                quality -= 5
                print(f"Reducing quality to {quality}")
                if quality < 65:
                    raise Exception("Cannot reduce image size enough while maintaining quality")
            else:
                break
        
        image_bytes = buffer.getvalue()
        print(f"Image size in bytes: {len(image_bytes):,}")
        
        # Call Textract
        print("Calling Textract...")
        textract_client = boto3.client('textract')
        response = textract_client.analyze_document(
            Document={'Bytes': image_bytes},
            FeatureTypes=['TABLES', 'FORMS']
        )
        
        blocks = response['Blocks']
        print(f"Total blocks found: {len(blocks)}")
        
        # Get all cells
        all_cells = [block for block in blocks if block['BlockType'] in ['CELL', 'MERGED_CELL']]
        print(f"Total cells found: {len(all_cells)}")
        
        if not all_cells:
            print("No cells found!")
            return
        
        # Get cell positions
        cell_tops = []
        for cell in all_cells:
            top = round(cell['Geometry']['BoundingBox']['Top'], 3)
            grouped = False
            for i, existing_top in enumerate(cell_tops):
                if abs(top - existing_top) < 0.005:
                    top = existing_top
                    grouped = True
                    break
            if not grouped:
                cell_tops.append(top)
        
        cell_tops = sorted(set(cell_tops))
        print(f"Found {len(cell_tops)} distinct rows")
        
        # Group cells by row
        rows = []
        for top in cell_tops:
            row_cells = [cell for cell in all_cells 
                        if abs(cell['Geometry']['BoundingBox']['Top'] - top) < 0.005]
            if row_cells:
                row_cells.sort(key=lambda c: c['Geometry']['BoundingBox']['Left'])
                rows.append(row_cells)
        
        # Create table data
        table_data = []
        for row in rows:
            row_data = []
            for cell in row:
                text = cell.get('Text', '')
                if not text and 'Relationships' in cell:
                    child_ids = [rel['Ids'] for rel in cell['Relationships'] 
                               if rel['Type'] == 'CHILD']
                    if child_ids:
                        child_texts = []
                        for child_id_list in child_ids:
                            for child_id in child_id_list:
                                child_block = next((b for b in blocks if b['Id'] == child_id), None)
                                if child_block and 'Text' in child_block:
                                    child_texts.append(child_block['Text'])
                        text = ' '.join(child_texts)
                row_data.append(text)
            table_data.append(row_data)
        
        # Pad rows to equal length
        max_cols = max(len(row) for row in table_data)
        table_data = [row + [''] * (max_cols - len(row)) for row in table_data]
        
        # Save to Excel
        df = pd.DataFrame(table_data)
        base_name = os.path.splitext(original_filename)[0]
        excel_filename = f"{base_name}{suffix}_table.xlsx"
        output_path = os.path.join(output_folder, excel_filename)
        df.to_excel(output_path, index=False)
        print(f"Saved table to: {excel_filename}")
        
    except Exception as e:
        print(f"Error processing half: {str(e)}")
        import traceback
        traceback.print_exc()

# Run the function
split_and_process_census(input_folder, output_folder, 19)


Processing image: 1920 Census_27.png
Original image size: (10176, 13184)

Processing left half...
Resizing to: 3048 x 7900
Image size in bytes: 5,291,697
Calling Textract...
Total blocks found: 2179
Total cells found: 797
Found 102 distinct rows
Saved table to: 1920 Census_27_left_table.xlsx

Processing right half...
Resizing to: 3048 x 7900
Image size in bytes: 4,571,282
Calling Textract...
Total blocks found: 2042
Total cells found: 685
Found 85 distinct rows
Saved table to: 1920 Census_27_right_table.xlsx
