# Introduction

In this notebook, we use Amazon Textract and Google Vision to provide a quick way of extracting text/tables from an image of a page.

Intended use: The intended use of this notebook is to quickly prototype. You should expect to modify the code in this notebook to suit your usecase.

Preparation: At a minimum, set a working folder, and make sure to add your API keys for both Textract and Google Vision. To do so, please follow the steps outlined here: https://github.com/MikeJGiordano/OCR_History/blob/main/ReadMe.md

This notebook contains four parts:

    1. Unmodified image OCR. This is intended to quickly detect text from a single image.
        a. There is then an option to run one or both OCR tools on a whole folder.
        
    2. Image preprocessing. This routine helps you to quickly preprocess a single image (adjust contrast, split image, etc). 
        a. If you are satisfied with the preprocessing routine, it will give you the option to preprocess a whole folder.
        
    3. Image preprocessing with text extraction. This runs the image modification from part 2 into the text detection from part 1.
    
    4. Image preprocessing with table extraction from Textract. This uses the image modification from part 2 to extract a table using Textract.

# Program Setup

## There are 5 steps, marked A-E.

In [11]:
from google.cloud import vision
import os

# Check credentials
print("Checking Google Cloud setup:")
if 'GOOGLE_APPLICATION_CREDENTIALS' in os.environ:
    cred_path = os.environ['GOOGLE_APPLICATION_CREDENTIALS']
    print(f"Credentials file path: {cred_path}")
    print(f"Credentials file exists: {os.path.exists(cred_path)}")
else:
    print("GOOGLE_APPLICATION_CREDENTIALS environment variable not set")

# Try to create a client
try:
    client = vision.ImageAnnotatorClient()
    print("Successfully created Vision client")
except Exception as e:
    print(f"Error creating Vision client: {e}")

Checking Google Cloud setup:
Credentials file path: /mnt/c/Users/WATLINGS/Documents/GitHub/OCR_History/OCR_Python/ServiceAccountToken.json
Credentials file exists: True
Successfully created Vision client


### A: Import packages

In [1]:
import io
import json
import os

# if you don't have these packages use any package manager to install
# you can install all packages at once using the provided requirements.txt file
import cv2
import boto3
from google.cloud import vision

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tqdm as tq

from PIL import Image, ImageDraw
from textractor import Textractor
from textractor.visualizers.entitylist import EntityList
from textractor.data.constants import TextractFeatures, Direction, DirectionalFinderType

# note: the following py file, you'll have to download
import preprocess as pp

### B: Please set your working directories here

In [2]:
# please set the path to the folder containing your images here
input_folder = "images"  # relative path since we're already in the correct directory
# please set the path to a desired output folder here
output_folder = "output"

### C: Please set your main input file here

In [3]:
# set the filename to your image here
newspaper_image = "NYT.png"

### D: Please authenticate Google Cloud

For help with Google Cloud, see https://github.com/MikeJGiordano/OCR_History/blob/main/Setup_Google_Cloud.md

In [4]:
#Authenticate Google Cloud here

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/mnt/c/Users/WATLINGS/Documents/GitHub/OCR_History/OCR_Python/ServiceAccountToken.json'
client = vision.ImageAnnotatorClient()

### E: Please authenticate Amazon Textract

For help with Amazon Textract, see https://github.com/MikeJGiordano/OCR_History/blob/main/Setup_AWS_Root.md

In [5]:
#Authenticate AWS Textract in the console/terminal

# Part 1: Basic text extraction

In [None]:
import os
from pathlib import Path
from google.cloud import vision

# Convert all Windows paths to WSL paths
def to_wsl_path(windows_path):
    return windows_path.replace('C:', '/mnt/c').replace('\\', '/')

# Convert all paths
wsl_image_path = to_wsl_path(r"C:\Users\WATLINGS\Documents\GitHub\OCR_History\OCR_Python\images\NYT.png")
wsl_input_folder = to_wsl_path(r"C:\Users\WATLINGS\Documents\GitHub\OCR_History\OCR_Python\images")
wsl_output_folder = to_wsl_path(r"C:\Users\WATLINGS\Documents\GitHub\OCR_History\OCR_Python\output")

# Verify all paths exist
print("Checking paths:")
print(f"Image exists: {os.path.exists(wsl_image_path)}")
print(f"Input folder exists: {os.path.exists(wsl_input_folder)}")
print(f"Output folder exists: {os.path.exists(wsl_output_folder)}")

try:
    # First test direct Vision API access
    client = vision.ImageAnnotatorClient()
    print("Successfully created Vision client")
    
    # Test direct image processing
    with open(wsl_image_path, 'rb') as image_file:
        content = image_file.read()
    
    image = vision.Image(content=content)
    response = client.text_detection(image=image)
    if response.error.message:
        print(f"API Error: {response.error.message}")
    else:
        print("Direct Vision API call successful!")
        
except Exception as e:
    print(f"Detailed error information: {str(e)}")
    import traceback
    print(f"Full traceback: {traceback.format_exc()}")

In [6]:
import logging
logging.basicConfig(level=logging.DEBUG)

# Add this before your process_content call to check the variables
print(f"Checking input variables:")
print(f"newspaper_image type: {type(newspaper_image)}")
print(f"input_folder path exists: {os.path.exists(input_folder)}")
print(f"output_folder path exists: {os.path.exists(output_folder)}")

# Add this debugging code right before your process_content call
try:
    print("\nTesting direct Vision API call...")
    from google.cloud import vision
    
    # Create vision client
    client = vision.ImageAnnotatorClient()
    
    # Construct the full image path
    full_image_path = os.path.join(input_folder, newspaper_image)
    print(f"Attempting to read image from: {full_image_path}")
    
    # Read the image file
    with open(full_image_path, 'rb') as image_file:
        content = image_file.read()
    
    # Create vision image object
    image = vision.Image(content=content)
    
    # Try text detection
    response = client.text_detection(image=image)
    texts = response.text_annotations
    
    print(f"Number of text blocks found: {len(texts)}")
    if len(texts) > 0:
        print("First text block found:", texts[0].description[:100])
    
except Exception as e:
    print(f"Direct Vision API test failed with error: {str(e)}")

# Then your original process_content call
print("\nNow trying process_content...")

try:
    print("Starting process_content...")
    pp.process_content(newspaper_image, 
                      input_folder,
                      output_folder,
                      show_image=True,
                      use_google_vision=True, 
                      use_textract=False, 
                      verbose=True)
    print("process_content completed")
except Exception as e:
    print(f"Detailed error information: {str(e)}")
    print(f"Error type: {type(e)}")
    import traceback
    print(f"Full traceback: {traceback.format_exc()}")
    
    # Check Google Cloud credentials
    import os
    print("\nChecking Google Cloud credentials:")
    if 'GOOGLE_APPLICATION_CREDENTIALS' in os.environ:
        print(f"Credentials path: {os.environ['GOOGLE_APPLICATION_CREDENTIALS']}")
        print(f"File exists: {os.path.exists(os.environ['GOOGLE_APPLICATION_CREDENTIALS'])}")
    else:
        print("GOOGLE_APPLICATION_CREDENTIALS not set in environment")

Checking input variables:
newspaper_image type: <class 'str'>
input_folder path exists: True
output_folder path exists: True
Starting process_content...
Google Vision Output:
Error with Cloud Vision
Setting all parameters=True gives a basic visualization of the outputs of both Cloud Vision, defaulted as the first image, and Textract, the second image. The .txt and .json outputs for both Cloud Vision and Textract are saved in the output_folder. By setting a parameter=False, you can skip that function. For example, if use_textract=False and use_google_vision=True, this will not send the image through Textract, but will send the image through Google Vision.
process_content completed


### You can use the next cell to get text and JSON files for the entire input folder through Google Vision, Textract, or both.

In [7]:
# Batch process all images in the input folder, save text and JSON outputs to the output folder

pp.batch_ocr(input_folder, 
                 output_folder, 
                 use_google_vision=False, 
                 use_textract=False)

Processing Images: 100%|█████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9679.16image/s]

All images OCR'd. text and JSON files are in folder output





# Part 2: Preprocess images
Often, it helps to preprocess an image. Common routines are:
    
    1. Adjusting contrast or brightness
    2. Converting to grayscale
    3. Cropping
    4. Erasing margins
    5. Splitting images
    
We now provide two examples:
    
    1. Applying points 1-4 
    2. Preprocessing and splitting the image

### Example 1: Full image

In [8]:
# set the filename to your image here
railroad_table = "1888_Page_161.png"

In [9]:
#The next cell will apply the default preprocess settings to your image.
#If you are unsatisfied with those settings, it will instruct you on how to make changes.
#Those changes should be inserted in this cell.



In [10]:
#Preprocess a single image.
pp.preprocess_image(railroad_table,
                       input_folder,
                       output_folder,
                       **pp.default);

Do you want to split this image into two separate images? (y/n): y
Do you want to split it Vertically or Horizontally? (v/h) y


AttributeError: module 'numpy.core.multiarray' has no attribute 'integer'

### Example 2: Split image

In [None]:
# set the filename to your split image here
korean_image = "126.png"

In [None]:
#The next cell will apply the default preprocess settings to your image.
#If you are unsatisfied with those settings, it will provide instructions on how to make changes.

pp.default['left_margin_percent'] = 30
pp.default['top_margin_percent'] = 5

In [None]:
#Preprocess a split image.
pp.preprocess_image(korean_image,
                       input_folder,
                       output_folder,
                       **pp.default);

# Part 3: Preprocessed Text Extraction

### Example 1: Full image

In [None]:
# using the above processing, the folder of modified images is located at:

modified_images = "output/modified_images/"

# Modification alters the name of the file to be:

modified_railroad = 'modified_' + railroad_table

In [None]:
# plot the image, save .json outputs
pp.process_content(modified_railroad, 
                   modified_images,
                   output_folder,
                   show_image = True,
                   use_google_vision=False, 
                   use_textract=True, 
                   verbose=True)

### Example 2: Split image

In [None]:
# Modification splits the file into two and renames them:

modified_1_split = 'modified_1_' + korean_image
modified_2_split = 'modified_2_' + korean_image

In [None]:
# plot the images, save .json and .txt outputs
pp.process_content(modified_1_split, 
                   modified_images,
                   output_folder,
                   show_image = True,
                   use_google_vision=True, 
                   use_textract=False, 
                   verbose=True)

pp.process_content(modified_2_split, 
                   modified_images,
                   output_folder,
                   show_image = False,
                   use_google_vision=False, 
                   use_textract=False, 
                   verbose=False)

### You can use the next cell to get text and JSON files for the entire folder of modified images through Google Vision, Textract, or both.

In [None]:
# Batch process all images in the modified folder, save .json outputs to the output folder

pp.batch_ocr(modified_images, 
                 output_folder, 
                 use_google_vision=False, 
                 use_textract=False)

# Part 4: Textract Table Extraction

### Setup

Initialize Textractor client, modify region if required

In [None]:
extractor = Textractor(profile_name="default")

Please specify the image you want to extract a table from.

In [None]:
# using the above processing, the folder of modified images is located at:

modified_images = "output/modified_images/"

# Modification alters the name of the file to be:

modified_railroad = 'modified_' + railroad_table

## Extract the tables

In [None]:
pp.extract_table(extractor, 
                       modified_railroad,
                       modified_images,
                       output_folder);