# Preprocessing Images for OCR

## 00. Opening an Image

In [4]:
import cv2
from matplotlib import pyplot as plt

image_file = "./data/page_01.jpg"
img = cv2.imread(image_file)
print("Image read into memory.")

Image read into memory.


### Helper function to display an image inline

In [5]:
def display(im_path):
    dpi = 80
    im_data = plt.imread(im_path)

    height, width  = im_data.shape[:2]
    
    # What size does the figure need to be in inches to fit the image?
    figsize = width / float(dpi), height / float(dpi)

    # Create a figure of the right size with one axes that takes up the full figure
    fig = plt.figure(figsize=figsize)
    ax = fig.add_axes([0, 0, 1, 1])

    # Hide spines, ticks, etc.
    ax.axis('off')

    # Display the image.
    ax.imshow(im_data, cmap='gray')

    plt.show()

In [7]:
# display(image_file)

## 01. Inverted Images

Inverting an image has to do with the inverting of pixels. White spectrum pixels will become black and vice-versa. 

Note: Inverting an image is not a critical/widely used step in Tesseract 4. In fact, results can be negatively affected. 

In [8]:
inverted_image = cv2.bitwise_not(img)
cv2.imwrite("./temp/inverted.jpg", inverted_image)

True

In [10]:
# display("./temp/inverted.jpg")

## 02. Rescaling

We rescale because there is an optimal range our file can live in for best OCR results.

This range is defined by the height of the characters; also knows as DPI (dots per inch). 

## 03. Binarization - CRITICAL STEP

The process of converting an image to black and white and increasing the contract between background and text. 

In order for an image to be converted to black and white well, it must first be in gray scale. 

This is a two step process:

    1. Process your image with a `grayscale` function.
    2. Run your image through the `cv2.threshold` function while manipulating threshold values.  

### Helper function to grayscale an image

In [11]:
def grayscale(image):
    return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

In [12]:
gray_image = grayscale(image=img)
cv2.imwrite("./temp/gray.jpg", gray_image)

True

Often, the `cv2.threshold` values will start at 127 and 255 but with text, this is not always the best parameters to star with. 

For faded images of text, start with 200 and 230.

In [14]:
thresh, im_bw = cv2.threshold(gray_image, 200, 230, cv2.THRESH_BINARY)
cv2.imwrite("./temp/bw_image.jpg", im_bw)

True