# Converting Images to Black and White for OCR
While Tesseract will to all appearances perform OCR quite happily on color images, it's worth noting that, behind the scenes, the software is performing a number of transformations on those images before attempting to recognize text, including converting our lovely color images to black and white. Most discussions I've seen of OCR workflows involve converting images before sending them to Tesseract in the first place. (If nothing else, the black and white files are considerably smaller. In a quick test, it looks like Tesseract completed text recognition of a black and white image in a bit less than half the time it took to process the same image in color.)

The code in this notebook lets us see what different effects can be produced by adjusting parameters while converting color images to black and white.

For most people, the focus of your attention doesn't really need to be on the details of the code, itself. Rather, focus on the different variables that affect the way that the images are transformed. I've set the notebook up so that you can change variables easily and re-process images to see what difference your changes make.


In [None]:
#Connect to Google Drive
from google.colab import drive
drive.mount('/gdrive')

#Import libraries to allow interactive widgets in this notebook
import ipywidgets as widgets
from ipywidgets import interact, interact_manual, interactive

## Move images from Google Drive to Colaboratory environment
We won't use all of the images for this notebook, but it's more convenient just to have them in one place in the `data_class` folder, so we'll go ahead and pull them all in, anyway.

In [None]:
%cp -r /gdrive/MyDrive/rbs_digital_approaches_2021/data_class/page_images/penn_pr3732_t7_1730b.zip /content/penn_pr3732_t7_1730b.zip
%cd /content/
!unzip penn_pr3732_t7_1730b.zip

## Setting our source image
We'll use a page from the University of Pennsylvania's scan of one of the copies of James Thomson's *Sophonisba* in the Kislak Center for Special Collections, Rare Books, and Manuscripts (PR 3732 T7 1730b).

In [None]:
source_directory = '/content/penn_pr3732_t7_1730b/'
source_image = source_directory + 'PR3732_T7_1730b_body0003.tif'

## Converting images using PIL/Pillow
PIL was the original library for working with images in Python, but is not compatible with Python 3. It has been superseded by a fork called Pillow. Because PIL was so widely-used, however, Pillow was written as a drop-in replacement to maintain compatibility with existing code: we even import it by calling it `PIL` instead of `Pillow`. Pillow can do lots of useful things with images, and you can find lots of pointers online for using it, so we'll start there. (Note that Pillow is installed by default in Google Colaboratory. If you were working in a different environment, you'd need to install Pillow using `pip`.)

We'll start by converting our color image to grayscale as an intermediate step towards getting our image to true black and white—a "binary" file in which each pixel is either black or white. We'll use [PIL's `convert()` method](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.convert), which requires us to select a mode for conversion. We'll use `L` for grayscale. (PIL's various modes are [explained in the documentation](https://pillow.readthedocs.io/en/stable/handbook/concepts.html#concept-modes).)

In [None]:
# !pip install pillow
#Install the Image and ImageDraw libraries from PIL (actually Pillow)
from PIL import Image, ImageDraw

### Converting from color to grayscale

In [None]:
#Use the open() method of Pillow's Image library to open the .tif file
pilcolor_image = Image.open(source_image)

#Use the Image library's convert() method to convert the color image to grayscale
pilgray_image = pilcolor_image.convert('L')

#Output
pilgray_image

### Converting our grayscale image to black and white
Since converting from color to grayscale was as simple as selecting the mode `L`, let's try converting our grayscale image to binary by selecting mode `1`: "1-bit pixels, black and white, stored with one pixel per byte."

In [None]:
pilbw_image = pilgray_image.convert('1')
pilbw_image

#### Overriding the default behavior: turning off dithering
Okay, that's not ideal. As noted in [the documentation](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.convert):
>The default method of converting a greyscale (“L”) or “RGB” image into a bilevel (mode “1”) image uses Floyd-Steinberg dither to approximate the original image luminosity levels. If dither is NONE, all values larger than 127 are set to 255 (white), all other values to 0 (black). To use other thresholds, use the point() method.

Let's try that again, this time turning off the default dithering behavior: this time any pixel above (i.e., darker than) a threshold value of 127 will be turned to black and any pixel below that threshold will be turned to white.

In [None]:
pilbw_image = pilgray_image.convert('1', dither=0)
pilbw_image

#### Adjusting the threshold point manually
That's probably better, but some of the text seems kind of attenuated. Given how variable the inking can be in early print, this default value might not always work. Let's see what happens if we adjust the threshold point.

Be sure to run the next cell to bring up a slider that will allow you to change the threshold value, then try experimenting with different threshold values.


In [None]:
 #@title Set a threshold value {display-mode: "form"}
 #@markdown Run this cell, then use the slider that will appear to adjust the threshold point for our image in the cell below. 
 
 #@markdown You only need to run this cell once (re-running it will just set things back to the default value). Try adjusting the slider and then re-running the *next* cell a few times to see the difference that different threshold values make.
 thresh_value_slider = widgets.IntSlider(
    min=0,
    max=255,
    step=1,
    description='Threshold:',
    value=150
)
 display(thresh_value_slider)

In [None]:
#See https://stackoverflow.com/questions/9506841/using-python-pil-to-turn-a-rgb-image-into-a-pure-black-and-white-image/50090612#50090612

#Get the current value of our slider widget from the cell above
thresh = thresh_value_slider.value

#This is defining a kind of quickie function 'fn' that will be used at line 18
#This is going to look at every pixel of our image. If the value of that pixel is
#greater than our "thresh" value (set by the slider above), then set the pixel to
#255 (pure black). If the pixel value is less than "thresh," set the pixel to 0
#(pure white) 
fn = lambda x : 255 if x > thresh else 0

#Convert our image, this time overriding the default dithering behavior
#with the threshold we've chosen, using the lambda function from line 11
pilbinary_image = pilgray_image.convert('L').point(fn, mode='1')

pilbinary_image

It shouldn't take much experimenting to see that different threshold points can create very different results. But let's say we have 1,000 page images from 50 different books. Given the variability of early print, there may not be one perfect threshold value that gets the best results for all of our images: what would be great for one image might leave another one too dark and noisy, and might leave a third too faint.

What we need is a way to determining a good threshold value for each image without having to experiment on each image individually. Fortunately, this is a problem that people have worked on. While we might be able out how to implement, say [Otsu's method](https://en.wikipedia.org/wiki/Otsu%27s_method) in code ourselves, we'd probably be better off taking advantage of the fact that other people have already done that work. For this, we'll turn from PIL to a different library.

## Converting images using OpenCV
OpenCV is a computer vision library that can be used for all sorts of things, including feature detection, image classification, and more. A library like this one might seem to be overkill for simply converting images from color to black and white, but it offers us lots of things built-in that would be difficult to work out from scratch, ourselves.

(Note that OpenCV and the Python wrapper for it are installed by default in Google Colab. If you were working in a different environment, you'd first need to install OpenCV—the process differs depending on your operating system, so I won't go into that here. You'd also need to install the Python wrapper using `pip`.)

In [None]:
# !pip install opencv-python
import cv2
#This is a Google Colab-specific patch to enable us to view OpenCV images
#in the browser
from google.colab.patches import cv2_imshow
#Other Python libraries that CV2 needs
import matplotlib.pyplot as plt
import numpy as np

#### Converting from color to grayscale
Opening and converting images with OpenCV is more or less like opening and converting images with PIL, but you'll notice a few differences. We use `cv2.imread()` and `cv2.cvtColor` rather than `Image.open()` and `Image.convert`, for example, and, in addition to supplying the file to open and convert, we also have to indicate a method related to the color space we need to work in. (We're working with a Python wrapper for OpenCV, but OpenCV, itself, is written in C++, so some of the conventions around capitalization and so forth will look different from lots of Python code you'll see.)

In [None]:
#Open the color image
cv2color_image = cv2.imread(source_image, cv2.IMREAD_COLOR)

#Convert the color image to grayscale
cv2gray_image = cv2.cvtColor(cv2color_image, cv2.COLOR_BGR2GRAY)

#Output
cv2_imshow(cv2gray_image)

Yep. Looks like a grayscale image, all right. (And the conversion was actually rather faster than with PIL.)

#### Converting from grayscale to black and white
We could simply convert our grayscale image to black and white right now, but there are arguments for applying a slight blur to our image first. While it seems counterintuitive that we would want to make an image blurrier when what we want to do is to recognize text clearly, that blurring can help to minimize the effect of any noise in the image (including, say, tiny flecks in the paper).

(Be sure to run the next cell, as it creates a widget for adjusting the blur that we'll apply in the cell following it.)

In [None]:
#@title Set values for Gaussian blur {display-mode: "form"}
#@markdown Try adjusting the value that will be used for blurring in the next cell.

#@markdown (You only need to run this cell once—re-running it will simply reset it to the default value. After changing the value of the slider, try re-running the cell below this one.)
blur = widgets.IntSlider(min=1, max=31, step=2, value=5, description='Blur')
display(blur)


In [None]:
#Apply a Gaussian blur, using a kernel whose width and height are both equal to
#the value set by the slider above.
cv2blurred_image = cv2.GaussianBlur(cv2gray_image, (blur.value, blur.value), 0)
cv2_imshow(cv2blurred_image)

#### Determining an appropriate threshold using Otsu's method 
We'll experiment with two different methods for automatically thresholding our image. The first, Otsu's method (as best I understand without actually being able to follow the equations) arrives at a threshold level for the image as a whole by noting the distribution of intensities across *all* the pixels of an image and finding the threshold level that optimally divides those intensities into two clusters: the point at which it makes most sense to say "Everything greater than this belongs together in one group, and everything less than this belongs together in a different group." The "greater than" group gets turned to black, and the "less than" group gets turned to white. 

This is more or less what we were trying to do experimentally, ourselves, with the threshold slider, above, but without the trial and error. Otsu's method seems to work quite well for the kinds of page images we're dealing with, as we'll see in the cell below.

**Note:** Make sure that the blurred image you've created in the cell above looks good to you, since that's what we'll be thresholding here. If you had experimented with the blur until it started looking terrible, now would be the time to set it back to a more sensible level.

In [None]:
#Threshold the blurred image using Otsu's method. This line looks a little weird.
#We're really creating two variables at the same time--T and 
#cv2binary_otsu_image--grouped together as a tuple. The cv2.threshold() method
#will return both the threshold level that's calculated by Otsu's method (that 
#will be T) and the image that results from applying that threshold. If we
#just want the image, we could substitute the following:
#cv2binary_otsu_image = cv2.threshold(cv2blurred_image, 0, 255, cv2.THRESH_OTSU)[1]
(T, cv2binary_otsu_image) = cv2.threshold(cv2blurred_image, 0, 255, cv2.THRESH_OTSU)

#Output
cv2_imshow(cv2binary_otsu_image)
print('Otsu threshold is: ' + str(T))

That looks pretty good with a threshold of 155. How does that compare to the threshold level you had arrived at, above?

Note that that this conversion was slower than what we got when supplying PIL with a threshold level, but we can be more confident that this method has found a good threshold level. If we were trying to automate the conversion of scores (or hundreds, or thousands) of images, the tradeoff in speed for adaptability would almost surely be worth it.

#### Sounds great! Is there a catch?
Otsu's method seems to work great for this page image. We could imagine circumstances, though, where the results wouldn't be so good. In an image where the separation between light and dark pixels was less clear, the threshold determined by Otsu's method might yield a result that was difficult to read. This could be the case, for instance, with an image of a page with lots of ink showing through from the other side, or with too much shadow on the page from less-than-ideal photographic circumstances, or, as we'll see in the next cell, with severe foxing of the paper.

In [None]:
foxed = source_directory + 'st_tz_foxing.jpg'
color_foxed = cv2.imread(foxed, cv2.IMREAD_COLOR)
cv2_imshow(color_foxed)

How does Otsu's method do with this image? Wellll...

(Note that this is just an image I found in a [blog post from the New England Document Conservation Center](https://www.nedcc.org/about/nedcc-stories/story-tz-interview), and not an image that was created through careful digitization. Still, if you've spent any time working with scanned books, you've surely seen something like this before.)

In [None]:
gray_foxed = cv2.cvtColor(color_foxed, cv2.COLOR_BGR2GRAY)
blurred_foxed = cv2.GaussianBlur(gray_foxed, (5, 5), 0)
otsu_foxed = cv2.threshold(blurred_foxed, 0, 255, cv2.THRESH_OTSU)[1]

cv2_imshow(otsu_foxed)

#### Applying adaptive thresholding for problematic images
Rather than attempting to calculate a single threshold point appropriate for the image as a whole, adpative thresholding first divides the image into segments based on the levels in different regions of the image and then calculates a separate threshold point for each segment. For a generally good image like our pages of *Sophonisba*, the difference isn't really all that noticeable. We can certainly see a difference, but it's not terribly dramatic.

In [None]:
cv2binary_adaptive_image = cv2.adaptiveThreshold(cv2blurred_image, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 101, 30)
cv2_imshow(cv2binary_adaptive_image)

But it makes a remakable difference for the badly foxed title page we saw giving Otsu's method trouble:

In [None]:
cv2binary_adaptive_image = cv2.adaptiveThreshold(blurred_foxed, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 101, 30)
cv2_imshow(cv2binary_adaptive_image)

## Clear Google Colab environment
I don't *think* that leaving all those .tif images in the Colab environment will count against your Google Drive storage quota, but it might, so let's just wipe out all of those files

In [None]:
%cd /content/
!rm -r ./*

## Takeaways
During the course of OCR, images are going to get converted to black and white, and the quality of that conversion can have a dramatic effect on the quality of the recognized text.

Different images respond better to different treatments. You could set a manual threshold level of say 150 or 155 and get pretty good results most of the time—but sometimes the results would not be good at all. 

Automated methods for figuring out likely-optimal values for thresholding any given image are necessary to do this kind of work at scale, but here, too, there are options. All-around, Otsu's method is quite effective (and since it's faster than adaptive thresholding, it's probably a good choice for most uses). But there will be some occasions where it's not the best tool for the job.