# HOMEWORK 8

In this homework you are going to implement your first machine learning algorithm to automatically binarize document images. The goal of document binarization is to separate the characters (letters) from everything else. This is the crucial part for automatic document understanding and information extraction from the . In order to do so, you will use the Otsu thresholding algorithm.

At the end of this notebook, there are a couple of questions for you to answer.

In [1]:
import cv2
import math
import numpy as np
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = [15, 10]

Let's load the document image we will be working on in this homework.

In [None]:
img = cv2.imread('./data/document.jpg')
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
plt.imshow(img, cmap='gray');

First, let's have a look at the histogram.

In [None]:
h = np.histogram(img, 256)
plt.bar(h[1][0:-1], h[0])
plt.xlabel('Colour'), plt.ylabel('Count')
plt.grid(True)

### Otsu Thresholding

Let's now implement the Otsu thresholding algorithm. Remember that the algorithm consists of an optimization process that finds the thresholds that minimize the intra-class variance or, equivalently, maximize the inter-class variance.

In this homework, you are going to demonstrate the working principle of the Otsu algorithm. Therefore, you won't have to worry about an efficient implementation, we are going to use the brute force approach here.

In [5]:
# Get image dimensions
rows, cols = img.shape
# Compute the total amount of image pixels
num_pixels = rows * cols

# Initializations
best_wcv = 1e6  # Best within-class variance (wcv)
opt_th = None   # Threshold corresponding to the best wcv

# Brute force search using all possible thresholds (levels of gray)
for th in range(0, 256):
    # Extract the image pixels corresponding to the background
    foreground = img[img >= th]
    # Extract the image pixels corresponding to the background
    background = img[img < th]
    
    # If foreground or background are empty, continue
    if len(foreground) == 0 or len(background) == 0:
        continue
    
    # Compute class-weights (omega parameters) for foreground and background
    omega_f = len(foreground) / num_pixels
    omega_b = len(background) / num_pixels
    
    # Compute pixel variance for foreground and background
    # Hint: Check out the var function from numpy ;-)
    # https://numpy.org/doc/stable/reference/generated/numpy.var.html
    sigma2_f = np.var(foreground)
    sigma2_b = np.var(background)
    
    # Compute the within-class variance
    wcv = omega_f * sigma2_f + omega_b * sigma2_b
    
    # Perform the optimization
    if wcv < best_wcv:
        best_wcv = wcv
        opt_th = th
        
# Print out the optimal threshold found by Otsu algorithm
print('Optimal threshold', opt_th)

Optimal threshold 160


Finally, let's compare the original image and its thresholded representation.

In [None]:
plt.subplot(121), plt.imshow(img, cmap='gray')
plt.subplot(122), plt.imshow(img > opt_th, cmap='gray');

### Questions

* Looking at the computed histogram, could it be considered bimodal?
* Looking at the computed histogram, what binarization threshold would you chose? Why?
* Looking at the resulting (thresholded) image, is the text binarization (detection) good?

> Looking at the computed histogram, could it be considered bimodal?

The distribution contains two peaks at `130` and `200`. Two Gaussian distributions with same `\mu`s and reasonable sigmas have much lower values between their peaks. That might mean pixels with levels between `150` and `170` might be incorrectly splitted on classes. 

And indeed, we see text in the upper part between the columns is hardly readable.

In [None]:
weights = [0.4, 0.6]
centers = [130, 200]
sigmas = [18, 10]

xx = np.linspace(0, 255, 256)
mixture = np.zeros_like(xx)

for wk, mu, s in zip(weights, centers, sigmas):
    mixture += wk * norm.pdf(xx, loc=mu, scale=s)

plt.bar(h[1][0:-1], h[0])
plt.xlabel('Colour'), plt.ylabel('Count')
plt.grid(True)
plt.plot(xx, mixture * np.sum(h[0]) / 1.8, 'r')
plt.show()

> Looking at the computed histogram, what binarization threshold would you chose? Why?

To be honest, I have no idea why any other threashold is better than `160` just from looking to histogram. I would try to lower the threashold by `5-10` levels to try to make text in the middle more readable.

> Looking at the resulting (thresholded) image, is the text binarization (detection) good?

It is not good everywhere. I would try to use adaptive threshold approach (it is marginally better):

In [55]:
from skimage.filters import threshold_sauvola

th = threshold_sauvola(img, window_size=15)

In [None]:
plt.subplot(121), plt.imshow(img > opt_th, cmap='gray')
plt.subplot(122), plt.imshow(img > th, cmap='gray');