# Show, Don't Tell: Image Search
*Curtis Miller*

In this notebook I will demonstrate image search. By this I mean I will write an algorithm that accepts an image as input and will return a list of "similar" images as output.

Search, in this case, means finding items that are "similar" some some specified item. In practice "similarity" needs to be defined, and different definitions may be useful for different applications. In this notebook, I define "similar" to mean that the color distribution of the pixels in the images are similar. In fact, I will refine the definition and say that similar images have pixels of similar color distributions in common regions of the image.

I have a collection of images through which my algorithm will search. The algorithm, when given an image, will compute how similar each image in the collection is to the input image. This may not be the most efficient approach to search; my objective is not efficiency but to demonstrate how we can find "similar" images.

We also need a way to describe the color distribution of an image. Here I bin RGB values (viewed as separate and independent channels) and use the discretized distributions to describe the colors in the image. When we have two of these distributions for an image, we then compute a metric known as the $\chi^2$-distance:

$$\sum_{k = 1}^K \frac{(x_k - y_k)^2}{x_k + y_k}$$

where $K$ is the number of bins, $x_k$ is the (normalized) count of the $k^{\text{th}}$ bin of one image, and $y_k$ is the equivalent number for the other image.

An image is divided up into a 3x3 grid and these $\chi^2$-distances are computed for each cell of the resulting grid. The sum of these distances (across all cells) is then used as the numeric descriptor of the similarity of the two images.

Our first task is to load in the image data.

In [None]:
import cv2
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import sys, os
import pandas as pd
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (18, 16)

In [None]:
im_dir = "images"
im0 = cv2.cvtColor(cv2.imread(im_dir + "/ocean0.jpg"), cv2.COLOR_BGR2RGB)
plt.imshow(im0)

In [None]:
im1 = cv2.cvtColor(cv2.imread(im_dir + "/ocean1.jpg"), cv2.COLOR_BGR2RGB)
plt.imshow(im1)

Let's now prepare tools for creating color histograms.

In [None]:
def col_256_bins(bincount = 10):
    """Returns a NumPy array with bincount number of bins that can be used to define histogram bins"""
    delta = 256/bincount
    return np.array(np.arange(bincount + 1) * delta, dtype = np.uint16)

col_256_bins()

In [None]:
np.histogram(im0[:, :, 0].flatten(), bins=col_256_bins())

In [None]:
np.histogram(im0[:, :, 0].flatten(), bins=col_256_bins(), density=True)

In [None]:
plt.hist(im0[:, :, 0].flatten(), bins=col_256_bins())    
# Red

In [None]:
plt.hist(im0[:, :, 1].flatten(), bins=col_256_bins())    # Green

In [None]:
plt.hist(im0[:, :, 2].flatten(), bins=col_256_bins())    # Blue

Now I write tools that compute a distance between two images based on the histograms computed from the images.

In [None]:
def chisq_dist(x, y):
    """Compute chi-square distance between histograms x and y"""
    binscore = (x[0] - y[0])**2 / (x[0] + y[0])    # Putting [0] since np.histogram returns tuples
    return np.nansum(binscore)

In [None]:
histbins = col_256_bins()
chisq_dist(np.histogram(im0[0:266, 0:400, 0].flatten(), histbins, density=True),
           np.histogram(im1[0:180, 0:320, 0].flatten(), histbins, density=True))

In [None]:
def image_dist(x, y, bins=col_256_bins()):
    """Compute the "distance" between images x and y"""
    hx, wx, _ = x.shape
    hy, wy, _ = y.shape
    div = 3     # Number of divisions; a div x div grid
    dist = 0    # Eventual distance measure
    
    # Iterate through the grid
    for i in range(div):
        for j in range(div):
            hdim_x = (int((hx / div) * i), int((hx / div) * (i + 1)))
            wdim_x = (int((wx / div) * j), int((wx / div) * (j + 1)))
            
            hdim_y = (int((hy / div) * i), int((hy / div) * (i + 1)))
            wdim_y = (int((wy / div) * j), int((wy / div) * (j + 1)))
            
            subimage_x = x[hdim_x[0]:hdim_x[1], wdim_x[0]:wdim_x[1], :]
            subimage_y = y[hdim_y[0]:hdim_y[1], wdim_y[0]:wdim_y[1], :]
            
            # Iterate through dimensions
            for d in range(3):
                chan_x = subimage_x[:, :, d].flatten()
                chan_y = subimage_y[:, :, d].flatten()
                
                hist_x = np.histogram(chan_x, bins, density=True)[0]
                hist_y = np.histogram(chan_y, bins, density=True)[0]
                
                dist += chisq_dist(hist_x, hist_y)
    
    return(dist)

In [None]:
image_dist(im0, im1)

Let's now prepare to search images in folders to find "similar" images to an input image.

In [None]:
imfiles = os.listdir("images/")
imfiles

In [None]:
image_list = [cv2.cvtColor(cv2.imread(im_dir + "/" + i), cv2.COLOR_BGR2RGB) for i in imfiles]
image_dict = dict(zip(imfiles, image_list))

def image_dist_list(image, imlist):
    dists = np.zeros(len(imlist))
    for i in range(len(imlist)):
        dists[i] = image_dist(image, imlist[i])
    
    return dists

Let's test the algorithm.

In [None]:
im0_scores = pd.Series(image_dist_list(im0, image_list), index=imfiles)
im0_scores

In [None]:
im0_scores.sort_values()

In [None]:
plt.imshow(image_dict['ocean0.jpg'])

In [None]:
plt.imshow(image_dict['ocean3.jpg'])

In [None]:
plt.imshow(image_dict['ocean4.jpg'])

In [None]:
plt.imshow(image_dict['city1.jpg'])

In [None]:
plt.imshow(image_dict['city7.jpeg'])

In [None]:
plt.imshow(image_dict['forest2.jpeg'])

In [None]:
plt.imshow(image_dict['forest8.jpg'])

Based on this test it looks like our algorithm is doing a fair job of finding "similar" images according to our criteria. Two ocean images were matched, along with some "similar" city scapes; after that the images don't seem to bear a strong resemblance.

This system could of course be improved, but this should give the basic idea of what is involved in an image search system, and it seems this simple approach already produces decent results.