# Image hashing (or image fingerprinting) is a technique that is used to convert an image to an alphanumeric string. While this might sound somewhat pointless, it actually has a number of practical applications and it can make a useful feature in certain machine learning models.

Image hashing has two main uses: it lets you detect identical or visually similar images, and it lets you store the image fingerprint so you don’t need to reload each image to check it.

# How does image hashing work?
Hashing functions convert images to short alphanumeric strings that represent the uniqueness of the image. As the hashes are small and text-based strings, like e7643c330f0f0f0f, they can be stored without taking up lots of room and they can be searched and compared at speed.

However, unlike the commonly used MD5 or SHA hashes we use on text strings, image hashes are designed to be able to handle images that have been resized, rotated, scaled, recoloured, or have noise or watermarks upon them. So while MD5 or SHA would give you a different hash for images treated in different ways, image hashing would give you similar or identical hashes for images that have been manipulated slightly.

# Load the packages
For this project we’re using four main packages. The Requests package is used for fetching remote images, the Pillow or PIL package is used for opening the image files, the ImageHash package is used for calculating image hashes, Pandas is used for displaying the results in dataframes, and iPyPlot is used for displaying the sample images.

In [5]:
import requests
from PIL import Image
import imagehash
import pandas as pd
import ipyplot

# Define the images to hash
Next, create a Python list of image URLs to hash. For demonstration purposes, I’ve included a selection of Land Rover Defender images from an eBay listing. Most of these are different, but relatively similar, but a couple of them are identical and are just cropped to different sizes.

In [6]:
images = ['https://i.ebayimg.com/images/g/vOUAAOSwVHle64yO/s-l1600.jpg',
         'https://i.ebayimg.com/images/g/jN8AAOSwxMle64yY/s-l1600.jpg',
         'https://i.ebayimg.com/images/g/3p8AAOSwk2Je64ym/s-l1600.jpg',
         'https://i.ebayimg.com/images/g/qqYAAOSweNle64zN/s-l1600.jpg',
         'https://i.ebayimg.com/images/g/cnkAAOSw~n9e64za/s-l1600.jpg',
         'https://i.ebayimg.com/images/g/3p8AAOSwk2Je64ym/s-l64.jpg',
         'https://i.ebayimg.com/images/g/qqYAAOSweNle64zN/s-l64.jpg']

In [7]:
ipyplot.plot_images(images)

# Create the image hashes
To create the image hashes and assess the performance of the different hashing algorithms, we’ll open each of the images in the list and hash the image using the average hash, perceptual hash, difference hash, Haar wavelet hash, and HSV color hash algorithms. We’ll store the results in a Pandas dataframe and print it out.

In [8]:
df = pd.DataFrame(columns=['image','ahash','phash','dhash','whash','colorhash'])

for url in images:
    file = Image.open(requests.get(url, stream=True).raw)

    data = {
        'image': url,
        'ahash': imagehash.average_hash(file),
        'phash': imagehash.phash(file),
        'dhash': imagehash.dhash(file),
        'whash': imagehash.whash(file),
        'colorhash': imagehash.colorhash(file),   
    }
    
    df = df.append(data, ignore_index=True)

If you check out the results below, you’ll see a couple of the images have identical hashes from the average hashing, perceptual hashing, and Haar wavelet hash algorithms, which is expected because the images are the same, just different sizes. However, the cropping has caused the hashes to differ very slightly when the images were scaled.

In [9]:
df.head()

Unnamed: 0,image,ahash,phash,dhash,whash,colorhash
0,https://i.ebayimg.com/images/g/vOUAAOSwVHle64y...,333e9f8981c3c3e3,ac9ec7216b61b076,e7643c330f0f0f0f,033e9f9981c3c3e3,072c0040000
1,https://i.ebayimg.com/images/g/jN8AAOSwxMle64y...,ffc1818183cfffff,bc6bc3903d8f9264,301b07373f3d6338,ff8181818185e3ff,07000000000
2,https://i.ebayimg.com/images/g/3p8AAOSwk2Je64y...,fe86c38181c1c3c7,f4959690dbc385a5,f436060337978f0e,fe86c381c1c1c3e7,06280040001
3,https://i.ebayimg.com/images/g/qqYAAOSweNle64z...,2700017303030f9f,af41943dc186ad3d,d6cac3c3c36f7f3f,7702037323838fbf,06080009000
4,https://i.ebayimg.com/images/g/cnkAAOSw~n9e64z...,ffc1818183cfffff,bc6bc3903d8f9264,301b07373f3d6338,ff8181818185e3ff,07000000000


Running the Pandas duplicated() function will show us when each image has been duplicated within the dataframe (but obviously doesn’t show the first occurrence). Image 5 was a rescaled version of image 2 so both share the aHash of fe86c38181c1c3c7, while image 6 was a rescaled version of image 3, so both share the hash 2700017303030f9f.

In [10]:
df.ahash.duplicated()

0    False
1    False
2    False
3    False
4     True
5     True
6     True
Name: ahash, dtype: bool

In [11]:
df.phash.duplicated()

0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: phash, dtype: bool

In [12]:
df.dhash.duplicated()

0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: dhash, dtype: bool

In [13]:
df.whash.duplicated()

0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: whash, dtype: bool

In [14]:
df.colorhash.duplicated()

0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: colorhash, dtype: bool

# Identifying duplicate images
Now we’ll create a function to identify whether a new image is a duplicate of another image already in our list. This will take the URL for a new image, return its hash and compare it to the other hashes we’ve already calculated to see if it is a duplicate of an existing image. To do this we’ll load the remote image, create an average hash, then compare the string of the average hash to the average hashes stored for the known images. Running the function on an image which is already in our dataset returns true.

In [15]:
def find_duplicate_images(df, ahash_column, image_url):
    """Determine whether a new image is a duplicate of 
    an existing image using average hashing.
    
    :param df: Pandas dataframe containing image hashes
    :param ahash_column: Name of column containing average hashes
    :param image_url: URL of new image
    
    :return: True if the image is a duplicate, or False if unique
    """
    
    file = Image.open(requests.get(image_url, stream=True).raw)
    ahash = imagehash.average_hash(file)

    matches = df[ahash_column].astype(str).str.contains(str(ahash))
    
    if matches.sum() > 0:
        return True
    else:
        return False

In [16]:
find_duplicate_images(df, 'ahash', 'https://i.ebayimg.com/images/g/vOUAAOSwVHle64yO/s-l1600.jpg')

True

# Identifying visually similar images
We can use a similar approach to create a reverse image search algorithm to find visually similar images. We’ll make a function which uses the Hamming distance to do this. The Hamming distances is a metric for comparing two binary strings of equal length and it returns a score based on the number of differences identified, effectively allowing us to see how similar one image is to the others in the dataset.

By passing in the URL of an unseen image, we can calculate its average hash or ahash, convert it to a string, and then use a lambda function to calculate its Hamming distance from each of the previously seen images in our dataframe. By sorting the values in the dataframe, we get a list of images ranked by their visual similarity to the new image. A score of 0 means the images match exactly.

In [19]:
import distance

In [20]:
def find_similar_images(df, ahash_column, image_url):
    """Compare an unseen image to previously seen images and return
    a list of images ranked by their similarity according to the 
    Hamming distance of their average hash or ahash.
    
    :param df: Pandas dataframe containing image and ahash columns
    :param ahash_column: Name of ahash column
    :param image_url: URL of the unseen image to hash and compare
   
    :return
        Pandas dataframe containing the most similar images
    """
    
    file = Image.open(requests.get(image_url, stream=True).raw)
    ahash = str(imagehash.average_hash(file))
        
    df['hamming_distance'] = df.apply(\
    lambda x: distance.hamming(str(x[ahash_column]), ahash), axis=1)

    df = df[['image','ahash','hamming_distance']]\
    .sort_values(by='hamming_distance', ascending=True)
    
    return df

In [21]:
df = find_similar_images(df, 'ahash', 'https://i.ebayimg.com/images/g/3p8AAOSwk2Je64ym/s-l1600.jpg')

In [22]:
df.head(10)

Unnamed: 0,image,ahash,hamming_distance
2,https://i.ebayimg.com/images/g/3p8AAOSwk2Je64y...,fe86c38181c1c3c7,0
0,https://i.ebayimg.com/images/g/vOUAAOSwVHle64y...,333e9f8981c3c3e3,10
1,https://i.ebayimg.com/images/g/jN8AAOSwxMle64y...,ffc1818183cfffff,11
4,https://i.ebayimg.com/images/g/cnkAAOSw~n9e64z...,ffc1818183cfffff,11
5,https://i.ebayimg.com/images/g/3p8AAOSwk2Je64y...,ffc1818183cfffff,11
3,https://i.ebayimg.com/images/g/qqYAAOSweNle64z...,2700017303030f9f,16
6,https://i.ebayimg.com/images/g/qqYAAOSweNle64z...,2700017303030f9f,16


By using the image hashing approach we can store a unique fingerprint for each of our images in our database to help us identify identical or visually similar images by comparing the hash of a new image with one of the hashes we’ve calculated before. The hashes are small, quick to search, and the technique is really effective.