Detect duplicate images in a dataset, as duplicates introduce bias into the trained model and model would not be able to generalize well.

In [1]:
# Imports

import cv2
from imutils import paths
import numpy as np
import os

We are going to detect duplicates by image hashing, image hashing creates a numerical representation for each image.

Image hashing or Perceptual hashing is the process of 

-> Examining the contents of the image

-> Constructing an hash value that uniquely identifies an image.

The md5, sha-1 hashing algorithms produce different hash, even with a single color change of a pixel.


In perceptual hashing, we want similar images to have similar hash values as well.


We are going to use "dHash" algorithm for detecting duplicates in the dataset.

# dHash

The dHash algorithm is fairly simple and has four self explanatory steps:

Step 1) Convert the image to grayscale, to discard color information

grayscale helps to Hash the image faster since we only have to examine one channel also enables
matching images that are identical but have slightly altered color spaces (since color information has been removed)

Step 2) Resize the image into 9 * 8 pixels ignoring the aspect ratio. (Aspect ratio is ignored to match images regardless of their initial space dimensions)

Step 3) Compute the difference

So, why in the world would we resize to 9 × 8?

Well, keep in mind the name of the algorithm we are implementing: difference hash (dhash).

The difference hash algorithm works by computing the difference (i.e., relative gradients) between adjacent pixels.

If we take an input image with 9 pixels per row and compute the difference between adjacent column pixels, we end up with 8 differences. Eight rows of eight differences (i.e., 8×8) is 64 which will become our 64-bit hash.


Step 4) Build the hash

Given a difference image D and corresponding set of pixels P, we apply the following test: P[x] > P[x + 1] = 1 else 0.



Comparing difference hashes

Typically we use the Hamming distance to compare hashes. The Hamming distance measures the number of bits in two hashes that are different.

Two hashes with a Hamming distance of zero implies that the two hashes are identical (since there are no differing bits) and that the two images are identical/perceptually similar as well.

Dr. Neal Krawetz of HackerFactor suggests that hashes with differences > 10 bits are most likely different while Hamming distances between 1 and 10 are potentially a variation of the same image. In practice you may need to tune these thresholds for your own applications and corresponding datasets.

In [2]:
def dhash(image, hashSize=8):
    # convert the image to grayscale and resize the grayscale image,
    # adding a single column (width) so we can compute the horizontal gradient

    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    resized = cv2.resize(gray, (hashSize + 1, hashSize))
    
    # compute the (relative) horizontal gradient between adjacent column pixels
    
    diff = resized[:, 1:] > resized[:, :-1]
    
    # convert the difference image to a hash and return it
    
    return sum([2 ** i for (i, v) in enumerate(diff.flatten()) if v])

In [3]:
imagePaths = list(paths.list_images('./dataset/'))
hashes = {}

# loop over our image paths

for imagePath in imagePaths:
    # load the input image and compute the hash
    image = cv2.imread(imagePath)
    h = dhash(image)
    # grab all image paths with that hash, add the current image
    # path to it, and store the list back in the hashes dictionary
    p = hashes.get(h, [])
    p.append(imagePath)
    hashes[h] = p

In [4]:
# All the hashes and their corresponding images.

hashes

{10759989518710190966: ['./dataset/00000243.jpg'],
 2111365950431910193: ['./dataset/00000827.jpg'],
 2968350046898703313: ['./dataset/00000631.jpg'],
 1616449088627773713: ['./dataset/00000816.jpg'],
 7443708506176673229: ['./dataset/00000501.jpg'],
 9631857315479889769: ['./dataset/00000438.jpg'],
 8676299376825512104: ['./dataset/00000496.jpg'],
 3032091204763964520: ['./dataset/00000790.jpg'],
 669322114467932787: ['./dataset/00000918.jpg'],
 17219604635314433308: ['./dataset/00000169.jpg'],
 8799115945459290201: ['./dataset/00000601.jpg'],
 3114532889054481163: ['./dataset/00000195.jpg'],
 6149259078788845340: ['./dataset/00000447.jpg'],
 5387881369107347042: ['./dataset/00000371.jpg'],
 2964251922798372395: ['./dataset/00000080.jpg'],
 7030613359754327978: ['./dataset/00000319.jpg'],
 6670422650237953830: ['./dataset/00000271.jpg'],
 13462643074793641303: ['./dataset/00000278.jpg'],
 14623440295126545702: ['./dataset/00000016.jpg'],
 368957821463000957: ['./dataset/00000906.jpg']