# EDA - Finding Similar Images in Train and Test Datasets

![](http://)In this notebook, we are going to use [imagededup](https://github.com/idealo/imagededup) to find similar images in the test dataset of [Peking University/Baidu - Autonomous Driving](https://www.kaggle.com/c/pku-autonomous-driving) competition.

### Install imagedup

In [None]:
!pip install imagededup

### Import modules

In [None]:
import matplotlib.pylab as plt
import imagededup
from imagededup.methods import PHash
from imagededup.utils import plot_duplicates

In [None]:
image_dir='../input/pku-autonomous-driving/test_images/'

## Find similar images

In [None]:
phasher = PHash()
duplicates = phasher.find_duplicates(image_dir=image_dir, scores=True, max_distance_threshold=3)

__Note:__ `max_distance_threshold` defines the threshold of differences between two images to consider them similar, the higher the value, the more tolerant it is in differences.

Below we list the first 15 images found having similar content according to imagededup. To get the full list, you have to display the content of variable `duplicates`.

In [None]:
{y: duplicates[y] for y in [x for x in duplicates if duplicates[x] != []][:15]}

In [None]:
print('There are', len([x for x in duplicates if duplicates[x] != []]), 'images with similar images over', len(duplicates), 'images.')

Wow! It seems there are a lot of similar images in this dataset. Let's have a look.

### Visualize results

First, we visualise an image for witch imagededup found two similar images with a 0 threshold.

In [None]:
plt.figure(figsize=(20,20))
plot_duplicates(image_dir=image_dir, duplicate_map=duplicates, filename='ID_5bf531cf3.jpg')

imagededup is correct. Except some image pre-processing and enhancement, these 3 images are identical.

Now we'll visualise an image for witch imagededup found two images with a threshold of 0 and two others with a threshold of 2.

In [None]:
plt.figure(figsize=(20,20))
plot_duplicates(image_dir=image_dir, duplicate_map=duplicates, filename='ID_ca20646c5.jpg')

The results are impressive:
 - it found two duplications of `ID_ca20646c5.jpg` but with different contrast
 - it found two other images (identical to each other) which were probably taken at the same place as `ID_ca20646c5.jpg`  but at a slightly different time.

# Conclusions

1. [imagededup](https://github.com/idealo/imagededup) is a powerful tool in your toolbox for various tasks like finding similar images in your dataset, find duplicated images to avoid leaks from train to test dataset...

2. The test dataset has duplicated images. Some image filters were applied to hide this duplication.

3. The test dataset also contains highly correlated images because they were taken at the same location within a short period of time.

Thanks for reading! 