# Find and remove duplicate images

On the following link https://github.com/MVC-Datasets/MVC/issues/4 it was raised that 261 images from the dataset have the same 'hash' number although they have different filenames. <br>

This means that there are duplicate images in our dataset that we need to delete before creating our model.

The following steps have been followed:

   1. Explore duplicates files and make the transformations needed
   - Identify the actual filename for each duplicate
   - Copy all duplicates into a new folder for visual exploration
   - Select and remove only the duplicate files

In [1]:
import os
from os import listdir
from shutil import copy2

import pandas as pd
import numpy as np

In [2]:
attribute_labels = "attribute_labels.json"  #~590Mb
mvc_info = "mvc_info.json"                  #~140Mb
duplicates = "duplicates.csv"

img_mvc = "./images/mvc/" #path to all images
img_selected = "./images/images_selected/"

In [3]:
df_dup = pd.read_csv(duplicates)
df_info = pd.read_json(mvc_info)
df_labels = pd.read_json(attribute_labels)

In [4]:
df_dup.describe()

Unnamed: 0,hash,link
count,261,261
unique,125,261
top,99d9d6ecfd4f4895696c29fb5cd103cd,z/2/5/1/6/8/4/2516846-3-4x.jpg
freq,3,1


It seems there are 125 unique images and 136 suspected duplicates

In [5]:
df_dup.head(3)

Unnamed: 0,hash,link
0,0168b5eced9b063bb3a0b96940b21512,z/2/5/1/6/8/4/2516846-2-4x.jpg
1,0168b5eced9b063bb3a0b96940b21512,z/2/5/1/6/8/4/2516849-2-4x.jpg
2,0168b5eced9b063bb3a0b96940b21512,z/2/5/1/6/8/5/2516853-2-4x.jpg


In [6]:
# create a new column with the pattern we need
# also when the MVC was created the images with '_vp' were replaced by '_v0' so we need to do the same

df_dup['filename'] = df_dup['link'].str[14:-7].str.replace('-', '_v')
df_dup['filename'] = df_dup['filename'].str.replace('_vp', '_v0')

# this dataset has 4 views for each product (name_v0, name_v1, name_v2 etc)
# after formating the filename column we need to sort it to ensure we always keep the '_v0' view at the top
df_dup.sort_values(by=['hash', 'filename'], inplace=True)
df_dup.reset_index(drop=True, inplace=True)
df_dup.head(3)

Unnamed: 0,hash,link,filename
0,0168b5eced9b063bb3a0b96940b21512,z/2/5/1/6/8/4/2516846-2-4x.jpg,2516846_v2
1,0168b5eced9b063bb3a0b96940b21512,z/2/5/1/6/8/4/2516849-2-4x.jpg,2516849_v2
2,0168b5eced9b063bb3a0b96940b21512,z/2/5/1/6/8/5/2516853-2-4x.jpg,2516853_v2


The 'filename' column from **df_labels** is the one we need to find the duplicate images.<br>

In [7]:
df_labels[['filename']].head(3)

Unnamed: 0,filename
0,p7258521_s3163710_v0
1,p7258521_s3163710_v1
2,p7258521_s3163710_v2


### Extract all duplicates into a folder

The code below copies all possible duplicate images in the *'./image/image-duplicates'* folder for a visual check

In [8]:
os.makedirs("./images/image-duplicates", exist_ok=True)
img_dup = "./images/image-duplicates/" #new folder that will contain all duplicate images

In [9]:
for i in df_dup['filename']:
    if i in df_labels['filename'].tolist():
        copy2(img_mvc + k + '.jpg', img_dup)

### Remove all duplicates

Now we need to split our dataframe to get name of the images we will delete

In [8]:
df_keep = df_dup.drop_duplicates(subset=['hash'], keep='first')
df_keep.head()

Unnamed: 0,hash,link,filename
0,0168b5eced9b063bb3a0b96940b21512,z/2/5/1/6/8/4/2516846-2-4x.jpg,2516846_v2
3,02138c173e87f49b963ccc6d1f603fb0,z/2/5/2/4/8/0/2524801-2-4x.jpg,2524801_v2
5,057cddb955e4b288d32aad576011b9fa,z/2/5/1/9/2/3/2519235-2-4x.jpg,2519235_v2
7,0609a47a7ce0a3399503dd30228a4dc2,z/2/5/2/4/8/3/2524835-3-4x.jpg,2524835_v3
9,090d4f54346a2191f79f553165813766,z/2/5/2/4/4/8/2524482-3-4x.jpg,2524482_v3


In [9]:
df_delete = df_dup[df_keep.reindex(df_dup.index)['hash'].isnull()]
df_delete.head()

Unnamed: 0,hash,link,filename
1,0168b5eced9b063bb3a0b96940b21512,z/2/5/1/6/8/4/2516849-2-4x.jpg,2516849_v2
2,0168b5eced9b063bb3a0b96940b21512,z/2/5/1/6/8/5/2516853-2-4x.jpg,2516853_v2
4,02138c173e87f49b963ccc6d1f603fb0,z/2/5/2/4/8/1/2524817-2-4x.jpg,2524817_v2
6,057cddb955e4b288d32aad576011b9fa,z/2/5/1/9/2/7/2519271-2-4x.jpg,2519271_v2
8,0609a47a7ce0a3399503dd30228a4dc2,z/2/5/2/4/8/6/2524865-3-4x.jpg,2524865_v3


In [11]:
print('Number of images to keep:', len(df_keep))
print('Number of images to delete:', len(df_delete))

Number of images to keep: 125
Number of images to delete: 136


To remove the images from *"./images/img_selected/"* just replace img_mvc with img_selected in the function below

In [32]:
%%time
for i in df_delete['filename']:
    for k in listdir(img_mvc):
        if i in k:
            #copy2(img_mvc + k , img_dup) # you can see that 136 images will be deleted
            os.remove(img_mvc + k)       # remove the hash of the action you wish to perform

Wall time: 4.55 s


In [10]:
# we save this dataframe so we can use to delete duplicates in any other notebook
df_delete.to_csv('duplicates_delete.csv')