<!-- Autogenerated by `scripts/make_examples.py` -->
<table align="left">
    <td>
        <a target="_blank" href="https://colab.research.google.com/github/voxel51/fiftyone-examples/blob/master/examples/image_deduplication.ipynb">
            <img src="https://user-images.githubusercontent.com/25985824/104791629-6e618700-5769-11eb-857f-d176b37d2496.png" height="32" width="32">
            Try in Google Colab
        </a>
    </td>
    <td>
        <a target="_blank" href="https://nbviewer.jupyter.org/github/voxel51/fiftyone-examples/blob/master/examples/image_deduplication.ipynb">
            <img src="https://user-images.githubusercontent.com/25985824/104791634-6efa1d80-5769-11eb-8a4c-71d6cb53ccf0.png" height="32" width="32">
            Share via nbviewer
        </a>
    </td>
    <td>
        <a target="_blank" href="https://github.com/voxel51/fiftyone-examples/blob/master/examples/image_deduplication.ipynb">
            <img src="https://user-images.githubusercontent.com/25985824/104791633-6efa1d80-5769-11eb-8ee3-4b2123fe4b66.png" height="32" width="32">
            View on GitHub
        </a>
    </td>
    <td>
        <a href="https://github.com/voxel51/fiftyone-examples/raw/master/examples/image_deduplication.ipynb" download>
            <img src="https://user-images.githubusercontent.com/25985824/104792428-60f9cc00-576c-11eb-95a4-5709d803023a.png" height="32" width="32">
            Download notebook
        </a>
    </td>
</table>


# Find and remove duplicate images with FiftyOne

Duplicate images in a train or test set can lead to your model learning biases that will impact its ability to generalize to new data. The problem in practice, however, is that large image datasets are difficult to examine so manually finding duplicates can be prohibitive.

This notebook will guide you through how to find and remove duplicates in your data by:

* Loading your data into FiftyOne

* Computing embeddings for your images using the FiftyOne Model Zoo

* Calculating the similarity of your images

* Visualizing and automatically removing duplicates

It should be noted that there are multiple ways that this can be done. The example here is useful for finding near-duplicates pairwise for every image in the dataset.

[FiftyOne also provides a `uniqueness` function](https://voxel51.com/docs/fiftyone/user_guide/brain.html) that computes a scalar property over the dataset determining the uniqueness of a sample in relation to the rest of the data. It can also be used to manually find near-duplicates, with low uniqueness indicating likely duplicate or near-duplicate images. You can see an example of it at the end of the post.

Alternatively, if you are only interested in exact duplicates, you can [compute a hash over your files](https://voxel51.com/docs/fiftyone/recipes/image_deduplication.html) to quickly find matches. However, if images vary by only small pixel values, this method will fail to find the duplicates.

## Loading data

In [None]:
import os
import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone.utils import yolo

In [None]:
name = "tat-dataset_hhf_v1"
dataset_dir = "../data/interim_hhf/"

In [None]:
# dataset_path = "../data/processed/image-deduplication_hhf"

for label_file in os.listdir(dataset_dir + '/labels/train/'):
    if os.stat(os.path.join(dataset_dir, 'labels/train/', label_file)).st_size == 0:
        os.remove(os.path.join(dataset_dir, 'labels/train/', label_file))
        os.remove(os.path.join(dataset_dir, 'images/train/', label_file.split('.')[0]+'.jpg'))

In [None]:
# Delete dataset
dataset = fo.load_dataset(name)
dataset.delete()

In [22]:
# The splits to load
splits = ["train"]

# Load the dataset, using tags to mark the samples in each split
dataset = fo.Dataset(name)
for split in splits:
    dataset.add_dir(
        dataset_dir=dataset_dir,
        dataset_type=fo.types.YOLOv5Dataset,
        split=split,
        tags=split,
)

# View summary info about the dataset
print(dataset)

# Print the first few samples in the dataset
print(dataset.head())

Name:        tat-dataset_hhf_v1
Media type:  image
Num samples: 4779
Persistent:  False
Tags:        []
Sample fields:
    id:           fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
[<Sample: {
    'id': '62f4fa2b209eefac84b167c2',
    'media_type': 'image',
    'filepath': '/media/denis/d6ca67e8-705f-40a9-8314-f59bfaaf7872/work/TAT/Татарстан/data/interim_hhf/images/train/23-11-2021_02-18-44_PM_0.jpg',
    'tags': BaseList(['train']),
    'metadata': None,
    'ground_truth': <Detections: {
        'detections': BaseList([
            <Detection: {
                'id': '62f4fa2b209eefac84b167c1',
                'attributes': BaseDict({}),
                'tags': BaseList([

In [None]:
session = fo.launch_app(dataset)

[Loading your own data](https://voxel51.com/docs/fiftyone/user_guide/dataset_creation/index.html) is also easy to accomplish in FiftyOne. For example, the following will let you load an image classification dataset stored in a directory tree:

```
import fiftyone as fo

dataset = fo.Dataset.from_dir("/path/to/dir", dataset_type=fo.types.ImageClassificationDirectoryTree)
```

## Generate Embeddings

Images store a lot of information in their pixel values. Comparing images pixel-by-pixel would be an expensive operation and result in poor quality results. 

Instead, we can use a pretrained computer vision model to generate embeddings for each image. An embedding is a result of processing an image through a model and extracting an intermediate representation of the image from within the model in the form of a vector containing a few thousand values distilling the information stored in the millions of pixels.

For deep learning models, one typically uses the output of a fully-connected layer near the end of the forward pass to generate embeddings.

The [FiftyOne Model Zoo](https://voxel51.com/docs/fiftyone/user_guide/model_zoo/index.html) contains a host of different pretrained models that we can use for this task. In this example, we will use a [MobileNet v2 model trained on ImageNet](https://voxel51.com/docs/fiftyone/user_guide/model_zoo/models.html#mobilenet-v2-imagenet-torch). This model provides relatively high performance, but most importantly is lightweight and can process our dataset quicker than other models. 

Any off-the-shelf model will be informative, but one can easily experiment with other models that may be more useful for particular datasets.

We can easily load the model and compute embeddings on our dataset.

In [47]:
model = foz.load_zoo_model("mobilenet-v2-imagenet-torch")

RuntimeError: CUDA error: unspecified launch failure

In [4]:
embeddings = dataset.compute_embeddings(model)

print(embeddings.shape)

 100% |███████████████| 2002/2002 [1.7m elapsed, 0s remaining, 31.5 samples/s]      
(2002, 1280)


## Calculate Similarity

Now that we have significantly reduced the dimensionality of our images, we can use classical similarity algorithms to compute how similar every image embedding is to every other image embedding.

In this case, we will use [cosine similarity provided by Scikit Learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) since this algorithm is simple and works fairly well in high dimensional spaces.

In [5]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [6]:
similarity_matrix = cosine_similarity(embeddings)

print(similarity_matrix.shape)
print(similarity_matrix)

(2002, 2002)
[[1.         0.73752369 0.67065162 ... 0.71426472 0.65319278 0.75256688]
 [0.73752369 1.         0.86043928 ... 0.73059472 0.6575839  0.65946913]
 [0.67065162 0.86043928 1.         ... 0.69547338 0.58414217 0.6155496 ]
 ...
 [0.71426472 0.73059472 0.69547338 ... 1.         0.70968906 0.71188586]
 [0.65319278 0.6575839  0.58414217 ... 0.70968906 1.         0.78240149]
 [0.75256688 0.65946913 0.6155496  ... 0.71188586 0.78240149 1.        ]]


As you can see, all diagonal values are 1 since every image is identical to itself. We can subtract by the identity matrix (N x N matrix with 1's on the diagonal and 0's elsewhere) in order to zero out the diagonal so those values don't show up when we look for samples with maximum similarity.

In [7]:
n = len(similarity_matrix)

similarity_matrix = similarity_matrix - np.identity(n)

**Note:** Computing cosine similarity on datasets with more than 100,000 images can time and memory intensive. It is recommended to split the embeddings into batches and parallelize the process to speed up this computation.

## Visualize and remove duplicates

We can now iterate through every sample and find which other samples are the most similar to it.






In [8]:
id_map = [s.id for s in dataset.select_fields(["id"])]

for idx, sample in enumerate(dataset):
    sample["max_similarity"] = similarity_matrix[idx].max()
    sample.save()

The [FiftyOne App](https://voxel51.com/docs/fiftyone/user_guide/app.html) allows us to visualize and explore our dataset right in this notebook.

In [50]:
session = fo.launch_app(dataset)

Visualizing the results and sorting by the samples with the highest similarity shows us the duplicates in the dataset. 

Right off the bat, we can see a lot of duplicates and something even more problematic. Two of the images are duplicates but one is in the train split and one is in the test split... and they are labeled differently (seal vs otter)!!! There are a few things wrong with this:

* It can't be both a seal and an otter so one of the labels is wrong. Additionally, providing different labels for the train and test versions of the image will undoubtedly cause the model to fail.

* Test sets that contain duplicates of the training set will lead to false confidence in the generalizability of your model. If your test set is not truly independent of your training set, the apparent performance of your model will likely drop-off when applied to production data.

By looking through the results, we can find a threshold that we can use as a cutoff for when two images are determined to be duplicated. This threshold will be different for every dataset/model used in this process so the visualization step is crucial.

Further inspection puts a good threshold for guaranteed duplicates around 0.92. Lower values likely also include duplicates but should be verified manually so that we do not remove useful data. We can filter the dataset through code as well to see just how many samples have a max_similarity of > 0.92.

In [None]:
from fiftyone import ViewField as F

dataset.match(F("max_similarity")>0.995)

4,345 out of 60,000 samples are conservatively marked as duplicates!

Let's use a threshold of 0.92 and tag all duplicate samples. This is where you would remove them if so desired.

In [None]:
id_map = [s.id for s in dataset.select_fields(["id"])]

In [None]:
thresh = 0.995
samples_to_remove = set()
samples_to_keep = set()

for idx, sample in enumerate(dataset):
    if sample.id not in samples_to_remove:
        # Keep the first instance of two duplicates
        samples_to_keep.add(sample.id)
        
        dup_idxs = np.where(similarity_matrix[idx] > thresh)[0]
        for dup in dup_idxs:
            # We kept the first instance so remove all other duplicates
            samples_to_remove.add(id_map[dup])

        if len(dup_idxs) > 0:
            sample.tags.append("has_duplicates")
            sample.save()

    else:
        sample.tags.append("duplicate")
        sample.save()

print(len(samples_to_remove) + len(samples_to_keep))

print(len(samples_to_remove))

In [None]:
# If you want to remove the samples from the dataset entirely, uncomment the following line

dataset.delete_samples(list(samples_to_remove))

In [None]:
EXPORT_DIR = "./data/image-deduplication"
dataset.export

# dataset.export(label_field="ground_truth", export_dir=EXPORT_DIR, dataset_type=fo.types.FiftyOneImageClassificationDataset,)
dataset.export(dataset_type=fo.types.ImageSegmentationDirectory, data_path=EXPORT_DIR + "/images/", labels_path=EXPORT_DIR + "/mask/")

In [None]:
session.show()

Let's see how many of these samples have duplicates both in the test and train split and how many are labeled differently.

In [None]:
view = dataset.match_tags(["has_duplicates","duplicate"])
thresh = 0.995

for idx, sample in enumerate(dataset):
    if sample.id in view:
        dup_idxs = np.where(similarity_matrix[idx] > thresh)[0]
        dup_splits = []
        dup_labels = {sample.ground_truth.mask}
        for dup in dup_idxs:
            dup_sample = dataset[id_map[dup]]
            dup_split = "test" if "test" in dup_sample.tags else "train"
            dup_splits.append(dup_split)
            dup_labels.add(dup_sample.ground_truth.label)
            
        sample["dup_splits"] = dup_splits
        sample["dup_labels"] = list(dup_labels)
        sample.save()

In [None]:
view.first()

In [None]:
from fiftyone import ViewField as F

Compute how many of the duplicates exist in BOTH the train and test split.

In [None]:
train_w_test_dups = len(
    view
    .match(F("tags").contains("train"))
    .match(F("dup_splits").contains("test"))
)

test_w_train_dups = len(
    view
    .match(F("tags").contains("test"))
    .match(F("dup_splits").contains("train"))
)

print(train_w_test_dups + test_w_train_dups)

Compute how many duplicates are labeled differently.

In [None]:
label_mismatches = len(
    view
    .match(F("dup_labels").length() > 1)
)

print(label_mismatches)

Visualize the samples with the most number of duplicates

In [None]:
session.view = view.sort_by(F("dup_splits").length(), reverse=True)

## Finding unique images

FiftyOne also provides a [uniqueness method](https://voxel51.com/docs/fiftyone/user_guide/brain.html#image-uniqueness) that can compute the uniqueness of every image in a dataset. This will result in a score for every image indicating how unique the contents of the image are with respect to all other images.

We previously computed near-duplicate images in the dataset with cosine similarity, but with uniqueness, we are able to rank the images in the dataset according to their relative uniqueness (i.e., information content) compared to the other images.

Uniqueness can be helpful when deciding which samples to send to annotators. If you have a limited annotation budget, then you will want to have the most unique samples annotated for training and testing.

In [3]:
import fiftyone.brain as fob

In [None]:
import fiftyone

# Process a subset of the dataset to give a taste
fob.compute_uniqueness(fiftyone.load_dataset(name='my-dataset_v2'))

In [None]:
uniqueness_view = dataset.exists("uniqueness").sort_by("uniqueness", reverse=True)

In [None]:
uniqueness_view

In [None]:
session.view = uniqueness_view

# Finding duplicate objects

In [4]:
import fiftyone.utils.iou as foui

foui.compute_max_ious(dataset, "ground_truth", iou_attr="max_iou", classwise=True)
print("Max IoU range: (%f, %f)" % dataset.bounds("ground_truth.detections.max_iou"))

 100% |███████████████| 3039/3039 [1.1s elapsed, 0s remaining, 2.9K samples/s]         
Max IoU range: (0.000000, 0.162791)


In [5]:
from fiftyone import ViewField as F
#ground_truth
# Retrieve detections that overlap above a chosen threshold
dups_view = dataset.filter_labels("ground_truth", F("max_iou") > 0.8)
print(dups_view)

Dataset:     tat-dataset_hhf_v1
Media type:  image
Num samples: 0
Sample fields:
    id:           fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
View stages:
    1. FilterLabels(field='ground_truth', filter={'$gt': ['$$this.max_iou', 0.8]}, only_matches=True, trajectories=False)


In [6]:
session = fo.launch_app(view=dups_view)

Connected to FiftyOne on port 5151 at localhost.
If you are not connecting to a remote session, you may need to start a new session and specify a port



Could not connect session, trying again in 10 seconds



In [12]:
# Automatically finding duplicates

dup_ids = foui.find_duplicates(
    dataset, "ground_truth", iou_thresh=0.80, classwise=True
)
print("Found %d duplicates" % len(dup_ids))

 100% |███████████████| 2002/2002 [6.2s elapsed, 0s remaining, 357.8 samples/s]      
Found 192 duplicates


In [13]:
# Tag the automatically selected duplicates
dataset.select_labels(ids=dup_ids).tag_labels("duplicate")
print(dataset.count_label_tags())

{'duplicate': 192}


In [14]:
session.view = dataset.match_labels(ids=dup_ids)


Could not connect session, trying again in 10 seconds



In [15]:
dataset.delete_labels(tags="duplicate")

In [2]:
dataset = fo.load_dataset("tat-dataset_v1")

In [20]:
# All data
EXPORT_DIR = "../data/image-deduplication"
#dataset.export

# print(dataset.default_classes)

# dataset.export(dataset_type=fo.types.YOLOv5Dataset, export_dir=EXPORT_DIR, label_field='ground_truth')
dataset.export(dataset_type=fo.types.YOLOv5Dataset, export_dir=EXPORT_DIR, classes=dataset.default_classes,)

Directory '../data/image-deduplication' already exists; export will be merged with existing files
 100% |███████████████| 2002/2002 [8.7s elapsed, 0s remaining, 207.1 samples/s]      


In [22]:
# Cars, human
EXPORT_DIR = "../data/image-deduplication_c_h"
#dataset.export

classes = ['car', 'human']

# print(dataset.default_classes)

# dataset.export(dataset_type=fo.types.YOLOv5Dataset, export_dir=EXPORT_DIR, label_field='ground_truth')
dataset.export(dataset_type=fo.types.YOLOv5Dataset, export_dir=EXPORT_DIR, classes=classes,)

   5% |/--------------|  101/2002 [191.5ms elapsed, 3.6s remaining, 527.5 samples/s] 



  13% |█\-------------|  263/2002 [399.3ms elapsed, 2.6s remaining, 658.6 samples/s] 



 100% |███████████████| 2002/2002 [6.2s elapsed, 0s remaining, 338.3 samples/s]      


In [3]:
# Carplate
EXPORT_DIR = "../data/image-deduplication_cp"

classes = ['carplate']

dataset.export(dataset_type=fo.types.YOLOv5Dataset, export_dir=EXPORT_DIR, classes=classes,)

  12% |█\-------------|  236/2002 [4.4s elapsed, 33.4s remaining, 52.0 samples/s]    



 100% |███████████████| 2002/2002 [41.9s elapsed, 0s remaining, 29.6 samples/s]      


In [3]:
# Human, head, face
EXPORT_DIR = "../data/image-deduplication_hhf"

classes = ['head', 'face', 'human']

dataset.export(dataset_type=fo.types.YOLOv5Dataset, export_dir=EXPORT_DIR, classes=classes,)

   0% |/--------------|    1/2002 [277.9ms elapsed, 9.3m remaining, 3.6 samples/s] 



  12% |█--------------|  236/2002 [4.4s elapsed, 32.0s remaining, 66.9 samples/s]   



 100% |███████████████| 2002/2002 [36.1s elapsed, 0s remaining, 55.3 samples/s]      


# Удаление изображений без разметки

In [11]:
import os

dataset_path = "../data/processed/image-deduplication_hhf"

for label_file in os.listdir(dataset_path + '/labels/val/'):
    if os.stat(os.path.join(dataset_path, 'labels/val/', label_file)).st_size == 0:
        os.remove(os.path.join(dataset_path, 'labels/val/', label_file))
        os.remove(os.path.join(dataset_path, 'images/val/', label_file.split('.')[0]+'.jpg'))

In [13]:
session = fo.launch_app(dataset)

In [12]:
session.freeze()

NameError: name 'session' is not defined