# Dataset Cleaning

*Image Deleter and Relabeler by [Zach Caceres](http://zachcaceres.com/now/) and Jason Hendrix, Duplicate Finder by Francisco Ingham*

In this notebook we will show you how to take advantage of fastai widgets to clean your dataset! We will delete images that do not correspond, relabel images with incorrect labels and delete duplicates. For this, we will use the CIFAR10 dataset but you can use it in your own custom dataset by using the [google images dataset](https://github.com/fpingham/google-images-dataset) notebook.

# Training your first model

In [53]:
from fastai import *
from fastai.vision import *

In [54]:
path = untar_data(URLs.CIFAR)

We will first train a model since it will suggest us which are the images that are most likely to be mislabelled or not belong to our dataset. We will also use the weights of the pretrained model to find similar images that might be duplicates.

In [55]:
np.random.seed(42)
data = ImageDataBunch.from_folder(path, train="train", valid_pct=0.2,
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [56]:
learn = create_cnn(data, models.resnet34, metrics=error_rate)

In [58]:
learn.fit_one_cycle(4)

epoch,train_loss,valid_loss,error_rate
1,0.514697,0.313609,0.107500
2,0.312180,0.205868,0.070167
3,0.272417,0.170283,0.057250
4,0.228463,0.160440,0.054333
,,,


In [59]:
learn.save('stage-1');

In [60]:
learn.load('stage-1');

## Cleaning your dataset

In [62]:
from fastai.widgets import *

To start, we will sort the indices of our images by the highest loss images since this suggests that the image might be mislabeled or just not belong to the dataset.

In [66]:
ds, idxs = DatasetFormatter().from_toplosses(learn, ds_type=DatasetType.Valid)

Now we will use the widget to delete or move images. Flag photos for deletion by clicking 'Delete' or move them by using the dropdown menu. Then click 'Next Batch' to delete flagged photos and keep the rest in that row. `ImageCleaner` will show you a new row of images until there are no more to show.

Pretty sure the first one is not a truck...

In [67]:
ImageCleaner(ds, idxs)

HBox(children=(VBox(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00d\x00d\x00\x00\xff…

Button(button_style='primary', description='Next Batch', layout=Layout(width='auto'), style=ButtonStyle())

<fastai.widgets.image_cleaner.ImageCleaner at 0x7f459c073ac8>

You can also find duplicates in your dataset and delete them! We will first get the sorted indices for the most similar images in the dataset.

In [None]:
ds, idxs = DatasetFormatter().from_similars(learn, ds_type=DatasetType.Valid, pool_dim=4)

Getting activations...


Computing similarities...


Take a look at the images in pairs and delete the ones you don't want to see anymore, until there are no more to show. `ImageCleaner` shows 40 images by default. If you still see duplicates in the last of those 40, you can always run the widget again, specifying `start=40` and `end=100` to see the next 60.

In [None]:
ImageCleaner(ds, idxs, duplicates=True)

Turns out there is quite a number of duplicates in CIFAR!

## Train with new dataset

Now we are ready to do our real training with a clean dataset! To use the new dataset we must indicate to our DataBunch object that we will load the labels from a csv.

In [None]:
np.random.seed(42)
data = ImageDataBunch.from_csv(".", folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [None]:
learn = create_cnn(data, models.resnet34, metrics=error_rate)

In [None]:
learn.fit_one_cycle(4)