# Dataset Cleaning

*Image Deleter and Relabeler by [Zach Caceres](http://zachcaceres.com/now/) in collaboration with Jason Patnick, Duplicate Finder by [Francisco Ingham](https://medium.com/@fpingham)*

In this notebook we will show you how to take advantage of fastai widgets to clean your dataset! We will delete images that do not correspond, relabel images with incorrect labels and delete duplicates. For this, we will use the CIFAR10 dataset but you can use it in your own custom dataset you built with the [google images dataset](https://github.com/fpingham/google-images-dataset) notebook.

# Training your first model

In [24]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [25]:
from fastai import *
from fastai.vision import *

In [26]:
path = untar_data(URLs.CIFAR)

We will first train a model since it will suggest us which are the images that are most likely to be mislabelled or not belong to our dataset. We will also use the weights of the pretrained model to find similar images that might be duplicates.

In [4]:
np.random.seed(42)
data = ImageDataBunch.from_folder(path, valid_pct=0.2,
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)



  warn(f"There seems to be something wrong with your dataset, can't access self.train_ds[i] for all i in {idx}")


In [5]:
learn = create_cnn(data, models.resnet34, metrics=error_rate)

In [6]:
# learn.fit_one_cycle(4)

In [7]:
# learn.save('stage-1')

In [6]:
learn.load('stage-1')

## Cleaning your dataset

In [27]:
from fastai.widgets import *
import torch

We will create an `ImageDataBunch` with all the images in the training set since the widget will only use the images in the training set.

In [28]:
# We create a databunch with all the data in the training set and no validation set (DatasetFormatter uses only the training set)
db = (ImageItemList.from_folder(path)
                   .no_split()
                   .label_from_folder()
                   .transform([crop_pad, crop_pad], size=224)
                   .databunch())



  warn(f"There seems to be something wrong with your dataset, can't access self.train_ds[i] for all i in {idx}")


To start, we will sort the indices of our images by the highest loss 
images since this suggests that the image might be mislabeled or just not belong to the dataset.

In [29]:
learn_rel = create_cnn(db, models.resnet34, metrics=error_rate)
learn_rel.load('stage-1')

Learner(data=ImageDataBunch;

Train: LabelList
y: CategoryList (60000 items)
[Category dog, Category dog, Category dog, Category dog, Category dog]...
Path: /home/jupyter/.fastai/data/cifar10
x: ImageItemList (60000 items)
[Image (3, 32, 32), Image (3, 32, 32), Image (3, 32, 32), Image (3, 32, 32), Image (3, 32, 32)]...
Path: /home/jupyter/.fastai/data/cifar10;

Valid: LabelList
y: CategoryList (0 items)
[]...
Path: /home/jupyter/.fastai/data/cifar10
x: ImageItemList (0 items)
[]...
Path: /home/jupyter/.fastai/data/cifar10;

Test: None, model=Sequential(
  (0): Sequential(
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace)
    (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (4): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

In [30]:
ds, idxs = DatasetFormatter().from_toplosses(learn_rel)

AttributeError: Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/opt/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/jupyter/fastai/fastai/data_block.py", line 532, in __getitem__
    x = x.apply_tfms(self.tfms, **self.tfmargs)
  File "/home/jupyter/fastai/fastai/vision/image.py", line 96, in apply_tfms
    tfms = sorted(listify(tfms), key=lambda o: o.tfm.order)
  File "/home/jupyter/fastai/fastai/vision/image.py", line 96, in <lambda>
    tfms = sorted(listify(tfms), key=lambda o: o.tfm.order)
AttributeError: 'TfmCrop' object has no attribute 'tfm'


Now we will use the widget to delete or move images. Flag photos for deletion by clicking 'Delete' or move them by using the dropdown menu. Then click 'Next Batch' to delete flagged photos and keep the rest in that row. `ImageCleaner` will show you a new row of images until there are no more to show. 

When you change your dataset, `ImageCleaner` will save the new dataset in a 'cleaned.csv' file in the same file where you have your notebook. 

In [16]:
ImageCleaner(ds, idxs, path)

HBox(children=(VBox(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00d\x00d\x00\x00\xff…

Button(button_style='primary', description='Next Batch', layout=Layout(width='auto'), style=ButtonStyle())

You can also find duplicates in your dataset and delete them! First we need to load our data from the csv and create a new `.Learner` object. We will only use 10000 examples in the dataset since examining the whole dataset would take a long time. To compute similarities for the whole data just delete the line with `df.head(10000)`.

In [31]:
df = pd.read_csv(path/'cleaned.csv', header='infer')

In [32]:
# We create a databunch from our csv. We include the data in the training set and we don't use a validation set (DatasetFormatter uses only the training set)
np.random.seed(42)
db = (ImageItemList.from_df(df, path)
                   .no_split()
                   .label_from_df()
                   .transform([crop_pad(), crop_pad()], size=224)
                   .databunch())

In [33]:
learn_dup = create_cnn(db, models.resnet34, metrics=error_rate)
learn_dup.load('stage-1');

Take a look at the images in pairs and delete the ones you don't want to see anymore, until you feel that images don't look alike anymore. Remember that if you want to rerun the widget you need to recreate the `ImageDataBunch` object, loading the data from `cleaned.csv`.

In [None]:
ds, fns_idxs = DatasetFormatter.from_similars(learn_dup, pool_dim=4)

Getting activations...


In [None]:
ImageCleaner(ds, fns_idxs, path, duplicates=True)

Turns out there is quite a number of duplicates in CIFAR!

## Train with new dataset

Now you are ready to work in your real training with a clean dataset! To use the new dataset you must load the labels and files from csv into your ImageDataBunch object.

In [86]:
# We create a databunch from our csv. We include the data in the training set and we don't use a validation set (DatasetFormatter uses only the training set)
np.random.seed(42)
db = (ImageItemList.from_df(df, path)
                   .no_split()
                   .label_from_df()
                   .transform([crop_pad(), crop_pad()], size=224)
                   .databunch())

In [87]:
learn = create_cnn(db, models.resnet34, metrics=error_rate)

In [45]:
learn.fit_one_cycle(4)

epoch,train_loss,valid_loss,error_rate
1,2.697477,1.928366,0.726316
2,2.147982,1.205164,0.421053
3,1.731222,1.010826,0.294737
4,1.483092,0.964751,0.326316
,,,
