# Cleaning Datasets for Image Classification with fastdup and Renumics Spotlight
This notebook aims at providing a blueprint on how you can **improve your machine learning data** in no time with fastdup and Renumics Spotlight. *fastdup* is an open source library scalable data curation, offering high-quality detection algorithms for uncovering the most common data problems. *Renumics Spotlight* is an open source tool for interactively visualizing datasets and machine learning results. Combined they can be a powerful way to automatically detect data issues and discover systematic patterns in the detection results. They will help you improve your data in a quick and effective manner!

In [25]:
!pip install -U pandas fastdup renumics-spotlight sliceguard[all]

Collecting renumics-spotlight
  Using cached renumics_spotlight-1.3.0-py3-none-any.whl (2.5 MB)
Collecting sliceguard[all]
  Using cached sliceguard-0.0.22-py3-none-any.whl (24 kB)
Collecting hnne>=0.1.9
  Using cached hnne-0.1.9-py3-none-any.whl
Collecting dash>=2.11.1
  Using cached dash-2.13.0-py3-none-any.whl (10.4 MB)
Collecting fairlearn>=0.8.0
  Using cached fairlearn-0.9.0-py3-none-any.whl (231 kB)
Collecting plotly>=5.15.0
  Using cached plotly-5.16.1-py2.py3-none-any.whl (15.6 MB)
Collecting datasets>=2.13.1
  Using cached datasets-2.14.4-py3-none-any.whl (519 kB)
Collecting torch>=2.0.1
  Using cached torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl (619.9 MB)
Collecting transformers>=4.30.2
  Using cached transformers-4.32.1-py3-none-any.whl (7.5 MB)
Collecting flaml>=2.0.0
  Using cached FLAML-2.0.1-py3-none-any.whl (295 kB)
Collecting sentence-transformers>=2.2.1
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting torchaudio>=2.0.2
  Using cached torchaudi

In [1]:
from pathlib import Path
import pandas as pd
from renumics import spotlight
import fastdup

/usr/bin/dpkg


In [2]:
INPUT_DIR = Path("/home/daniel/data/license_plates")

In [3]:
# Load the data
train_df = pd.read_csv(INPUT_DIR / "train.csv")
train_df["split"] = "train"
test_df = pd.read_csv(INPUT_DIR / "test.csv")
test_df["split"] = "test"
df = pd.concat((train_df, test_df))
df = df.rename(columns={"Image": "filename", "Label": "label"})
df["filename"] = df.apply(lambda row: str(INPUT_DIR / "data" / row["label"] /row["filename"]), axis=1)

In [4]:
df

Unnamed: 0,label,filename,split
0,Oklahoma,/home/daniel/data/license_plates/data/Oklahoma/3bee0f9774d98e.jpg,train
1,Maryland,/home/daniel/data/license_plates/data/Maryland/c721fc8835807c.jpg,train
2,Nevada,/home/daniel/data/license_plates/data/Nevada/bb8fe304434127.jpg,train
3,Wyoming,/home/daniel/data/license_plates/data/Wyoming/8242a974d5e154.jpg,train
4,Wyoming,/home/daniel/data/license_plates/data/Wyoming/66a962c5a3605c.jpg,train
...,...,...,...
442,Utah,/home/daniel/data/license_plates/data/Utah/b13e4ec47aa127.jpg,test
443,WashingtonDC,/home/daniel/data/license_plates/data/WashingtonDC/8a9342d9b114b8.jpg,test
444,Vermont,/home/daniel/data/license_plates/data/Vermont/d313026cffe30c.jpg,test
445,NewYork,/home/daniel/data/license_plates/data/NewYork/9b702e5aca6c97.jpg,test


In [5]:
fd = fastdup.create(input_dir =INPUT_DIR / "data")
fd.run(annotations=df, overwrite=True) # Detect data issues using fastdup

FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
2023-08-30 11:02:41 [INFO] Going to loop over dir /tmp/tmpxl_xwfm0.csv
2023-08-30 11:02:41 [INFO] Found total 4463 images to run on, 4463 train, 0 test, name list 4463, counter 4463 
2023-08-30 11:02:52 [INFO] Found total 4463 images to run ontimated: 0 Minutes
Finished histogram 1.003
Finished bucket sort 1.013
2023-08-30 11:02:52 [INFO] 67) Finished write_index() NN model
2023-08-30 11:02:52 [INFO] Stored nn model index file work_dir/nnf.index
2023-08-30 11:02:52 [INFO] Total time took 11186 ms
2023-08-30 11:02:52 [INFO] Found a total of 14 fully identical images (d>0.990), which are 0.16 %
2023-08-30 11:02:52 [INFO] Found a total of 50 nearly identical images(d>0.980), which are 0.56 %
2023-08-30 11:02:52 [INFO] Found a total of 6399 above threshold images (d>0.900), which are 71.69 %
2023-08-30 11:02:52 [INFO] Found a total of 446 outlier images         (d<0.050), which are 5.00 %
2023-08-30 11:02:52 [INFO] 

0

# Check the data for outliers

## Get a first overview using fastdup

In [7]:
fd.vis.outliers_gallery()

100%|███████████████████████████████████████| 20/20 [00:00<00:00, 33907.07it/s]


Stored outliers visual view in  work_dir/galleries/outliers.html
########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################


Info,Unnamed: 1
Distance,0.582192
Path,/California/93299a6c613240.jpg
label,California

Info,Unnamed: 1
Distance,0.603176
Path,/California/1eb3c0c9f265c3.jpg
label,California

Info,Unnamed: 1
Distance,0.678315
Path,/California/5e6c58774fea4c.jpg
label,California

Info,Unnamed: 1
Distance,0.687646
Path,/WestVirginia/240b51be4efb2c.jpg
label,WestVirginia

Info,Unnamed: 1
Distance,0.707301
Path,/California/370ab5accd50f7.jpg
label,California

Info,Unnamed: 1
Distance,0.714195
Path,/RhodeIsland/bfea2760c32730.jpg
label,RhodeIsland

Info,Unnamed: 1
Distance,0.723277
Path,/Pennsylvania/66ae08153bc9a1.jpg
label,Pennsylvania

Info,Unnamed: 1
Distance,0.741869
Path,/Illinois/75d35d77f7a6fd.jpg
label,Illinois

Info,Unnamed: 1
Distance,0.742653
Path,/Pennsylvania/ddfe2ce716d9fd.jpg
label,Pennsylvania

Info,Unnamed: 1
Distance,0.744157
Path,/Wyoming/1e0f47c2670231.jpg
label,Wyoming

Info,Unnamed: 1
Distance,0.744255
Path,/Florida/2661a21c8d6f1e.jpg
label,Florida

Info,Unnamed: 1
Distance,0.744639
Path,/Michigan/dafa681c128a00.jpg
label,Michigan

Info,Unnamed: 1
Distance,0.749459
Path,/Tennessee/3b6102b059097f.jpg
label,Tennessee

Info,Unnamed: 1
Distance,0.750616
Path,/Iowa/4332a4213dd5d8.jpg
label,Iowa

Info,Unnamed: 1
Distance,0.753136
Path,/Arizona/cc94037a9fb12f.jpg
label,Arizona

Info,Unnamed: 1
Distance,0.75574
Path,/Idaho/24cad694cbbd56.jpg
label,Idaho

Info,Unnamed: 1
Distance,0.757052
Path,/California/7d374f8b041b8a.jpg
label,California

Info,Unnamed: 1
Distance,0.760553
Path,/Michigan/814db28f298b8a.jpg
label,Michigan

Info,Unnamed: 1
Distance,0.766651
Path,/California/e0e72f389ecda7.jpg
label,California

Info,Unnamed: 1
Distance,0.766651
Path,/California/fcf834f3989975.jpg
label,California


0

## Discover patterns using Renumics Spotlight
fastdup's report already gives us a good first overview on severe outliers in the data. If you want to find out if there are systematic **outlier patterns**, which are possibly related to specific classes, you can use Renumics Spotlight for an **interactive analysis** of fastdup's detection results.

In [13]:
outlier_df = fd.outliers()
_, embeddings = fd.embeddings()
outlier_df["embedding"] = embeddings[outlier_df["outlier"]].tolist()

Read a total of  4463 images


In [14]:
spotlight.show(outlier_df, dtype={"embedding": spotlight.Embedding, "filename_outlier": spotlight.Image, "filename_nearest": spotlight.Image}, layout="spotlight-layout-outlier.json")

VBox(children=(Label(value='Spotlight running on http://127.0.0.1:46019/'), HBox(children=(Button(description=…

**Result**:
![Spotlight Outlier View](img/outliers_spotlight.png)

Spotlight will give you additional possibilities to interactively explore the outliers detected by fastdup and answer questions such as:
1. How are the outliers **distributed** across classes
2. Where are **clusters** of outliers that share similar properties
3. Are outliers fastdup detects in the image data **explainable via metadata** you might have
4. ...