# Cleaning Datasets for Image Classification with fastdup and Renumics Spotlight
This notebook aims at providing a blueprint on how you can **improve your machine learning data** in no time with fastdup and Renumics Spotlight. *fastdup* is an open source library scalable data curation, offering high-quality detection algorithms for uncovering the most common data problems. *Renumics Spotlight* is an open source tool for interactively visualizing datasets and machine learning results. Combined they can be a powerful way to automatically detect data issues and discover systematic patterns in the detection results. They will help you improve your data in a quick and effective manner!

In [25]:
!pip install -U numpy pandas fastdup renumics-spotlight

Collecting renumics-spotlight
  Using cached renumics_spotlight-1.3.0-py3-none-any.whl (2.5 MB)
Collecting sliceguard[all]
  Using cached sliceguard-0.0.22-py3-none-any.whl (24 kB)
Collecting hnne>=0.1.9
  Using cached hnne-0.1.9-py3-none-any.whl
Collecting dash>=2.11.1
  Using cached dash-2.13.0-py3-none-any.whl (10.4 MB)
Collecting fairlearn>=0.8.0
  Using cached fairlearn-0.9.0-py3-none-any.whl (231 kB)
Collecting plotly>=5.15.0
  Using cached plotly-5.16.1-py2.py3-none-any.whl (15.6 MB)
Collecting datasets>=2.13.1
  Using cached datasets-2.14.4-py3-none-any.whl (519 kB)
Collecting torch>=2.0.1
  Using cached torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl (619.9 MB)
Collecting transformers>=4.30.2
  Using cached transformers-4.32.1-py3-none-any.whl (7.5 MB)
Collecting flaml>=2.0.0
  Using cached FLAML-2.0.1-py3-none-any.whl (295 kB)
Collecting sentence-transformers>=2.2.1
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting torchaudio>=2.0.2
  Using cached torchaudi

In [1]:
# The imports you need for running this example notebook
from pathlib import Path
import pandas as pd
import numpy as np
import fastdup
from renumics import spotlight
from renumics.spotlight.analysis import DataIssue

/usr/bin/dpkg


In [2]:
# The dataset input directory
INPUT_DIR = Path("/home/daniel/data/license_plates")

In [3]:
# Load the data
train_df = pd.read_csv(INPUT_DIR / "train.csv")
train_df["split"] = "train"
test_df = pd.read_csv(INPUT_DIR / "test.csv")
test_df["split"] = "test"
df = pd.concat((train_df, test_df)).reset_index(drop=True)
df = df.rename(columns={"Image": "filename", "Label": "label"})
df["filename"] = df.apply(lambda row: str(INPUT_DIR / "data" / row["label"] /row["filename"]), axis=1)

In [4]:
df

Unnamed: 0,label,filename,split
0,Oklahoma,/home/daniel/data/license_plates/data/Oklahoma/3bee0f9774d98e.jpg,train
1,Maryland,/home/daniel/data/license_plates/data/Maryland/c721fc8835807c.jpg,train
2,Nevada,/home/daniel/data/license_plates/data/Nevada/bb8fe304434127.jpg,train
3,Wyoming,/home/daniel/data/license_plates/data/Wyoming/8242a974d5e154.jpg,train
4,Wyoming,/home/daniel/data/license_plates/data/Wyoming/66a962c5a3605c.jpg,train
...,...,...,...
4458,Utah,/home/daniel/data/license_plates/data/Utah/b13e4ec47aa127.jpg,test
4459,WashingtonDC,/home/daniel/data/license_plates/data/WashingtonDC/8a9342d9b114b8.jpg,test
4460,Vermont,/home/daniel/data/license_plates/data/Vermont/d313026cffe30c.jpg,test
4461,NewYork,/home/daniel/data/license_plates/data/NewYork/9b702e5aca6c97.jpg,test


In [5]:
fd = fastdup.create(input_dir =INPUT_DIR / "data")
fd.run(annotations=df, overwrite=True) # Detect data issues using fastdup
_, embeddings = fd.embeddings() # Save the generated embedding to variable

FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
2023-08-31 09:55:16 [INFO] Going to loop over dir /tmp/tmp0h4wtiye.csv
2023-08-31 09:55:16 [INFO] Found total 4463 images to run on, 4463 train, 0 test, name list 4463, counter 4463 
2023-08-31 09:55:26 [INFO] Found total 4463 images to run ontimated: 0 Minutes
Finished histogram 0.891
Finished bucket sort 0.909
2023-08-31 09:55:26 [INFO] 67) Finished write_index() NN model
2023-08-31 09:55:26 [INFO] Stored nn model index file work_dir/nnf.index
2023-08-31 09:55:27 [INFO] Total time took 10959 ms
2023-08-31 09:55:27 [INFO] Found a total of 14 fully identical images (d>0.990), which are 0.16 %
2023-08-31 09:55:27 [INFO] Found a total of 50 nearly identical images(d>0.980), which are 0.56 %
2023-08-31 09:55:27 [INFO] Found a total of 6399 above threshold images (d>0.900), which are 71.69 %
2023-08-31 09:55:27 [INFO] Found a total of 446 outlier images         (d<0.050), which are 5.00 %
2023-08-31 09:55:27 [INFO] 

# Check the data for outliers

## Detect issues fast using fastdup

In [9]:
fd.vis.outliers_gallery()

100%|█████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 29495.81it/s]


Stored outliers visual view in  work_dir/galleries/outliers.html
########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################


Info,Unnamed: 1
Distance,0.582192
Path,/California/93299a6c613240.jpg
label,California

Info,Unnamed: 1
Distance,0.633572
Path,/California/1eb3c0c9f265c3.jpg
label,California

Info,Unnamed: 1
Distance,0.678315
Path,/California/5e6c58774fea4c.jpg
label,California

Info,Unnamed: 1
Distance,0.687646
Path,/WestVirginia/240b51be4efb2c.jpg
label,WestVirginia

Info,Unnamed: 1
Distance,0.707301
Path,/California/370ab5accd50f7.jpg
label,California

Info,Unnamed: 1
Distance,0.714195
Path,/RhodeIsland/bfea2760c32730.jpg
label,RhodeIsland

Info,Unnamed: 1
Distance,0.723277
Path,/Pennsylvania/66ae08153bc9a1.jpg
label,Pennsylvania

Info,Unnamed: 1
Distance,0.741869
Path,/Illinois/75d35d77f7a6fd.jpg
label,Illinois

Info,Unnamed: 1
Distance,0.742653
Path,/Pennsylvania/ddfe2ce716d9fd.jpg
label,Pennsylvania

Info,Unnamed: 1
Distance,0.744157
Path,/Wyoming/1e0f47c2670231.jpg
label,Wyoming

Info,Unnamed: 1
Distance,0.744255
Path,/Florida/2661a21c8d6f1e.jpg
label,Florida

Info,Unnamed: 1
Distance,0.744639
Path,/Michigan/dafa681c128a00.jpg
label,Michigan

Info,Unnamed: 1
Distance,0.749459
Path,/Tennessee/3b6102b059097f.jpg
label,Tennessee

Info,Unnamed: 1
Distance,0.750616
Path,/Iowa/4332a4213dd5d8.jpg
label,Iowa

Info,Unnamed: 1
Distance,0.753136
Path,/Arizona/cc94037a9fb12f.jpg
label,Arizona

Info,Unnamed: 1
Distance,0.75574
Path,/Idaho/24cad694cbbd56.jpg
label,Idaho

Info,Unnamed: 1
Distance,0.757052
Path,/California/7d374f8b041b8a.jpg
label,California

Info,Unnamed: 1
Distance,0.760553
Path,/Michigan/814db28f298b8a.jpg
label,Michigan

Info,Unnamed: 1
Distance,0.766651
Path,/California/e0e72f389ecda7.jpg
label,California

Info,Unnamed: 1
Distance,0.766651
Path,/California/fcf834f3989975.jpg
label,California


0

## Discover patterns using Renumics Spotlight
fastdup's report already gives us a good first overview on severe outliers in the data. If you want to find out if there are systematic **outlier patterns**, which are possibly related to specific classes, you can use Renumics Spotlight for an **interactive analysis** of fastdup's detection results.

In [10]:
outlier_df = fd.outliers()
outlier_df["embedding"] = embeddings[outlier_df["outlier"]].tolist()

In [11]:
spotlight.show(outlier_df, dtype={"embedding": spotlight.Embedding, "filename_outlier": spotlight.Image, "filename_nearest": spotlight.Image}, layout="spotlight-layout-outlier.json")

VBox(children=(Label(value='Spotlight running on http://127.0.0.1:37301/'), HBox(children=(Button(description=…

**Result**:
![Spotlight Outlier View](img/outliers_spotlight.png)

Spotlight will give you additional possibilities to interactively explore the outliers detected by fastdup and answer questions such as:
1. How are the outliers **distributed** across classes
2. Where are **clusters** of outliers that share similar properties
3. Are outliers fastdup detects in the image data **explainable via metadata** you might have
4. ...

# Check the data for duplicates

## Detect issues fast using fastdup

In [6]:
fd.vis.duplicates_gallery()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[out_col] = df[in_col].apply(lambda x: get_label_func.get(x, MISSING_LABEL))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[out_col] = df[in_col].apply(lambda x: get_label_func.get(x, MISSING_LABEL))
100%|███████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 137.35it/s]


Stored similarity visual view in  work_dir/galleries/duplicates.html
########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################


Info,Unnamed: 1
Distance,1.0
From,/Missouri/23074d4b0ae51a.jpg
To,/Missouri/f3ba17a8f63858.jpg
From_Label,Missouri
To_Label,Missouri

Info,Unnamed: 1
Distance,1.0
From,/Maryland/055f24ebae6f05.jpg
To,/Maryland/76da827b7b9113.jpg
From_Label,Maryland
To_Label,Maryland

Info,Unnamed: 1
Distance,1.0
From,/Wyoming/d52f79c00711d5.jpg
To,/Wyoming/f65a5529651e94.jpg
From_Label,Wyoming
To_Label,Wyoming

Info,Unnamed: 1
Distance,0.993153
From,/Colorado/b45b3808c66bc7.jpg
To,/Colorado/9ff4fa35f7858d.jpg
From_Label,Colorado
To_Label,Colorado

Info,Unnamed: 1
Distance,0.992452
From,/Pennsylvania/c82893b58e10d5.jpg
To,/Pennsylvania/0ecfedc5235aee.jpg
From_Label,Pennsylvania
To_Label,Pennsylvania

Info,Unnamed: 1
Distance,0.992388
From,/Pennsylvania/e2d034793b759f.jpg
To,/Pennsylvania/8caec045384da2.jpg
From_Label,Pennsylvania
To_Label,Pennsylvania

Info,Unnamed: 1
Distance,0.990011
From,/Colorado/62edf3574a18f0.jpg
To,/Colorado/041effb53537c6.jpg
From_Label,Colorado
To_Label,Colorado

Info,Unnamed: 1
Distance,0.989356
From,/Colorado/002102e0b8e03b.jpg
To,/Colorado/ffbe1574b0d2cb.jpg
From_Label,Colorado
To_Label,Colorado

Info,Unnamed: 1
Distance,0.988373
From,/Pennsylvania/a4ded363c6ba21.jpg
To,/Pennsylvania/8caec045384da2.jpg
From_Label,Pennsylvania
To_Label,Pennsylvania

Info,Unnamed: 1
Distance,0.987977
From,/Virginia/62990e755ac92f.jpg
To,/Virginia/29e2236e465aaf.jpg
From_Label,Virginia
To_Label,Virginia


0

## Discover patterns using Renumics Spotlight

In [7]:
similarity_df = fd.similarity()

# Adjust the thresholds to include more dissimilar images in the issues list
exact_dup_threshold = 0.998
near_dup_threshold = 0.98

df["duplicate"] = "no"
df.loc[similarity_df["distance"] >= exact_dup_threshold, "duplicate"] = "exact"
df.loc[(similarity_df["distance"] >= near_dup_threshold) & (similarity_df["distance"] < exact_dup_threshold), "duplicate"] = "near"

duplicate_issues = []
for _, row in similarity_df[similarity_df["distance"] >= near_dup_threshold].iterrows():
        duplicate_issue = DataIssue(
                            title=f"{'Exact' if row['distance'] >= exact_dup_threshold else 'Near'} Duplicate | Distance {row['distance']:.2f}",
                            description=f"Labels from/to: {row['label_from']}/{row['label_to']}",
                            rows=[row["from"], row["to"]],
                            columns=["embedding"]
                        )
        duplicate_issues.append(duplicate_issue)
       

df["embedding"] = embeddings.tolist()

In [8]:
spotlight.show(df, dtype={"embedding": spotlight.Embedding, "filename": spotlight.Image}, issues=duplicate_issues, layout="spotlight-layout-duplicates.json")

VBox(children=(Label(value='Spotlight running on http://127.0.0.1:42755/'), HBox(children=(Button(description=…

**Results:**
![Spotlight Results](img/duplicates_spotlight.png)

Spotlight will give you the possibility to explore fastdup's detection results interactively. You can investigate questions such as:
1. Are there **data slices** containing a large numbers of duplicates?
2. Can you manually identify larger **clusters of near duplicates**?
3. Are certain **metadata** attributes explanatory for certain types of duplicates?

# Check the data for label inconsistencies

## Detect issues fast using fastdup

In [16]:
similarities_df = fd.vis.similarity_gallery() 

100%|██████████████████████████████████████████| 20/20 [00:03<00:00,  6.57it/s]

Stored similar images visual view in  work_dir/galleries/similarity.html
########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################





Info From,Unnamed: 1
label,Oregon
from,/Oregon/f8864dfd661ae0.jpg

Info To,Unnamed: 1,Unnamed: 2
0.905165,/Oregon/a9afba35ca9bcb.jpg,Oregon
0.900036,/Michigan/cdb9305189c756.jpg,Michigan

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Nevada
from,/Nevada/f62670663e330f.jpg

Info To,Unnamed: 1,Unnamed: 2
0.900042,/Nevada/beec05dfb8a5b9.jpg,Nevada

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Nevada
from,/Nevada/beec05dfb8a5b9.jpg

Info To,Unnamed: 1,Unnamed: 2
0.900042,/Nevada/f62670663e330f.jpg,Nevada

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Oregon
from,/Oregon/e274fe82ce9588.jpg

Info To,Unnamed: 1,Unnamed: 2
0.908024,/Oregon/3768cbb9aff7cb.jpg,Oregon
0.900055,/Oregon/0e8c2f823fd7e3.jpg,Oregon

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Texas
from,/Texas/e2c3ca9317f112.jpg

Info To,Unnamed: 1,Unnamed: 2
0.90006,/Texas/c8c1aecf25c09d.jpg,Texas

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Oklahoma
from,/Oklahoma/8b0df1bb75f715.jpg

Info To,Unnamed: 1,Unnamed: 2
0.907905,/Montana/2974f5fe69537f.jpg,Montana
0.900063,/Nevada/17f2d8680fcbe4.jpg,Nevada

0
Query Image

0
Similar

Info From,Unnamed: 1
label,NewHampshire
from,/NewHampshire/9f387a24fc3210.jpg

Info To,Unnamed: 1,Unnamed: 2
0.910486,/NewHampshire/d0ebc4bfe0f9a7.jpg,NewHampshire
0.900076,/NewHampshire/129a3ad14af708.jpg,NewHampshire

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Vermont
from,/Vermont/6361fc8f82e3c0.jpg

Info To,Unnamed: 1,Unnamed: 2
0.924734,/Vermont/c7ec8309b85a1c.jpg,Vermont
0.900082,/Vermont/d313026cffe30c.jpg,Vermont

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Wyoming
from,/Wyoming/7081e21b53ae0c.jpg

Info To,Unnamed: 1,Unnamed: 2
0.917465,/Missouri/231423076a5dd4.jpg,Missouri
0.900105,/Massachusetts/23773e96d237d3.jpg,Massachusetts

0
Query Image

0
Similar

Info From,Unnamed: 1
label,SouthDakota
from,/SouthDakota/cb67ce7d2dead8.jpg

Info To,Unnamed: 1,Unnamed: 2
0.900899,/California/89cfa25f9cbbc8.jpg,California
0.900107,/SouthDakota/829f5966c712a1.jpg,SouthDakota

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Wyoming
from,/Wyoming/29a05a0fd700ac.jpg

Info To,Unnamed: 1,Unnamed: 2
0.900123,/Oklahoma/8c08d6e75aa4ee.jpg,Oklahoma

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Hawaii
from,/Hawaii/fc5506ca91ebea.jpg

Info To,Unnamed: 1,Unnamed: 2
0.915178,/Hawaii/708eeabdbd966d.jpg,Hawaii
0.900154,/Hawaii/262bdfa8abf42f.jpg,Hawaii

0
Query Image

0
Similar

Info From,Unnamed: 1
label,California
from,/California/9a1b8e6ce8f3ad.jpg

Info To,Unnamed: 1,Unnamed: 2
0.900163,/Alabama/121bceec957e21.jpg,Alabama

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Connecticut
from,/Connecticut/b43ea7671bd8a1.jpg

Info To,Unnamed: 1,Unnamed: 2
0.918041,/Connecticut/36e180f70bafe1.jpg,Connecticut
0.90018,/Massachusetts/855181245af131.jpg,Massachusetts

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Iowa
from,/Iowa/1962b6722606f5.jpg

Info To,Unnamed: 1,Unnamed: 2
0.900205,/Illinois/371abc3c057b31.jpg,Illinois

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Alabama
from,/Alabama/36e7eda4d54c3f.jpg

Info To,Unnamed: 1,Unnamed: 2
0.909184,/SouthDakota/3e56a93793a98c.jpg,SouthDakota
0.900209,/Alabama/bf52479a7da3da.jpg,Alabama

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Alabama
from,/Alabama/79cf57b79115b6.jpg

Info To,Unnamed: 1,Unnamed: 2
0.900232,/Alabama/6b01738e517a2d.jpg,Alabama

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Alabama
from,/Alabama/6b01738e517a2d.jpg

Info To,Unnamed: 1,Unnamed: 2
0.965183,/Alabama/d06e2c1a017c5e.jpg,Alabama
0.900232,/Alabama/79cf57b79115b6.jpg,Alabama

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Massachusetts
from,/Massachusetts/0dbd597f5a1934.jpg

Info To,Unnamed: 1,Unnamed: 2
0.90744,/Connecticut/6fd16f6b24b8aa.jpg,Connecticut
0.900243,/Ohio/7bf70037594568.jpg,Ohio

0
Query Image

0
Similar

Info From,Unnamed: 1
label,Oklahoma
from,/Oklahoma/fdfeab99b753a4.jpg

Info To,Unnamed: 1,Unnamed: 2
0.901383,/SouthDakota/b79e660d5de1ff.jpg,SouthDakota
0.900274,/Utah/b5f6f0cf50b2ea.jpg,Utah

0
Query Image

0
Similar


## Discover patterns using Renumics Spotlight

In [22]:
similarity_df = fd.similarity()
_, embeddings = fd.embeddings()
similarity_df["embedding"] = embeddings[similarity_df["from"]].tolist()

Read a total of  4463 images


In [26]:
spotlight.show(similarity_df, dtype={"embedding": spotlight.Embedding})

VBox(children=(Label(value='Spotlight running on http://127.0.0.1:41371/'), HBox(children=(Button(description=…

# Check the data for common issues like low contrast or blur

# Identify clusters to gain additional insights for training and evaluation