# Cleaning Datasets for Image Classification with fastdup and Renumics Spotlight
This notebook aims at providing a blueprint on how you can **improve your machine learning data** in no time with fastdup and Renumics Spotlight. *fastdup* is an open source library for **scalable data curation**, offering high-quality detection algorithms for uncovering the most common data problems. *Renumics Spotlight* is an open source tool for **interactively visualizing** datasets and machine learning results. Combined they can be a powerful way to **automatically detect** data issues and **discover systematic patterns** in the detection results. They will help you improve your data in a quick and effective manner!

# Preparations
To run this notebook you need to install the following **dependencies**:

In [None]:
!pip install -U numpy pandas fastdup renumics-spotlight

You also will need some imports and **load the dataset** used in this example. You can find the dataset [here on kaggle](https://www.kaggle.com/datasets/tolgadincer/us-license-plates).

Be sure to also **adjust the dataset path** below if you want to run the code locally.

In [None]:
# The imports you need for running this example notebook
from pathlib import Path
import pandas as pd
import numpy as np
import fastdup
from renumics import spotlight
from renumics.spotlight.analysis import DataIssue

In [None]:
# The dataset input directory
INPUT_DIR = Path("/home/daniel/data/license_plates/data") # ADJUST THE PATH TO YOUR DIRECTORY

In [None]:
# Load the data
labels = []
filenames = []

class_dirs = INPUT_DIR.glob("*")
for class_dir in class_dirs:
    if class_dir.is_dir():
        images = class_dir.glob("*")
        for image in images:
            filenames.append(str(image))
            labels.append(class_dir.name)

df = pd.DataFrame({"filename": filenames, "label": labels})

After loading the data, you will end up with a **dataframe** that looks as follows. If you want to **apply the notebook to your data** make sure either file layout or at least the final dataframe match the format shown here and you can directly apply everything shown here.

In [None]:
df

In [None]:
fd = fastdup.create(input_dir =INPUT_DIR)
fd.run(annotations=df, overwrite=True, model_path="dinov2s", run_advanced_stats=True) # Detect data issues using fastdup
_, embeddings = fd.embeddings(d=384) # Save the generated embedding to variable
df["embedding"] = embeddings.tolist()

In [None]:
fd.img_stats().columns

# Check the data for common issues like low contrast or blur

## Detect issues fast using fastdup
fastdup already does a great job at **automatically** finding images that might be subject to difficult to handle for a machine learning model because of **challenging environmental conditions**.

In [None]:
fd.vis.stats_gallery(metric="bright") # also try dark and blur

## Discover patterns using Renumics Spotlight
fastdup's report already shows you a very useful overview on potentially challenging cases you model might struggle with. Spotlight can help you to additionally **identify patterns** on e.g. clusters of **scenarios where these challenging conditions might occur**.

In [None]:
stats_df = fd.img_stats()

columns_to_use = ["mean", "blur", "contrast", "mean_saturation", "edge_density"] # for more stats check out the dataframe columns!

df = pd.concat([df, stats_df[np.setdiff1d(columns_to_use, df.columns)]], axis=1)

df["issue"] = "no"
df.loc[df["mean"] > 220.5, "issue"] = "bright"
df.loc[df["mean"] < 40, "issue"] = "dark"
df.loc[df["blur"] < 400, "issue"] = "blurry"

stats_issues = []
for issue_type in df["issue"].unique():
    if issue_type == "no":
        continue
    stats_issue = DataIssue(
                        title=f"{issue_type.capitalize()} Image",
                        rows=df[df["issue"] == issue_type].index.tolist(),
                        columns=["blur"] if issue_type == "blurry" else ["mean"]
                        )
    stats_issues.append(stats_issue)

In [None]:
spotlight.show(df, dtype={"embedding": spotlight.Embedding, "filename": spotlight.Image}, issues=stats_issues, layout="spotlight-layout-issues.json", wait=True)

**Results:**
![Spotlight Results](img/issues_spotlight.png)

Spotlight will give you the opportunity to interactively explore the data using the features generated by fastdup. This will help you answer questions such as:
1. Where are **clusters** of images taken under challenging conditions?
2. Are the challenging conditions **associated** with specific classes?
3. Are there any conditions the **scalar features** do not sufficiently capture?
4. ...

# Check the data for outliers

## Detect issues fast using fastdup
Again, fastdup will give us a great **overview** on which types of outlier might exist in our data. Rendering an **html report** is just a oneliner.

In [None]:
fd.vis.outliers_gallery()

## Discover patterns using Renumics Spotlight
If you want to find out if there are systematic **outlier patterns**, which are possibly related to specific classes, you can use Renumics Spotlight for an **interactive analysis** of fastdup's detection results.

In [None]:
outlier_df = fd.outliers()

df["outlier"] = "no"
df.loc[outlier_df["outlier"], "outlier"] = "outlier"

columns_to_use = ["distance", "filename_nearest","label_nearest"]

df = pd.concat([df, outlier_df.set_index("outlier")[np.setdiff1d(columns_to_use, df.columns)]], axis=1)


outlier_issue = DataIssue(
                    title=f"Outlier",
                    rows=df[df["outlier"] == "outlier"].sort_values("distance").index.tolist(),
                    columns=["embedding"]
                )

In [None]:
spotlight.show(df, dtype={"embedding": spotlight.Embedding, "filename": spotlight.Image, "filename_nearest": spotlight.Image}, issues=[outlier_issue], layout="spotlight-layout-outliers.json", wait=True)

**Result**:
![Spotlight Outlier View](img/outliers_spotlight.png)

Spotlight will give you additional possibilities to interactively explore the outliers detected by fastdup and answer questions such as:
1. How are the outliers **distributed** across classes
2. Where are **clusters** of outliers that share similar properties
3. Are outliers fastdup detects in the image data **explainable via metadata** you might have
4. ...

# Check the data for duplicates

## Detect issues fast using fastdup
fastdup also offers an **html report** on the most likely exact and near duplicates.

In [None]:
fd.vis.duplicates_gallery()

## Discover patterns using Renumics Spotlight
Spotlight will again help you **interactively explore** exact and near duplicates and like this find additional patterns in the data.

In [None]:
similarity_df = fd.similarity()

# Adjust the thresholds to include more dissimilar images in the issues list
exact_dup_threshold = 0.998
near_dup_threshold = 0.98

df["duplicate"] = "no"
df.loc[similarity_df["distance"] >= exact_dup_threshold, "duplicate"] = "exact"
df.loc[(similarity_df["distance"] >= near_dup_threshold) & (similarity_df["distance"] < exact_dup_threshold), "duplicate"] = "near"

duplicate_issues = []
for _, row in similarity_df[similarity_df["distance"] >= near_dup_threshold].iterrows():
        duplicate_issue = DataIssue(
                            title=f"{'Exact' if row['distance'] >= exact_dup_threshold else 'Near'} Duplicate | Distance {row['distance']:.2f}",
                            description=f"Labels from/to: {row['label_from']}/{row['label_to']}",
                            rows=[row["from"], row["to"]],
                            columns=["embedding"]
                        )
        duplicate_issues.append(duplicate_issue)
    

In [None]:
spotlight.show(df, dtype={"embedding": spotlight.Embedding, "filename": spotlight.Image}, issues=duplicate_issues, layout="spotlight-layout-duplicates.json", wait=True)

**Results:**
![Spotlight Results](img/duplicates_spotlight.png)

Spotlight will give you the possibility to explore fastdup's detection results interactively. You can investigate questions such as:
1. Are there **data slices** containing a large numbers of duplicates?
2. Can you manually identify larger **clusters of near duplicates**?
3. Are certain **metadata** attributes explanatory for certain types of duplicates?
4. ...

# Check the data for label inconsistencies

## Detect issues fast using fastdup
fastdup's possibilities to explore potential **label inconsistencies** are based on the assumption that close images should probably have the same label. If this is not the case it is a potential inconsistency.

In [None]:
similarities_df = fd.vis.similarity_gallery() 

## Discover patterns using Renumics Spotlight
Spotlight will give you the opportunity to explore label inconsistencies on a **cluster level**.

In [None]:
similarity_df = fd.similarity()

# Adjust the thresholds to include more dissimilar images in the issues list
inconsistency_threshold = 0.96

df["label_inconsistency"] = "no"
df.loc[(similarity_df["distance"] >= inconsistency_threshold) & (similarity_df["label_from"] != similarity_df["label_to"]), "label_inconsistency"] = "inconsistent"

inconsistency_issues = []
for _, row in similarity_df[(similarity_df["distance"] >= inconsistency_threshold) & (similarity_df["label_from"] != similarity_df["label_to"])].iterrows():
        inconsistency_issue = DataIssue(
                                title=f"Label Inconsistency | Distance {row['distance']:.2f}",
                                description=f"Labels from/to: {row['label_from']}/{row['label_to']}",
                                rows=[row["from"], row["to"]],
                                columns=["embedding"]
                                )
        inconsistency_issues.append(inconsistency_issue)

In [None]:
spotlight.show(df, dtype={"embedding": spotlight.Embedding, "filename": spotlight.Image}, issues=inconsistency_issues, layout="spotlight-layout-label-inconsistencies.json", wait=True)

**Results:**
![Spotlight Results](img/inconsistencies_spotlight.png)

Spotlight will help you answer questions such as:
1. Are the detected label inconsistencies **true inconsistencies**?
2. Are the label inconsistencies especially present in certain **clusters** or classes?
3. Are there ways to **filter or correct** inconsistencies automatically?
4. ...

# Identify clusters to gain additional insights for training and evaluation

## Detect image clusters with fastdup
fastdup offers a gallery few to explore the most relevant **clusters of similar images**.

In [None]:
fd.vis.component_gallery()

## Interactively browse image clusters using Renumics Spotlight
Spotlight offers a similar functionality but allows you to do the exploration **interactively** and explore subsets of the data in more detail by using its **filtering** and dynamic **dimensionality reduction** capabilities.

In [None]:
cc_df, _ = fd.connected_components()

largest_groups = cc_df.groupby("component_id")["component_id"].count().sort_values(ascending=False)[:20]

df["cluster"] = -1

clusters = []

for group in largest_groups.index:
    indices = cc_df[(cc_df["component_id"] == group)]["index"]
    df.loc[indices, "cluster"] = group
    cluster = DataIssue(
                        title=f"Image Cluster",
                        rows=indices.tolist(),
                        columns=["embedding"]
                        )
    clusters.append(cluster)

In [None]:
spotlight.show(df, dtype={"embedding": spotlight.Embedding, "filename": spotlight.Image}, issues=clusters, layout="spotlight-layout-clusters.json", wait=True)

**Results:**
![Spotlight Results](img/clusters_spotlight.png)

Here Spotlight can simply help you **build an understanding** for the dataset in a really intuitive way, using structured as well as unstructured data.