# Find data slices with Sliceguard

We use [Sliceguard](https://github.com/Renumics/sliceguard) to identify data segments where our machine learning model performs anomalously (data slices). We interactively explore these data slices to find model failure modes and problematic data segments.

More information about this play can be found in the Spotlight documentation: [Find typical image datasets with Cleanvision](https://renumics.com/docs/playbook/data-slices-sliceguard)

For more data-centric AI workflows, check out our [Awesome Open Data-centric AI](https://github.com/Renumics/awesome-open-data-centric-ai) list on Github.

## tldr

In [None]:
# @title Install required packages with PIP

!pip install renumics-spotlight datasets cleanvision sliceguard

In [None]:
from sklearn.metrics import accuracy_score
import pandas as pd
import datasets
from renumics.spotlight import Image
from sliceguard import SliceGuard
from cleanvision.imagelab import Imagelab


def find_data_slices(
    df,
    categories,
    category_types={},
    spotlight_dtype={},
    embedding_name="embedding",
    label_name="label",
    prediction_name="prediction",
):
    sg = SliceGuard()
    df_slices = sg.find_issues(
        df,
        categories,
        label_name,
        prediction_name,
        accuracy_score,
        precomputed_embeddings={"embedding": df[embedding_name].to_numpy()},
        metric_mode="max",
        feature_types=category_types,
    )

    sg.report(spotlight_dtype=spotlight_dtype)

    return df_slices

## Step-by-step example on CIFAR-100

### Load CIFAR-100 from Huggingface hub and convert it to Pandas dataframe

In [None]:
import datasets
from renumics import spotlight

dataset = datasets.load_dataset("renumics/cifar100-enriched", split="test")

df = dataset.to_pandas()

### Compute heuristics for typical image data error scores with Cleanvision

In [None]:
def cv_issues_cleanvision(df, image_name="image"):
    image_paths = df["image"].to_list()
    imagelab = Imagelab(filepaths=image_paths)
    imagelab.find_issues()

    df_cv = imagelab.issues.reset_index()

    return df_cv


df_cv = cv_issues_cleanvision(df)
df = pd.concat([df, df_cv], axis=1)

### Inspect errors and detect problematic data slices with Sliceguard

> ⚠️ Running Spotlight in Colab currently has severe limitations (slow, no similarity map, no layouts) due to Colab restrictions (e.g. no websocket support). Run the notebook locally for the full Spotlight experience.

In [None]:
categories = [
    "dark_score",
    "low_information_score",
    "light_score",
    "blurry_score",
    "fine_label",
]
prediction = "fine_label_prediction"
label = "fine_label"
category_types = {"fine_label": "nominal"}
spotlight_dtype = {"image": Image}

find_data_slices(
    df,
    categories,
    category_types=category_types,
    spotlight_dtype=spotlight_dtype,
    embedding_name="embedding",
    label_name=label,
    prediction_name=prediction,
)