# Find data slices with Sliceguard

We use [Sliceguard](https://github.com/Renumics/sliceguard) to identify data segments where our machine learning model performs anomalously (data slices). We interactively explore these data slices to find model failure modes and problematic data segments.

More information about this play can be found in the Spotlight documentation: [Find typical image datasets with Cleanvision](https://renumics.com/docs/playbook/data-slices-sliceguard)

For more data-centric AI workflows, check out our [Awesome Open Data-centric AI](https://github.com/Renumics/awesome-open-data-centric-ai) list on Github.

## tldr

In [1]:
# @title Install required packages with PIP

!pip install renumics-spotlight datasets cleanvision sliceguard




[notice] A new release of pip is available: 23.0.1 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
from sklearn.metrics import accuracy_score
import pandas as pd
import datasets
from renumics.spotlight import Image
from sliceguard import SliceGuard
from cleanvision.imagelab import Imagelab


def find_data_slices(df, categories, category_types={}, spotlight_dtype={}, embedding_name='embedding', label_name='label', prediction_name='prediction'):
    sg = SliceGuard()
    df_slices = sg.find_issues(
        df,
        categories,
        label_name,
        prediction_name,
        accuracy_score,
        precomputed_embeddings = {'embedding': df[embedding_name].to_numpy()},
        metric_mode="max",
        feature_types=category_types
    )

    sg.report(spotlight_dtype=spotlight_dtype)

    return df_slices

## Step-by-step example on CIFAR-100

### Load CIFAR-100 from Huggingface hub and convert it to Pandas dataframe

In [3]:
import datasets
from renumics import spotlight

dataset = datasets.load_dataset("renumics/cifar100-enriched", split="test")

df = dataset.to_pandas()

### Compute heuristics for typical image data error scores with Cleanvision

In [4]:
def cv_issues_cleanvision(df, image_name='image'):

    image_paths = df['image'].to_list()
    imagelab = Imagelab(filepaths=image_paths)
    imagelab.find_issues()

    df_cv=imagelab.issues.reset_index()

    return df_cv

df_cv = cv_issues_cleanvision(df)
df = pd.concat([df, df_cv], axis=1)

Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...


  0%|          | 0/10000 [00:00<?, ?it/s]

  0%|          | 0/10000 [00:00<?, ?it/s]

Issue checks completed. 70 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().


### Inspect errors and detect problematic data slices with Sliceguard

> ⚠️ Running Spotlight in Colab currently has severe limitations (slow, no similarity map, no layouts) due to Colab restrictions (e.g. no websocket support). Run the notebook locally for the full Spotlight experience.

In [6]:
categories=['dark_score', 'low_information_score', 'light_score', 'blurry_score', 'fine_label']
prediction = 'fine_label_prediction'
label = 'fine_label'
category_types={'fine_label': 'nominal'}
spotlight_dtype={"image": Image}

find_data_slices(df, categories, category_types=category_types, spotlight_dtype=spotlight_dtype, embedding_name='embedding', label_name=label, prediction_name=prediction)

Feature dark_score will be treated as numerical value. You can override this by specifying feature_types.
Feature low_information_score will be treated as numerical value. You can override this by specifying feature_types.
Feature light_score will be treated as numerical value. You can override this by specifying feature_types.
Feature blurry_score will be treated as numerical value. You can override this by specifying feature_types.
The overall metric value is 0.9148
Using 25 as minimum support for determining problematic clusters.
Using 0.09148 as minimum drop for determining problematic clusters.
Identified 34 problematic slices.


[{'id': 0,
  'level': 0,
  'indices': array([  37,   88,  130,  141,  244,  291,  358,  379,  503,  511,  830,
         1182, 1238, 1399, 1461, 1521, 1541, 1561, 1758, 1800, 1833, 2119,
         2170, 2305, 2427, 2451, 2452, 2608, 2638, 2718, 2780, 2868, 2878,
         2899, 2923, 2952, 3068, 3400, 3425, 3557, 3599, 3687, 3707, 3724,
         3816, 3869, 3980, 4036, 4046, 4082, 4110, 4141, 4180, 4324, 4544,
         4574, 4623, 4917, 4955, 5118, 5203, 5512, 5685, 5798, 5996, 6023,
         6110, 6241, 6247, 6606, 6617, 6672, 6716, 6775, 6871, 7260, 7346,
         7368, 7487, 7608, 7671, 7929, 8016, 8235, 8271, 8276, 8375, 8591,
         8615, 9026, 9137, 9240, 9278, 9339, 9635, 9674, 9727, 9858, 9887,
         9974], dtype=int64),
  'metric': 0.75,
  'explanation': 'fine_label, (1.00), blurry_score, (0.00), light_score, (0.00)'},
 {'id': 1,
  'level': 0,
  'indices': array([  57,   65,   99,  138,  142,  188,  211,  398,  408,  431,  454,
          512,  523,  534,  551,  589,  600,  6