In [1]:
%load_ext autoreload
%autoreload 2

#  Sliceguard – Find critical data segments in your data (fast)

Sliceguard is a python library for quickly finding **critical data slices** like outliers, errors, or biases. It works on **structured** and **unstructured** data.

This notebook showcases especially the unstructured data case, so if you have tabular data instead have a look at [this notebook](examples/quickstart_structured_data.ipynb) instead

It is interesting for you if you want to do the following:
1. Find **performance issues** of your machine learning model.
2. Find **anomalies and inconsistencies** in your data.
3. Quickly **explore** your data using an interactive report to generate **insights**.

To run this notebook install and import sliceguard:

In [None]:
!pip install sliceguard

In [2]:
from sliceguard import SliceGuard
from sliceguard.data import from_huggingface

Now download the demo dataset with our utility function:

In [26]:
df = from_huggingface("Matthijs/snacks")

You now have the following dataframe containing an image column with a path to the raw image on the harddrive, a label and a split marker.

In [20]:
# df = df[df["label"] == "banana"] # For this example we downsample the dataset. Remove to run on all data.

## Check for Outliers and larger Error Groups

In [58]:
sg = SliceGuard()
issues = sg.find_issues(df[df["label"] == "popcorn"], ["image"], drop_reference="overall") # also try out drop_reference="parent" for more class-specific results

Feature image was inferred as referring to raw data. If this is not the case, please specify in feature_types!
Using default model for computing embeddings for feature image.




Embedding computation on cuda with batch size 1 and multiprocessing None.


Some weights of ViTModel were not initialized from the model checkpoint at google/vit-base-patch16-224 and are newly initialized: ['vit.pooler.dense.weight', 'vit.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/260 [00:00<?, ? examples/s]

Pre-reducing feature image in mode outlier.
Using op mix ratio 0.25.
Using num dimensions 32.
You didn't supply ground-truth labels and predictions. Will fit outlier detection model to find anomal slices instead.
The overall metric value is 0.4712803953080694
For outlier detection mode metric_mode will be set to min if not specified otherwise.
Using 20 as maximum slice number to return.
Using drop as sorting criterion for the slices to return.
Identified 20 problematic slices.


In [59]:
report_df = sg.report()

## Let sliceguard train a model to pinpoint problems even better

In [48]:
sg = SliceGuard()
issues = sg.find_issues(df.sample(1000), ["image"], y="label", drop_reference="overall") # also try out drop_reference="parent" for more class-specific results

AssertionError: 

## Train an own advanced model and find its weaknesses