In [1]:
%load_ext autoreload
%autoreload 2

#  Sliceguard – Find critical data segments in your data (fast)

Sliceguard is a python library for quickly finding **critical data slices** like outliers, errors, or biases. It works on **structured** and **unstructured** data.

This notebook showcases especially the unstructured data case, so if you have tabular data instead have a look at [this notebook](examples/quickstart_structured_data.ipynb) instead

It is interesting for you if you want to do the following:
1. Find **performance issues** of your machine learning model.
2. Find **anomalies and inconsistencies** in your data.
3. Quickly **explore** your data using an interactive report to generate **insights**.

To run this notebook install and import sliceguard:

In [None]:
!pip install sliceguard

In [2]:
from sliceguard import SliceGuard
from sliceguard.data import from_huggingface

Now download the demo dataset with our utility function:

In [26]:
df = from_huggingface("Matthijs/snacks")

You now have the following dataframe containing an image column with a path to the raw image on the harddrive, a label and a split marker.

In [27]:
df

Unnamed: 0,image,label,split
0,/home/daniel/.cache/huggingface/datasets/downl...,apple,train
1,/home/daniel/.cache/huggingface/datasets/downl...,apple,train
2,/home/daniel/.cache/huggingface/datasets/downl...,apple,train
3,/home/daniel/.cache/huggingface/datasets/downl...,apple,train
4,/home/daniel/.cache/huggingface/datasets/downl...,apple,train
...,...,...,...
950,/home/daniel/.cache/huggingface/datasets/downl...,watermelon,validation
951,/home/daniel/.cache/huggingface/datasets/downl...,watermelon,validation
952,/home/daniel/.cache/huggingface/datasets/downl...,watermelon,validation
953,/home/daniel/.cache/huggingface/datasets/downl...,watermelon,validation


## Check for larger error groups and outliers

In [50]:
sg = SliceGuard()
issues = sg.find_issues(df, ["image"], drop_reference="parent")

Feature image was inferred as referring to raw data. If this is not the case, please specify in feature_types!
Using default model for computing embeddings for feature image.




Embedding computation on cuda with batch size 1 and multiprocessing None.


Some weights of ViTModel were not initialized from the model checkpoint at google/vit-base-patch16-224 and are newly initialized: ['vit.pooler.dense.bias', 'vit.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/6745 [00:00<?, ? examples/s]

You didn't supply ground-truth labels and predictions. Will fit outlier detection model to find anomal slices instead.
The overall metric value is 0.4790081435851741
For outlier detection mode metric_mode will be set to min if not specified otherwise.
Using 20 as maximum slice number to return.
Using drop as sorting criterion for the slices to return.
        metric  support      drop  level  drop+support
1343  0.639310        3  0.160302      4      0.480906
1363  0.638014        3  0.159006      4      0.477017
1326  0.637582        2  0.158574      4      0.317148
1370  0.637582        2  0.158574      4      0.317148
1330  0.637367        2  0.158359      4      0.316717
...        ...      ...       ...    ...           ...
235   0.412520        3 -0.066488      4     -0.199464
268   0.412320        4 -0.066688      4     -0.266752
284   0.411953        3 -0.067055      4     -0.201165
92    0.411787       11 -0.067221      3     -0.739433
1087  0.410110        3 -0.068898      4 

In [51]:
sg.report()

(                                                 image       label  \
 0    /home/daniel/.cache/huggingface/datasets/downl...       apple   
 1    /home/daniel/.cache/huggingface/datasets/downl...       apple   
 2    /home/daniel/.cache/huggingface/datasets/downl...       apple   
 3    /home/daniel/.cache/huggingface/datasets/downl...       apple   
 4    /home/daniel/.cache/huggingface/datasets/downl...       apple   
 ..                                                 ...         ...   
 950  /home/daniel/.cache/huggingface/datasets/downl...  watermelon   
 951  /home/daniel/.cache/huggingface/datasets/downl...  watermelon   
 952  /home/daniel/.cache/huggingface/datasets/downl...  watermelon   
 953  /home/daniel/.cache/huggingface/datasets/downl...  watermelon   
 954  /home/daniel/.cache/huggingface/datasets/downl...  watermelon   
 
           split                                       sg_emb_image  sg_y_pred  
 0         train  [0.12190916389226913, 0.2230367809534073, -0.1.

## Let sliceguard train a model to pinpoint problems even better

## Train an own advanced model and find its weaknesses