In [1]:
%load_ext autoreload
%autoreload 2

#  Sliceguard – Find critical data segments in your data (fast)

Sliceguard is a python library for quickly finding **critical data slices** like outliers, errors, or biases. It works on **structured** and **unstructured** data.

This notebook showcases especially the unstructured data case, so if you have tabular data instead have a look at [this notebook](examples/quickstart_structured_data.ipynb) instead

It is interesting for you if you want to do the following:
1. Find **performance issues** of your machine learning model.
2. Find **anomalies and inconsistencies** in your data.
3. Quickly **explore** your data using an interactive report to generate **insights**.

To run this notebook install and import sliceguard:

In [None]:
!pip install sliceguard

In [2]:
from sliceguard import SliceGuard
from sliceguard.data import from_huggingface

Now download the demo dataset with our utility function:

In [26]:
df = from_huggingface("Matthijs/snacks")

You now have the following dataframe containing an image column with a path to the raw image on the harddrive, a label and a split marker.

In [27]:
df

Unnamed: 0,image,label,split
0,/home/daniel/.cache/huggingface/datasets/downl...,apple,train
1,/home/daniel/.cache/huggingface/datasets/downl...,apple,train
2,/home/daniel/.cache/huggingface/datasets/downl...,apple,train
3,/home/daniel/.cache/huggingface/datasets/downl...,apple,train
4,/home/daniel/.cache/huggingface/datasets/downl...,apple,train
...,...,...,...
950,/home/daniel/.cache/huggingface/datasets/downl...,watermelon,validation
951,/home/daniel/.cache/huggingface/datasets/downl...,watermelon,validation
952,/home/daniel/.cache/huggingface/datasets/downl...,watermelon,validation
953,/home/daniel/.cache/huggingface/datasets/downl...,watermelon,validation


## Check for larger error groups and outliers

In [48]:
sg = SliceGuard()
issues = sg.find_issues(df, ["image"])

Feature image was inferred as referring to raw data. If this is not the case, please specify in feature_types!
Using default model for computing embeddings for feature image.




Embedding computation on cuda with batch size 1 and multiprocessing None.


Some weights of ViTModel were not initialized from the model checkpoint at google/vit-base-patch16-224 and are newly initialized: ['vit.pooler.dense.bias', 'vit.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/400 [00:00<?, ? examples/s]

You didn't supply ground-truth labels and predictions. Will fit outlier detection model to find anomal slices instead.
The overall metric value is 0.47909346882516424
For outlier detection mode metric_mode will be set to min if not specified otherwise.
Using 20 as maximum slice number to return.
Using drop as sorting criterion for the slices to return.
      metric  support      drop  level  drop+support
89  0.648239        2  0.169146      3      0.338291
47  0.633935        3  0.154842      3      0.464525
26  0.632910        2  0.153816      3      0.307633
64  0.628950        4  0.149856      3      0.599425
16  0.619255       11  0.140161      2      1.541776
..       ...      ...       ...    ...           ...
43  0.426496        8 -0.052598      3     -0.420783
13  0.425738        4 -0.053355      3     -0.213422
61  0.420830        5 -0.058263      3     -0.291315
85  0.419886        2 -0.059208      3     -0.118416
83  0.419882        2 -0.059211      3     -0.118423

[136 row

In [49]:
sg.report()

(                                                  image       label  \
 578   /home/daniel/.cache/huggingface/datasets/downl...      muffin   
 2661  /home/daniel/.cache/huggingface/datasets/downl...       juice   
 645   /home/daniel/.cache/huggingface/datasets/downl...        cake   
 2851  /home/daniel/.cache/huggingface/datasets/downl...      muffin   
 801   /home/daniel/.cache/huggingface/datasets/downl...       salad   
 ...                                                 ...         ...   
 807   /home/daniel/.cache/huggingface/datasets/downl...  strawberry   
 3977  /home/daniel/.cache/huggingface/datasets/downl...       salad   
 1017  /home/daniel/.cache/huggingface/datasets/downl...      carrot   
 2937  /home/daniel/.cache/huggingface/datasets/downl...      muffin   
 209   /home/daniel/.cache/huggingface/datasets/downl...      carrot   
 
            split                                       sg_emb_image  sg_y_pred  
 578         test  [-0.007233772426843643, 0.7412320

## Let sliceguard train a model to pinpoint problems even better

## Train an own advanced model and find its weaknesses