In [1]:
%load_ext autoreload
%autoreload 2

#  Sliceguard – Find critical data segments in your data (fast)
## Structured Data Walkthrough

Sliceguard is a python library for quickly finding **critical data slices** like outliers, errors, or biases. It works on **structured** and **unstructured** data.

This notebook showcases especially the **structured** data case, so if you have unstructured data like images or audio instead have a look at [this notebook](./examples/quickstart_unstructured_data.ipynb) instead

It is interesting for you if you want to do the following:
1. Find **performance issues** of your machine learning model.
2. Find **anomalies and inconsistencies** in your data.
3. Quickly **explore** your data using an interactive report to generate **insights**.

To run this notebook install and import sliceguard:

In [None]:
!pip install sliceguard

In [15]:
from sliceguard import SliceGuard
from sliceguard.data import from_huggingface
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

Now download the demo dataset from the huggingface hub:

In [3]:
df = from_huggingface("mstz/wine")

In [11]:
# Define the feature names
feature_names = ["fixed_acidity", "volatile_acidity", "citric_acid", "residual_sugar",
                "chlorides", "free_sulfur_dioxide", "total_sulfur_dioxide", "density", "pH",
                "sulphates", "alcohol"]

In [4]:
# Show dataframe
df

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,is_red,split
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,red,train
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,red,train
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,red,train
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,red,train
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,red,train
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6,white,train
6493,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5,white,train
6494,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6,white,train
6495,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,white,train


## Check for data slices that are particulary different (Outliers/Errors in the data)
Here sliceguard will train an **outlier detection** model to highlight data points that are especially different from the rest. Note that we here run the outlier detection only on the class "red". You will easily see there is a bunch of large outliers e.g. with extremely high residual sugar values.

You can use the **report feature** that uses [Renumics Spotlight](https://github.com/Renumics/spotlight) for visualization to dig into the reasons why a cluster is considered an outlier. E.g. use multiple **histograms** to see what is characteristic for a datapoint.

In [5]:
sg = SliceGuard()
issues = sg.find_issues(df[df["is_red"] == "red"], features=feature_names)

Feature fixed_acidity will be treated as numerical value. You can override this by specifying feature_types.
Feature volatile_acidity will be treated as numerical value. You can override this by specifying feature_types.
Feature citric_acid will be treated as numerical value. You can override this by specifying feature_types.
Feature residual_sugar will be treated as numerical value. You can override this by specifying feature_types.
Feature chlorides will be treated as numerical value. You can override this by specifying feature_types.
Feature free_sulfur_dioxide will be treated as numerical value. You can override this by specifying feature_types.
Feature total_sulfur_dioxide will be treated as numerical value. You can override this by specifying feature_types.
Feature density will be treated as numerical value. You can override this by specifying feature_types.
Feature pH will be treated as numerical value. You can override this by specifying feature_types.
Feature sulphates will be

In [6]:
_ = sg.report()

## Check for data slices where models are prone to fail (hard samples, inconsistencies)
Here sliceguard will **train a classification model** and check for data slices where the accuracy score is particulary bad. You will realize that this will show you a bunch of wines that are easily confused using there chemical values, e.g. red wines with particulary high residual sugar levels or especially low chloride values.

Note that the dataset is relatively easy, so it won't find that many issues as in more complex datasets.

In [None]:
# Train the model and predict on the same data (of course in practice you will want to split your data!!!)
# This is only for showing the principle
sg = SliceGuard()
issues = sg.find_issues(df,
                        features=feature_names,
                        y="is_red",
                        metric=accuracy_score,
                        automl_task="classification"
                       ) # also try out drop_reference="parent" for more class-specific results

In [None]:
_ = sg.report()

## Check for weaknesses of your own model (...and hard samples + inconsistencies)
This shows how to pass your **own model predictions** into sliceguard to find slices that are performing badly according to a supplied metric function. This allows you to uncover **inconsistencies** and samples that are **hard to learn** in no time!

In [17]:
# Train the model and predict on the same data (of course in practice you will want to split your data!!!)
# This is only for showing the principle
X = StandardScaler().fit_transform(df[feature_names].values)
y = df["is_red"]

clf = SVC()
clf.fit(X, y)
df["predictions"] = clf.predict(X)

In [18]:
# Pass the predictions to sliceguard and uncover hard samples and inconsistencies.
sg = SliceGuard()
issues = sg.find_issues(df,
                        features=feature_names,
                        y="is_red",
                        y_pred="predictions",
                        metric=accuracy_score)

Feature fixed_acidity will be treated as numerical value. You can override this by specifying feature_types.
Feature volatile_acidity will be treated as numerical value. You can override this by specifying feature_types.
Feature citric_acid will be treated as numerical value. You can override this by specifying feature_types.
Feature residual_sugar will be treated as numerical value. You can override this by specifying feature_types.
Feature chlorides will be treated as numerical value. You can override this by specifying feature_types.
Feature free_sulfur_dioxide will be treated as numerical value. You can override this by specifying feature_types.
Feature total_sulfur_dioxide will be treated as numerical value. You can override this by specifying feature_types.
Feature density will be treated as numerical value. You can override this by specifying feature_types.
Feature pH will be treated as numerical value. You can override this by specifying feature_types.
Feature sulphates will be

In [19]:
_ = sg.report()