In [1]:
%load_ext autoreload
%autoreload 2

#  Sliceguard – Find critical data segments in your data (fast)
## Mixed Data Walkthrough

Sliceguard is a python library for quickly finding **critical data slices** like outliers, errors, or biases. It works on **structured** and **unstructured** data.

This notebook showcases especially the **mixed** data case. If you are specifically interested in structured data or unstructured data analysis, please refer to the specific guides for **[structured data](./quickstart_structured_data.ipynb)** and **[unstructured data](./quickstart_unstructured_data.ipynb)** respectively.

It is interesting for you if you want to do the following:
1. Find **performance issues** of your machine learning model.
2. Find **anomalies and inconsistencies** in your data.
3. Quickly **explore** your data using an interactive report to generate **insights**.

To run this notebook install and import sliceguard:

In [7]:
!pip install sliceguard


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [51]:
from sliceguard import SliceGuard
from sliceguard.data import from_huggingface
from sklearn.metrics import mean_squared_error

Now download the demo dataset from the huggingface hub:

In [20]:
df = from_huggingface("alfredodeza/wine-ratings")

Downloading readme:   0%|          | 0.00/502 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/13.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/81.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/81.0k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/32780 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/200 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/200 [00:00<?, ? examples/s]

In [30]:
# Subsample dataframe for quicker execution
df = df.sample(2000)

In [31]:
# Show dataframe
df

Unnamed: 0,name,region,variety,rating,notes,split
2631,Arrowood Reserve Speciale Cabernet Sauvignon 2012,"Sonoma Valley, Sonoma County, California",Red Wine,94.0,The 2012 Réserve Spéciale Cabernet Sauvignon o...,train
25274,Gramercy Cellars The Deuce Syrah 2014,"Walla Walla Valley, Columbia Valley, Washington",Red Wine,93.0,"Red fruit – raspberry, cranberry, red cherry. ...",train
11602,Chateau Haut-Brisson 2000,"St. Emilion, Bordeaux, France",Red Wine,88.0,"""One of Saint-Emilion's most attractive over-p...",train
30946,La Cana Albarino 2010,"Rias Baixas, Spain",White Wine,90.0,The grapes for this wine come from 11.5 Ha of ...,train
7532,Ca' Rome Rapet Gold Label Barolo (1.5L Magnum)...,"Barolo, Piedmont, Italy",Red Wine,91.0,"Garnet with orange reflections. Ample, elegant...",train
...,...,...,...,...,...,...
8492,Casanova di Neri Brunello di Montalcino (1.5 L...,"Montalcino, Tuscany, Italy",Red Wine,91.0,The White Label 2014 is extraordinarily unique...,train
8331,Carpe Diem Chardonnay 2016,"Anderson Valley, Mendocino, California",White Wine,91.0,Carpe Diem Chardonnay is ripe with aromas and ...,train
860,Allegrini Amarone 2001,"Veneto, Italy",Red Wine,93.0,Deep purple in color with a bouquet full of dr...,train
2323,Argiano Solengo 2003,"Tuscany, Italy",Red Wine,90.0,"Opaque ruby purple color, black currant and bl...",train


## Check for data slices that are particulary different (Outliers/Errors in the data)
Here sliceguard will train an **outlier detection** model to highlight data points that are especially different from the rest. Note that you can simply use **structured data** like the categorical variables *variety* and *region* in parallel to **unstructured data** like *notes* or *name*. Sliceguard will do embedding calculation and proper normalization internally. However, beware that often raw data and embeddings are way richer than a categorical field with only 5 unique values. This makes it much more likely sliceguard will find isolated clusters based on embeddings. You can however use the "embedding_weights" parameter. To lower the influence of specific embeddings manually.

You can then use the **report feature** that uses [Renumics Spotlight](https://github.com/Renumics/spotlight) for visualization to dig into the reasons why a cluster is considered an outlier. For mixed data it can especially make sense to use the inspector view to visualize unstructured data in parallel to visualizing structured data by using Histograms, Scatterplots, and so on.

In [48]:
sg = SliceGuard()
issues = sg.find_issues(df, features=["notes", "variety"], embedding_weights={"notes": 0.5}) # Play with the embedding weights parameter a bit. More fun in richer datasets.

Feature notes was inferred as referring to raw data. If this is not the case, please specify in feature_types!
Feature region was inferred as being categorical. Will be treated as nominal by default. If ordinal specify in feature_types and feature_orders!
Using default model for computing embeddings for feature notes.
Embedding computation on cuda with batch size 1 and multiprocessing None.
Pre-reducing feature notes in mode outlier.
Using op mix ratio 0.25.
Using num dimensions 8.
Weighting the embedding with manually supplied weight 0.05.
You didn't supply ground-truth labels and predictions. Will fit outlier detection model to find anomal slices instead.
The overall metric value is 0.31693185527105866
For outlier detection mode metric_mode will be set to min if not specified otherwise.
Using 20 as maximum slice number to return.
Using drop as sorting criterion for the slices to return.
Identified 20 problematic slices.


In [49]:
_ = sg.report()

## Check for data slices where models are prone to fail (hard samples, inconsistencies)
Here sliceguard will **train a regression model** and check for data slices where the mse score is particulary bad. You will realize that in general for the model it is hard to determine the proper rating from the notes and variety. However, there are certain patterns you can uncover, especially **uninformative notes** such as "Ex-chateau release" that do not contain any information for generalizing on other data.

In [67]:
# Train the model and predict on the same data (of course in practice you will want to split your data!!!)
# This is only for showing the principle
sg = SliceGuard()
issues = sg.find_issues(df,
                        features=["notes", "variety"],
                        y="rating",
                        n_slices=30,
                        criterion="drop",
                        metric=mean_squared_error,
                        automl_task="regression",
                        automl_time_budget=180
                       ) # also try out drop_reference="parent" for more class-specific results

Feature notes was inferred as referring to raw data. If this is not the case, please specify in feature_types!
Feature variety was inferred as being categorical. Will be treated as nominal by default. If ordinal specify in feature_types and feature_orders!
Using default model for computing embeddings for feature notes.
Embedding computation on cuda with batch size 1 and multiprocessing None.
Pre-reducing feature notes in mode automl.
Using op mix ratio 0.8.
Using num dimensions 8.
[flaml.automl.logger: 08-21 13:25:01] {1679} INFO - task = regression
[flaml.automl.logger: 08-21 13:25:01] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 08-21 13:25:01] {1788} INFO - Minimizing error metric: mse
[flaml.automl.logger: 08-21 13:25:01] {1900} INFO - List of ML learners in AutoML Run: ['xgboost']
[flaml.automl.logger: 08-21 13:25:01] {2218} INFO - iteration 0, current learner xgboost
[flaml.automl.logger: 08-21 13:25:01] {2344} INFO - Estimated sufficient time budget=892s. Estimated 

In [68]:
_ = sg.report()

In [70]:
notes_embeddings = sg.embeddings["notes"]

## Check for weaknesses of your own model (...and hard samples + inconsistencies)
This shows how to pass your **own model predictions** into sliceguard to find slices that are performing badly according to a supplied metric function. This allows you to uncover **inconsistencies** and samples that are **hard to learn** in no time!

In [71]:
notes_embeddings.shape

(2000, 384)