In [None]:
%load_ext autoreload
%autoreload 2

#  Sliceguard – Find critical data segments in your data (fast)
## Mixed Data Walkthrough

Sliceguard is a python library for quickly finding **critical data slices** like outliers, errors, or biases. It works on **structured** and **unstructured** data.

This notebook showcases especially the **mixed** data case. If you are specifically interested in structured data or unstructured data analysis, please refer to the specific guides for **[structured data](./quickstart_structured_data.ipynb)** and **[unstructured data](./quickstart_unstructured_data.ipynb)** respectively.

It is interesting for you if you want to do the following:
1. Find **performance issues** of your machine learning model.
2. Find **anomalies and inconsistencies** in your data.
3. Quickly **explore** your data using an interactive report to generate **insights**.

To run this notebook install and import sliceguard:

In [None]:
!pip install sliceguard

In [None]:
from sliceguard import SliceGuard
from sliceguard.data import from_huggingface
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR

Now download the demo dataset from the huggingface hub:

In [None]:
df = from_huggingface("alfredodeza/wine-ratings")

In [None]:
# Subsample dataframe for quicker execution
df = df.sample(2000)

In [None]:
# Show dataframe
df

## Check for data slices that are particulary different (Outliers/Errors in the data)
Here sliceguard will train an **outlier detection** model to highlight data points that are especially different from the rest. Note that you can simply use **structured data** like the categorical variables *variety* and *region* in parallel to **unstructured data** like *notes* or *name*. Sliceguard will do embedding calculation and proper normalization internally. However, beware that often raw data and embeddings are way richer than a categorical field with only 5 unique values. This makes it much more likely sliceguard will find isolated clusters based on embeddings. You can however use the "embedding_weights" parameter. To lower the influence of specific embeddings manually.

You can then use the **report feature** that uses [Renumics Spotlight](https://github.com/Renumics/spotlight) for visualization to dig into the reasons why a cluster is considered an outlier. For mixed data it can especially make sense to use the inspector view to visualize unstructured data in parallel to visualizing structured data by using Histograms, Scatterplots, and so on.

In [None]:
sg = SliceGuard()
issues = sg.find_issues(df, features=["notes", "variety"], embedding_weights={"notes": 0.5}) # Play with the embedding weights parameter a bit. More fun in richer datasets.

In [None]:
_ = sg.report()

## Check for data slices where models are prone to fail (hard samples, inconsistencies)
Here sliceguard will **train a regression model** and check for data slices where the mse score is particulary bad. You will realize that in general for the model it is hard to determine the proper rating from the notes and variety. However, there are certain patterns you can uncover, especially **uninformative notes** such as "Ex-chateau release" that do not contain any information for generalizing on other data.

In [None]:
# Train the model and predict on the same data (of course in practice you will want to split your data!!!)
# This is only for showing the principle
sg = SliceGuard()
issues = sg.find_issues(df,
                        features=["notes", "variety"],
                        y="rating",
                        n_slices=30,
                        criterion="drop",
                        metric=mean_squared_error,
                        automl_task="regression",
                        automl_time_budget=180
                       ) # also try out drop_reference="parent" for more class-specific results

In [None]:
_ = sg.report()

In [None]:
# Save embeddings for later use in own model example
notes_embeddings = sg.embeddings["notes"]

## Check for weaknesses of your own model (...and hard samples + inconsistencies)
This shows how to pass your **own model predictions** into sliceguard to find slices that are performing badly according to a supplied metric function. This allows you to uncover **inconsistencies** and samples that are **hard to learn** in no time!

In [None]:
# Train the model and predict on the same data (of course in practice you will want to split your data!!!)
# This is only for showing the principle
clf = SVR()
clf.fit(notes_embeddings, df["rating"])
df["predictions"] = clf.predict(notes_embeddings)

In [None]:
# Pass the predictions to sliceguard and uncover hard samples and inconsistencies.
sg = SliceGuard()
issues = sg.find_issues(df,
                        features=["notes"],
                        y="rating",
                        y_pred="predictions",
                        metric=mean_squared_error,
                        metric_mode="min",
                        precomputed_embeddings={"notes": notes_embeddings})

In [None]:
_ = sg.report()