# End-to-end Sliceline application

_____________________________
This demo notebook is split in 2 parts:

- **Machine Learning modelling**

This part implements a basic classification pipeline on the [Titanic dataset](https://www.openml.org/search?type=data&sort=runs&id=40945&status=active) to predict if a passanger survived.

- **Model debugging with Sliceline**

This part identifies slices where the training error of the model is significantly higher, thanks to [sliceline](https://github.com/DataDome/sliceline).

## Machine Learning modelling

The pipeline is composed of 2 steps:
1. The preprocessor: to transform raw data into numerical data without NaN,
2. The classifier: a [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) with default parameters.

The training error is the element-wise [log loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html).

In [1]:
# import useful modules
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier

# fetch titanic dataset
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X.drop(['cabin', 'boat', 'body', 'home.dest', 'name', 'ticket'], axis=1, inplace=True)

# define pipeline
cat_cols = X.select_dtypes("category").columns
num_cols = X.select_dtypes("number").columns

cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False)),
])

num_transformer = Pipeline(steps=[
    ('imputer', KNNImputer(n_neighbors=5))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols),
        ('cat', cat_transformer, cat_cols),
    ])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(random_state=42))])

# training
clf.fit(X, y)

# predict on training data
y_proba = clf.predict_proba(X)[:, 1] # score of being a '1'

# compute element-wise log loss (the lower, the better)
eps = 1e-15
y_proba = np.clip(y_proba, eps, 1 - eps)
y = y.astype(int)

training_errors = - (y * np.log(y_proba) + (1 - y) * np.log(1 - y_proba))

## Model debbuging with Sliceline

**Sliceline considers all the columns of the input dataset as categorical.**

So, to get more relevant slices, the following numerical features should be discretized:
- `age`
- `fare`

Indeed, those columns as-is would lead to poor exploitable results. We would rather have range of values to specific value in our slices definition.

To discretize them and compute their bins, we use [OptBinning](http://gnpalencia.org/optbinning/) but feel free to experiment other binning implementations.

Sliceline configuration:
- `alpha = 0.95`: we are interested in small slice with high log loss.
- `k = 1`: we want Sliceline to find the rules with the best score.
- `max_l = X_trans.shape[1]`: we want Sliceline to be able to use all of the inputs columns.
- `min_sup = 1`: because the input dataset is relatively small, we do not add constraint regarding the minimal support.

In [2]:
# import sliceline and binning class
from sliceline.slicefinder import Slicefinder
from optbinning import ContinuousOptimalBinning

# dataset before prediction
X_trans = pd.DataFrame(clf[0].transform(X), columns=clf[0].get_feature_names_out())

# `age` and `fare` have to be bined
columns_to_bin = ["num__age", "num__fare"]

optimal_binner = ContinuousOptimalBinning()

X_trans[columns_to_bin] = np.array(
    [
        optimal_binner.fit_transform(X_trans[col], training_errors, metric="bins") for col in columns_to_bin
    ]
).T

# fitting sliceline
sf = Slicefinder(
    alpha = 0.95,
    k = 1,
    max_l = X_trans.shape[1],
    min_sup = 1,
    verbose = True
)

sf.fit(X_trans, training_errors)

DEBUG:sliceline.slicefinder:Dropping 0/38 features below min_sup = 1.
DEBUG:sliceline.slicefinder:Initial top-K: count=1, max=0.413802, min=0.413802
DEBUG:sliceline.slicefinder:Level 2:
DEBUG:sliceline.slicefinder: -- generated paired slice candidates: 38 -> 451
  (slice_errors / slice_sizes) / self.average_error_ - 1
  ) - (1 - self.alpha) * (n_row_x_encoded / slice_sizes - 1)
DEBUG:sliceline.slicefinder: -- valid slices after eval: 410/451
DEBUG:sliceline.slicefinder: -- top-K: count=1, max=0.565950, min=0.565950
DEBUG:sliceline.slicefinder:Level 3:
DEBUG:sliceline.slicefinder: -- generated paired slice candidates: 451 -> 2282
DEBUG:sliceline.slicefinder: -- valid slices after eval: 2237/2282
DEBUG:sliceline.slicefinder: -- top-K: count=1, max=0.682603, min=0.682603
DEBUG:sliceline.slicefinder:Level 4:
DEBUG:sliceline.slicefinder: -- generated paired slice candidates: 2282 -> 6396
DEBUG:sliceline.slicefinder: -- valid slices after eval: 6325/6396
DEBUG:sliceline.slicefinder: -- top-K

In [3]:
# slices found
pd.DataFrame(sf.top_slices_, columns=sf.feature_names_in_, index=sf.get_feature_names_out())

Unnamed: 0,num__pclass,num__age,num__sibsp,num__parch,num__fare,cat__sex_female,cat__sex_male,cat__embarked_C,cat__embarked_Q,cat__embarked_S
slice_0,3.0,"[39.50, inf)",,0.0,,,,,1.0,
slice_1,3.0,"[39.50, inf)",0.0,,,,,,1.0,
slice_2,3.0,"[39.50, inf)",,0.0,,,,,1.0,0.0
slice_3,3.0,"[39.50, inf)",,0.0,,,,0.0,,0.0
slice_4,3.0,"[39.50, inf)",,0.0,,,,0.0,1.0,
slice_5,3.0,"[39.50, inf)",0.0,,,,,,1.0,0.0
slice_6,3.0,"[39.50, inf)",0.0,,,,,0.0,,0.0
slice_7,3.0,"[39.50, inf)",0.0,,,,,0.0,1.0,
slice_8,3.0,"[39.50, inf)",0.0,0.0,,,,,1.0,
slice_9,3.0,"[39.50, inf)",,0.0,,,,0.0,1.0,0.0


**Note:**

We found 15 slices with `k` set to 1. As described in the documentation, it means that those 15 slices have the same score.

**In fact, they target the same subset of data.**

_(`None` values refer to unused features in each slices.)_

In [4]:
from sklearn.metrics import log_loss

# select one slice
slice_index = 0
current_slice = sf.top_slices_[slice_index]

# create a pandas filter
predicate_conditions = [X_trans[feature_name] == feature_value for feature_name, feature_value in zip(
    sf.feature_names_in_, current_slice) if feature_value is not None]
condition = " & ".join(
    [f"@predicate_conditions[{i}]" for i in range(len(predicate_conditions))]
)

# get slice element indices
indices = X_trans.query(condition).index

print("Model log loss on:")
print(f"- the full dataset ({X.shape[0]} passengers):", log_loss(y, y_proba))
print(f"- the selected slice ({len(indices)} passengers):", log_loss(y.iloc[indices], y_proba[indices]))

Model log loss on:
- the full dataset (1309 passengers): 0.14262975213717055
- the selected slice (38 passengers): 0.49969327000130753


# Conclusion

With Sliceline, we identified a subset of 38 passengers on which the model performs significantly worse. Those passengers:
- were in 3rd class (`num__pclass=3`),
- were 39.5 years old or more (`num__age='[39.50, inf)'`),
- without any parents or children aboard (`num__parch=0.0`),
- and embarked in Queenstown (`cat__embarked_Q=1`).

To improve the modelisation, we should focus on reducing the error on those passengers.