# End-to-end Sliceline application

_____________________________
This demo notebook is split in 2 parts:

- **Machine Learning modelling**

This part implements a basic regressor on the [California housing dataset](https://www.openml.org/search?type=data&sort=runs&id=41211&status=active) to predict house values.
  
- **Model debugging with Sliceline**

This part identifies slices where the training error of the model is significantly higher, thanks to [sliceline](https://github.com/DataDome/sliceline).

## Machine Learning modelling

We used a [HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html) with default parameters as regressor. The optimisation metric is the [Root Mean Square Error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html).

No preprocessing or feature engineering is applied in the pipeline. It is not the purpose of this demo notebook.

The training error is the element-wise square error.

In [1]:
# import useful modules
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import HistGradientBoostingRegressor

pd.set_option("display.max_rows", None, "display.max_columns", None)

# fetch California housing dataset
X, y = fetch_california_housing(as_frame=True, return_X_y=True)

# define the model
model = HistGradientBoostingRegressor(random_state=42)

# training
model.fit(X, y)

# predict
y_pred = model.predict(X)

# compute element-wise square error (the lower, the better)
training_errors = (y - y_pred)**2

## Model debbuging with Sliceline

**Sliceline considers all the columns of the input dataset as categorical.**

So, to get more relevant slices, features should be discretized.

Indeed, columns as-is would lead to poor exploitable results. We would rather have range of values to specific value in our slices definition.

To discretize them and compute their bins, we use [OptBinning](http://gnpalencia.org/optbinning/) but feel free to experiment other binning implementations.

Sliceline configuration:
- `alpha = 0.95`: we are interested in small slice with high log loss.
- `k = 1`: we want Sliceline to find the rules with the best score.
- `max_l = X_trans.shape[1]`: we want Sliceline to be able to use all of the inputs columns.
- `min_sup = 1`: because the input dataset is relatively small, we do not add constraint regarding the minimal support.

In [2]:
# import sliceline and binning class
from sliceline.slicefinder import Slicefinder
from optbinning import ContinuousOptimalBinning

# Columns have to be bined
optimal_binner = ContinuousOptimalBinning(max_n_bins=5)

X_trans = pd.DataFrame(np.array(
    [
        optimal_binner.fit_transform(X[col], training_errors, metric="bins") for col in X.columns
    ]
).T, columns=X.columns)

# fitting sliceline
sf = Slicefinder(
    alpha = 0.95,
    k = 1,
    max_l = X_trans.shape[1],
    min_sup = 1,
    verbose = True
)

sf.fit(X_trans, training_errors)

(CVXPY) Mar 24 12:09:32 PM: Encountered unexpected exception importing solver GLOP:
RuntimeError('Unrecognized new version of ortools (9.12.4544). Expected < 9.12.0. Please open a feature request on cvxpy to enable support for this version.')


(CVXPY) Mar 24 12:09:32 PM: Encountered unexpected exception importing solver PDLP:
RuntimeError('Unrecognized new version of ortools (9.12.4544). Expected < 9.12.0. Please open a feature request on cvxpy to enable support for this version.')


DEBUG:sliceline.slicefinder:Dropping 0/40 features below min_sup = 1.


DEBUG:sliceline.slicefinder:Initial top-K: count=1, max=0.657496, min=0.657496


DEBUG:sliceline.slicefinder:Level 2:


DEBUG:sliceline.slicefinder: -- generated paired slice candidates: 40 -> 700


  (slice_errors / slice_sizes) / self.average_error_ - 1
  ) - (1 - self.alpha) * (n_row_x_encoded / slice_sizes - 1)
DEBUG:sliceline.slicefinder: -- valid slices after eval: 692/700


DEBUG:sliceline.slicefinder: -- top-K: count=1, max=0.657496, min=0.657496


DEBUG:sliceline.slicefinder:Level 3:


DEBUG:sliceline.slicefinder: -- generated paired slice candidates: 692 -> 6590


DEBUG:sliceline.slicefinder: -- valid slices after eval: 6468/6590


DEBUG:sliceline.slicefinder: -- top-K: count=1, max=0.657496, min=0.657496


DEBUG:sliceline.slicefinder:Level 4:


DEBUG:sliceline.slicefinder: -- generated paired slice candidates: 6468 -> 26328


DEBUG:sliceline.slicefinder: -- valid slices after eval: 24429/26328


DEBUG:sliceline.slicefinder: -- top-K: count=1, max=0.657496, min=0.657496


DEBUG:sliceline.slicefinder:Level 5:


DEBUG:sliceline.slicefinder: -- generated paired slice candidates: 24429 -> 37712


DEBUG:sliceline.slicefinder: -- valid slices after eval: 35863/37712


DEBUG:sliceline.slicefinder: -- top-K: count=1, max=0.657496, min=0.657496


DEBUG:sliceline.slicefinder:Level 6:


DEBUG:sliceline.slicefinder: -- generated paired slice candidates: 35863 -> 24242


DEBUG:sliceline.slicefinder: -- valid slices after eval: 23833/24242


DEBUG:sliceline.slicefinder: -- top-K: count=1, max=0.657496, min=0.657496


DEBUG:sliceline.slicefinder:Level 7:


DEBUG:sliceline.slicefinder: -- generated paired slice candidates: 23833 -> 7727


DEBUG:sliceline.slicefinder: -- valid slices after eval: 7695/7727


DEBUG:sliceline.slicefinder: -- top-K: count=1, max=0.657496, min=0.657496


DEBUG:sliceline.slicefinder:Level 8:


DEBUG:sliceline.slicefinder: -- generated paired slice candidates: 7695 -> 1018


DEBUG:sliceline.slicefinder: -- valid slices after eval: 1018/1018


DEBUG:sliceline.slicefinder: -- top-K: count=1, max=0.657496, min=0.657496


DEBUG:sliceline.slicefinder:Terminated at level 8.


In [3]:
# slices found
pd.DataFrame(sf.top_slices_, columns=sf.feature_names_in_, index=sf.get_feature_names_out())

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
slice_0,,,,,,"(-inf, 2.02)",,


**Note:**

We found 1 slices with `k` set to 1.

_(`None` values refer to unused features in each slices.)_

In [4]:
from sklearn.metrics import mean_squared_error as rmse

# select one slice
slice_index = 0
current_slice = sf.top_slices_[slice_index]

# create a pandas filter
predicate_conditions = [X_trans[feature_name] == feature_value for feature_name, feature_value in zip(
    sf.feature_names_in_, current_slice) if feature_value is not None]
condition = " & ".join(
    [f"@predicate_conditions[{i}]" for i in range(len(predicate_conditions))]
)

# get slice element indices
indices = X_trans.query(condition).index

print("Model RMSE on:")
print(f"- the full dataset ({X.shape[0]} houses):", rmse(y, y_pred))
print(f"- the selected slice ({len(indices)} houses):", rmse(y.iloc[indices], y_pred[indices]))

Model RMSE on:
- the full dataset (20640 houses): 0.1641313869461246
- the selected slice (1756 houses): 0.3706251665488993


# Conclusion

With Sliceline, we identified a subset of 1756 houses on which the model performs significantly worse. Those houses:
- count 2 or less average number of household members (`AveOccup='(-inf, 2.02)'`).

To improve the modelisation, we should focus on reducing the error on those houses.