# Using Uncertainty to monitor ML Models

A crucial to retrain your model in production is monitoring: how do yo know when it is time ti retrain? when you can no longer trusts its predictions? 

*In this work, we use non-parametric bootstrapped uncertainty estimates and SHAP values to provide explainable uncertainty estimation as a technique that aims to monitor the deterioration of machine learning models in deployment environments, as well as determine the source of model deterioration when target labels are not available.*
*Classical methods are purely aimed at detecting distribution shift, which can lead to false positives in the sense that the model has not deteriorated despite a shift in the data distribution.*

 The basic idea is to estimate and combine several sources of variation in our ML predictions, and create prediction intervals from the combination of all these sources.

Firstly we want to estimate how much the model depends on specific samples of our training set. We achieve this by sampling parts of the dataset, fitting the model on each of them in turn, and measuring how different the predictions of the resulting models are. Secondly, our models might have an inherent bias, meaning that it will never be able to fully approximate the underlying data distribution, no matter how much data we throw at it. Lastly we want to estimate the noise that the model inherently has, again no matter how much data we throw at it. 

We can approximate all of these sources of noise using bootstrapping, and from these sources we can produce very accurate prediction intervals. 

## Detecting the source of deterioration

One thing is to detect when a model is deteriorating, but sometimes we might want to know **how** is deteriorating.This could be due to the shift in one or more variables, the knowledge of which are the sources of a model deterioration might be useful in its own right. 

To account for the reasons of model deterioration, we fitted a separate model on the uncertainty values (the inputs of this model is the shifted feature values, and the outputs are the estimates uncertainties). We proceeded to compute SHAP values of this separate model, which shows which features of the datasets contributes the most to an increased uncertainty.

To test this approach, and compare it to competing methods, we took one of our datasets and shifted two features, which are the most correlated with the target variable, **GrLivArea** and **TotalBsmtSF** as well as introducing a random variable and shifting that as well. We thus want the model to identify the first two features, but disregard the random one, as it is being shifted does not affect the model performance at all. Here are the results:

![](https://saattrupdan.github.io/img/uncertainty-shap.png)

In [2]:
from sklearn.linear_model import LinearRegression
from doubt import Boot
import numpy as np

# Generate normal-distributed random data
x1 = np.random.normal(1, 0.1, size=10000)
x2 = np.random.normal(1, 0.1, size=10000)
x3 = np.random.normal(1, 0.1, size=10000)

# Create a synthetic dataset with the random data, of shape (1, 3)
X = np.array([x1, x2, x3]).T

# Create out-of-distribution data by shifting the first feature by 5
X_ood = np.array([x1 + 5, x2, x3]).T

# Create the target variable, which depends non-linearly on `x1`, linearly on `x2`, and does not depend on `x3` at all
y = x1 ** 2 + x2 + np.random.normal(0, 0.01, 10000)

# Create linear regression model with uncertainty estimation support, using our `Boot` wrapper class
clf = Boot(LinearRegression())

# Fit the model to the data
clf = clf.fit(X, y)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Compute predictions along with prediction intervals on the out-of-distribution data
preds, intervals = clf.predict(X_ood, uncertainty=0.05)

# Compute the uncertainty, being the width of the prediction intervals
unc = intervals[:, 1] - intervals[:, 0]

# As for explaining where the uncertainty comes from, we fit a new linear regression model
# on the out-of-distribution data, which attempts to predict the uncertainties
m = LinearRegression().fit(X_ood, unc)

# Print out the coefficients of the second model, which corresponds to the SHAP values.
# We see that it puts no importance on any of the variables, as they are merely random
np.round(m.coef_, decimals=2)

array([ 0.01, -0.  , -0.  ])