In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import phik
import numpy as np
import pandas as pd
import scipy
import scipy.stats as sts
import scipy.special as ssp
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import confusion_matrix, precision_recall_curve, precision_recall_fscore_support

In [None]:
import mmu
from mmur import ModelGenerator

In [None]:
%matplotlib inline
plt.style.use('ggplot')
plt.style.use('ggplot')
plt.rcParams['text.color'] = 'black'
plt.rcParams['figure.max_open_warning'] = 0
COLORS = [i['color'] for i in plt.rcParams['axes.prop_cycle']]

In [None]:
def plot_probas(probas, ground_truth, probas_alt=None, fig=None, axs=None):
    if axs is None:
        fig, axs = plt.subplots(figsize=(14, 7), nrows=1, ncols=2)

    for i in range(probas.shape[1]):
        axs[0].plot(np.sort(probas[:, i]), c='grey', alpha=0.5)
        axs[1].plot(np.sort(ground_truth['proba']), np.sort(probas[:, i]), c='grey', alpha=0.5)
    if probas_alt is not None:
        for i in range(probas_alt.shape[1]):
            axs[0].plot(np.sort(probas_alt[:, i]), c=COLORS[2], alpha=0.5)
            axs[1].plot(np.sort(ground_truth['proba']), np.sort(probas_alt[:, i]), c=COLORS[2], alpha=0.5)
            
    axs[0].plot(np.sort(ground_truth['proba']), c='red', ls='--', lw=2, zorder=10, label='True model')
    axs[0].set_title('Probabilities -- model draws', fontsize=18)
    axs[0].set_ylabel('proba', fontsize=18)
    axs[0].set_xlabel('sorted observations', fontsize=18)
    axs[0].tick_params(labelsize=16)
    axs[0].legend(fontsize=18)
    axs[1].plot(ground_truth['proba'], ground_truth['proba'], c='red', ls='--', lw=2, zorder=10, label='True model')
    axs[1].set_title('model draws -- Q-Q ', fontsize=18)
    axs[1].set_ylabel('proba -- ground truth', fontsize=18)
    axs[1].set_xlabel('proba -- draws', fontsize=18)
    axs[1].tick_params(labelsize=16)
    axs[1].legend(fontsize=18)
    
    if fig is not None:
        fig.tight_layout()
        return fig, axs
    return axs

# Model Metric Uncertainty

## Ralph Urlus

#### What is model metric uncertainty?
* Who has ever created a credible or confidence interval on your metrics?
* Did you use cross-validation?
* Did you have enough statistics to be confident in your CI?
* Did you have an unbiased CI?
    

#### Model performance is a stochastic and depends on multiple sources of uncertainty

We have well defined uncertainties on, most, statistical models why not on ML models?

### The setting

* Binary classification problem

* Non-symmetrical costs for errors

* Utility function over sensitivity (recall on the positive class) and specificity (recall on the negative class)

* The utility function can be optimized for any model that outputs a probability using the classification threshold.

### Problem statement

Determine the optimal classification threshold that maximises the utility function over a pair of classification metrics considering their simultaneous uncertainty.

#### Let introduce a bit of formalism

<sub><sup>I am sorry Fari</sup></sub>

Assume we have a classification problem with feature set $X \subset \mathbb{R}^{N \mathrm{x} K}$ and labels $y = \{y_{i} \in \{0, 1\} \mid 1 \leq  i \leq N\}$.

Let $T_{m} \subset X$ be the train set and $Z_{m} \subset X$ be the test set for run $ m \in \mathbb{M};~\mathbb{M} = \{1, \ldots, M\}$.

Additionally let $T_{m} \cap Z_{m} = \emptyset~\forall~m \in \mathbb{M}$ and 

$T_{i} \cap Z_{j} \not\equiv \emptyset~\forall~i, j \in \mathbb{M}$

where $a \not\equiv b$ denotes $a$ _is not necessarily equal to_ $b$.

Let $f_{m}:\mathbf{x} \to [0, 1]$ represent one of the $M$ model instances trained on $T_{m}$ with hyper-parameters $\Theta_{m}$ and evaluated on $Z_{m}$.

Assume that $f$ is not deterministic given the same training data

$f_{i}(T_{m}, \Theta_{m}) \not\equiv f_{j}(T_{m}, \Theta_{m})~\forall~i, j \in \mathbb{M}$.

Assume that not all observations are equivalently easy to predict

$\exists~i, j \in N~|~i \neq j \text{ s.t. } P\left(y_{i} = f_{m}(X_{i})\right) > P\left(y_{j} = f_{m}(X_{j})\right)$.

We observe an estimate of the population metric $\phi$:
\begin{equation*}
    \hat{\phi} = \phi + \epsilon = \phi + \epsilon_{X} + \epsilon_{f} = \phi + \epsilon_{T} + \epsilon_{Z} + \epsilon_{f}
\end{equation*}
where $\epsilon_{X}$ represents the error induced by the data sample, $\epsilon_{f}$ the error due to non-deterministic behaviour during training and $\epsilon_{T},~\epsilon_{Z}$ are subcomponents of the data driven uncertainty in the form of the training and test set.

### Uncertainties

What uncertainties did we just include?

### Ilan's list

1. Sampling variation
2. Measurement noise
3. Model misspecification
4. Overt overftting
5. Covert overfitting
6. Data leakage
7. Pseudo-random number generation
8. Dataset shift
9. Optimisation ambiguities
10. Identifiability

1. Sampling variation -- Yes
2. Measurement noise -- Yes
3. Model misspecification -- No
4. Overt overfitting -- Partially
5. Covert overfitting -- No
6. Data leakage -- Partially
7. Pseudo-random number generation -- Partially
8. Dataset shift -- Partially
9. Optimisation ambiguities -- Partially
10. Identifiability -- Yes

Let's assume we don't screw up to often, we exclude mistakes

### My list


1. Assume that not all observations are equivalently easy to predict
    * Sampling variation
    * Overt overfitting
    * Dataset shift
2. Assume that $f$ is not deterministic given the same training data
    * Overt overfitting
    * Optimisation ambiguities
    * Identifiability
3. There can be overlap between the test and training sets between runs
    * Data leakage


#### How do these uncertainties manifest themselves?

In practise these uncertainties do not clearly distinguish themselves

An analytical description of these joint uncertainties is unlikely to exist.

For example, a Gaussian error on $\theta = \alpha + \beta_{0} * X_{0} + ...$ before the logistic function in Logistic Regression results in Logit-normal distributed error in the probability space which does not have any moments that can be described analytically.


### Simulation it is...

So how do simulate all these uncertainties involved?

What if we can simulate classifiers where we can turn on and off various sources of uncertainty? 

#### ModelGenerator

We developed a model generator that can incorporate and isolate these sources:

1. Sampling uncertainty

    * number of data points
    * split between train test (sample noise)
    * class imbalance
    * sub-class imbalance (cluster imbalances)
   
2. Measurement noise
    
    * noise over X (cluster noise)
    * label noise (label noise)
    
3. Non deterministic training (model noise)

The generator is an extension of sklearn's `make_classification`.

Generate `n_clusters_per_class` positioned on the vertices off a hypercube and assign a class label.
For binary classification and 2 clusters per class we have 4 clusters.

1. For each cluster generate a hypersphere of dimension `n_features` from standard normals.
2. Generate a random covariance matrix per cluster over the features
3. Shift the hyperspheres to the centroids

#### Ground truth

We need a deterministic base model:

1. Sample 250K samples from X and y that are noise free
2. Fit Logistic regression on X, y
3. Store coefficients

This model should be, largely, free of noise

#### What does it look like?

In [None]:
generator = ModelGenerator(random_state=12345)
fit = generator.fit()
train_mask, labels, probas, X, models, ground_truth = fit.transform(
    n_models=100,
    n_samples=5000
)

#### Output 

In [None]:
ground_truth

In [None]:
probas[:5, :3]

In [None]:
_ = plot_probas(probas, ground_truth) 

#### Without the uncertainties

What does it look like if we disable all uncertainties in the generator?

In [None]:
generator = ModelGenerator(random_state=12345)
fit = generator.fit()
train_mask, labels, probas, X, models, ground_truth = fit.transform(
    n_models=100,
    n_samples=5000,
    enable_cluster_imbalances=False,
    enable_cluster_noise=False,
    enable_sample_noise=False,
    enable_label_noise=False,
    enable_model_noise=False,
)

In [None]:
_ = plot_probas(probas, ground_truth)

## Sample noise 

Sample noise here is defined as the effects due having sampled a different train-test split

In [None]:
generator = ModelGenerator(random_state=12345)
fit = generator.fit()
train_mask, labels, probas, X, models, ground_truth = fit.transform(
    n_models=100,
    n_samples=5000,
    enable_sample_noise=True,
    enable_cluster_imbalances=False,
    enable_cluster_noise=False,
    enable_label_noise=False,
    enable_model_noise=False,
)

In [None]:
_ = plot_probas(probas, ground_truth)

#### Cluster imbalances

Cluster imbalances are introduced by creating clusters that are of uneven size based on sample from a Dirichlet.

In [None]:
generator = ModelGenerator(random_state=12345)
fit = generator.fit()
train_mask, labels, probas, X, models, ground_truth = fit.transform(
    n_models=100,
    n_samples=5000,
    alpha_weights=10,
    enable_sample_noise=False,
    enable_cluster_imbalances=True,
    enable_cluster_noise=False,
    enable_label_noise=False,
    enable_model_noise=False,
)

In [None]:
_ = plot_probas(probas, ground_truth)

#### Cluster noise

Scale and shift clusters of X to simulate certain subclasses/observations that are harder to predict than others.

In [None]:
generator = ModelGenerator(random_state=12345)
fit = generator.fit()
train_mask, labels, probas, X, models, ground_truth = fit.transform(
    n_models=100,
    n_samples=5000,
    alpha_weights=10,
    enable_sample_noise=False,
    enable_cluster_imbalances=False,
    enable_cluster_noise=True,
    enable_label_noise=False,
    enable_model_noise=False,
)

probas_alt = probas.copy()

In [None]:
_ = plot_probas(probas, ground_truth)

#### Cluster noise and imbalances

In [None]:
generator = ModelGenerator(random_state=12345)
fit = generator.fit()
train_mask, labels, probas, X, models, ground_truth = fit.transform(
    n_models=100,
    n_samples=5000,
    alpha_weights=10,
    enable_sample_noise=False,
    enable_cluster_imbalances=True,
    enable_cluster_noise=True,
    enable_label_noise=False,
    enable_model_noise=False,
)

In [None]:
_ = plot_probas(probas, ground_truth, probas_alt=probas_alt)

#### Label Noise

The probability of a label being flipped is 1%

In [None]:
generator = ModelGenerator(random_state=12345)
fit = generator.fit()
train_mask, labels, probas, X, models, ground_truth = fit.transform(
    n_models=100,
    n_samples=5000,
    alpha_weights=10,
    enable_sample_noise=False,
    enable_cluster_imbalances=False,
    enable_cluster_noise=False,
    enable_label_noise=True,
    enable_model_noise=False,
)

probas_alt = probas.copy()

In [None]:
_ = plot_probas(probas, ground_truth)

10% label flip probability

In [None]:
generator = ModelGenerator(label_flip=0.1, random_state=12345)
fit = generator.fit()
train_mask, labels, probas, X, models, ground_truth = fit.transform(
    n_models=100,
    n_samples=5000,
    alpha_weights=10,
    enable_sample_noise=False,
    enable_cluster_imbalances=False,
    enable_cluster_noise=False,
    enable_label_noise=True,
    enable_model_noise=False,
)

In [None]:
_ = plot_probas(probas, ground_truth, probas_alt)

#### Model noise

Rotation and shift to the probabilities

In [None]:
generator = ModelGenerator(random_state=12345)
fit = generator.fit()
train_mask, labels, probas, X, models, ground_truth = fit.transform(
    n_models=100,
    n_samples=5000,
    alpha_weights=10,
    enable_sample_noise=False,
    enable_cluster_imbalances=False,
    enable_cluster_noise=False,
    enable_label_noise=False,
    enable_model_noise=True,
)

In [None]:
_ = plot_probas(probas, ground_truth)

Model noise -- no rotation

In [None]:
generator = ModelGenerator(
    model_rotation=0.0,
    random_state=12345
)
fit = generator.fit()
train_mask, labels, probas, X, models, ground_truth = fit.transform(
    n_models=100,
    n_samples=5000,
    alpha_weights=10,
    enable_sample_noise=False,
    enable_cluster_imbalances=False,
    enable_cluster_noise=False,
    enable_label_noise=False,
    enable_model_noise=True,
)

In [None]:
_ = plot_probas(probas, ground_truth)

Model noise -- no shift

In [None]:
generator = ModelGenerator(
    model_shift=0.0,
    random_state=12345
)
fit = generator.fit()
train_mask, labels, probas, X, models, ground_truth = fit.transform(
    n_models=100,
    n_samples=5000,
    alpha_weights=10,
    enable_sample_noise=False,
    enable_cluster_imbalances=False,
    enable_cluster_noise=False,
    enable_label_noise=False,
    enable_model_noise=True,
)

In [None]:
_ = plot_probas(probas, ground_truth)

#### Data driven uncertainty

In [None]:
generator = ModelGenerator(random_state=12345)
fit = generator.fit()
train_mask, labels, probas, X, models, ground_truth = fit.transform(
    n_models=100,
    n_samples=5000,
    alpha_weights=10,
    enable_sample_noise=True,
    enable_cluster_imbalances=True,
    enable_cluster_noise=True,
    enable_label_noise=False,
    enable_model_noise=False,
)

In [None]:
_ = plot_probas(probas, ground_truth)

#### All sources uncertainty combined

In [None]:
generator = ModelGenerator(random_state=12345)
fit = generator.fit()
train_mask, labels, probas, X, models, ground_truth = fit.transform(
    n_models=100,
    n_samples=5000,
    alpha_weights=10,
    enable_sample_noise=True,
    enable_cluster_imbalances=True,
    enable_cluster_noise=True,
    enable_label_noise=True,
    enable_model_noise=True,
)

In [None]:
_ = plot_probas(probas, ground_truth)

## What is next?

##### Experiment week

We want to explore different modelling approaches, e.g.
* Beta-Binomial
* Dirichlet-Multinomial
* Bootstrapping

Model the Precision-Recall curve with a Gaussian Processes:
* with uncertainty on X
* with heterogeneous errors

...

##### Data science 'hobby project'

We want to develop a package that becomes the, internal, standard for model evaluation.

We would be happy if you joined the efforts.

Thus far we have:
* the draft simulation engine
* fast confusion matrix & computation of binary classification metrics
    * 0 - neg.precision aka Negative Predictive Value
    * 1 - pos.precision aka Positive Predictive Value
    * 2 - neg.recall aka True Negative Rate & Specificity
    * 3 - pos.recall aka True Positive Rate aka Sensitivity
    * 4 - neg.f1 score
    * 5 - pos.f1 score
    * 6 - False Positive Rate
    * 7 - False Negative Rate
    * 8 - Accuracy
    * 9 - MCC
    
    * Plus we are 700 times faster than sklearn while computing 4 metrics more

##### Get a master student

There relatively little literature on the topic.

Most literature focusses on distinguishing the better model from a small set.

But how do we get well defined uncertainties in the setting we discussed above?

If we can come up with a good approach we try to publish it.

# Up next:

## A Beta-Binomial model for a confusion matrix

### Ilan Fridman Rojas