# A Primer on Prior Networks

Harvard Fall 2019 Applied Math 207 Term Project

Authors (alphabetically ordered): Simon Batzner, Theo Guenais, Rylan Schaeffer, Dimitris Vamvourellis

Paper: [Predictive Uncertainty Estimation via Prior Networks](https://papers.nips.cc/paper/7936-predictive-uncertainty-estimation-via-prior-networks.pdf)


## Section 1: Problem Statement

## Section 2: Context/Scope

## Section 3: Existing Work

## Section 4: Technical Content

## Section 5: Experiments

In [2]:
# import packages
import numpy as np
import torch
from matplotlib import pyplot as plt
%matplotlib inline
from utils import data, measures, models, plot, run

### Ordinarily Trained Discriminative Classifiers Have No Distributional Uncertainty

The typical lifecycle of a machine learning model consists of splitting one's data into two sets, the training dataset and the test dataset, training a model using the training dataset before evaluating the model on the test dataset. Once trained, the model can be used for inference given new input data. However, this naive approach can be dangerous. Given new input data sampled from a drastically different distribution than the training  distribution, the model might make confident, but incorrect predictions.

To demonstrate this, suppose a medical study identifies that breast cancer patients were cured by one of three treatment plans, and we're recruited to train a model that recommend a treatment plan for a new patient. For simplicity, we'll assume that each patient has two features of interest (e.g. age, BMI) that are predictive of the best treatment for that patient. The dataset might look like the following.

In [3]:
train_data = data.create_data(
        create_data_functions=[
            data.create_data_mixture_of_gaussians,
        ],
        functions_args=[
            data.mog_three_in_distribution
        ])

In [4]:
labels_np = train_data['targets'].numpy()
samples_np = train_data['samples'].numpy()
train_labels_idx = np.where(labels_np != 3)[0]
x_train_np = samples_np[train_labels_idx]
y_train_np = labels_np[train_labels_idx]

labels_names = ['Cured by Treatment 1', 'Cured by Treatment 2', 'Cured by Treatment 3', 'New Patient Data']

In [5]:
plot.plot_training_data(
    samples=train_data['samples'].numpy(),
    labels=train_data['targets'].numpy(),
    labels_names=labels_names,
    plot_title='Training Data',
    xaxis=dict(title='Patient Feature 1 (e.g. age)'),
    yaxis=dict(title='Patient Feature 2 (e.g. BMI)')
)

This looks like a solvable problem! We decide to model these three clusters using a categorical distribution and we train a model to predict the correct class.

In [8]:
# create the model, optimizer, training data
model = run.create_model(in_dim=2, out_dim=3, n_per_hidden_layer=[50], args={})
optimizer = run.create_optimizer(model=model, args={'lr': 0.001})
loss_fn = run.create_loss_fn(loss_fn_str='kl', args={})

In [9]:
# fit the model
model, optimizer, training_loss = run.train_model(
    model=model,
    optimizer=optimizer,
    loss_fn=loss_fn,
    train_data=train_data,
    args={},
    n_epochs=1000,
    batch_size=32)

In [10]:
# plot the training loss
plot.plot_training_loss(training_loss=training_loss)

Looks good! But now, suppose that a medical hospital sends us new patient data that looks unlike anything we've previously seen. What should the model recommend?

In [11]:
new_data = data.create_data(
    create_data_functions=[data.create_data_mixture_of_gaussians,],
    functions_args=[data.mog_three_in_distribution_one_out])
plot.plot_training_data(
    samples=new_data['samples'].numpy(),
    labels=new_data['targets'].numpy(),
    labels_names=labels_names,
    plot_title='Training Data',
    xaxis=dict(title='Patient Feature 1 (e.g. age)'),
    yaxis=dict(title='Patient Feature 2 (e.g. BMI)')
)

Because this new patient data looks nothing like the training data, we would want the model to be highly uncertain. However, when we ask the model to predict the appropriate treatment for the new patients, as the below code demonstrates, the model is actually strongly confident that the correct solution is to prescribe Treatment 2.

In [12]:
new_patient_data_indices = new_data['targets'] == 3
new_patient_samples = new_data['samples'][new_patient_data_indices]
new_patient_model_output = model(new_patient_samples)
print(np.round(new_patient_model_output['y_pred'].detach().numpy(), 3))

[[0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0.

To understand why the model classifies the new patient data with the closest training data class, we visualize the decision surface of the model by calculating the entropy of the predicted probabilities vector, which is a measure of the total uncertainty in the predictions of a model. We add the three training classes as well as the new patient data above the decision surface. As we can see the model makes very confident predictions ever far away from training data (entropy is very low). The model makes uncertain predictions (high entropy) only in between the classes as well as in the middle of the data. Hence, the model does not account for distributional uncertainty. In other words, if the model is given test data which are not drawn from the same distribution as training data, it will still make very confident predictions (low entropy probabilities vector) which may well be utterly wrong.

In [13]:
#please rerun this to create interactive 3D plot
plot.plot_decision_surface(model=model,
                           samples=new_data['samples'],
                           labels=new_data['targets'],
                           labels_names=labels_names,
                           z_fn=measures.entropy_categorical,
                           x_axis_title='Patient Feature 1 (e.g. age)',
                           y_axis_title='Patient Feature 2 (e.g. BMI)',
                           z_axis_title='Predicted Class Entropy'
                          )

### Modified Loss Function/Training Curriculum Gives Classifier Distributional Uncertainty

Below, we are fitting a Dirichlet Prior Network. According to the methodology described in the paper, to train this model both the in-distribution and the out-of-distribution targets must be defined. Hence, we use the same training data as the in-distribution data for which we know the labels. The labels of the new patient data are not known and are far from training data so we use these data points as the out-of-distribution data to train our model. Then, we visualize the model's behaviour both in regions of in and out of distribution data. 

The model makes condifent predictions (low-entropy) in regions dominated by each class. However, now the model makes totally uncertain predictions in the region where we defined that data are out-of-distribution. On this region, as well as in the rest of regions between the classes and away from training data, the entropy attains its maximum value. In other words, the Prior Network assigns a probability of 1/3 to each class, which is a more rational decision given that data from such regions are assumed to be out-of-distribution and we cannot be certain of the class they belong to.

The last figure is a plot of the Mutual Information between the label and the expected categorical predicted by the network. As mentioned in the paper, this is a measure of the distributional uncertainty. Particularly, it is the difference between the total uncertainty and the expected data uncertainty. As it can be observed, the mutual information curve has similar shape to the entropy curve. However, the mutual information is zero in the middle of the three classes, since the uncertainty in this region arises due to class overlap and not due to the fact that this region is considered out-of-domain.

In [15]:
#please run this cell to generate interactive 3D plot
model, optimizer, training_loss = run.train_model(
    model=model,
    optimizer=optimizer,
    loss_fn=loss_fn,
    train_data=new_data, #TRAIN WITH IN AND OUT OF DISTRIBUTION
    args={},
    n_epochs=1000,
    batch_size=128)

In [16]:
plot.plot_training_loss(training_loss=training_loss)

In [13]:
#please rerun this to create interactive 3D plot
plot.plot_decision_surface(model=model,
                           samples=new_data['samples'],
                           labels=new_data['targets'],
                           labels_names=labels_names,
                           z_fn=measures.entropy_categorical,
                           x_axis_title='Patient Feature 1 (e.g. age)',
                           y_axis_title='Patient Feature 2 (e.g. BMI)',
                           z_axis_title='Predicted Class Entropy'
                          )

In [14]:
#please rerun this to create interactive 3D plot
plot.plot_MI(model=model,
            samples=new_data['samples'],
            labels=new_data['targets'],
            labels_names=labels_names,
            x_axis_title='Patient Feature 1 (e.g. age)',
            y_axis_title='Patient Feature 2 (e.g. BMI)',
            )

## Experiments with different OOD shapes

### OOD as a ring around the data

To train a Prior Network, the multi-task objective defined in the paper, requires samples from the out-of-domain distribution. In practical terms, this is unknown and samples are unavailable. One solution is to synthetically generate points on the boundary of the in-domain region. In two dimensions where data can be visualized, this is possible. A sensible choice is to create an out-of-domain distribution which forms a ring around the training data. As a result, any test data on our out of the ring are assumed to be out-of-distribution.

By osberving the entropy and the mutual information plots, it is evident  that the model makes confident predictions within the ring (entropy is minimized apart from the center where classes overlap), whereas the model makes uncertain predictions outside the ring (entropy and mutual information are maximized).

Again, the entropy is high in the middle of the classes whereas mutual information is low. This is because the uncertainty in this region arises due to class overlap (data uncertainty) and not due to lack of training data.

In [17]:
#create the data
train_data_ring = data.create_data(
    create_data_functions=[
        data.create_data_spherical_shells,
        data.create_data_mixture_of_gaussians,
    ],
    functions_args=[
        data.rings,
        data.mog_three_in_distribution_overlap
    ])
plot.plot_training_data(
    samples=train_data_ring['samples'].numpy(),
    labels=train_data_ring['targets'].numpy(),
    labels_names=labels_names,
    plot_title='Training Data',
    xaxis=dict(title='Patient Feature 1 (e.g. age)'),
    yaxis=dict(title='Patient Feature 2 (e.g. BMI)')
)

In [18]:
model, optimizer, training_loss = run.train_model(
    model=model,
    optimizer=optimizer,
    loss_fn=loss_fn,
    train_data=train_data_ring, #TRAIN WITH IN AND OUT OF DISTRIBUTION
    args={},
    n_epochs=1000,
    batch_size=128)

In [19]:
plot.plot_training_loss(training_loss=training_loss)

In [20]:
#please rerun this to create interactive 3D plot
plot.plot_decision_surface(model=model,
                           samples=train_data_ring['samples'],
                           labels=train_data_ring['targets'],
                           labels_names=labels_names,
                           z_fn=measures.entropy_categorical,
                           x_axis_title='Patient Feature 1 (e.g. age)',
                           y_axis_title='Patient Feature 2 (e.g. BMI)',
                           z_axis_title='Predicted Class Entropy'
                          )

In [21]:
#please rerun this to create interactive 3D plot
plot.plot_MI(model=model,
            samples=train_data_ring['samples'],
            labels=train_data_ring['targets'],
            labels_names=labels_names,
            x_axis_title='Patient Feature 1 (e.g. age)',
            y_axis_title='Patient Feature 2 (e.g. BMI)',
            )

### Interpolation: OOD in the middle of the data

In other cases, we might not have any data in the region in between the classes. Consequently, we may want our classifier to refrain from making confident predictions in this region since without any data, we are quite uncertain of the class that these cases belong to. 

Hence, we can generate points from within the inner boundary of the in-domain region to form the OOD data needed to train a Prior Network. The resulting model is now uncertain when interpolating. This is confirmed by the fact that both the entropy and the mutual information are maximized in the middle of the three classes, a region which we specifically assumed to be out-of-domain.

In [22]:
train_data_interpolation = data.create_data(
    create_data_functions=[
        data.create_data_mixture_of_gaussians,
    ],
    functions_args=[
        data.mog_ood_in_middle_no_overlap
    ])
plot.plot_training_data(
    samples=train_data_interpolation['samples'].numpy(),
    labels=train_data_interpolation['targets'].numpy(),
    labels_names=labels_names,
    plot_title='Training Data',
    xaxis=dict(title='Patient Feature 1 (e.g. age)'),
    yaxis=dict(title='Patient Feature 2 (e.g. BMI)')
)

In [23]:
#please run this cell to generate interactive 3D plot
model, optimizer, training_loss = run.train_model(
    model=model,
    optimizer=optimizer,
    loss_fn=loss_fn,
    train_data=train_data_interpolation, #TRAIN WITH IN AND OUT OF DISTRIBUTION
    args={},
    n_epochs=1000,
    batch_size=128)

In [24]:
#please rerun this to create interactive 3D plot
plot.plot_decision_surface(model=model,
                           samples=train_data_interpolation['samples'],
                           labels=train_data_interpolation['targets'],
                           labels_names=labels_names,
                           z_fn=measures.entropy_categorical,
                           x_axis_title='Patient Feature 1 (e.g. age)',
                           y_axis_title='Patient Feature 2 (e.g. BMI)',
                           z_axis_title='Predicted Class Entropy'
                          )

In [25]:
#please rerun this to create interactive 3D plot
plot.plot_MI(model=model,
            samples=train_data_interpolation['samples'],
            labels=train_data_interpolation['targets'],
            labels_names=labels_names,
            x_axis_title='Patient Feature 1 (e.g. age)',
            y_axis_title='Patient Feature 2 (e.g. BMI)',
            )

## Section 6: Evaluation

## Section 7: Future Work