In this notebook we show how a traditional machine learning approach compares to the Bayesian approach. As we will see, setting up a Bayesian model will be less straightforward. However, the predictions made with the Bayesian approach will generalise better and will be easier to inspect.
To keep the discussion simple, we will focus in a supervised, binary classification task, where the model of choice is a Logistic regression.

Before anything else, let's import all the necessary dependencies.


In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import os                        
import jax.numpy as jnp
import matplotlib.pyplot as plt
import numpy as np
import numpyro
import numpyro.distributions as dist

from jax import random
from numpyro.infer import MCMC, NUTS


NUM_CPUS = int(os.environ.get("NUM_CPUS", os.cpu_count()))
numpyro.set_host_device_count(NUM_CPUS)

# set a random seed for later use
seed = random.PRNGKey(42)

In [None]:
# The data that we wish to model, looks as follows:
X = np.array([
    [  0.84,  -1.48],
    [ -4.64,  -4.08],
    [  1.32,  -7.64],
    [ -3.04,  -6.64],
    [ -8.8 , -10.48],
    [ -0.84,   1.48],
    [  4.64,   4.08],
    [ -1.32,   7.64],
    [  3.04,   6.64],
    [  8.8 ,  10.48],
])

y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

In [None]:
# Our test set is "every" point in the [-15, 15]x[-15, 15] 2d space.
x_space = np.linspace(-15, 15, num=100)
X1, X2 = np.meshgrid(x_space, x_space)
X_test = np.array([X1.ravel(), X2.ravel()]).T

In [None]:
# Test set is every "pixel" in the grid.
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(*X[:5].T, 'o', ms=12, mec='w', label='y = 0')
ax.plot(*X[5:].T, "o", ms=12, mec='w', label='y = 1')
ax.set_xticks(x_space[::])
ax.set_xticklabels([])
ax.set_yticks(x_space[::])
ax.set_yticklabels([])
ax.grid()
ax.set(xlabel='$x_1$', ylabel='$x_2$')
ax.legend()
plt.show()

## Traditional ML approach: fit-predict
Here's how you quickly solve this task using scikit-learn.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression(fit_intercept=False)
model.fit(X, y)
y_pred = model.predict_proba(X_test)[:, 0] # keep the 1st column, which corresponds to label "y=0"

### What's going on under the hood?

The class LogisticRegression  contains a hard-coded cost function given by:

$$ Cost(y, x, w) = \frac{1}{2} w\cdot w + C\sum_{i=1}^{n} (y_i \log p_i) + (1 - y_i)\log(1 - p_i)) $$

where, $p = \mathrm{logistic}(X_i\cdot w)$. From a Bayesian point of view, the second term is saying that the data follows a Bernoulli distribution, because that term is equal to the log-likelihood of a Bernoulli distribution with proabability $p$ (convice yourself of this!).

At the same time, scikit-learn has placed some regularisation in place for you (without your consent, but whatever). The default regularisation is an L2 regularisation with $C=1$ . From a Bayesian interpretaion, this is saying that the weights are expected to follow a _standard_ normal distribution (why?)

The line `model.fit(X, y)` is finding the value of theta that minimizes the cost function for the given data. As we saw, from the Bayesian point of view, this value of theta is the maximum of the posterior distribution (a.k.a. the mode). We can inspect such value by looking at the model.coeffs_  attribute:

In [None]:
model.coef_

Keep in mind though, that the mode of the distribution has no special place in the bayesian framework. It's actual probability is zero, just like every other point!

What happens when we do `model.predict_proba`  The name given to this method might suggest that the result you're getting is the probability of `y=0`  given the data seen so far, but that's not true (sorry to be the bringer of sad news). Such value required us to evaluate some complicated integrals which I'm sorry to say sklearn is not doing. A more appropriate name for this method would be eval_likelihood , because what is really happening is that the expression

$$1 - \mathrm{logistic}(X_{new}\cdot w))$$

is being evaluated, with $w$ being replaced by the value that minimised the cost. Let's check that this statement is true.

In [None]:
from scipy.special import expit as logistic

In [None]:
w_star = model.coef_.flatten()
y_pred_manual = 1 - logistic(X_test @ w_star)

In [None]:
# check:
np.isclose(y_pred_manual, y_pred).all()

In case, there's any confusion, I'll say it clearly: **the above answer is wrong!**. That's not the answer we are really looking for, and for it to be a valid approximation to the mathematically correct answer one would have to make some very strong assumptions. To be fair, the assumptions are not unrealistic and they often happen in practice. But the key point is that they often don't.

To conclude this section, let's visualise the decision boundary predicted by the ML model:

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
contour = ax.contourf(X1, X2, y_pred.reshape(*X1.shape), cmap='RdYlBu', levels=11)
ax.plot(*X[:5].T, 'o', ms=12, mec='w')
ax.plot(*X[5:].T, "o", ms=12, mec='w')
cbar = plt.colorbar(contour, ax=ax)
ax.set(xlabel='$x_1$', ylabel='$x_2$')
cbar.ax.set_ylabel('Probability of $y=0$')
plt.show()

Despite the regularisation, predictions made by a ML model are doomed to perform poorly away from the data. In this case, the decision boundary extends in a straight line despite the fact that we don't have any data in the "edges" of the grid. Hopefully this bothers you.

## Bayesian approach

Rather than evaluating the likelihood with a single value of theta, the Bayesian approach aims to compute the correct answer which, as we saw in the slides, should be the average likelihood (where the average is taken over the posterior distribution). The average, is calculated via a complicated integral, which we approximate with samples using MCMC. To draw the samples, we first write our NumPyro model:

In [None]:
# note: there are more "elegant" ways of writing the model below, 
# but I'm aiming for readibility
def logistic_regression(X, y=None):
    n_obs, n_dims = X.shape
    # Let's use the same prior as the one sklearn uses:
    w = numpyro.sample("w", dist.Normal(0, 1).expand((n_dims,)))
    
    # This is the likelihood. The `obs` argument allows this model
    # to be used on unseen data. But we will not cover that syntax here.
    numpyro.sample("y", dist.BernoulliLogits(X @ w), obs=y)

In [None]:
mcmc_kwargs = dict(num_warmup=2000, num_samples=2000, num_chains=NUM_CPUS)
mcmc = MCMC(NUTS(logistic_regression), **mcmc_kwargs)
mcmc.run(seed, X=X, y=y)

The samples are stored inside the mcmc  object. We can print a summary of the samples drew:

In [None]:
mcmc.print_summary()

The mean of the coefficients does not agree with the values obtained by sklearn, but they shouldn't anyway: The mean  and the mode  are different quantities.
When we run MCMC, though, we don't just get the means. We obtain the full distribution:

In [None]:
w_samples = mcmc.get_samples()["w"]

In [None]:
w_samples

In [None]:
f, axes = plt.subplots(1, 2, figsize=(13, 5), sharey=True, sharex=True)
axes[0].hist(w_samples[:,0], density=True, bins=30)
axes[1].hist(w_samples[:, 1], density=True, bins=30)
axes[0].set_title("$W_1$ distribution")
axes[1].set_title("$W_2$ distribution")
plt.show()

Or if you want to see the joint distribution

In [None]:
from seaborn import jointplot

In [None]:
jointplot(*w_samples.T, alpha=0.05);

Now, numpyro provides a nice api for making predictions on new data once the samples are available. However, for the sake of making this notebook as didactical as possible, I'm going to manually do what numpyro would do for you.

Say, for instance, that we want to make a prediction for a new point `x=(2, 0.5)`

Then we do:

In [None]:
x_new_ = (2, 0.5)
likelihoods = []
for w in w_samples:
    lkhood = 1 - logistic(x_new_ @ w)
    likelihoods.append(lkhood)

The value we should predict is then given by np.mean(likelihoods) :

In [None]:
np.mean(likelihoods)

Let's now do it for every point on the test set:

In [None]:
bayesian_y_pred = np.zeros(X_test.shape[0])
for i, x_new in enumerate(X_test):
    p = 1 - logistic(x_new @ w_samples.T) # resulting likelihood for every sample
    bayesian_y_pred[i] = np.mean(p)

Let's look at the prediction boundary:

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Bayesian Logistic Regression")
contour = ax.contourf(X1, X2, bayesian_y_pred.reshape(*X1.shape), cmap='RdYlBu', levels=11)
ax.plot(*X[:5].T, 'o', ms=12, mec='w')
ax.plot(*X[5:].T, "o", ms=12, mec='w')
cbar = plt.colorbar(contour, ax=ax)
ax.set(xlabel='$x_1$', ylabel='$x_2$')
cbar.ax.set_ylabel('Average Probability of $y=0$')
plt.show()

A cool feature available to Bayesian models, is that we can also inspect the how much the predictions vary across samples -- hence serving as indicator for "model uncertainty":

In [None]:
y_pred_uncertainty = np.zeros(X_test.shape[0])
for i, x_new in enumerate(X_test):
    p = 1 - logistic(x_new @ w_samples.T) # resulting likelihood for every sample
    y_pred_uncertainty[i] = np.std(p) # notice the difference to previous calculation.

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Bayesian Logistic Regression")
contour = ax.contourf(X1, X2, y_pred_uncertainty.reshape(*X1.shape), cmap='magma_r', levels=10)
ax.plot(*X[:5].T, 'o', ms=12, mec='w')
ax.plot(*X[5:].T, "o", ms=12, mec='w')
cbar = plt.colorbar(contour, ax=ax)
ax.set(xlabel='$x_1$', ylabel='$x_2$')
cbar.ax.set_ylabel('Std. deviation of predicted proba')
plt.show()

This is showing the regions of space where predictions are reliable. Basically, away from the data there is too much uncertainty and you shouldn't trust the predictions. It is extremely difficult to get a similar insight from a traditional ML model.