# Bayesian Logistic Regression

In this notebook we show how a logistic regression compares between the traditional machine learning approach and the Bayesian approach. As we will see, setting up a Bayesian model will be less straightforward. However, the predictions made with the Bayesian approach will generalise better and will be easier to inspect. 

The data that we wish to model is hard coded, centered at zero, and reads as follows.

In [None]:
import numpy as np

In [None]:
X = np.array(
    [
        [0.84, -1.48],
        [-4.64, -4.08],
        [1.32, -7.64],
        [-3.04, -6.64],
        [-8.8, -10.48],
        [-0.84, 1.48],
        [4.64, 4.08],
        [-1.32, 7.64],
        [3.04, 6.64],
        [8.8, 10.48],
    ]
)

y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

The data looks like this.

In [None]:
import matplotlib.pylab as plt

In [None]:
f, ax = plt.subplots(figsize=(6, 6))

ax.plot(*X[:5].T, "o", ms=8, mec="w", label="y = 0")
ax.plot(*X[5:].T, "o", ms=8, mec="w", label="y = 1")
ax.set(xlabel="$x_1$", ylabel="$x_2$")
ax.legend()

f.tight_layout()
plt.show()

## Conventional ML

First let's fit a regular logistic regression to this data using Scikit-learn.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression(fit_intercept=False, C=1)
model.fit(X, y)

#### What's going on under the hood?

The class LogisticRegression contains a hard-coded cost function given by:

$$ Cost(y, x, \theta) = \frac{1}{2} \theta \cdot \theta - \sum_{i=1}^{n} (y_i \log p_i) + (1 - y_i)\log(1 - p_i)) $$

where, $p_i = \mathrm{logistic}(X_i\cdot \theta)$. 

See logistic function here: https://en.wikipedia.org/wiki/Logistic_function

The second term is saying that the data follows a Bernoulli distribution, because that term is equal to the log-likelihood of a Bernoulli distribution with probability $p$ (convice yourself of this!).

The first term is a regularisation which Scikit-learn automatically added. The default regularisation is an L2 regularisation.

The line `model.fit(X, y)` is finding the value of $\theta$ that minimizes the cost function for the given data. The fitted parameters are:

In [None]:
model.coef_

Keep in mind though, that a single value of $\theta$ has no special place in the Bayesian framework. It's actual probability is zero, just like every other point. 

#### Predictions

Our test set is "every" point in the _[-15, 15] x [-15, 15]_ 2d space.

In [None]:
x_space = np.linspace(-15, 15, num=100)
X1, X2 = np.meshgrid(x_space, x_space)
X_test = np.array([X1.ravel(), X2.ravel()]).T

In [None]:
# Test set is every "pixel" in the grid.
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(*X[:5].T, 'o', ms=12, mec='w', label='y = 0')
ax.plot(*X[5:].T, "o", ms=12, mec='w', label='y = 1')
ax.set_xticks(x_space[::])
ax.set_xticklabels([])
ax.set_yticks(x_space[::])
ax.set_yticklabels([])
ax.grid()
ax.set(xlabel='$x_1$', ylabel='$x_2$')
ax.legend()
plt.show()

In [None]:
y_pred = model.predict_proba(X_test)[:, 1] # keep the 2nd column, which corresponds to label "y=1"
y_pred

What happens when we do ```model.predict_proba```? The name given to this method might suggest that the result you're getting is the probability of y=0 given the data seen so far, but **that's not true.**

Such value required us to evaluate some complicated integrals which Scikit-learn is not doing. A more appropriate name for this method would be `eval_likelihood`, because what is really happening is that the expression

$$ \mathrm{logistic}( X_{\text{new}} \cdot \hat{\theta})) \tag{1}$$
 
is being evaluated, where $ X_{\text{new}}$ is an unseen data point and $\hat{\theta}$ is the value of $\theta$ that minimised the cost. Let's check that this statement is true.

#### Check that ```model.predict_proba``` evalutes equation (1).

In [None]:
from scipy.special import expit as logistic # this is the logistic sigmoid function

In [None]:
y_pred_manual = logistic(np.dot(X_test, model.coef_.flatten()))
np.isclose(y_pred_manual, y_pred).all()

Let's plot the heatmap of predictions.

In [None]:
# increase/decrease the number of levels to bin the probability
levels=11

f, ax = plt.subplots(figsize=(8, 5))

contour = ax.contourf(X1, X2, y_pred.reshape(*X1.shape), cmap="YlOrRd", levels=levels)

# training data
ax.plot(*X[:5].T, "o", ms=8, mec="w")
ax.plot(*X[5:].T, "o", ms=8, mec="w")

cbar = plt.colorbar(contour, ax=ax)
ax.set(xlabel="$x_1$", ylabel="$x_2$")
cbar.ax.set_ylabel("Probability of $y=1$")

f.tight_layout()
plt.show()

The decision boundary is a realtively sharp straight line. The fact that on the far left or right we don't have any data point is not reflected.

Despite the regularisation, predictions made by a ML model are doomed to perform poorly away from the data. In this case, the decision boundary extends in a straight line despite the fact that we don't have any data in the "edges" of the grid. Hopefully this bothers you.

## Bayesian Logistic regression

Rather than evaluating the likelihood with a single value of theta, the Bayesian approach aims to compute the correct answer which, as we saw in the slides, should be the average likelihood, where the average is taken over the posterior distribution. 

The average, is calculated via a complicated integral, which we approximate with samples using MCMC. 

We start by defining a simple, 2 parameter logistic regression model, with a standard normal prior distribution on the theta parameter:

<div align="center"><b> prior: </b></div> 
$$ 
\theta_i \sim \mathcal{N}(0, 1)\\
$$


<div align="center"><b> likelihood: </b></div> 
$$ 
y_i \sim \text{Bernoulli}(p_i) \\ 
p_i = \text{Logistic}(x_i \cdot \theta) \\
$$


To draw the samples, we first write our model in NumPyro.

#### Exercise: Specify the above model in NumPyro.

(Tip: You'll need to use `dist.Normal` and `dist.BernoulliLogits`.)

In [None]:
def model(X, y=None):
    n_obs, n_dims = X.shape
    
    # prior
    with numpyro.plate("theta_plate", n_dims):
        theta = numpyro.sample("theta", dist.Normal(0, 1))

    # likelihood
    with numpyro.plate("n", n_obs):
        numpyro.sample("y", dist.BernoulliLogits(X @ theta), obs=y)

In [None]:
import os
from jax import random
import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS

# tell numpyro to use multiple cores
numpyro.set_host_device_count(4)

In [None]:
rng_key = random.PRNGKey(42)

mcmc = MCMC(NUTS(model), num_warmup=1000, num_samples=500, num_chains=4)
mcmc.run(rng_key, X=X, y=y)

In [None]:
mcmc.print_summary()

The mean of the coefficients does not agree with the values obtained by sklearn, but they shouldn't anyway: The mean and the mode are different quantities. When we run MCMC, though, we don't just get the means. We obtain the full distribution:

In [None]:
theta_samples = mcmc.get_samples()["theta"]

In [None]:
theta_samples.shape

In [None]:
f, axes = plt.subplots(1, 2, figsize=(13, 5), sharey=True, sharex=True)
axes[0].hist(np.array(theta_samples)[:,0], density=True, bins=30)
axes[1].hist(np.array(theta_samples)[:, 1], density=True, bins=30)
axes[0].set_title("$Theta_1$ distribution")
axes[1].set_title("$Theta_2$ distribution")
plt.show()

Or if you want to see the joint distribution

In [None]:
from seaborn import jointplot

In [None]:
jointplot(*np.array(theta_samples).T, alpha=0.05);

### New predictions

Now, numpyro provides a nice API for making predictions on new data once the samples are available. However, for the sake of making this notebook as didactical as possible, let's manually do what numpyro would do for us.

Say, for instance, that we want to make a prediction for a new point `x_new`. 

We know that:

$$
p(\tilde{y}|y) = \int p(\tilde{y}|\theta)p(\theta|y)d\theta
$$

which is basically the integral over all possible values of $\theta$ of our posterior distribution times the likelihood of a new observation for that value of $\theta$.


We know we can rewrite that as the conditional mean of the likelihood function on the new data:

$$
p(\tilde{y}|y) = \mathbb{E}_{\theta} (p(\tilde{y}|\theta)|y)
$$

which basically mean that our Monte Carlo approximation of the probability for the new prediction is simply the **mean of the likelihoods** over the samples of \theta generated by the posterior!

In other words, we do:

In [None]:
x_new = np.array([2, 0.5])
likelihoods = []
for theta in theta_samples:
    lkhood = logistic(x_new @ theta)
    likelihoods.append(lkhood)

The value we should predict is then given by ```np.mean(likelihoods)```:

In [None]:
np.mean(likelihoods)

Let's now do it for every point on the test set:

In [None]:
bayesian_y_pred = np.zeros(X_test.shape[0])
for i, x_new in enumerate(X_test):
    p = logistic(x_new @ theta_samples.T)  # resulting likelihood for every sample
    bayesian_y_pred[i] = np.mean(p)

Let's look at the prediction boundary:

In [None]:
# increase/decrease the number of levels to bin the probability
levels=11

f, ax = plt.subplots(figsize=(8, 5))

contour = ax.contourf(X1, X2, bayesian_y_pred.reshape(*X1.shape), cmap="YlOrRd", levels=levels)

# training data
ax.plot(*X[:5].T, "o", ms=8, mec="w")
ax.plot(*X[5:].T, "o", ms=8, mec="w")

cbar = plt.colorbar(contour, ax=ax)
ax.set(xlabel="$x_1$", ylabel="$x_2$")
cbar.ax.set_ylabel("Probability of $y=1$")

f.tight_layout()
plt.show()

A cool feature available to Bayesian models, is that we can also inspect the how much the predictions vary across samples -- hence serving as indicator for "model uncertainty":

**Exercise:** Compute the uncertainty predictions as the standard deviation in the predicted probabilties.

In [None]:
y_pred_uncertainty = np.zeros(X_test.shape[0])
for i, x_new in enumerate(X_test):
    p = logistic(x_new @ theta_samples.T)  # resulting likelihood for every sample
    y_pred_uncertainty[i] = np.std(p)  # notice the difference to previous calculation.

Let's look at the uncertainties.

In [None]:
# increase/decrease the number of levels to bin the probability
levels=11

f, ax = plt.subplots(figsize=(8, 5))

contour = ax.contourf(X1, X2, y_pred_uncertainty.reshape(*X1.shape), cmap="magma_r", levels=levels)

# training data
ax.plot(*X[:5].T, "o", ms=8, mec="w")
ax.plot(*X[5:].T, "o", ms=8, mec="w")

cbar = plt.colorbar(contour, ax=ax)
ax.set(xlabel="$x_1$", ylabel="$x_2$")
cbar.ax.set_ylabel("Probability of $y=1$")

f.tight_layout()
plt.show()

This is showing the regions of space where predictions are reliable. Basically, away from the data there is too much uncertainty and you shouldn't trust the predictions. It is extremely difficult to get a similar insight from a traditional ML model.