# MNIST For ML Beginners: The Bayesian way
**(c) 2017 Sreekumar Thaithara Balan, Fergus Simpson and Richard Mason, Alpha-I**. 

Download this notebook at: [link]

This tutorial is intended for readers who are new to machine learning, TensorFlow and Bayesian Methods. Our intention is to teach you how to train your first Bayesian neural network, and provide a Bayesian companion to the well known [getting started example](https://www.tensorflow.org/get_started/mnist/beginners) in TensorFlow.

So why do we need Bayesian Neural Networks? Traditionally neural networks are trained to produce a point estimate of some variable of interest. For example, we might train a neural network on historical stock price data to produce a prediction of the price at a future point in time. The limitation of a single point estimate is that it does not provide us with any measure of the uncertainty in it's prediction.  To continue our stock prediction example, if the network has a 95% confidence that the stock will increase in value then we have an easy decision to buy, but what if it is only say 50% confidence? With point estimates we just don't know how uncertain we are. Bayesian Neural Networks on the other hand can use the formalism of Bayes' rule to provide just this sort of measure of uncertainty. 

In this tutorial, we will learn about:

+ How Bayesian statistics are related to machine learning.
+ How to construct a Bayesian model for the classification of MNIST images.
+ How Bayesian neural networks can quantify uncertainties in predictions.

*For those who are eager to see why we care about uncertainties, scroll down to the bottom of this blog where we input the image of the letter **D** and ask our model to classify it. With a Bayesian model we can see how confident we are about our predictions!*

For more background information on Bayesian Neural Networks, Thomas Wiecki's blog on [Bayesian Deep Learning](http://twiecki.github.io/blog/2016/06/01/bayesian-deep-learning/) and Yarin Gal's blog [What my deep model doesn't know...](http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html) are extremely useful starting points.

The tutorial requires [TensorFlow](https://www.tensorflow.org/) *version 1.1.0* and [Edward](http://edwardlib.org/) *version 1.3.1*.

## Bayesian Neural Networks

So what is a Bayesian Neural Network? To understand their nature, we'll begin with a brief outline of Bayesian statistics. At its core, Bayesian statistics is a tool which advises us on how we should alter our beliefs in light of new information.

### Bayes' rule: 

Suppose that we have two events $x$ and $y$ and we want to know the conditional probability distribution of $x$ given $y$. Then Bayes' rule from probability theory tells us that

$$ P(x \;|\;y) = \frac{P(y\;|\;x)P(x)}{P(y)}$$

where $P(y\;|\;x)$ is the likelihood of observing event $y$ given $x$, $P(x)$ is our prior belief about $x$ and $P(y)$ is the probability of event $y$. Note that our prior belief about the variable $x$ is a probability distribution and that we obtain an entire distribution on the possible values of $x$ given $y$.

### Neural Networks

**[Reference to Blundell Paper somewhere here]** So how does Bayes' rule connect to neural networks? Well suppose that we are given a data set $D= \{(x_{i},y_{i})\}_{i=1}^{N}$ consisting of pairs of inputs $x_{i}$ and corresponding outputs $y_{i}$ for $i=1,2,\ldots,N$. We can use a neural network to model the likelihood function $P(y\;|\;x;\omega)$, where $\omega$ is the set of tunable parameters of the model i.e., the weights and biases of the network. For example, for a classification problem we could use a standard feedforward network, $y_{i} = f(x_{i};\omega)$ followed by a softmax layer to normalise output so that it represents a valid probability mass function $P(y_{i}\;|\;x_{i};\omega)$.

Traditional approaches to training neural networks typically produce a point estimate by optimising the weights and biases to minimize a loss function, such as a cross-entropy loss in the case of a classification problem. From the probabilistic viewpoint, this is equivalent to maximising the log likelihood of the observed data $P(D\;|\;\omega)$ to find the maximum likelihood estimate (MLE)

$$ \omega^{\text{MLE}} = \text{arg}\underset{\omega}{\text{max}} \;\log{P(D\;|\;\omega)}$$
$$ \quad\quad\quad\quad = \text{arg}\underset{\omega}{\text{max}} \;\sum_{i=1}^{N}\log{P(y_{i}\;|\;x_{i},\omega)}$$


This optimisation is typically carried out using some form of gradient descent (e.g., backprop), and then with the weights and biases fixed we can predict a new output $y^{*}=f(x^{*};\omega)$ for a given input $x^{*}$.

Training a neural network in this way is well known to be prone to overfitting and so often we introduce regularisation term such as an $L_{2}$ norm of the weights. Using Baye's rule you can show that placing $L_{2}$ regularization of the weights is equivalent to placing a normal Gaussian prior $P(\omega)\sim\mathcal(0,I)$ on the weights and maximising the a-priori estimate $p(\omega\;|\;D)$. This gives us the Maximum a-Priori estimate (MAP) of the parameters (see chapter 41 of MacKay's [book](http://www.inference.phy.cam.ac.uk/itprnn/book.html) for details):

$$ \omega^{\text{MAP}} = \text{arg}\underset{\omega}{\text{max}}\;\log{P(\omega\;|\;D)}$$
$$ \quad\quad\quad\quad\quad\quad\;\; = \text{arg}\underset{\omega}{\text{max}}\;\log{P(D\;|\;\omega)} + \log{P(\omega)}.$$

From this we can see that traditional approaches to neural network training and regularisation can be placed within the framework of performing inference using Bayes' rule. Bayesian Neural Networks go one step further by trying to approximate the entire posterior distribution using either Monte Carlo or Variational Inference techniques. In the rest of the tutorial we will show you how to do this using Tensorflow and [Edward](http://edwardlib.org/).

## Importing data
Let us import the [MNIST images](http://yann.lecun.com/exdb/mnist/) using the built in TensorFlow methods.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
from edward.models import Categorical, Normal
import edward as ed
import pandas as pd

ImportError: No module named seaborn

In [5]:
# Use the TensorFlow method to download and/or load the data.
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True) 

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


## Modeling

Our machine learning model will be a simple soft-max regression that will attempt to classify the handrwritten MNIST digits into one of the classes {0,1,2,...,9}. To do this we need a function to quantify the probability of the observed data given a set of parameters (weights and biases in our case), this is called the likelihood function. We use a Categorical likelihood function (see Chapter 2, [Machine Learning: a Probabilistic Perspective](https://www.cs.ubc.ca/~murphyk/MLbook/) by Kevin Murphy for a detailed description of Categorical distribution, also called Multinoulli distribution.).

We first set up some placeholder variables in TensorFlow. This follows the same procedure as you would for a standard neural network except that we use Edward to place priors on the weights in biases. In the code below, we place a normal Gaussian prior on the weights and biases.

In [1]:
ed.set_seed(31415)
N = 100   # number of images in a minibatch.
D = 784   # number of features.
K = 10    # number of classes.

NameError: name 'ed' is not defined

In [None]:
# Create a placeholder to hold the data (in minibatches) in a TensorFlow graph.
x = tf.placeholder(tf.float32, [None, D])
# Normal(0,1) priors for the variables. Note that the syntax assumes TensorFlow 1.1.
w = Normal(loc=tf.zeros([D, K]), scale=tf.ones([D, K]))
b = Normal(loc=tf.zeros(K), scale=tf.ones(K))
# Categorical likelihood for classication.
y = Categorical(tf.matmul(x,w)+b)

## Variational Inference

Up to this point we have defined the likelihood $P(y\;|\;x,;\omega)$ and the prior $P(\omega)$, next we want to use Bayes rule to compute the posterior $P(\omega\;|\;y,x)$. However, we immediately face a problem because in practice the probability of the outputs $P(y)$ is often computationally intractable to compute for large problems and so we can't calculate the posterior directly.

To tackle this problem we will be using Variational Inference (VI),which instead uses a family of parameterised distributions $Q(\omega;λ)$ over parameters $\omega$ to approximate the true posterior, and tries to optimize the parameters $\lambda$ so as to match the true posterior distribution as best as possible. The core idea is to minimise what is known as the KL divergence between the true posterior $P(\omega\;|\;y,x)$ and the approximating ditribution $Q(\omega;λ)$, which can be thought of as a measure of the similarity between two probability distributions.

The theory behind VI is beyond the scope of this blog, so more more information a quick introduction to VI can be found in Edward's documentation and a detailed one in Variational Inference: A Review for Statisticians by Blei et al.. Chapter 33 or MacKay's book is also a very good reference.

So next we use Edward to place a distribution on the weights and biases of the network:

In [None]:
# Contruct the q(w) and q(b). in this case we assume Normal distributions.
qw = Normal(loc=tf.Variable(tf.random_normal([D, K])),
              scale=tf.nn.softplus(tf.Variable(tf.random_normal([D, K])))) 
qb = Normal(loc=tf.Variable(tf.random_normal([K])),
              scale=tf.nn.softplus(tf.Variable(tf.random_normal([K]))))

In [None]:
# We use a place holder for the labels in anticipation of the traning data.
y_ph = tf.placeholder(tf.int32, [N])
# Define the VI inference technique, ie. minimise the KL divergence between q and p.
inference = ed.KLqp({w: qw, b: qb}, data={y:y_ph})

In [3]:
# Initialse the infernce variables
inference.initialize(n_iter=5000, n_print=100, scale={y: float(mnist.train.num_examples) / N})

NameError: name 'inference' is not defined

Now we are ready for the VI. We load up a TensorFlow session and start the iterations. This may take a few minutes...

In [None]:
# We will use an interactive session.
sess = tf.InteractiveSession()
# Initialise all the vairables in the session.
tf.global_variables_initializer().run()

In [None]:
# Let the training begin. We load the data in minibatches and update the VI infernce using each new batch.
for _ in range(inference.n_iter):
    X_batch, Y_batch = mnist.train.next_batch(N)
    # TensorFlow method gives the label data in a one hot vetor format. We convert that into a single label.
    Y_batch = np.argmax(Y_batch,axis=1)
    info_dict = inference.update(feed_dict={x: X_batch, y_ph: Y_batch})
    inference.print_progress(info_dict)

## Evaluating Our Model
We now have everything that we need to run our model on the test data, let's see how good our model is! The major difference in Bayesian model evaluation is that there is no single weight that we should use to evaluate the model. Instead we should use the distribution of weights and biases in our model so that the uncertainties in these parameters are reflected in the final prediction. Thus instead of a single prediction we get a set of predictions and their accuracies.

We draw a 100 samples from the posterior distribution and see how we perform on each of these samples. *Taking samples be might a slow process, may take few seconds!*

In [None]:
# Load the test images.
X_test = mnist.test.images
# TensorFlow method gives the label data in a one hot vetor format. We convert that into a single label.
Y_test = np.argmax(mnist.test.labels,axis=1)

In [None]:
# Generate samples the posterior and store them.
n_samples = 100
prob_lst = []
samples = []
w_samples = []
b_samples = []
for _ in range(n_samples):
    w_samp = qw.sample()
    b_samp = qb.sample()
    w_samples.append(w_samp)
    b_samples.append(b_samp)
    # Also compue the probabiliy of each class for each (w,b) sample.
    prob = tf.nn.softmax(tf.matmul( X_test,w_samp ) + b_samp)
    prob_lst.append(prob.eval())
    sample = tf.concat([tf.reshape(w_samp,[-1]),b_samp],0)
    samples.append(sample.eval())

In [4]:
# Compute the accuracy of the model. 
# For each sample we compute the predicted class and compare with the test labels.
# Predicted class is defined as the one which as maximum proability.
# We perform this test for each (w,b) in the posterior giving us a set of accuracies
# Finally we make a histogram of accuracies for the test data.
accy_test = []
for prob in prob_lst:
    y_trn_prd = np.argmax(prob,axis=1).astype(np.float32)
    acc = (y_trn_prd == Y_test).mean()*100
    accy_test.append(acc)

plt.hist(accy_test)
plt.title("Histogram of prediction accuracies in the MNIST test data")
plt.xlabel("Accuracy")
plt.ylabel("Frequency")

NameError: name 'prob_lst' is not defined

We have a range of accuacies for the samples. Note that posterior distributions of weights and biases refect the information gained from the entire MNIST test data. Thus the above histogram is representative of the uncertainty coming from the statistically possible range of weights and biases.

We can perform a model averaging and try to get a equivalent of a classical machine learning model. We do this by stacking up the predictions of the 100 samples we took from the posterior distribution and then computing the average of the predictions.