# MNIST For ML Beginners: The Bayesian way
**(c) 2017 Sreekumar Thaithara Balan, Fergus Simpson and Richard Mason, Alpha-I**. 

Download this notebook at: [link]

**[Who is it for?]** This tutorial is intended for readers who are new to machine learning, TensorFlow and Bayesian Methods. **[What is it?]** Our intention is to teach you how to train your first Bayesian neural network and provide a Bayesian companion to the well known [getting started example](https://www.tensorflow.org/get_started/mnist/beginners) in TensorFlow.

**[Why should I read it?]** So why do we need Bayesian Neural Networks? Traditionally neural networks are trained to produce a point estimate of some variable of interest. For example, we might train a neural network on historical stock price data to produce a prediction of the price at a future point in time. The limitation of a single point estimate is that it does not provide us with any measure of the uncertainty in it's prediction.  To continue our stock prediction example, if the network has a 95% confidence that the stock will increase in value then we have an easy decision to buy, but what if it is only say 50% confidence? With point estimates we just don't know how uncertain we are. Bayesian Neural Networks on the other hand can use the formalism of Bayes' rule to provide just this sort of measure of uncertainty. 

[What the tutorial will teach you] In this tutorial, we will learn about:

+ How Bayesian statistics are related to machine learning.
+ How to construct a Bayesian model for the classification of MNIST images.
+ How Bayesian neural networks can quantify uncertainties in predictions.

*For those who are eager to see why we care about uncertainties, scroll down to the bottom of this blog where we input the image of the letter **D** and ask our model to classify it. With a Bayesian model we can see how confident we are about our predictions!*

For more background information on Bayesian Neural Networks, Thomas Wiecki's blog on [Bayesian Deep Learning](http://twiecki.github.io/blog/2016/06/01/bayesian-deep-learning/) and Yarin Gal's blog [What my deep model doesn't know...](http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html) are extremely useful starting points.

The tutorial requires [TensorFlow](https://www.tensorflow.org/) *version 1.1.0* and [Edward](http://edwardlib.org/) *version 1.3.1*.

## Bayesian Neural Networks

So what is a Bayesian Neural Network? To understand their nature, we'll begin with a brief outline of Bayesian statistics. At its core, Bayesian statistics is a tool which advises us on how we should alter our beliefs in light of new information.

### Bayes' rule: 

Suppose that we have two events $x$ and $y$ and we want to know the conditional probability distribution of $x$ given $y$. Then Bayes' rule from probability theory tells us that

$$ P(x \;|\;y) = \frac{P(y\;|\;x)P(x)}{P(y)}$$

where $P(y\;|\;x)$ is the likelihood of observing event $y$ given x, $P(x)$ is our prior belief about $x$ and $P(y)$ is the probability of event $y$. Note that our prior belief about the variable $x$ is a probability distribution and that we obtain an entire distribution on the possible values of $x$ given $y$.

### Neural Networks

So how does this connect to neural networks? Well suppose that we are given a data set $D= \{(x_{i},y_{i})\}_{i=1}^{N}$ consisting of pairs of inputs $x_{i}$ and corresponding outputs $y_{i}$ for $i=1,2,\ldots,N$. We can use a neural network to model the likelihood function $p(y | x;\omega)$, where $\omega$ is the set of tunable parameters of the model i.e., the weights and biases of the network.

Traditional approaches neural networks produce a point estimate by optimising weights and biases to maximise the log of the likelihood of the observed data $P(D\;|\;\omega)$ which is known as the maximum likelihood estimate (MLE)

$$ \omega^{\text{MLE}} = \text{arg}\underset{\omega}{\text{max}} \;\log{P(D\;|\;\omega)}$$
$$ \quad\quad\quad\quad = \text{arg}\underset{\omega}{\text{max}} \;\sum_{i=1}^{N}\log{P(y_{i}\;|x_{i}\;\omega)}$$


This optimisation is typically carried out using some form of gradient descent (e.g., backprop). Training a neural network in this way is well known to be prone to overfitting and so researchers have introduced regularisation such as placing a penalty on the $L_{2}$ norm of the weights.

Using Baye's rule you can show that placing $L_{2}$ regularization of the weights is equivalent to placing a Gaussian prior $P(\omega)\sim\mathcal(0,I)$ and finding the weights that maximise the a-priori estimate $p(\omega\;|\;D)$. This gives us the Maximum a-Priori estimate (MAP) of the parameters:

$$ \omega^{\text{MAP}} = \text{arg}\underset{\omega}{\text{max}}\;\log{P(\omega\;|\;D)}$$
$$ \quad\quad\quad\quad\quad\quad\;\; = \text{arg}\underset{\omega}{\text{max}}\;\log{P(D\;|\;\omega)} + \log{P(\omega)}.$$

From this we see that traditional approaches to neural network training and regularisation can be placed within the framework of performing inference using Bayes' rule. Bayesian Neural Networks go one step further by explicitly placing a prior on the network weights and trying to approximate the entire posterior distribution using either Monte Carlo or Variational Inference techniques. In the rest of the tutorial we will show you how to do this using Tensorflow and [Edward](http://edwardlib.org/).

## Importing data
Let us import the [MNIST images](http://yann.lecun.com/exdb/mnist/) using the built in TensorFlow methods.

In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
from edward.models import Categorical, Normal
import edward as ed
import pandas as pd

In [5]:
# Use the TensorFlow method to download and/or load the data.
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True) 

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


## Modeling

Our machine learning model will be a simple soft-max regression that will attempt to classify the handrwritten MNIST digits into one of the classes {0,1,2,...,9}. Since we are "thinking" in terms of probabilities, each parameter in our model has a probability distribution attached to it. We also need a function to quantify the probability of the observed data given a set of parameters (weights and biases in our case), this is called the likelihood function.

We use a Categorical likelihood function (see Chapter 2, [Machine Learning: a Probabilistic Perspective](https://www.cs.ubc.ca/~murphyk/MLbook/) by Kevin Murphy for a detailed description of Categorical distribution, also called Multinoulli distribution.).

The combination of the Categorical likelihood and Gaussian priors on the weights and biases results in a log-posterior function which resembles the cross-entropy minimisation with a $L_{2}$ regulariser (chapter 41 of MacKay's [book](http://www.inference.phy.cam.ac.uk/itprnn/book.html) )! We will not get more into the details now but will try to deal with it in another blog post.

We first set up some placceholder variables in TensorFlow, just as you would for a standard neural network:

In [7]:
ed.set_seed(31415)
N = 100   # number of images in a minibatch.
D = 784   # number of features.
K = 10    # number of classes.

In [8]:
# Create a placeholder to hold the data (in minibatches) in a TensorFlow graph.
x = tf.placeholder(tf.float32, [None, D])
# Normal(0,1) priors for the variables. Note that the syntax assumes TensorFlow 1.1.
w = Normal(loc=tf.zeros([D, K]), scale=tf.ones([D, K]))
b = Normal(loc=tf.zeros(K), scale=tf.ones(K))
# Categorical likelihood for classication.
y = Categorical(tf.matmul(x,w)+b)