# Getting started with [TensorFlow](https://www.tensorflow.org/): The Bayesian way

## Introduction
This tutorial is about getting started with TensorFlow, the Deep Learning framwwork (actually it is more than that) from Google. Here you will get started with TensorFlow through a Bayesian veiwpoint. 

### Uncertainty matters!
We will start with this blog by explaining why uncertainty matters and how Bayesian methods can help in estimating the uncertainty with the predictions, classifications or decisions made with machine learning models. Traditionally when we think about uncertainty in machine learning the concept that comes to mind is the accuracy of prediction or classification. See for example, the figure below from [Canziani et al 2017](https://arxiv.org/abs/1605.07678) for a relative comparison of various CNN models and their accuracy.

![Canziani et al 2017](figures/acc_vs_net_vs_ops.png)

What we have to remember here is that thee numbers are compute by keeping the model fixed. In other words the numbers quoted only consider the best model (architecture) chosen by the corresponding team out of an ensemble of models they played with. The models that are chosen are the ones that has most generally applicable and which has least prediction / classification errors. However we should note that we have only considered a single model while we could have tested the entire ensemble fore the analysis giving rise to a range of prediction accuracies corresponding to each model. Now we get a feeling another kind on uncertainty in the process: the one coming from the choice of model itself. In many machine learning applications the model uncertainty is extremely important in decision making.

### Why Bayesian?
The question now is how to incorporate the model uncertainty in while quoting these numbers or more generally can find a unified probabilistic (natural way of quantifying uncertainties) framework/language with which we can describe these uncertainties coming from data, model and our assumptions. This is where Bayesian methodologies comes to mind. In many natural sciences fields is already the de-facto language of making inferences. In machine learning there has been some pioneering work in the 90s by MacKay and Neal but the ideas were forgotten quickly after their introduction. Since the use deep learning methodologies has increased exponentially, so has the need for a deeper understanding of the uncertainties originating from them. A seminal work in this field is Yerin Gal’s PhD thesis which tells you why it important to know when the model doesn’t know.

### Why [TensorFlow](https://www.tensorflow.org/)?
TensorFlow has become hugely popular in the deep learning community in recent years. It has huge number of examples covering a wide range of problmes in Deep Learning and has [Keras](https://keras.io/) a top layer for easy protopyping. It is fast enough for Bayesian computations. Finally, we will be useing [Edward](http://edwardlib.org/) for proabilistic programming.  To summarise, TensorFlow is
+ popular amomng deep learning community
+ easy prototyping using Keras
+ excellet support
+ can build complex models
+ Bayesian computaion using Edward

### Current state of the art
We will now summarise the current state of the art in Bayesian deep learning. Applying Bayesian inference to deep learning has been mainly about developing variational methods that approximates the probability distributions and Hamiltonian Monte Carlo methods. In more general note there has been a significant interest in uncertainty quantification in deep neural networks. In the table below I summarise recent advances in this field

| Paper/Blog        | Remarks  |
| :-----------------: | :---------:|
| [Learning as Inference, Chapter 41, MacKay 2003](http://www.inference.phy.cam.ac.uk/itprnn/book.html)      | A great starting point. Clearly explains the idea that learning is an inference problme. |
|    Uncertainty in Deep Learning, Yerin Gal, 2016   |  A superb work in quantification of uncertaintiy in deep learning. A must read. |
| [Uncertainty in Deep Learning, Buldel et a 2015](https://arxiv.org/abs/1505.05424) | Describes a novel method for calculating the probability distribution of weights in a neural network. |
| more | papers... |

The table blow summarises the probabilistic programming libraries that can be used for implementing Bayesian methods.

| Library/Framework        | Remarks  |
| :-----------------: | :---------:|
| [Edward](http://edwardlib.org/) | A great library for Bayesian computation and in genearl proabilisitc programming. Built on top of TensorFlow. |
| [Stan](http://mc-stan.org/) | Stan is a general purpose library for statistical modelling.|
| more | libraries... |


## MNIST data and TF example
![MNIST data](figures/MNIST.png)
In the last section we described the need for the uncertainty qualification in deep learning Now we dive into implementing MNIST classification using TensorFlow. We will not get into a lot of detail here as it has already been discussed in detail in [TensorFlow getting started guide](https://www.tensorflow.org/get_started/mnist/beginners). What will take from this example is the loss function, understanding of which is crucial to implementing the Bayesian inference.

$$ H_{y'_i}(y) = - \sum_i y'_i\log(y_i)  $$

In that tutorial it is referred to as the cross entropy which basically tells us what it means to be a good model. *A bit more about the cross entropy function and it history*
The take away is that the each machine learning problem has an associated loss function.


## Thinking Bayesian: Learning is inference
We are now ready to transform what we learned in implementing a simple cross entropy loss function into a Bayesian framework. Bayesian formalism is based on probabilities and relates any prior beliefs we have and the information content in the data to a transformed set of beliefs.

The key thing to remember here is that by thiniining probabilistically, we transform the learning problem to an inference problem. In other words the learning or training part of the machine learning becomes a statistical inference problem.

Now that we have discribed what this is about, let us get our hands dirty and implement Bayesian inference problem based not he cross entropy function. We recognise that the cross-entropy function is related to what Bayesians call a likelihood function. A likelihood function is defined a probability distribution for the data given a model. In our problem the likelihood as form of a Bernouli distribution (WIKI LINK).

In [5]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


In [6]:
import tensorflow as tf

# Implementing the Regression
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)

# Training
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

for _ in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})


In [7]:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

0.9186


## Thinking Bayesian: Learning is inference
We are now ready to transform what we learned in implementing a simple cross entropy loss function into a Bayesian framework. Bayesian formalism is based on probabilities and relates any prior beliefs we have and the information content in the data to a transformed set of beliefs.

The key thing to remember here is that by thiniining probabilistically, we transform the learning problem to an inference problem. In other words the learning or training part of the machine learning becomes a statistical inference problem.

Now that we have discribed what this is about, let us get our hands dirty and implement Bayesian inference problem based not he cross entropy function. We recognise that the cross-entropy function is related to what Bayesians call a likelihood function. A likelihood function is defined a probability distribution for the data given a model. In our problem the likelihood as form of a Bernouli distribution (WIKI LINK).



In [2]:
import tensorflow as tf
from edward.models import Normal
import numpy as np

x_train = np.linspace(-3, 3, num=50)
y_train = np.cos(x_train) + np.random.normal(0, 0.1, size=50)
x_train = x_train.astype(np.float32).reshape((50, 1))
y_train = y_train.astype(np.float32).reshape((50, 1))

W_0 = Normal(mu=tf.zeros([1, 2]), sigma=tf.ones([1, 2]))
W_1 = Normal(mu=tf.zeros([2, 1]), sigma=tf.ones([2, 1]))
b_0 = Normal(mu=tf.zeros(2), sigma=tf.ones(2))
b_1 = Normal(mu=tf.zeros(1), sigma=tf.ones(1))

x = x_train
y = Normal(mu=tf.matmul(tf.tanh(tf.matmul(x, W_0) + b_0), W_1) + b_1,
           sigma=0.1)

## Making predictions with Uncertainties
Making predictions is crucial part of the Machine Learning process. In Bayesian inference this step correspond to the posterior predictive distribution. In the previous section we have explained what is meant by the posterior distribution of the weights which gives the uncertainty in describing the possible values of them. Now are in a position to translate these uncertainties into the predictoin / classificaion of a new data $d^{*}$ given these conditions. Since the value of $\mathbf{w}$ is uncertain we should average over all possible values $\mathbf{w}$. The posterior predictive distribution is given by
$$
\Pr(d^{*}|\mathbf{d}) = \int \Pr(d^{*}|\boldsymbol{\theta}) \Pr(\boldsymbol{\theta}|\mathbf{x})\mathrm{d}\boldsymbol{\theta}
$$
This distribution telss about the preditive uncertainites about the new data given the previously observed data and the uncertainity in the model parameters. Thus the predicton becomes a statistical inference problem. In other words when the new unseen data comes in we can find the possible predictions given the posterior distribution of the model parameters we computed.
## Summary
The main points we discussed in this blog are

1. From a Bayesian prespective the Machine Learning is an inference problem.
2. Trasforming the MNIST classification to an infernece problem can be easily achieved with Edwards.
3. The key trick is to think of the cross-entropy function as a likelihood of data given the weights.
4. Uncertainties in the prediction can be inferred from the posterior predictive distritbuion.