# MNIST For ML Beginners
This tutorial is based on the MNIST For ML Beginners tutorial form the TensorFlow Website: https://www.tensorflow.org/get_started/mnist/beginners



MNIST is a simple computer vision dataset. It consists of images of handwritten digits. It also includes labels for each image, telling us which digit it is. 
The following code downloads the images from Yann LeCun's Webside

In [1]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


The MNIST data is split into three parts: 
    - 55.000 data points of training data (mnist.train)
    - 10.000 points of test data (mnist.test)
    - and 5.000 points of validation data (mnist.validation)
    
Images will be called X and labels will be called Y!

To access the training images: mnist.train.images [55.000, 784]

To access the corresponding labels: mnist.trian.labels [55.000, 10]

Each Image is 28x28 pixels. This matrix will be flatten into a vector with 784 numbers. 

# Softmax Regression

We know that every image in MNIST is of a handwritten digit between zero and nine. So there are only ten possible things that a given image can be. We want to be able to look at an image and give the probabilities for it being each digit. For example, our model might look at a picture of a nine and be 80% sure it's a nine, but give a 5% chance to it being an eight (because of the top loop) and a bit of probability to all the others because it isn't 100% sure.

We will get this probability by using a technique called Softmax Regression. A softmax regression has two steps:
- First we add up the evidence of our input being in certain classes.
- Then we convert that evidence into probabilities.


To check the evidence that a given image is in a particular class, we do a weighted sum of the pixel intensities. The weight is negative if that pixel having a high intensity is evidence against the image being in that class, and positive if it is evidence in favor.

We also add some extra evidence called a bias. Basically, we want to be able to say that some things are more likely independent of the input. The result is that the evidence for a class i given an input x is:

\begin{equation} 
evidence_i =  \sum_j{W_{i,j} x_j + b_i}
\end{equation}

$W_i$ is the weights and $b_i$ is the bias for class $i$, and $j$ is an index for summing over the pixels in our input image $x$. We then convert the use the softmax function to shape the output of our linear functino into a form we want - in this case, a probability distribution of 10 classes.

![Illustration of The Computational Graph](images/Convolution1.PNG)

 We can "vectorisze" this procedure, turning it into a matrix multiplication and vector addition. This is helpful for computational efficiency.

![Illustration of The Computational Graph](images/Convolution2.PNG)

\begin{equation} 
y = softmax(Wx+b)
\end{equation}

# Implementing the Regression

The whole matrix multiplication is very computational intense and takes a lot of time. That is why the math is not done with python but with better performing lenguages like c++. We dont transfer every singe computation into c++ but insted we create the already mentioned graph of computation and compute it with the help of for example numpy in another lenguage or on other distributions or on one or more GPU's.

Now lets create our model!

In [2]:
import tensorflow as tf


In [3]:
x = tf.placeholder(tf.float32, [None, 784])


$ x $ is our placeholder. It stands for the number of MNIST images that we want to pass throug our model, each flattened into a 784-dimensional vector. We represent this as a 2-D tensor of floating-point numbers, with a shape [None, 784]. (Here None means that a dimension can be of any length. We dont specify the number of images)

In [4]:
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

W and b are our Variables and they stand for the weights and biases. We pass them as tf.Variables into the model so that it can adjust them. They are all initialised as zeros since we gone change them anyway. 

The following line implements our full model.

In [5]:
logits = tf.matmul(x,W)+b
y = tf.nn.softmax(logits)

Notice that W has a shape of [784, 10] because we want to multiply the 784-dimensional image vectors by it to produce 10-dimensional vectors of evidence for the difference classes. b has a shape of [10] so we can add it to the output. We also fliped x and W in our tf.matmul command so that the result looks like this.
- [None x 748] x [748 x 10] + [10 x 1] = [None x 10] ( Image 1 = 000001000)

# Training - Cross Entropy

To train the model, we need to keep tracking our loss. Our loss tells us, how far we are away from the optimal result. One very common, very nice function to determine the loss of a model is called "cross entropy". In some rough sense, the cross-entropy is measuring how inefficient our predictions are for describing the truth. 

\begin{equation} 
H_{y'}(y) = - \sum_i{y'_{i} log(y_i)}
\end{equation}

To compare our perediction with the truth, we need to give our model access to it. For that, we introduce a new placeholder y_.

In [6]:
y_ = tf.placeholder(tf.float32, [None, 10])

Then we implement our cross entropy function.

In [7]:
# cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=logits))

Because TensorFlow knows the entire graph of our computations, it can automatically use the backpropagation algorithm to efficiently determine how your variables affect the loss you ask it to minimize. Then we use GradientDescentOptimizer to optimize our model variables(Weights and biases). 

In [8]:
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

The learningrate is 0.5. Gradient descent is a simple procedure, where TensorFlow simply shifts each variable a little bit in the direction that reduces the cost.

We can now launch the model in an InteractiveSession.

In [9]:
sess = tf.InteractiveSession()

We first have to create an operation to initialize the variables we created.

In [10]:
tf.global_variables_initializer().run()

Now we train our model 1000 times.

In [11]:
for _ in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

We run "train_step" 1000 times but we dont feed all the 55.000 images through the model. That would take to much time. Instead, we create smaller so called batches of our data. They contain random datapoints and doing this is much cheaper. 

# Evaluating Our Model
tf.argmax returns an index of the highest entry in a tensor along some axis. We use this function to compare the prediction (tf.argmax(y,1)) with the real value (tf.argmax(y_,1)).

In [12]:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

correct_prediction is a list of booleans. We cast them to floating point numbers and then take the mean.
- [ True, False, True, True] -> [1,0,1,1] -> 0,75%

In [13]:
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

0.9182


# Complete program

In [14]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

import tensorflow as tf

# Model input and output
x = tf.placeholder(tf.float32, [None, 784])
y_ = tf.placeholder(tf.float32, [None, 10])

# Model parameters
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

# The Model
y = tf.nn.softmax(tf.matmul(x, W) + b)

# loss
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.5)
train_step = optimizer.minimize(loss)

# training loop
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
for _ in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})   
    
# evaluate training accuracy
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print("Accuracy = %s" % sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Accuracy = 0.9057
