$$ \text{softmax(x)}_i = \text{normalize}(\text{exp(}x_i)) $$

$$ Wx + b = y $$
for each observation. W is a weighting matrix, x is a single sample (a vector of feature values), b is a bias (prior probability), y is a vector of class membership scores.

The exponential always produces a value greater than zero for all $x_i$, and means that increases in evidence stack multiplicatively (rather than additively).

In [3]:
import tensorflow as tf

# Create tensorflow data object
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True) # labels are one-hot encoded

# Build Graph #
#-------------#
x = tf.placeholder(tf.float32, shape=[None, 784]) # 'None' means the dimension can be of any length
W = tf.Variable(tf.zeros([784, 10])) # Remember that the response is a one-hot encoded vector -> a matrix
b = tf.Variable(tf.zeros([10])) # bias is towards classes 

y = tf.nn.softmax(tf.matmul(x, W) + b) # matrix, each row is a  vector of predicted class membership probabilities
# Weights matrix is a p x m, with m being equal to the number of possible classes x could belong to and
# p being equal to the number of features per observation
# The biases suggest prior class membership - consequently there are m of these
# x is an n x p matrix of observations

# Note that  W is applied to each row in x to return a vector of estimated class membership probabilities
# - the same W is applied to all samples (as is the same b).
# W and b are tuned to minimize the classification error.

y_ = tf.placeholder(tf.float32, [None, 10]) # True classes
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
#reduction indices=1 indicates to sum in the 2nd dimension of y
# tf.reduce_mean() computes the mean over all samples in the batch (mean cross-entropy)

train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
# Ask tensorflow to minimize cross-entropy using the gradient descent algorithm, with a learning rate of 0.5.
# Gradient descent shifts each variable a little bit in the direction that reduces the cost.
# There are many other optimization tools in TensorFlow.

# As for how this step is influencing the TensorFlow graph - it adds backpropagation and gradient descent
# operations. The variable returned is an operation that, when run, implements one step of gradient descent.

init = tf.initialize_all_variables()


Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


$$ H_{y'}(y) = -\sum_iy_i'\text{log}(y_i) $$

$y_i$ is the true class membership of observation $i$, and $y_i'$ is the class that our classifier predicts. We want to minimize cross-entropy.


In [7]:
# Run Session #
#-------------#
sess = tf.Session()
sess.run(init) # initialize variables
for i in range(2000): # 1000 gradient descent steps
    batch_xs, batch_ys = mnist.train.next_batch(100) # Gets a subset of data - 100 samples
    sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

In [8]:
# Evaluate model #
# Append an evaluation step to the graph, then run it
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_, 1)) # Boolean vector on (correct prediction?)
# argmax returns index of largest value along index
# remember that y is a matrix, where each row (axis 1) corresponds to a series of estimated class membership
# probabilities
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x:mnist.test.images, y_: mnist.test.labels}))

0.9203
