******************
******************

In this notebook I will use keras to make a multilayered perceptron model (MLP), in particular a convolutional neural network, to classify the MNIST dataset of hand-written digits.

******************
******************

Import keras

In [1]:
from keras.datasets import mnist # mnist is the dataset we use here

Using TensorFlow backend.


In [2]:
from keras.models import Model # used to specify and train the NN

In [3]:
from keras.layers import Input, Dense # the two types of NN layers we will use are Input and Dense

In [4]:
from keras.utils import np_utils 
# utilities for one-hot encoding of ground truth values

In [5]:
# example of one-hot incoding:
# “1” => [1, 0 nine times]
# “2” => [0, 1, 0 eight times]
# ...
# “0” => [0 nine times, 1]
# ground truth values = data that is "known" to be correct

Define hyperparameters such that

In [6]:
batch_size=128 # mini-batch gradient decent uses 128 training examples (a batch size of 128) during one iteration of grad decent 
#rather than batch gradient decent which does one interation of grad decent using the entire data set m
num_epochs=30 # the num of iterations of grad decent (to find the optimal theta) we want before terminating, we use this final theta
hidden_size=512 # the num of neurons in the hidden layers (both have the same number)

Preprocess MNIST data using Keras. This is very easy using Keras which has a fixed interface for fetching and extracting the data directly from a remote server directly into NumPy arrays.

In [7]:
num_train=60000 # =m the number of training examples in MNIST

In [8]:
num_test=10000 # =m_test the number of test examples in MNIST

In [9]:
height, width, depth = 28, 28, 1 # MNIST images are 28x28 and greyscale

In [10]:
num_classes=10 # =k=10 classes, one class for each digit

In [11]:
(X_train, y_train), (X_test, y_test) = mnist.load_data() # fetch MNIST data from mnist==keras.datasets into...
# ... training and testing data

In [12]:
# aside:
# X/y_train/text are arrays
type(X_train) # X_train is an array

numpy.ndarray

In [13]:
X_train.shape  # 60000 by 28 by 28 WHAT IS THIS?
# 60000 elements
# Each element is an array of 28 elements
# Each element of that array has 28 values
# np.array([1,2,3]) is an array = vector of elements: 1, 2 and 3
# np.array([[1,2],[3,4],[5,6]]) is an array of 3 elements
# each element, i.e. elements [1,2]-[5,6], is an array of two elements
# this is AKA a 3x2 matrix 
# np.array([ [ [1,2],[3,4],[5,6] ], [ [1,2],[3,4],[5,6] ], [ [1,2],[3,4],[5,6] ] ])
# is an array of 3 elements
# each element is an array of 3 elements
# each element of that array has 2 elements
## arrays of [] are vectors
## arrays of [[]] are matricies
## arrays of [[[]]] can be thought of as vectors of matricies 
## in this case. In other cases it might be a matrix of vectors

(60000, 28, 28)

In [14]:
print(X_train) # printing array X_train

[[[0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  ..., 
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]]

 [[0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  ..., 
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]]

 [[0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  ..., 
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]]

 ..., 
 [[0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  ..., 
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]]

 [[0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  ..., 
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]]

 [[0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  ..., 
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]
  [0 0 0 ..., 0 0 0]]]


In [15]:
X_train=X_train.reshape(num_train, height*width) 
# reshape X_train from (60000,28,28) == 60000x28x28
# to (num_train, height*width) = (60000, 784)

# more or less this takes X_train and looks at it as an array of 60000x28x28 els
# then takes the first 784 els and writes that into an array and puts the in
# the first el of X_train then it looks at the second 784 els and puts these ni
# the second el of X_train then and so on. 

# X_train is NOW a reshaped array of elements/examples 
# where each example is an array of real numbers 0-255 corresponding to this:
# for an example i between 0-60000 the element of X_train[i]
# is a vector = [ 11 pixel intensity, ..., 1n pixel int, 21 pixel int,...
# ... 2n pixel int, ..., n1 pixel int, ..., nn pixel int]
# aside note np.arange(6) makes an array = array([0,...,5])

In [16]:
print(X_train)

[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]


In [17]:
type(X_train)

numpy.ndarray

In [18]:
X_train.shape

(60000, 784)

In [19]:
X_test=X_test.reshape(num_test, height*width) # reshape X_train from (10000,28,28) == 10000x28x28
# to (10000,784) == 10000x784

In [20]:
# change both X_train/test to type of variable that is a float 32
# this is so X_train/test do not have ambiguous types and means that
# we can use the command below "/=" to divide X_train by a real number
X_train=X_train.astype('float32') 
X_test=X_test.astype('float32')

In [21]:
X_train/=255 # data was between 0-255 normalise it so it is between 0-1
X_test/=255 # data was between 0-255 normalise it so it is between 0-1

One-hot encode labels 

In [22]:
y_train.shape # this is a vector of 60000 real number els 

(60000,)

In [23]:
# y_train examples
y_train

array([5, 0, 4, ..., 5, 6, 8], dtype=uint8)

In [24]:
# one-hot encode y_train/test 
# (using to_categorical in library np_utils)
# so that if the ith el of y_train/test is 3 and there are 4 classes then
# the ith el of Y_train/test is [0,0,1,0]
Y_train=np_utils.to_categorical(y_train, num_classes)
Y_test=np_utils.to_categorical(y_test, num_classes)

In [25]:
Y_train[5] # e.g. here is the 5th element of Y_train

array([ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

Making NN

In [26]:
inp=Input(shape=(height*width,)) # our input layer is defied as inp 
# and is given by Input, an array/ vector layer with random initial conditions 
# defined by keras.layers 
# we choose an input layer of size 784

In [27]:
inp.shape # inp is a tensor of dimension 784

TensorShape([Dimension(None), Dimension(784)])

In [28]:
type(inp) # inp is a tensoor

tensorflow.python.framework.ops.Tensor

In [29]:
# note: In 2 class logistic regression, the rpedicted probs are as follows
# using the sigmoid function:
# p(y=0) = exp(-bX)/(1+exp(-b_0X))
# p(y=1) = 1/(1+ exp(-bX))
# for multiclass logistic regression with K classes the predicted probs,
# using the softmax function are:
# P(y=k) = exp(b_kX)/Sum(j in [0-K])(exp(b_jX) )
hidden_1 = Dense(hidden_size, activation='relu')(inp) # First hidden ReLU layer 
# == first hidden layer which has an activation=max(0,input)='if negative set to zero'
hidden_2 = Dense(hidden_size, activation='relu')(hidden_1) # Second hidden ReLU layer
# == second hidden layer which has an activation=max(0,input)='if negative set to zero'
out = Dense(num_classes, activation='softmax')(hidden_2) # Output softmax layer
# == output layer which has an activation= softmax/logistic function for many outputs 
# Softmax is often used for classification in the output layer because it 
# provides probabilities for different classes which sum to 1.
# Sigmoids and tanh can be used as well, 
# but they have the disadvantage of not summing to 1.
# Softmax function is typically used only in the output layer of NN
# to represent a probability distribution of possible outcomes 
# of the network. Relu or rectified linear is a popular variant 
# of activation functions esp in deep convolutional nn to impose
# non linearity to the incoming activations.
# Linear functions are used in the output layer  your NN is for
# regression instead of classification.

In [30]:
model= Model(inputs=inp, outputs=out) # specify the input and output layers

Loss function = cross-entropy loss rather than least sqaured error.
The cross-entropy loss is better for probablistic tasks <-> ones with
output logistic sigmoid/ softmax output neurons BECAUSE of the
corss-entropy derivation - i.e. it aims only to max the model's confience 
in the correct class, and is not concerned w the distribution of probs
for other classes - while the sqrd error loss would dedicated equal attention
to getting all of the other class probs as close to zero as possible. This
is due to the fact that incorrect classes, i.e. classes i w yguess_i =0,
eliminate the respective neuron's output from the loss function.

The optimmisation alg. will revolve around some grad desent but with a 
chosen or adapted learning rate. Adam optimiser typically performs well. 

Classes are balanced, not squewed, so the accuracy, prop of inputs classified
correctly is a good metric/ measure of the NN working well.

In [31]:
model.compile(loss='categorical_crossentropy', # we use the cross-entropy loss function
              optimizer='adam', # we use the Adam optimiser
             metrics=['accuracy']) # provide the accuracy

Call the training alg w batch size and epoch count holding 10% of data for cv. 
Verboseity provides real-time pretty printing of the training algorithm's progress

In [None]:
model.fit(X_train, Y_train, # train model w training set
          batch_size=batch_size, epochs=num_epochs,
          verbose=1, # nice data vis
          validation_split=0.1) # hold 10% of data for cv. 
# How is the cv data set used?
# The algorithm trains over each epoch 
# At the end of each epoch the accuracy is determined from the 
# cv data set. This is much quicker than checking it on the test set
# because the test set is much larger
# also the cv set can enable us to pick values of particular parameters 
# in our model in an unbiased way before finally testing it on the 
# test set

Train on 54000 samples, validate on 6000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30

In [35]:
model.evaluate(X_test, Y_test, # finally eval model w test set, NOT training/cv set 
               verbose=1) # nice results printed out



[0.11312305948414719, 0.97999999999999998]

In [None]:
# 98% accuaracy is pretty good!