# Shallow Net in Keras


Build a Shallow neural network to classify MNIST digits


#### Install prerequisites

First steps (instructions for Mac or Linux). You need to install a recent version of Python, plus the packages keras, numpy, matplotlib and jupyter.

#### Start with setting a seed to get reproducible code

In [None]:
%matplotlib inline
import numpy as np

In [None]:
np.random.seed(42)

#### Import the pre requisites

Keras is a high level API for calling on the Tensorflow in the backend and is easy to start off.

Keras is a (Batteries included) high-level neural network library that, among many other things, wraps an API similar to scikit-learn's around the Theano or TensorFlow backends.

Keras is a high-level neural network library created by François Chollet at Google.

In [None]:
import keras
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (7,7) # Make the figures a bit bigger
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

#### Load Data

In [None]:
(X_train,y_train),(X_test,y_test) = mnist.load_data()

In [None]:
print("X_train original shape", X_train.shape) # handwritten images of digits
print("y_train original shape", y_train.shape) # actual value of those digits

Lets look at some of the training data sets

In [None]:
for i in range(9):
    plt.subplot(3,3,i+1)
    plt.imshow(X_train[i], cmap='gray', interpolation='none')
    plt.title("Class {}".format(y_train[i]))

In [None]:
X_train[0]

#### Lets do some pre processing now

In [None]:
 # Flatten the input so that each 28x28 image becomes a single 784 dimensional vector.
X_train = X_train.reshape(60000,784).astype('float32')
X_test = X_test.reshape(10000,784).astype('float32')

In [None]:
#change darkness from 0:255 to 0:1 
X_train /= 255
X_test /= 255

In [None]:
print("Training matrix shape", X_train.shape)
print("Testing matrix shape", X_test.shape)

Keras needs labels to be one hot encoded so we would modify the target matrices to be in the one-hot format, i.e.

0 -> [1, 0, 0, 0, 0, 0, 0, 0, 0]

1 -> [0, 1, 0, 0, 0, 0, 0, 0, 0]

2 -> [0, 0, 1, 0, 0, 0, 0, 0, 0]
etc.

In [None]:
X_train[0] # it is no longer a two dimensional image but a one dimensional array

In [None]:
nbr_classes = 10
y_train = keras.utils.to_categorical(y_train, nbr_classes)
y_test = keras.utils.to_categorical(y_test, nbr_classes)

In [None]:
y_train[0] # 5 is one hot encoded here now

#### Lets start building our lego model aka neural network now 

While the number of features/classes in your data provide constraints, you can determine all the other aspects of model structure: number of layers, size of layers, the nature of the connections between the layers, etc. (And if that didn't make sense, Keras is a great way to experiment with it!)

In [None]:
# We're going to define our model in the most common way: as a sequential stack of layers. 
# The alternative is as a computational graph, but we're going to stick to Sequential() here.
model = Sequential()

In [None]:
model.add(Dense(64, activation= 'sigmoid', input_shape = (784,))) 
# An "activation" is just a non-linear function applied to the output 
# of the layer above. Here, with a "sigmoid",
# we clamp all values below 0 to 1.

In [None]:
model.add(Dense(10,activation = 'softmax')) 
# This special "softmax" activation among other things,
# ensures the output is a valid probaility distribution, that is
# that its values are all non-negative and sum to 1.
# the units of the hidden layer model an un normalized score of how likely the input
# is to belong to a particular class.
# Softmax layer normalizes this so that the output represents the probability for every class

#### Lets see what model did we create? 

In [None]:
model.summary() # we made a full connected dense network

In [None]:
64*10+10

#### Configure model

Now we need to tell this neuron how to learn

Keras is built on top of Tensorflow,that allow you to define a computation graph in Python, which they then compile and run efficiently on the CPU or GPU without the overhead of the Python interpreter.

When compiling a model, Keras asks you to specify your loss function and your optimizer. The loss function we'll start with is mean squared error and then we would check out categorical crossentropy, which is a loss function well-suited to comparing two probability distributions.

Here our predictions are probability distributions across the ten different digits (e.g. "we're 80% confident this image is a 3, 10% sure it's an 8, 5% it's a 2, etc."), and the target is a probability distribution with 100% for the correct category, and 0 for everything else. The cross-entropy is a measure of how different your predicted distribution is from the target distribution. More detail at Wikipedia

The optimizer helps determine how quickly the model learns, how resistent it is to getting "stuck" or "blowing up". We won't discuss this in too much detail, but "adam" is often a good choice (developed here at U of T).

In [None]:

model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01),metrics = ['accuracy'])

#model.compile(loss='categorical_crossentropy', optimizer='adam')

#model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01),metrics = ['accuracy'])


#### Train the model

In [None]:
model.fit(X_train,y_train,batch_size=128,epochs=20,verbose=1,validation_data=(X_test,y_test))

In [None]:
# Now lets evaluate the final performance, evaluate() returns the loss function and 
# any other metrics we asked for when we compiled the model. In our case, we asked for accuracy
score = model.evaluate(X_test, y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])

And it does hint at one of the dangers of neural networks: overfitting. We've been careful here to hold out a test set and measure performance with that, but it's a small set, and 89% accuracy seems high to me, so I wouldn't be surprised if there was some overfitting going on. You could work on that by adding dropout (which is built into Keras). That's the neural network equivalent of the regularization our logistic regression classifier uses.