# Digit Recognition with TensorFlow and Keras
This notebook illustrates how to use Tensorflow and Keras to build and train a Convolutional Neural Network (CNN) to perform digit recognition.

<br>
<hr>
<br>

<img height="60" src="https://www.tensorflow.org/images/tf_logo_social.png" />

[**TensorFlow**](https://tensorflow.org):
> An end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications.

<img height="60" src="https://keras.io/img/logo-k-keras-wb.png" />

[**Keras**](https://keras.io/): 
> A deep-learning API written in Python, running on top of the machine learning platform TensorFlow. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result as fast as possible is key to doing good research.

## The Goal
1. We will create our own multi-layer neural network for digit recognition using Keras and TensorFlow
2. Train the network on the MNIST training data.
3. Use the trained network to inference new digits that are presented on the inputs.

![picture](https://miro.medium.com/max/700/1*XdCMCaHPt-pqtEibUfAnNw.png)

## What is MNIST
MNIST stands for [Modified National Institute of Standards and Technology](https://en.wikipedia.org/wiki/MNIST_database).  It is a large database of handwritten digits that is commonly used for training various image processing systems  The handwritten digits look like this

![picture](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png)

The dataset is a blend of digit images taken from handwritten notes from Census Bureau employees and high school students.  It is one of the most common datasets used for image classification and it is accessible from many different sources. Tensorflow and Keras allow direct imports of this dataset from their API.

The MNIST database contains 60,000 training images and 10,000 testing images. Here we separate the groups according to training vs testing. These groups are further subdivided into images and labels.

In [None]:
import tensorflow as tf
# x dimension is greyscale image data
# y dimension is actual digit labels
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
print(f"Training images: {len(x_train):10}")
print(f"Training labels: {len(y_train):10}")
print(f"Testing images:  {len(x_test):10}")
print(f"Testing labels:  {len(y_test):10}")

## View a random image from the dataset.

Note the dimensions of each image: 28 x 28 x 1.

In [None]:
import random
import matplotlib.pyplot as plt
%matplotlib inline

index = random.randint(0, len(x_train)-1)
print(f'The selected index is {index}')
print(f'The label of this data is {y_train[index]}')
plt.imshow(x_train[index], cmap='Greys')

## What is the "shape" of the data?
The x and y datasets are represented by a `numpy.ndarray` object.  An `ndarray` is a (usually fixed-size) multidimensional container of items of the same type and size. The number of dimensions and items in an array is defined by its shape, which is a tuple of N non-negative integers that specify the sizes of each dimension.

In [None]:
print(f"Type of data: {type(x_train)}")
print(f"Shape of data: {x_train.shape}")

Before using Keras, we need to re-shape the input data into 4 dimensions instead of 3.  The four dimensions are:
- number of images (60000)
- height of each image (28)
- width of each image (28)
- depth of each image (1)

In [None]:
# Reshape all input image data (x) into 4 dimensions
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
input_shape = (28, 28, 1)

# Convert input pixel data to float instead of int, because we'll be dividing
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

# Normalize the Greyscale values by dividing by max Greyscale value.  
x_train /= 255.0
x_test /= 255.0

print(f"New shape of x_train: {x_train.shape}")

## Building the Convolutional Neural Network (CNN)
We will create a CNN in Keras, using a Sequential model and adding layers to it:

1. 2D Convolution
2. 2D Maxpool
3. Flatten
4. Dense with RELU activation
5. Dropout
6. Dense with Softmax activation

The model will look something like this

<img height="300" src="https://cdn-images-1.medium.com/max/628/1*RM7nqjYSMxkc0QlCxERQnw.png" />


### 2D Convolution
This first network layer is used to reduce computational complexity, while still preserving ability to detect features in the images. Notice the 3x3 matrix that is sliding over the larger 5x5 matrix.  This is called a `kernel` and performs a weighted sum of the cells in it's view as it slides over the large matrix.

<img height="200" src="https://miro.medium.com/max/535/1*Zx-ZMLKab7VOCQTxdZ1OAw.gif" />

Whether or not an input feature falls within this “roughly same location”, gets determined directly by whether it’s in the area of the kernel that produced the output or not. This means the size of the kernel directly determines how many (or few) input features get combined in the production of a new output feature.


### 2D Maxpool
<img height="200" src="https://media.geeksforgeeks.org/wp-content/uploads/20190721025744/Screenshot-2019-07-21-at-2.57.13-AM.png" />

Why this layer?
- Dimension reduction. Reduces the number of parameters to learn and the amount of computation performed in the network.
- The pooling layer summarizes the features present in a region of the feature map generated by a convolution layer. So, further operations are performed on summarized features instead of precisely positioned features generated by the convolution layer. This makes the model more robust to variations in the position of the features in the input image.


### [Flatten](https://keras.io/api/layers/reshaping_layers/flatten/)
<img height="280" src="https://miro.medium.com/max/700/1*IWUxuBpqn2VuV-7Ubr01ng.png" />

Flattening is converting the data into a 1-dimensional array for inputting it to the next layer. We flatten the output of the convolutional layers to create a single long feature vector. And it is connected to the final classification model, which is called a fully-connected layer. In other words, we put all the pixel data in one line and make connections with the final layer.

### [Dense + RELU](https://keras.io/api/layers/core_layers/dense/)
After the flatten, we have a feature vector with 128 elements.  The Dense layer performs the simple neuron function of

    output = (input * weight) + bias

and then passes the output through an activation function called RELU (Rectified Linear Unit activation) which is basically just a max() function.

<img height="240" src="https://miro.medium.com/max/746/1*umurYoig4DJ2_j3AIjI4Gw.png" />

This activation function will only pass values that are greater than 0.

### [Dropout](https://keras.io/api/layers/regularization_layers/dropout/)
The Dropout layer randomly sets a certain percentage of input units to 0.0 at each step during training time, which helps prevent overfitting. Inputs not set to 0 are scaled up by 1/(1 - drop_rate) such that the sum over all inputs is unchanged.  Note that the Dropout layer only applies when the model is being trained, not when it is used for predicting.

### [Dense + Softmax](https://keras.io/api/layers/activation_layers/softmax/)
<img height="280" src="https://krisbolton.com/images/posts/2018/softmax-activation-function.jpg" />

This is the output layer of 10 outputs, one for each digit.  We need a function that takes input values and transforms them into a probability distribution.  The function is great for classification problems, especially if you’re dealing with multi-class classification problems, as it will report back the “confidence score” for each class. Since we’re dealing with probabilities here, the scores returned by the softmax function will add up to 1.
The predicted class is, therefore, the item in the list where confidence score is the highest.



In [None]:
# We will use the Sequential model from Keras, which allows to manually build the layers.
from tensorflow.keras.models import Sequential
# These are the layers that will be in the model
from tensorflow.keras.layers import Dense, Conv2D, Dropout, Flatten, MaxPooling2D

# Construct the model
model = Sequential()
model.add(Conv2D(28, kernel_size=(3,3), input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten()) # Flattening the 2D arrays for fully connected layers
model.add(Dense(128, activation=tf.nn.relu))
model.add(Dropout(0.2))  # For training only: randomly zeros 20% of inputs
model.add(Dense(10,activation=tf.nn.softmax))

# Let's see the summary
print(f"The model output shape is {model.output_shape}")
print(model.summary())

## Compile and Train the Model
Now that the model (Neural Network) is defined, it must be compiled.  Compiling assigns a specific loss function, as well as an optimizer method which instructs how to perform the training.

For this model we chose the "Adam" (Adaptive Momentum) optimizer instead of a standard gradient descent.  Adam gets the speed from momentum and the ability to adapt to gradients in different directions from RMSProp. The combination of the two makes it powerful.

### Gradient Descent
This is how models are trained.  Think of each ball as a different method of finding the lowest spot on the feature map.  Finding the lowest spot means that the network has found the best combinations of all weights to minimze it's predictive errors.

<img height="300" src="https://miro.medium.com/max/1432/1*47skUygd3tWf3yB9A10QHg.gif" />

### Selecting the Loss Function
The purpose of loss (error) functions is to compute the quantity that a model should seek to minimize during training.  Since all our images are labeled, we select a type of loss function called "Probabilistic" because our final 10 digit categories are represented by probability scores.  The loss function is selected to be "sparse categorical crossentropy" because our digit outputs are mutually exclusive (e.g. each sample belongs exactly to one class).  A 1 can never also be a 2, a 2 cannot ever also be a 3.


In [None]:
model.compile(
    optimizer='adam', # Extends the SDG optimizer
    loss='sparse_categorical_crossentropy', 
    metrics=['accuracy']
    )
model.fit(x=x_train, y=y_train, epochs=5)

## Check the network
In this stage we ask, how does our model perform against the actual test data, which it has never seen during the training phase?

In [None]:
model.evaluate(x_test, y_test)

## Use the network to make a prediction!

In [None]:
# pick a number from 0 - 9999
index = 874

# retrieve the 28 x 28 image data
img = x_test[index]

# give the image to the model. Needs a slight reshape first.
pred = model.predict(img.reshape(1, 28, 28, 1))

print(f"Predicted Digit: {pred.argmax()}")
print("\nDigit  Probabilities")
for i, p in enumerate(pred[0]):
    print(f"{i}:     {p:.4f}")

In [None]:
# Let's verify our model's prediction
plt.imshow(img.reshape(28, 28), cmap='Greys')

## Conclusion
- Keras makes machine learning highly accessible to developers
- Sample code to try out MNIST dataset is everywhere. 
- There are many variations of CNN that can perform with similar accuracy
- Big $$$ in designing predictive models that are accurate

## References
- [Keras Documentation](https://keras.io/about/)
- [Keras Simple MNIST convnet](https://keras.io/examples/vision/mnist_convnet/)
- [Visualizing Gradient Descent Methods](https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c)
- [Gentle Introduction to the Adam Optimization Algorithm for Deep Learning](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)
- [Image classification with MNIST](https://towardsdatascience.com/image-classification-in-10-minutes-with-mnist-dataset-54c35b77a38d)
- [Understanding Convolutions for Deep Learning](https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1)