<a href="https://colab.research.google.com/github/RubeRad/tcscs/blob/master/MNIST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Notebook for MNIST neural network
This is a Jupyter Python notebook, which is a collection of cells. Each cell is either of type 'markdown' (formatted text, like this cell) or code (python, grey background). The two most important rules of Jupyter Notebooks are:
1. ***SHIFT-ENTER*** will cause the current cell to execute. 
  - For Markdown cells, 'execute' means render the formatting.
  - For Code cells, 'execute' means run the python.
  - Some Code cells take a while to execute, watch for the * to change to a number
1. Any cell can be edited (double-click into it) and re-executed (SHIFT-ENTER again).
--- 

# Setup
This first code cell includes import statements, which set up libraries of ready-to-go capabilities

In [None]:
import random                      # so we can display random images from the dataset
import numpy as np                 # numpy is everything with arrays and matrices, 'np' is a commonly-used nickname
import matplotlib.pyplot as plt    # most common python graphing library
import tensorflow as tf            # tensorflow is a deep learning framework.
from   tensorflow   import keras   # tensorflow.keras offers higher-level control of common tensorflow tasks
from   sklearn      import metrics # we get the 'confusion_matrix' function from here

The tensorflow keras library includes some datasets for training. The Modified National Institute of Standards and Technology (MNIST) dataset is 70000 images of handwritten numerical digits; each a 28x28 pixel image, and corresponding labels specifying the right answer for each.

The dataset is pre-divided into 60000 images for training the neural network, and 10000 held back from the training set for independent testing.

In [None]:
# Fetch the dataset, pre-divided into train/test, and labels for each
(train_images, train_labels), (test_images, test_labels) = keras.datasets.mnist.load_data()

Hold off and don't execute this immediately; use the cells below to investigate the data, and then come back and run this and peek at them again

In [None]:
# Just a little data prep here, we scale the pixel values from the range 0..255 to 0..1.0 
# (numpy makes it simple, every value in every 28x28 array gets scaled by the one divisor).
train_images = train_images / 255.0  # rescale from 0..255 to 0..1.0
test_images  = test_images  / 255.0 

## How the data looks to Python

What did we just get? 

In [None]:
train_images.shape

In [None]:
ntrain = len(train_images)
ntrain

In [None]:
train_images[0]

## Exercise
Can you tell from the layout of the nonzero numbers what digit that is? Try looking at `train_images[3]` or `test_images[2]`.

In [None]:
train_labels.shape

In [None]:
train_labels

In [None]:
test_images.shape

In [None]:
ntest = len(test_images)
ntest

In [None]:
test_labels.shape

In [None]:
test_labels

## How the data looks to humans
Let's take a look at what some of these handwritten digits look like when displayed as images.

In [None]:
which = 0
#which = random.randint(0, ntrain-1)
image = train_images[which]
#ansa  = train_labels[which]

# Matplotlib can, instead of plot(), imshow() 
fig = plt.figure()
ax  = fig.add_subplot()
ax.imshow(image)  # , cmap=plt.cm.binary)

#ax.set_xticks( [] )
#ax.set_yticks( [] )
#caption = 'train_images[{}] is a {}'.format(which, ansa)
#ax.set_xlabel(caption)


## Exercise 
* In the graphing cell above, uncomment the extra parts one by one, see what happens.
* Change it to show an image from the test set instead of the training set
* Go back to the cell which divides the train/test image pixel values by 255, and rerun up to here

## Functions for plotting training and test cases
Here's a function for drawing an image and a caption:

In [None]:
def plot_image(ax,   # a matplotlib Axes to draw the image onto
               img,  # the image to draw
               cap): # caption text
    
    # Draw the image with the colormap that draws 0=white-->1=black
    ax.imshow(img, cmap=plt.cm.binary)
    
    # get rid of the pixel row/column counters
    ax.set_xticks( [] )
    ax.set_yticks( [] )

    # use that caption as the label for the X axis
    ax.set_xlabel(cap)

Now that the function is written, test it out and make sure it works (and you understand how it works)

In [None]:
fig = plt.figure()
ax  = fig.add_subplot() 
img = train_images[0]
ansa = train_labels[0]
cap = 'train[{}] is a {}'.format(0, ansa)
plot_image(ax, img, cap)

## Exercise

* Change the cell above to plot the first *test* image instead of the first *training* image
  * (And appropriate caption)
* Change the cell above to plot *both* the first training and testing images
  * (Refer to the Matplotlib intro/Anscombe's quartet for a reminder of how to use multiple subplots)

It's still a lot of typing or pasting to get it to render a particular training or test image. Here's a function to render a training image, given just a number.

Test it, and then make the analogous function for test images (and test it).

We will use these later.

In [None]:
# Here's a function to plot and caption a training image
def plot_training_image(idx):
    fig = plt.figure()   # boilerplate stuff
    ax  = fig.add_subplot()
    img = train_images[idx]   # this is the training image we want to plot
    ansa = train_labels[idx]  # this is the correct answer for what it is
    cap = 'train[{}] is a {}'.format(idx, ansa)  # build a caption string
    plot_image(ax, img, cap)  # plot & caption on this ax

In [None]:
plot_training_image(0)

In [None]:
# Use this cell to make a function to plot and caption a test image
def plot_test_image(idx):
    

In [None]:
# Now test it
plot_test_image(0)

This python code renders 25 `train_images` in a 5x5 grid. In the middle of the loop you can switch between `which=i` and `which=random...` to control which 25 get displayed.

In [None]:
# This cell shows 5x5 training images (first 25 or random 25)
fig = plt.figure(figsize=(10,10))  # 10x10 'inches'

for i in range(25): # i = 0...24
    # set up each next subplot
    ax = fig.add_subplot(5,5, i+1)       # in a 5x5 grid, setup subplot 1...25
    
    # which training image to show?
    which = i                            # show the ith training image
    #which = random.randint(0,ntrain-1)  # show a random image
    
    # fill these out
    img = ...
    ansa = ...
    cap = ...
    plot_image(ax, img, cap)


## Exercise

* Fill in the `img=`, `ansa=`, `cap` lines so the `plot_image()` command has what it needs to run
* Refactor so the `plot_image()` call is just one longer line
* Uncomment the `which = random` line, and re-run to generate a bunch more grids of digit images. Can you find some that look ambiguous and might be difficult?

# Neural Network Analysis
## Structure the Network
This is where we set up the structure of the neural network, and run the training. The model has 3 layers:

1. The first/input `Flatten` layer maps the individual pixel values from their 28x28 grid to an array of 784 values.
1. The second/middle layer is `Dense`, which means an arc from each of the 784 first-layer nodes, to each of the 2nd-layer nodes.   
  * `relu` ("rectified linear unit") is the most common 'activation function', and all it does is check whether the sum of scaled/biased inputs is positive or not. 
  * If it is positive, it 'fires' by outputting that value. 
  * If it is negative, it doesn't fire (or rather, outputs the value 0).
1. The third/output layer (also `Dense`) must have 10 nodes, because we are classifying the 10 different digits. 
  * `softmax` takes the 10 numerical values that accumulate in the 10 nodes, and rescales them so they are positive and sum to 1. 
  * This way we can interpret them as probabilities (of being various digits)

Note for the middle layer we can choose more or fewer nodes, but the outer layers have to fit the input and output. Also we could add more intermediate layers.

In [None]:
# Model architecture
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),   # input  layer; one node per pixel
    keras.layers.Dense(16, activation='relu'),    # middle layer; may be changed to more nodes
    keras.layers.Dense(10, activation='softmax')  # output layer; must have 10 nodes because 10 digits
])

# 'compile' basically means get ready to run
model.compile(optimizer='adam',
             loss='sparse_categorical_crossentropy',
             metrics=['accuracy'])

## Train the Network
This is where the computation happens:
- Forward-propagation from pixel inputs through the network, to scores in the output layer
- Comparison of scores to truth, yielding errors
- Back-propagation from errors to update coefficients in the network

The two output statistics are
* **accuracy** the percentage of the 60000 training images predicted correctly. 
* **loss**: a penalty for not assigning probability 1 to the correct answer; see below

The number of training epochs can be increased until convergence (the model stops improving)

In [None]:
model.fit(train_images, 
          train_labels, 
          epochs=1)

Now that the model has been trained, it's got weights -- scale factors for the edges, and biases for the nodes. Look at all the shapes of the parts of `model.weights`

In [None]:
model.weights

## Evaluate the Network
Now that the model is trained (the network coefficients have been fit to the training data), we test it by evaluating on the test images it has never seen.

Note that **accuracy** and **loss** are also computed to evaluate the performance of the trained model on the test dataset.

In [None]:
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)

In [None]:
test_loss

In [None]:
test_acc

How do those values of loss and accuracy on the *test* set, compare to loss and accuracy from the *training*?

## Look at the predictions
The point of the network is to predict what digit an image is. We can run the model on all the `test_images` and get all the predictions, like this:

In [None]:
predictions = model.predict(test_images)

In [None]:
predictions.shape

In [None]:
predictions

## Examine a particular test image and its prediction from the network

In [None]:
which = 0                            # first we'll look at test case 0
#which = random.randint(0, ntest-1)   # later we can do random

In [None]:
preds = predictions[which]
preds

What is `predictions[0]`? How big is it? (Why is it that big?) What is it saying? Is this prediction correct?

`predictions[0]` is supposed to predict the right answer for test case number 0. What is the right answer for test case number 0?

In [None]:
ansa = test_labels[which]
ansa

`predictions[0]` is the neural network output layer, when `test_images[0]` is fed into the input layer of the network. What is `test_images[0]` anyways?

In [None]:
# use this cell to get matplotlib to show a picture of test_images[0]
plot_test_image(which)

In [None]:
max(preds)

In [None]:
np.argmax(preds)

In [None]:
pred = np.argmax(preds)

In [None]:
# two kinds of printouts
if pred == ansa:              # if the model predicted the right ansa        
    p = 100 * preds[ansa]
    print('The highest-probability prediction is {:.1f}% for {}, which is correct'.format(p,ansa))
else:
    q = preds[ansa]     # this is what we SHOULD have chosen
    p = preds[pred]     # but this larger probability is what DID get chosen
    print('The correct answer {} had probability {:.1f}%'.format(ansa, 100*q))
    print('But the prediction {} had probability {:.1f}%'.format(pred, 100*p))
    # compute and print 'loss' here, see below

In [None]:
# Plot the 10 predictions as a bar graph
fig = plt.figure()
ax  = fig.add_subplot()
#ax.set_xticks(range(10)) # xticks at 0,1,...9, matching the digit labels

barplot = ax.bar(range(10), preds, color="gray")

# remember 'pred' and 'ansa' that were set a few cells above?
#barplot[pred].set_color('red')
#barplot[ansa].set_color('blue')
# why does this work? what happens if pred==ansa vs if pred!=ansa?

## Understand 'loss'
If everything is working right, preds has a probability of 1 for the correct answer (and the bar graph has a near-1-height bar) . If the prediction probability for the right answer is less than 1, that is the basis for computing 'loss', as in the next cell. 

The reported 'loss' (along with accuracy) is the average of these values for all cases (across either the test set or training set).

In [None]:
q = preds[ansa]  # remember ansa is the truth label (and the index of the truth label)
q                # for a good prediction, this should be near 1

In [None]:
# This is the formula for loss
loss = np.log2(1.0/q)    # If q is almost 1, this is almost 0. 
                         # The smaller q gets, the bigger 1/q gets, so the larger log(1/q) gets
loss

# Exercise
* Add a loss calculation and printout to the `if/else` cell above
* Repeat the cells above, setting `which` to different indices, or letting it be random. Can you find any interesting cases which are wrongly-predicted, or more marginal than an easy correct prediction?

## Confusion Matrix
Here is the list of all the correct answers (most not printed, because 10,000 is too long!)

In [None]:
test_labels # these are the correct answers

As we saw above, each prediction is an array of 10 floating point numbers. This cell applies `argmax` to each to get the index of the largest score in each.

In [None]:
test_preds = predictions.argmax(axis=1) # these are the predictions; the index of the largest score for each test
test_preds

As you can see, first three predictions and the last three predictions match the truth. But since accuracy was not 100%, there are some mismatches in those 9994 that are not printed. 

The 'confusion matrix' generated by the next cell details which numbers were mistaken for which. The rows of the matrix mean 'which digit it actually is'. The columns mean 'which digit was predicted'. 

What is the meaning of the large diagonal values? Other than the upper-left value, what's the largest value in the first column? What does it mean? What's the largest off-diagonal value, and what does it mean?

In [None]:
metrics.confusion_matrix(test_labels, test_preds)

## Visualize groups of results
Below is more complex code that graphs a large number of test results, with the bar graph red to highlight wrong answers.

The cells with `def` create functions (python analogues of Snap! custom blocks), and only have to be run once. The last block can be rerun many times, especially if `which` is set to random.

In [None]:
# This function plots predictions (an array of 10 scores) as a bar graph,
# color coding blue for true, red for error
def plot_predictions(ax, i, predictions, true_label):
  ax.grid(False)
  ax.set_xticks(range(10))
  #ax.set_yticks([])  # let there be yticks, so we can see the scale of the barplot
  barplot = ax.bar(range(10), predictions, color="#777777")
  ax.set_ylim([0, 1])
  pred_label = np.argmax(predictions)

  barplot[pred_label].set_color('red')   # set the prediction red
  barplot[true_label].set_color('blue')  # set the truth blue
  # if the prediction was right, it will just be blue
  # if the prediction was wrong, the truth will be blue, 
  # and the incorrect prediction still red

In [None]:
fig = plt.figure(figsize=(12,8))

# This uses the functions above to graph images and bar graphs in a grid.
# In the middle again choose either which=i for the first results, or which=random

# 6 rows, and 4 columns
num_images = 6*4
# BUT, next to each image we also plot its prediction bar graph
# so really 8 columns

subplot_i = 0 # we will actually start with 1

for i in range(num_images):
    which = i # as before, leave this for sequential, or uncomment the next line for random
    #which = random.randint(0, len(test_images)-1)
    
    img   = test_images[which] # this is the image we're trying to predict
    ansa  = test_labels[which] # this is the correct answer
    
    preds = predictions[which] # the 10 scores from the output layer
    score = np.max(preds)      # max score
    pred  = np.argmax(preds)   # index of max score (i.e. prediction)
    pct   = round(score*100)
    cap   = '#{}({}) {}%'.format(which, ansa, pct)
    
    subplot_i = subplot_i + 1                # advance to the next subplot number
    imgax = fig.add_subplot(6,8, subplot_i)  # create the new subplot
    plot_image(imgax, img, cap)
    
    subplot_i = subplot_i + 1                # advance to the next subplot number
    barax = fig.add_subplot(6,8, subplot_i)  # create the new subplot
    plot_predictions(barax, which, preds, ansa)    

plt.tight_layout()

# Homework
(See also Schoology)

**Part 1**

* Make sure the network is structured with the middle Dense layer set to 16 nodes, and epochs to 1 (so that incorrect predictions are relatively common).
* Just in case, re-execute the Structure/Train/Evaluate cells to (badly) retrain the network
* In the last code cell, uncomment the `which=random` line
* Repeatedly run the last cell to generate a new grid of random results, until 4 incorrect predictions are shown.
* Submit your notebook with your grid containing $\ge 4$ incorrect predictions.

**Part 2**

Go back to the cells above in the 'Structure the Network' section. 
* Set the size of the middle layer to 16, 32, 64, 128 nodes and re-execute the cell.
* Set the number of training epochs to 1, 3, 5 and re-execute the cell.
* Re-execute the cell that evaluates the model on the test data.

For each of those 4x3 runs, populate statistics into a spreadsheet with these columns:
* Nodes (middle layer)
* Epochs
* Accuracy (Train)
* Loss (Train)
* Accuracy (Test)
* Loss (Test)

