# Python Notebook for MNIST neural network
This is a Jupyter Python notebook, which is a collection of cells. Each cell is either of type 'markdown' (formatted text, like this cell) or code (python, grey background). The two most important rules of Jupyter Notebooks are:
1. ***SHIFT-ENTER*** will cause the current cell to execute. 
  - For Markdown cells, 'execute' means render the formatting.
  - For Code cells, 'execute' means run the python.
  - Some Code cells take a while to execute, watch for the * to change to a number
1. Any cell can be edited (double-click into it) and re-executed (SHIFT-ENTER again).
--- 

### Setup
This first code cell includes import statements, which set up libraries of ready-to-go capabilities

In [None]:
import random                    # so we can display random images from the dataset
import numpy as np               # numpy is everything with arrays and matrices, 'np' is a commonly-used nickname
import matplotlib.pyplot as plt  # most common python graphing library
import tensorflow as tf          # tensorflow is a deep learning framework.
from   tensorflow import keras   # tensorflow.keras offers higher-level control of common tensorflow tasks
from      sklearn import metrics # we get the 'confusion_matrix' function from here

The tensorflow keras library includes some datasets for training. The Modified National Institute of Standards and Technology (MNIST) dataset is 70000 images of handwritten numerical digits; each a 28x28 pixel image, and corresponding labels specifying the right answer for each.

The dataset is pre-divided into 60000 images for training the neural network, and 10000 held back from the training set for independent testing.

In [None]:
# Fetch the dataset, pre-divided into train/test, and labels for each
(train_images, train_labels), (test_images, test_labels) = keras.datasets.mnist.load_data()

# Just a little data prep here, we scale the pixel values from the range 0..255 to 0..1.0 
# (numpy makes it simple, every value in the 28x28 array gets scaled by the one divisor).

train_images = train_images / 255.0  # rescale from 0..255 to 0..1.0
test_images  = test_images  / 255.0 

print(len(train_images), 'training images')
print(len(test_images),  'test images')

### Examine the Data
Let's take a look at what some of these handwritten digits look like. This python code plots 25 of them in a 5x5 grid. In the middle of the loop you can switch between `which=i` and `which=random...` to control which 25 get displayed.

If you display enough random digits, can you find some that look ambiguous and might be difficult? Make a note of the index, we can see later how the network did.

In [None]:
# This cell shows 5x5 training images (first 25 or random 25)
plt.figure(figsize=(10,10))  # 10x10 'inches'
for i in range(25):          # i = 0...24
    plt.subplot(5,5, i+1)    # in a 5x5 grid, setup subplot 1...25
    plt.grid(False)
    plt.xticks([])           # don't use xticks, yticks, or grid
    plt.yticks([])
    which = i                                       # show the ith training image
    #which = random.randint(0,len(train_images)-1)  # show a random image
    plt.imshow(train_images[which], cmap=plt.cm.binary) # show image in plot
    caption = 'train[{}] is a {}'.format(which, train_labels[which])
    plt.xlabel(caption)        # caption with corresp. label
plt.show()  # after all 25 subplots are set up, show the plot

# Exercise
Uncomment the `which = random` line, and regenerate a bunch more grids of digit images

# Neural Network Analysis
### Structure the Network
This is where we set up the structure of the neural network, and run the training. The model has 3 layers:

1. The first/input `Flatten` layer maps the individual pixel values from their 28x28 grid to an array of 784 values.
1. The second/middle layer is `Dense`, which means an arc from each of the 784 first-layer nodes, to each of the 2nd-layer nodes.   
  * `relu` is the most common 'activation function', and all it does is check whether the sum of scaled/biased inputs is positive or not. 
  * If it is positive, it 'fires' by outputting that value. 
  * If it is negative, it doesn't fire.
1. The third/output layer (also `Dense`) must have 10 nodes, because we are classifying the 10 different digits. 
  * `softmax` takes the 10 numerical values that accumulate in the 10 nodes, and rescales them so they are positive and sum to 1. 
  * This way we can interpret them as probabilities (ofbeing various digits)

Note for the middle layer we can choose more or fewer nodes, but the outer layers have to fit the input and output. Also we could add more intermediate layers.

In [None]:
# Model architecture
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),   # input  layer; one node per pixel
    keras.layers.Dense(16, activation='relu'),    # middle layer; may be changed to more nodes
    keras.layers.Dense(10, activation='softmax')  # output layer; must have 10 nodes because 10 digits
])

# 'compile' basically means get ready to run
model.compile(optimizer='adam',
             loss='sparse_categorical_crossentropy',
             metrics=['accuracy'])

### Train the Network
This is where the computation happens:
- Forward-propagation from pixel inputs through the network, to scores in the output layer
- Comparison of scores to truth, yielding errors
- Back-propagation from errors to update coefficients in the network

The two output statistics are
* **accuracy** the percentage of the 60000 training images predicted correctly. 
* **loss**: a penalty for not assigning probability 1 to the correct answer; see below

The number of training epochs can be increased until convergence (the model stops improving)

In [None]:
results = model.fit(train_images, 
                    train_labels, 
                    epochs=1)

### Evaluate the Network
Now that the model is trained (the network coefficients have been fit to the training data), we test it by evaluating on the test images it has never seen.

Note that **accuracy** and **loss** are also computed to evaluate the performance of the trained model on the test dataset.

In [None]:
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
predictions = model.predict(test_images)

# Examine Individual Results
Now that we have applied the trained model to predict classifications for all the test data, let's take a look at a few to see how they compare to truth (the provided labels).

### Output Layer Prediction Scores
Set `which` variously and run the following cells to investigate.

This first cell looks at the predictions array (the 10 numbers in the final layer of the network, after pushing the image through). The largest value is what digit is predicted.

In [None]:
# 0 is the index of the first test image (python always counts starting from 0 not 1)
which=0  # edit this to examine a specific 
# uncomment this to choose a random test case
#which = random.randint(0,len(test_images)-1

preds = predictions[which]    # grab this prediction, which is a 10-vector of scores from the final layer
print('Predictions:', preds)  # let's see what it looks like!
ansa = test_labels[which]     # this is the right answer
pred = np.argmax(preds)       # 'max' is the largest *value* in preds, we want the *index* max lives at

# two kinds of printouts
if pred == ansa:              # if the model predicted the right ansa        
    p = 100 * preds[ansa]
    print('The highest-probability prediction is {:.1f}% for {}, which is correct'.format(p,ansa))
else:
    p = 100 * preds[ansa]     # this is what we should have chosen
    q = 100 * preds[pred]     # but this had a larger probability
    print('The correct answer {} had probability {:.1f}%'.format(ansa, p))
    print('But the prediction {} had probability {:.1f}%'.format(pred, q))

### View the Image
This cell shows the image for test data 'which', using matplotlib similarly to above. Does it look like the prediction? If the prediction is wrong, does it make sense why it could have chosen that wrong prediction?

In [None]:
plt.figure() # similar matplotlib setup as above to simply show the pixels 
plt.grid(False)
plt.xticks([])
plt.yticks([])
plt.imshow(test_images[which], cmap=plt.cm.binary)
plt.show()

### View the Prediction Scores
This cell plots the predictions as a bar graph

In [None]:
plt.figure()
plt.grid(False)
plt.xticks(range(10)) # xticks at 0,1,...9, matching the digit labels
#plt.yticks([])       # don't eliminate yticks, let them show percentages
barplot = plt.bar(range(10), preds, color="gray")
# remember 'pred' and 'ansa' that were set a few cells above?
barplot[pred].set_color('red')
barplot[ansa].set_color('blue')
# why does this work? what happens if pred==ansa vs if pred!=ansa?

### Understand 'loss'
If everything is working right, preds has a probability of 1 for the correct answer (and the bar graph has a near-1-height bar) . If the prediction probability for the right answer is less than 1, that is the basis for computing 'loss', as in the next cell. 

The reported 'loss' (along with accuracy) is the average of these values for all cases (across either the test set or training set).

In [None]:
q = preds[ansa]          # remember ansa is the truth label (and the index of the truth label)
print("This digit is a {}; the model's score for that was {:.6f} (should be near 1.0)".format(ansa, q))

# This is the formula for loss
loss = np.log2(1.0/q)    # If q is almost 1, this is almost 0. 
                         # The smaller q gets, the bigger 1/q gets, so the larger log(1/q) gets
print('The average loss is {:.4f}; the loss that this case contributes is {:.6f}'.format(test_loss, loss))

# Exercise
Repeat the cells above, setting which to a different index (any of the test images 0...9999), looking at the results for various test data.

### Confusion Matrix
Here is the list of all the correct answers (most not printed, because 10,000 is too long!)

In [None]:
test_labels # these are the correct answers

As we saw above, each prediction is an array of 10 floating point numbers. This cell applies `argmax` to each to get the index of the largest score in each.

In [None]:
test_preds = predictions.argmax(axis=1) # these are the predictions; the index of the largest score for each test
test_preds

As you can see, first three predictions and the last three predictions match the truth. But since accuracy was not 100%, there are some mismatches in those 9994 that are not printed. 

The 'confusion matrix' generated by the next cell details which numbers were mistaken for which. The rows of the matrix mean 'which digit it actually is'. The columns mean 'which digit was predicted'. 

What is the meaning of the large diagonal values? Other than the upper-left value, what's the largest value in the first column? What does it mean? What's the largest off-diagonal value, and what does it mean?

In [None]:
metrics.confusion_matrix(test_labels, test_preds)

## Visualize groups of results
Below is more complex code that graphs a large number of test results, with the bar graph red to highlight wrong answers.

The cells with `def` create functions (python analogues of Snap! custom blocks), and only have to be run once. The last block can be rerun many times, especially if `which` is set to random.

In [None]:
# This function plots one image with a blue or red caption, 
# into the currently-selected subplot
def plot_image(i, predictions_array, true_label, img):
    plt.grid(False)
    plt.xticks([])
    plt.yticks([])

    plt.imshow(img, cmap=plt.cm.binary)  # this shows the pixels

    # the rest of this assembles the caption text
    predicted_label = np.argmax(predictions_array)
    if predicted_label == true_label:
        color = 'blue'
    else:
        color = 'red'

    # this is the probability of whatever digit was predicted (right or wrong)
    p = np.max(predictions_array)
    
    # this is the probability of predicting the right answer
    q = predictions_array[true_label]
    # if we chose the right answer, p==q
    
    # include the loss of this individual case in the caption
    loss = np.log2(1.0/q) # the smaller the q, the larger the loss
  
    # assemble the caption by formatting values into a text string
    caption = "#{}({}) {:2.0f}% loss {:.3f}".format(i,
                                                    true_label,
                                                    100*p, # 100* turns fractions into %
                                                    loss)
    # add the caption, with the appropriate color
    plt.xlabel(caption, color=color) 

In [None]:
# This function plots the corresponding bar graph, red if it's wrong,
# into the currently-selected subplot
def plot_value_array(i, predictions_array, true_label):
  predictions_array, true_label = predictions_array, true_label[i]
  plt.grid(False)
  plt.xticks(range(10))
  #plt.yticks([])
  thisplot = plt.bar(range(10), predictions_array, color="#777777")
  plt.ylim([0, 1])
  predicted_label = np.argmax(predictions_array)

  thisplot[predicted_label].set_color('red')
  thisplot[true_label].set_color('blue')

In [None]:
# This uses the functions above to graph images and bar graphs in a grid.
# In the middle again choose either which=i for the first results, or which=random
num_cols=4 # twice as many columns really, because a digit and a bar graph for each
num_rows=6 # freals this many rows
num_images = num_rows*num_cols
plt.figure(figsize=(2*2*num_cols, 2*num_rows))
for i in range(num_images):
    which = i # as before, leave this for sequential, or uncomment the next line for random
    #which = random.randint(0, len(test_images)-1)
    plt.subplot(num_rows, 2*num_cols, 2*i+1)                                      # advance to the next subplot
    plot_image(which, predictions[which], test_labels[which], test_images[which]) # then plot in it
    plt.subplot(num_rows, 2*num_cols, 2*i+2)
    plot_value_array(which, predictions[which], test_labels)    
plt.tight_layout()
plt.show()

# Homework
(See also Schoology)

**Part 1**

Go back to the cells above in the 'Neural Network Analysis' section. 
* Set the size of the middle layer to 16, 32, 64, 128 nodes and re-execute the cell.
* Set the number of training epochs to 1, 3, 5 and re-execute the cell.
* Re-execute the cell that evaluates the model on the test data.

For each of those 4x3 runs, populate statistics into a spreadsheet with these columns:
* Nodes (middle layer)
* Epochs
* Accuracy (Train)
* Loss (Train)
* Accuracy (Test)
* Loss (Test)

**Part 2**

* Reset the middle Dense layer to 16 nodes, and epochs to 1, so that incorrect predictions are more common.
* Re-execute the Structure/Train/Evaluate cells to (badly) retrain the network
* In the last code cell, uncomment the `which=random` line
* Repeatedly run the last cell to generate a new grid of random results, until 4 incorrect predictions are shown.
* Submit a screenshot of the grid (right-click, Save Image As...)
