<a href="https://colab.research.google.com/github/AI-and-Cultural-Computing/caicc_intensive_week2/blob/main/hello_computer_vision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we're going to be taking a very brief look at the "hello world" of computer vision and deep learning: recognizing single digit numbers from pictures. We're going to assume just a little bit of familiarity with Python but not much, explaining the rational of why code is written the way it is more than we're going to explain the syntax itself. There will be pointers to other references, tutorials, and the like as we go.

Why do we start with single digits? Well this is actually a good lesson about the importance of breaking tasks apart. If we tried to have a "general number recognizer" we'd have to figure out how to make a network that can parse out numbers of arbitrary lengths, possibly with commas or periods between digits. But! The easier thing to do is to have a network that can see spaces between things in a picture, which is called *segmentation*, and then a separate network that can process an isolated digit. If we can make both of these pieces then we can glue them together in our code in a way that's simpler than making a single network that combines both these tasks together.

One last note, the code in this tutorial is inspired by the official [tensorflow tutorial](https://www.tensorflow.org/tutorials/quickstart/beginner), with some simplifications where I think it's helpful.

So the very first thing we need to do is to load the Tensorflow library, since that's what we're going to use for this little introduction. You don't even need to install anything because tensorflow, like colab, is a google product so they bundle it in. Convenient yet possibly problematic if you're old enough to remember when Microsoft got in hot water for bundling a web browser with an operating sytem. 

In [11]:
import tensorflow as tf
import numpy as np

The next thing is to load our dataset. We're taking a shortcut here and using the famous [MNIST digit training set](https://en.wikipedia.org/wiki/MNIST_database), that's a bunch of small images of digits that are only 28 pixels by 28 pixels each. This dataset is small enough in size and famous enough that it tends to be included inside most machine learning systems, which is true for tensorflow as well.

In [3]:
mnist = tf.keras.datasets.mnist

(image_train, number_train), (image_test, number_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


So we've taken the data from the `mnist` dataset and loaded it into four separate variables. 
 
 - `image_train` : the list of actual data of the images we'll use for training
 - `number_train` : the list of (correct) classifications of the images for training
 - `image_test` :  the list of data for the images we'll use to test our trained algorithm
 - `number_test` : the list of classifications of the images for testing

 we can take a look at how this data is formatted easily enough

 since this is a "list" (actually an "array" as provided by the numpy library, which you can think of like a list but it's more efficient for the computer to use) we can access the first element and see what it looks like

In [20]:
image_train[0]

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.    

So it's a two-dimensional structure of numbers (28 lists of 28 elements) from 0 to 255. You can't see all the entries but you can calculate out that if it's 28x28 there must be a total of 784 entries in the whole image. 

This makes sense because a black and white image, like these are, has a single number per pixel ranging from 0-255 with 0 representing black and 255 representing white 

here, let's take a look at what exactly this number is

In [None]:
number_train[0]

5

It's kind of hard to look at the image and tell what it is

here, if we write just a little bit of python code we can kinda visualize this a little bit better

In [36]:
def showImg(img):
  for i in range(0,27):
    print("")
    for j in range(0,27):
      if img[i][j] > 0:
        print("w", end=" ")
      else:
        print("b", end=" ")

showImg(image_train[0])


b b b b b b b b b b b b b b b b b b b b b b b b b b b 
b b b b b b b b b b b b b b b b b b b b b b b b b b b 
b b b b b b b b b b b b b b b b b b b b b b b b b b b 
b b b b b b b b b b b b b b b b b b b b b b b b b b b 
b b b b b b b b b b b b b b b b b b b b b b b b b b b 
b b b b b b b b b b b b w w w w w w w w w w w w b b b 
b b b b b b b b w w w w w w w w w w w w w w w w b b b 
b b b b b b b w w w w w w w w w w w w w w w w b b b b 
b b b b b b b w w w w w w w w w w w b b b b b b b b b 
b b b b b b b b w w w w w w w b w w b b b b b b b b b 
b b b b b b b b b w w w w w b b b b b b b b b b b b b 
b b b b b b b b b b b w w w w b b b b b b b b b b b b 
b b b b b b b b b b b w w w w b b b b b b b b b b b b 
b b b b b b b b b b b b w w w w w w b b b b b b b b b 
b b b b b b b b b b b b b w w w w w w b b b b b b b b 
b b b b b b b b b b b b b b w w w w w w b b b b b b b 
b b b b b b b b b b b b b b b w w w w w b b b b b b b 
b b b b b b b b b b b b b b b b b w w w w b b b b b b 
b b b b b

there, now it's a little easier to visualize! It's sort of a scraggly five!

For machine learning, though, we generally want to *normalize* our data between 0 and 1 as much as possible. To do that we can do kind of a neat trick and just divide the entire two dimensional image by 255.0.

In [6]:
image_train = image_train / 255.0
image_test = image_test / 255.0

Now that we have our data we need to actually build our neural network. We're going to do the simplest possible we can. First, though, let's think about what the restrictions we're dealing with really even are: we know that there must be 784 inputs---one for each pixel---and we know there must be 10 outputs, one representing each of the numbers 0 through 9. 

But this is a "deep" network, meaning that to work it also needs to have some kind of "hidden" layers between the inputs and outputs. 

We're going to start with just one hidden layer and we're going to make it 32 nodes in size. Why? Honestly it's kind of arbitrary! If you have too big of hidden layers or too many, you risk "overfitting", meaning that your network has just "memorized" the training data and will perform terribly when run on perfectly valid data that's too different from the training set. Too few and there's not enough flexibility to actually do any generalization.

Now, if you're thinking to yourself that there *should* be some kind of definitive answer I don't blame you, but deep learning is still very "ad hoc", meaning that we try things and figure out strategies on a case by case basis rather than having rigorous answers.

Despite all its potential it's not yet a science or an engineering discipline. Yet.

So to make a network is going to be pretty easy and I'll show you the code first and then explain it bit by bit. 

In [7]:
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(32, activation='sigmoid'),
  tf.keras.layers.Dense(10)
])

This means that we're defining our model as taking 784 inputs---telling it to expect the 784 inputs as a 28x28 list of lists---then we fully connect every one of those inputs to every one of the nodes in the middle layer, that's what `Dense` means. Already we can do some math and figure out how many weights just are connecting the input to the middle layer: 784 * 32 = 25088 weights. The line `activation='sigmoid]` is the activation we've talked about before, the classic [logistic function](https://en.wikipedia.org/wiki/Logistic_function), and so this function is applied as data exits each node. 

From there we can connect the hidden layer to the output layer by adding another "dense" layer with ten nodes, adding another 320 weights to be adjusted.

So how do we read the outputs? Well, there's ten outputs, one for each possible digit. The numbers the network is trained to calculate for each output are the odds that that label is correct. You can interpret the biggest of all the output numbers as the network's "best guess" for what the correct answer is. I put "guess" in quotes because there's no thinking or real guesswork involved, it's just a calculated estimate: input to output, deterministically. But as long as we're internally careful to remember that there's no thinking, "guess" is a pretty clear and evocative word.

We can actually retrieve a prediction like this:





In [13]:
model(image_train[:1])

<tf.Tensor: shape=(1, 10), dtype=float32, numpy=
array([[ 0.23398848,  1.1608454 ,  0.59939647, -0.950471  , -0.43713567,
        -0.22372867,  0.03539851,  0.0704702 , -0.1868309 ,  0.18669273]],
      dtype=float32)>

In other words, a model is just a function that takes in inputs and gives out a list of lists of outputs.

Okay, but what, how on earth did we run the model if we haven't *trained* it yet? 

Well the secret is that every, and I do mean every, machine learning system initializes the weights to *something*. Generally they're always set to small randomly assigned numbers, to provide the variation needed for the learning algorithm to be most effective.

Now you'll notice that these numbers are both positive and negative and yet they're somehow related the "guess" the model is making. These are called [logits](https://en.wikipedia.org/wiki/Logit) in the machine learning literature.

We can interpret this prediction of logits with something called the [softmax](https://en.wikipedia.org/wiki/Softmax_function) function, which takes these positive and negative numbers and then converts them into properly normalized probabilities.

You don't have to write the softmax function, thankfully, it already exists in any machine learning library: specifically in Tensorflow, which we're using in this tutorial, it's `tf.nn.softmax`.

So we can try this use of `model` again with:

In [17]:
tf.nn.softmax(model(image_train[:1]))

<tf.Tensor: shape=(1, 10), dtype=float32, numpy=
array([[0.10312701, 0.26055613, 0.14861652, 0.03154774, 0.05271169,
        0.06525119, 0.0845524 , 0.08757041, 0.06770378, 0.09836309]],
      dtype=float32)>

Okay, so now we can at least read these like percentages but it would be nice to make it more readable still. We're going to do that with a Python function that goes through the entire list of results and finds the probability that's the largest and, then, returns both the number and the probability that the number is correct.

Exercise: try to step through the logic of the for-loop in the function in order to understand why it's written the way it is

In [32]:
def bestGuess(ls):
  ps = tf.nn.softmax(ls)
  maxp = 0
  maxi = -1
  for i in range(0,tf.size(ps)):
    p = ps[i]
    if p > maxp:
      maxp = p
      maxi = i
  return (maxi, maxp)

bestGuess(model(image_train[:1])[0])

(5, <tf.Tensor: shape=(), dtype=float32, numpy=0.9002729>)

So now we're all set to actually train the model. The last thing we need to do is figure out what it means for a prediction to be right or wrong. In other words, to define the *loss* function.

If the output of our model is to calculate numbers-as-values we'd use the [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error), but since we're calculating probabilities we want to use something called the, and bear with me here, [categorical cross entropy loss](https://gombru.github.io/2018/05/23/cross_entropy_loss/). So that's a dense term but all it really means is that it's a way to quantify how "wrong" a guess at categorizing something is.

To prepare our model for training we just need to do one last thing: compile it.

This turns our description of our model and our loss into a thing we can train.

In [26]:
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

In [27]:
model.fit(image_train, number_train, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f74c10ff550>

Alright, you've now trained a model!

Let's test it now on elements of our training set. 

We're going write just a little more Python glue code---that is, code that ties together code from libraries---to print out a comparison of guesses with numbers!

In [34]:
for i in range(0,10):
  print(f'Our prediction for image {i} is {bestGuess(model(image_test[i:i+1])[0])}')
  print(f'The real number is {number_test[i]}')

Our prediction for image 0 is (7, <tf.Tensor: shape=(), dtype=float32, numpy=0.99907875>)
The real number is 7
Our prediction for image 1 is (2, <tf.Tensor: shape=(), dtype=float32, numpy=0.98654515>)
The real number is 2
Our prediction for image 2 is (1, <tf.Tensor: shape=(), dtype=float32, numpy=0.992543>)
The real number is 1
Our prediction for image 3 is (0, <tf.Tensor: shape=(), dtype=float32, numpy=0.9990615>)
The real number is 0
Our prediction for image 4 is (4, <tf.Tensor: shape=(), dtype=float32, numpy=0.9872649>)
The real number is 4
Our prediction for image 5 is (1, <tf.Tensor: shape=(), dtype=float32, numpy=0.9944832>)
The real number is 1
Our prediction for image 6 is (4, <tf.Tensor: shape=(), dtype=float32, numpy=0.9708294>)
The real number is 4
Our prediction for image 7 is (9, <tf.Tensor: shape=(), dtype=float32, numpy=0.9796934>)
The real number is 9
Our prediction for image 8 is (5, <tf.Tensor: shape=(), dtype=float32, numpy=0.66218424>)
The real number is 5
Our pred

Not bad for being so simple! If you look at the probabilities you'll probably see a mixture of ones in the 90s and some much lower! This tells us that the model is generally very confident but there are ambiguous cases it has trouble with.

Try visualizing one of the lower confidence predictions with our `showImg` function from above

In [39]:
bad_img = 8 #change this to the number of the image you want to visualize
showImg(image_test[bad_img])


b b b b b b b b b b b b b b b b b b b b b b b b b b b 
b b b b b b b b b b b b b b b b b b b b b b b b b b b 
b b b b b b b b b b b b b b b b b b b b b b b b b b b 
b b b b b b b b b b b b b b b b b b b b b b b b b b b 
b b b b b b b b b b b b b b b b b w w w w w w w w b b 
b b b b b b b b b b b b b b b w w w w w w w w w w b b 
b b b b b b b b b b b b w w w w w w w w w w w w w b b 
b b b b b b b b b b b b w w w w w w w w w w w w w b b 
b b b b b b b b w w w b w w w w w w w w w b b b b b b 
b b b b b b b w w w w b b w b w w b b b b b b b b b b 
b b b b b b w w w w w b b b b b b b b b b b b b b b b 
b b b b b w w w w w b b b b b b b b b b b b b b b b b 
b b b b b w w w w b b b b b b b b b b b b b b b b b b 
b b b b b w w w b b b b b b b b b b b b b b b b b b b 
b b b b b w w w w w b b b b b b b b b b b b b b b b b 
b b b b b w w w w w w w w w w w w w b b b b b b b b b 
b b b b b w w w w w w w w w w w w w w b b b b b b b b 
b b b b b w w w w w w w w w w w w w w w b b b b b b b 
b b b b b