# How do Neural Nets work?  

Note:  I am assuming that students have Anaconda installed on their machine, along with the standard libraries that come along with it: numpy, matplotlib, and sklearn. Sklearn is not commonly used in real-world applications, but this was a design choice to allow people to try out and learn about our method without having to install more complicated software.  

Neural networks are a very popular method in machine learning.  There are many different architectures, but to demonstrate how to prune them, we will start with one of the simplest, the single hidden layer network.  


The central idea behind neural nets, especially historically, is that they are based on "we think information is processed in the brain" - hence the name.  Information is stored in the network in how we weight the connections between different neurons, and we train them to perform tasks by changing these weights. A single hidden layer neural network means that data passes in to a layer of neurons, and then the output of those neurons goes to a second (not hidden) layer for the output.  

In our example here, we are going to train a network to identify the digits 0-9 from images of hand-written numbers.  

We start by importing some simple data below.  Each piece of data is an 8x8 image of the hand written digits.  This is what we will call "X". Their labels we will call "y". Note that we split our data in to a "train" and a "test" set.  This lets us teach the network using some of our data, and then verify that it learned correctly based on data that it has never seen before. This is a way for us to know if the network learned how to recognize our digits, or if it simply memorized all the data it has already seen, but cannot perform well on new data.  The latter case is called over-fitting. 


(Select the cell below, and press "shift" and "enter" together to run the cell.)  




In [None]:
#some imports
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn import datasets
np.set_printoptions(suppress=True)


#load the dataset from the library.  
digits = datasets.load_digits()

#display some examples of the dataset.  
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, label in zip(axes, digits.images, digits.target):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_title('Training: %i' % label)

n_samples = len(digits.images)
    


#flatten the images to a single column of 64.    These correspond to 64 features in each data point.   
data = digits.images.reshape((n_samples, -1))


X_train, X_test, y_train, y_test = train_test_split(data, digits.target, test_size=0.5, shuffle=False)



Now we will define a neural network using sklearn, and train it to fit the data.  

First, we need to quantify how close the network is to fitting the data.  We do this with a "loss function".  This can take several forms.  When we want the network to look at data and return a number (for example, looking at housing data and returning a predicted price), we will commonly use the Mean Squared Error (MSE).  This likely looks familiar if you have ever done curve fitting.  

$MSE=\frac{1}{n} \sum_{i=1}^{n}(y_i-\hat{y}_i)^2$

where $y_i$ is the value predicted by the network, and $\hat{y_i}$ is the true value.   We expect this to decrease as the network gets closer to predicting the correct value.  

In this case, we want the network to sort the images into classes; for example, the first class would be all the written numbers the computer thinks are "0".  We encode the labels as probabilities; a label of "0" in our scheme would be encoded as [1,0,0,0,0,0,0,0,0], meaning that there is a probability of 1 that the hand-drawn digit is a zero, and a probability of zero that it is anything else (this is called "one-hot encoding").  We then train the network to predict these probabilities.  Our metric for how closely the probabilities match is called the log-loss:  

$log$-$loss= \sum_{i=1}^N \sum_{j=1}^M p_j(x_i)log(q_j(x_i))$

for each of $N$ images (also called data points; the $i^{th}$ image in the set is labelled as $x_i$), with $M$ possible classes, where $p_j(x_i)$ is the probability that the true label for data point $i$ is class $j$, and $q_j(x_i)$ is what the network predicts for the probability that $x_i$ is in class $j$. As an example of the probabilities, let's say that $x_{323}$ is a drawing of the number 7. $p_0(x_{323})$ should be zero; the one-hot encoding for "0" definitely does not describe the one-hot encoding for "7". $p_j(x_i)$ will always be either 0 or 1. $q_0(x_{323})$, on the other hand, is what the computer tells us that the probability of this hand-drawn 7 being classified as a 0 is. This could be anything between 0 and 1.

The "fit" method uses something called gradient descent, which calculates the gradient of the loss function with respect to all of the parameters in the network, and then attempts to minimize the loss by changing the parameters to move "downhill" in the loss.  In addition to the loss, we can also look at the percentage of the data points which have their correct class predicted by the network.

Run the cell below:  



In [None]:
#An MLP classifier is a one hidden layer neural network.  
clf = MLPClassifier(random_state=1, max_iter=300)
print("Label for first data point:  {}".format(y_train[0]))
import numpy as np

clf.fit(X_train, y_train)
print('Class probabilities predicted by trained network:\n {}'.format(clf.predict_proba(X_train[0].reshape(1, -1)).round(6)))


print("Accuracy: {}".format(clf.score(X_test, y_test )))



We see that a relatively simple network with only two layers and 100 neurons in the hidden layer can perform well on this simple task.  

At their core, neural networks are mathematical models. We can show how to calcluate the output of the network as a series of matrix multiplications.  

Each of the 100 neurons in the first layer has weights, which correspond to the "features" of the data (pixels, in this case).  Since the data points in this set are 8x8, there are 64 features and thus 64 weights.  We can find these weights from the model and write them as a matrix.

The shape of the weight matrix is written as (weights per neuron, number of neurons).


Run the code below.  



In [None]:
print("Shape of weight matrix")
print(clf.coefs_[0].shape)

print("Weight matrix")
print(clf.coefs_[0])


The data can be written as a matrix, where each data point has 64 features. 

In [None]:
print("Shape of data matrix: ")
print(X_train.shape)

The shape of the data matrix is written as (number of data points/images, number of features/pixels in each). Each feature will also have a value associated with it, to describe the color of the pixel. Since these images are greyscale, every feature will have a value from 0 (black) to 16 (white).

Each of the neurons applies the weights to the data by multiplying the weight by the value of the feature in the data.  This looks like the matrix multiplication $z=xW$.

Run the code below.  
(using numpy, the @ symbol means matrix multiplication)

In [None]:
W=clf.coefs_[0]
z=X_train@W
print(z)


print(z.shape)

Now we can see that each of the 100 neurons has an output for each of the 898 data points.  

Each neuron also has a bias, which it adds to its output for each data point.    

We can add this in numpy by broadcasting out the shape.  


$Z=xW+b$

In [None]:
print("bias shape:")
print(clf.intercepts_[0].shape)
print("bias:")
print(clf.intercepts_[0])
b=clf.intercepts_[0]
print("adding bias:")
Z=X_train@W+b
print(Z)

So far, we have only done linear operations. However, our networks use a non-linear activation function that gets applied to the output which is called ReLU, or Rectification Linear Unit.  This is zero if the value of $Z$ is negative and leaves it alone if the value is positive. 

The value might be negative if the weights of a neuron and data point do not align well, i.e. their inner product is negative.  Intuitively, we do not want neurons that are very far away from "matching" the data point to be contributing to the classification.  The bias allows the network to decide how close the neuron and data point have to be to contribute.  

We can implement this simply using the np.maximum function.  

In [None]:
Z=np.maximum(Z, 0)

print(Z)

Note that all of the negative values have become zero. Applying the activation function can lead to a very different matrix than before the activation function, but since it is often non-linear, it may not commute with everything.  This will become important later.  



We can apply the same process to the second layer (the output layer). This layer has $M$ neurons since there are $M$ classes; the output need only tell us which image is in which class.  However, in this case we will use "softmax" $\sigma$ as the activation function rather than ReLU.  

$\sigma (\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^M e^{x_j}}$

Note that this makes the predicted probabilities of all the classes sum to 1.  

We will call the weights of the second layer $U$, and the biases of the second layer $\beta$.  
In matrix form:  

$\sigma ( ReLU( xW + b)U+ \beta)$

Putting this all together, 
(Run the code below) 


In [None]:
U=clf.coefs_[1]
beta=clf.intercepts_[1]

print("Shape of output of first layer")
print(Z.shape)
print("Shape of second layer weights")
print(U.shape)

secondLayer=Z@U+beta

print("Output of second layer shape:  ")
print(secondLayer.shape)
def softmax(layer2):
    e=np.e**layer2
    normed=e/np.sum(e, axis=1)[:,None]
    return normed
output=softmax(secondLayer)



To check that we have done this correctly, we will check the output on the first data point:  


In [None]:
print("Network prediction")
print(clf.predict_proba(X_train[0].reshape(1, -1)).round(6))
print("Network prediction from our matrix multiplications")
print(output[0].round(6))

Finally, we want to compare how well our network does on data it has seen (the training set) v.s. data it has not seen, the "test set".  


In [None]:
print("Training Accuracy: {}%".format(round(clf.score(X_train, y_train)*100,2)))
print("Test Accuracy: {}%".format(round(clf.score(X_test, y_test)*100,2)))

This is what we refer to as generalization, and will become important later.  