# Elementary Neural Networks

In the introduction notebook we clarified what precisely we mean by a neural network and some key terms such as *node, neuron, activiation function*. In this notebook we will begin to build more complicated and useful neural networks. The first neural network is a historical landmark example which provided the basis for not only a useful network topology, but also a working mathematical of memory: The Hopfield Network. We will then cover the multilayer perceptron in more detail. The key learning outcomes of this notebook are: understanding how neural networks work in practice; developing a working model of memory; understanding the blueprint model for modern neural networks.

## 1.0 Hopfield and Tank: Ascociative Memory

The Hopfield and Tank network was a landmark development in biology and Deep Learning: for the first time it gave a working model of asscociative (non-addressable) memory as well as a generic method to train a neural network to classify objects which was, in some sense, provably reliable. The network can be formulated in a very biological minded fashion which is helpful because it allows us to not only develop useful comptutational tools, but also draw biological insight.

The task the network aims to solve is: "Given a set of classifiable input data, map the data to their classification labels". We will assume the data is encoded in a vector $v \in \mathbb{R}^n$ and the classification labels in a vector $u \in \mathbb{R}^d$. Given the association $v_i \rightarrow u_i$, we aim to construct a matrix of weights $W$ and a vector of biases $b$ such that for a given activation function $f(W * v_i + b) = u_i$.

Let's simplfy our problem a little bit. We know that the vector $u$ contains class labels and a natural way to encode this would be to let $d$ be the number of classes: we can then associate every unit vector with a class e.g. (1,0,0) => apple, (0,1,0) => banana, (0,0,1) => pear. The simplest way to find a unit vector from our weights and biases estimate is to take the coordinate with the maximum value (convince yourself this is true for all monotonic activation functions). We can further reduce the problem by assuming that $b = \vec{0}$. This allows us to think about the problem just in terms of the $W$ vector.

### Training
Now, we know that the brain uses the Hebbian rule to learn which is often colloquially summarised as "neurones that fire togther, wire together". What this means is that neurones with strongly correlated activity patterns will have high connection weights. What does this mean for us? Well, we definitely want our network to identify the labelled data and each labels must be strongly correlated with itself. The Hebb rule then dictates for a single label $v_i$ we should examine its autocorrelation $v v^T$. Fortunately, this is a matrix! If we assume that all the data are independant then the natural thing to do is to sum them all up! Thus, we arrive at:

$$ W = \sum_{i=1}^{|\text{data}|} v_i v_i^T $$ 

In [None]:
# Insert Hopfield and Tank here

We now need to put the class labels in the right place and for this we use our labelled training data. This is nothing more than a simple hashing routine.

In [None]:
# Insert hashing routine

### Classification

We now have trained our first neural network! How do we use it? First, we should understand a little bit more about what has happened in training. While it seems quite simple, and intuitive, some subtilities are happening behind the scenes. What we have done is assumed that the energy landscape is connected in correlations i.e. vectors that look similiar to each other are close to each in the vector space under some metrics. We have then started with a flat energy landscape and constructively added bumps into it: we made local stable points for each $v_i$ through $v_i v_i^T$. These traditionally have been likened to the spin glass states from physics. What does this mean for us? If we are correct in our assumption that similar vectors are in some sense close to each other then we have created an energy well near our class labels. If we want to classify a new point we can rely on dynamical systems theory and simply iterate our input vector recursively until it reaches a stable point. The stable point it reachs will be one of the class labels (and hopefully the correct one!). 

To find a stable point we simply feed the network output back into the network itself until the output vector stops changing. This amounts to pre-multiplying $W$ several times: $v^2_k = W * v^1_k = W * W v^0_k$ etc. Therefore, for an unknown vector $v_k$ we predict the class by the routine:

$$ u_k = \prod_{t=1}^M W^t v_k, $$

where M is some suitably large number. Let's write a routine that predicts the class of our neural network!

In [7]:
function class_predict(W, hash, v; M = 20)
    u = W^M * v
    uarg = argmax(u)
    class_label = hash[uarg]
    return class_label
end

class_predict (generic function with 1 method)

It is always good practice to partition your datatest into a training subset and a validation subset. We have already done this, fortunately. Let's have a look at how well we did:

In [None]:
classifications=[]
for i in validation_data
    push!(classifications, (i[:label], class_predict(W, hash, i[:data])))
end
println(classifications)

We did remarkably well! Our neural network can read! The interpretation that we can make about the brain is that it uses neural networks to create *content addressable* memory. That is to say our brains, under this model, are clustering things that have similar correlational properties together to store them. This is different to a computer which gives a pointer to a linear vector of data.

### Corruption

Let's now do some experiments to show off some of the remarkable properties of the Hopfield-Tank network. First, we will corrupt some input data by adding some random noise to it. The noised images are actually quite straining for us to visually process, but the network handles them with ease!

In [8]:
rand(1:length(data), 10)
vcorrupt = [data[i,:data] .+ randn(length(data[i,:data])) for i in randinds]
classes = data[randinds, :classes]
pred_classes = class_predict.(W, hash, vcorrupt)
# println(hcat(classes, pred_classes))
# show corrupted data

LoadError: UndefVarError: data not defined

That works better than one might expect on some simply corrupted data. I would encourage you to examine various different ways of corrupting data to get a flavour for how this is working. Now let's look at corrupting the network itself by deleting random nodes.

In [9]:
# delete nodes

This is even more amazing than corrupting the data. We have taken our memory storage device (and classifier) and deleted half of it and it still works! If we were to delete sometimes even a single pointer in computer memory we can corrupt the entire drive. This implies that the content addressable neural network is *robust*. It can withstand modification. Think about what this implies for strokes and other brain damages and the biological advantages it confers over computer memory.

### Catastrophic Forgetting

This all seems to good to be true - why don't we just apply this method to all new labelled data in some centralised location? Then we have a robust storage of that data with good recall! Unfortunately, our network, as with all things has limits. After a certain number of vectors have been encoded there is simply not enough informational space in the weights matrix to store them and the network begins to forget; quite catastrophically in fact. Let's try and add too much data.

In [10]:
# code block

While dissapointing, this leads us to yet another biological insight: why we forget. We are continuially learning and relearning things. We store these in some content addressable format and so it makes sense under this model that we tend to forget things that become old and are not continually reinforced (trained).

This model is remarkable for a number of reasons: it is simple, it is performant, and (importantly) it offers key biological insights about how real brains may work. There are a number of extensions of the model which improve the storage capacity and training time e.g. [Storkey Network]().

## 2.0 Multilayer Perceptrons: The Blueprint

In the previous notebook we encountered a challenge with the perceptron: it could not model the XOR gate. This was a susbtantial blow to early mathematical neuroscience as models of the brain needed to be able model logic (early thinking around AI assumed that the brain operated on primitive logical operations, like a computer; it doesn't). We now will demonstrate that we can solve our XOR problem with a multi-layered neural network. First, try this as an exercise!

In [None]:
# Insert XOR here

So it is nice that we can fix the XOR problem, but it seems highly unlikely that intelligent brains are operating in this strict logical fashion. In fact, our model for the Hopfield-Tank network seems to offer more realism. The Hopfield-Tank network allowed all the neurons to communicate information to one and other and is a recurrent neural network. These exist in the brain but are not ubiquitous. 

In fact, a more common structure is the *feed-forward* network where information comes in through a set of neurones (say the eye cells), and is transmitted forward for processing into another region (say the colliculus), and again into a further region (say the visual cortex) and so forth. We can imagine the brain as being constructed by layers of neurons. The key insight here is the layers and the informational flow and this is precisely what the XOR network is: two layers and two neurons. We can readily generalise the concept to arbitrary layers and neurons and in doing so find ourselves with another working model of the brain. This one is actually more general because we can readily add the recurrent connections found in the Hopfield network into every layer. For this reason, multi-layer perceptrons (MLP) are what we typically refer to when we say *Artificial Neural Network* (ANN). They are usually depicted like this:

[Picture]

As we add more and more layers, the network gets deeper and deeper. This is *deep learning*. It seems that it would be useful to have a computational structure that abstracts this complexity into its core definition. We probably want to change the weights and biases so it should be a mutable structure. The connections are all-to-all (for now) in every layer. So we just need to specify the dimension of each layer and the activation function asscociated with it. We can also define a useful function that actually implements the function on data.

In [66]:
mutable struct ANN
        W::Array{Array{Float64, 2}, 1}
        b::Array{Array{Float64, 1}, 1}
        f::Array{Function, 1}
        function ANN(dims::Tuple, act::Array{Function, 1})
            gen = 2:length(dims)
            W = Array{Array{Float64, 2}, 1}([rand(dims[i], dims[i-1]) for i in gen])
            b = Array{Array{Float64, 1}, 1}([rand(dims[i]) for i in gen])
            f = Array{Function, 1}([i for i in act])
            new(W, b, f)
        end
end

function feed_forward(nn::ANN, data::Vector)
    u = similar(nn.W[1] * nn.b[1])
    for i in 1:length(nn.W)
        u = nn.f[i].(nn.W[i] * u + nn.b[i])
    end
    return u
end     

feed_forward (generic function with 2 methods)

We could make this more or less complicated. Perhaps you might like to write your own structure which takes *pre-trained* weights and biases and constructs the network from them. 

Deeper networks are also more abstract and are able to detangle higher order correlations and so are typically more powerful. This is expected, they are combinatiorally more connected, but it also means they are harder to train. This is something we haven't touched on yet. With our perceptron example we were able to train with analogy to a routine that already worked in another domain. We had to hand-code or XOR example. These MLPs seem to have many more parameters so hand-tuning is not an option and linear regression isn't possible here. 

To train them we first need to realise that they are nothing more than functions (say $\vec{{y_*}} = f(\vec{x}$) from $\mathbb{R}^n$ to $\mathbb{R}^m$ and are parameterised by their weights and biases. The other thing we need is a task or an *objective*. This is provided by the *objective function* also known as a *loss function*. The function typically takes paired labelled true data, and the predictors and combines them together to create a score that defines the object. A common loss function is mean squared error defined by:

$$ L(f(\vec{x}), y(\vec{x})) = \sum_{i=1}^N |f(\vec{x})_i - y(\vec{x})_i|^2$$.

This useful function gives a measure of similarity between two vectors in the Euclidean distance sense (often a very good proxy for any distance, coincidentally). Can you think of what the loss function for the Hopfield-Tank network might have been? To train them we want to optimise the loss function in the weights and biases over the data. To do this we can draw on tools from functional analysis and in particular: *gradient descent*. This will be the subject of the next notebook.