# An Introduction to Neural Networks

Neural networks are one of the hottest areas of machine learning. Neural network models attract the most press because of their relationship to the human brain enabling the anthropomorphism of artificial intelligence further. Neural networks aim to emulate the neurons of the brain with perceptrons linking them in networks.

Humans still maintain the edge in the comparison of human and computer intelligence and a look at the wetware specs of the brain and the hardware specs of computers gives us an idea as to why.

### Human Brain
* Neuron Switching Time:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$0.001\text{s}$ ($1\text{ms}$)
* Number of Neurons:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$10^{10}$
* Connections per Neuron:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$10^4-10^5$
* Scene Recognition Time:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$~0.1\ Second$

### Computer
* Transistor Switching Time:&nbsp;&nbsp;&nbsp;&nbsp;$~333\ picoseonds$ (3GHz Processor)
* Transistors per Chip:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$8 - 20\ billion$
* Connections per Transisor:&nbsp;&nbsp;&nbsp;$~10$
* Scene Recognition Time:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;No general scene recognition progamme yet exists (The _Inception_ model is getting pretty close).

We may see that computer hardware wins the comparison hands down in switching time and is starting to win in terms of number of transistors/neurons but the human brain seems way more powerful as a general purpose computing unit at the moment. The power of this comes through the connections per neuron. It is currently not forseen that computing hardware of the current paradigm will ever overtake the brain in this respect. The connections between neurons (albeit slow, exhchanging information via synapses) are where the brain learns and maintains its computational superiority.

Furthermore the brain manages this power with power consumption on the order of 100 Watts whereas the computers used to train and build the leading neural network models consume far far more energy. The brain manages to achieve massive parallelism through the connections of neurons from one to another and thereby achieves remarkable efficiency and understanding.

## The Neuron and the Perceptron
The software equivalent of the neuron is the perceptron. A single neuron is able to encode a single binary threshold which could be sufficent for a simple binary classification model. The neuron either fires or doesn't. In the case that the incoming stimulus is sufficient to cause the neuron to fire an action potential is sent down the axon and down the dendrites which then passes the stimulus on to the neurons connected to firing neuron. This threshold will differ from neuron to neuron as will the number and paths of the connections between neurons.

![Neuron Diagram](./img/neuron.png "A Neuron")
<center><span style="color:gray">*Source: Wikimedia Commons*</span></center>

The perceptron tries to emulate this mathematically by taking in a series of inputs summing them (scaled by some weights) and comparing them to a threshold.

![Perceptron Diagram](./img/perceptron.png "The Perceptron")
<center><span style="color:gray">*Source: Wikimedia Commons (Edited)*</span></center>

In the diagram the threshold is given by the constant $\theta$ which is also referred to as the *bias*. The inputs are given by the *n*-vector ***x***. The *x* values are then multiplied by their respective weights and summed before the result is passed through the step function which then leads to a binary output ($1$ '*fire*' or $0$ '*do not fire*'). This step function is known as the *activation function* since it determines the rule as to whether or not the perception '*fires*'.

Mathematically the perceptron (in the case of the purest neuron emulation) undertakes the following calculation:

$$output = \begin{cases} 1 &\mbox{if    } \left(w_0 + \sum_{i=1}^n w_ix_i\right)>0 \\ 
0 & \mbox{otherwise} \end{cases}$$

The output of the perceptron can then be passed forwards to other perceptrons akin to the action potential passing from the dendrites of one neuron to the next neuron through the chemical mediation of a synapse (which is very slow by computing standards). In this way perceptrons are built up to generate a neural network.

Taking the single perceptron as an example we may see that it characterises a hyperplane which implements a binary classification. The gradient of the hyperplane is determined by the weights vector $\boldsymbol{w}$ and its position is given by the bias $w_0 = \theta$. The two-dimensional case leads to classification according to a single straight line which clearly shows the limitations of the perceptron as a classifier - it can only separate two regions and only then in a linear way. The most often cited of what this cannot deal with is the case of parity (a.k.a. the *exclusive-or* or `XOR` function). However, we will see later that neural networks can get around this issue to be one of the most flexible learners out there.

The 'learning' of anything in neural networks is mediated through the weights. The weights encode all knowledge. The learning process begins with the weights initialised at small random numbers and then as examples are seen and errors noted the weights are updated in order to reduce the error. This can be thought of as the strengthening of pathways in the brain. The more a path is used to predict the output the greater the mangnitude of the weight of that link in the network. This comparison is more clear cut in the case of a neural network where the inputs to one perceptron are the outputs of other perceptrons.

## Coding A Perceptron
We will use python to set up a perceptron and gain an idea of how to train it. To do this we will use NumPy the go to package for scientific computing in python.

There are many packages for building neural networks in Python the most popular being SciKit-Learn, Keras, Theano and TensorFlow however we will initially build our models from scratch for pedagogic suitability.

In [1]:
import numpy as np
import pandas as pd

In keeping with the idea that a perceptron simply learns a hyperplane we will set up a dataset that based on a hyperplane and we will try to learn it with a single perceptron.

#### Creating a Dataset
To set up the dataset we will simply take points in three-dimensional space and classify them according to the hyperplane defined below. Below the hyperplane the points will be coded as 0 and above they will be encoded as 1.

$$-1.8x+0.7y+2.1z=0.5$$

The dataset will be created by generating random sets of numbers and then determining the output according to the hyperplane above.

In [2]:
x = np.random.uniform(size=(10000))
y = np.random.uniform(size=(10000))
z = np.random.uniform(size=(10000))
output = -1.8*x + 0.7*y + 2.1*z > 0.5
output = output.astype(np.float32)

All being well the dataset should yield approximately 32% positive and 68% negative outcomes. Let us place all of the data in a Pandas DataFrame so that we may see and manage the data in tabular form.

In [3]:
data = pd.DataFrame(data={'x':x,'y':y,'z':z,'Target':output})
data = data[['x','y','z','Target']]
data.head()

Unnamed: 0,x,y,z,Target
0,0.137614,0.338238,0.967687,1.0
1,0.774698,0.138471,0.833223,0.0
2,0.048048,0.931693,0.506675,1.0
3,0.238811,0.333215,0.58125,1.0
4,0.966882,0.312617,0.409213,0.0


Now that we have our dataset and an idea of the target function we may set up our perceptron. We initialise the weights vector and for simplicity we then build a function which will generate a prediction given the input vector and weights vector which includes the bias (hence why it is of size `NUM_INPUTS + 1`). The weights vector is initialised to small random values as we do not wish to impose any siginificant initial model or relationship before the model has been shown data.

In [4]:
NUM_INPUTS = 3
weights = np.random.normal(0, 0.2, size=NUM_INPUTS + 1)

When generating a prediction we must generate a binary output. This is done by comparing the sum of the product of the weights and the inputs to the threshold. This threshold is learned and is given by $w_0$. This is why $-1$ is prepended to the input vector so that the threshold is taken from the sum of the other weights and inputs so that generic comparison to may be undertaken.

In [5]:
def perceptron_prediction(input_vector, weights):
    inputs = [-1, *input_vector]
    prediction = float(np.dot(inputs, weights) > 0)
    return prediction

If the perceptron manages to learn the model exactly then we will expect the following weights vector.

$$\boldsymbol{w}=\begin{bmatrix}w_0 \\ w_1 \\ w_2 \\ w_3\end{bmatrix}=
\left[
 \begin{matrix}
  0.5  \\
  -1.8 \\
  0.7  \\
  2.1
 \end{matrix}
\right]$$

## Perceptron Training
In machine learning, as in life, we learn from our mistakes. Models are improved by identifying errors and reducing them by manipulating the parameters of our model. In the case of the perceptron we will update our weights to reduce the error.

The weights are first initialised to small non-zero values. They must be non-zero as all zero entries will lead to a zero output regardless of inputs and therefore the initial step does not enable us to distinguish between which weights are most responsible for the initial in accuracy. We also initialise to random small numbers so as not to impose any significant presupposed relationship to start learning from (as would be the case with large weight initialisation weights). The use of randomness means that our model may converge to a different result in the case that there are local minima rather than finding the true global optimum. Running several training sessions with random initialisations can enable comparison of final models to hopefully find the global minimum for the loss function.

### The Perceptron Training Rule
As one may expect given the explanation above the perceptron training rule relates the size of the errors to the changes in the weights needed. The weights are updated in turn by the required change $\Delta w_i$ where the update is as given below.

$$\Delta w_i=\eta\ (t - o)x_i$$

Where $\eta$ is the 'learning rate', $w_i$ is the weight being trained, $t$ is the target value (i.e. the lable for this observation), $o$ is the output of the perceptron with the curent weights and $x_i$ is the $i^{th}$ element of the input vector $\boldsymbol{x}$ which is when scaled by $w_i$ contributes to the generation of the output $o$.

The error term $(t - o)$ ensures that the weights are only updated if they need to be and indeed in the case that they are updated they should be updated in the right direction. The inclusion of the input element $x_i$ further ensures that learning occurs in the right direction since inputs may in pactice be positive or negative. Since the weights enter the prediction equation multiplicatively the sign of the inputs is therefore important in determining the sign of the influence on the output.

The size of the learning rate is one of the hyperparameters that one can play with to improve your model. Since the learning rate $(\eta)$ determines the size of the step taken towards the minimum we may understand that too large an $\eta$ will lead to potentially overshooting the global minimum and not being able to reach it with sufficient precision being able to only attain final weights with a precision of $\eta$. Too small a learning rate may lead to very slow convergence if the intitialisation is far from the optimum and in the case of real noisy data the model is likely to overfit.

Overfitting is a major concern in machine learning. It is the problem of learning and then drawing on realtionships in the training data that are specific to that sample and not generally applicable to the population as a whole. This will lead to in-sample tests showing high accuracy but when applied to out-of-sample cases the model is inaccurate since the special case relationships of the sample no longer hold true.

In our case since the relationship is not noisy (by construction) we cannot overfit and hence we will take a small value of $\eta$ to hopefully get close to learning the true relationship. In practical applications decades of field experience and research has led to the rule of thumb that a learning rate of $0.1$ is generally a good place to start.

We may now train our perceptron.

In [6]:
eta = 0.05
w = np.round(np.random.normal(size=4, scale = 0.5), 2)

In [7]:
def loss_function(data, weights, prediction_fn):
    x, y = data.pop()
    predictions = np.array([prediction_fn(input_x, weights) for input_x in data.iterrows()])
    errors = y - predictions
    accuracy = errors.mean()
    return accuracy    

In [8]:
def train_perceptron(prediction_fn, training_data, eta, weights):
    epochs = len(training_data)
    w = np.copy(weights)
    for epoch in range(epochs):
        example = training_data.iloc[epoch]
        for i in range(len(w)):
            x = example[:-1]
            y = example[-1]
            preidction = prediction_fn(example[:-1], w)
            if i == 0:
                update = eta * (y - prediction) * 1
            else:
                update = eta * (y - prediction) * x[i-1]
            w[i] = w[i] + update
        if epoch % 10 == 0:
            print('Current accuracy: %f.2%%' % loss_function(training_data, w, prediction_fn))
    return w