# Introductions to neural networks
## Perceptrons and logic functions

A perceptron, see figure below is a binary classifier and can take any number of inputs. For a first example we shall consider a single perceptron and how this represents and can train a boolean function.

<img src="percept_img/PerceptronSymbol.png" width="400">

In theory a perceptron can have any number of inputs but for our case we use only two, $x_1 \in \{0, 1\}$ and $x_2 \in \{0, 1\}$ to represent the simplest boolean founctions with output $y \in \{0,1\}$.

Each perceptron input $x_1$ $x_2$ has an associated weight $w_1$ and $w_2$, and the output $y$ is determined when the weighted sum $\sum_i w_i x_i$ is entered in an activation function given as follows

\begin{equation}\label{eq:}
y = 
\begin{cases}
 1 & \mathrm{if} \sum_i w_i x_i \geq \theta_1 \\  
 0 & \mathrm{if} \sum_i w_i x_i < \theta_1 \\ 
\end{cases}
\end{equation}

Here, $\text heta_1$ is the activation threshold or bias for the activation function, and hence this is effectively a step function. Due to the small inputspace we only have 4 combinations of inputs. We want to se if we can train this node to become a boolean AND function.

## The Basic boolean functions AND/OR

### Problems
- Manually tune the parameters of the single perceptron to create a logic AND function and then a logic OR.
- Can you output the same if you keep the bias $\theta$ at 0?


In [None]:
from __future__ import print_function
import os
import matplotlib.pyplot as plt
import numpy as np
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

In [None]:
def activation_function(x,w,theta):
    wtsum = np.dot(x,w)
    output = 0
    if (wtsum >= theta): output=1
    return output

%matplotlib inline
def boolean_matrix(w1,w2,theta):
    #Print matrix
    #os.system('clear')
    print('+-----+-----+-----+')
    print('+ x_1 + x_2 +  y  +')
    print('+-----+-----+-----+')
    print('+  0  +  0  +  '+str(activation_function([0,0],[w1,w2],theta))+'  +')
    print('+-----+-----+-----+')
    print('+  1  +  0  +  '+str(activation_function([1,0],[w1,w2],theta))+'  +')
    print('+-----+-----+-----+')
    print('+  0  +  1  +  '+str(activation_function([0,1],[w1,w2],theta))+'  +')
    print('+-----+-----+-----+')
    print('+  1  +  1  +  '+str(activation_function([1,1],[w1,w2],theta))+'  +')
    print('+-----+-----+-----+')
    
    fig = plt.figure(1)
    plt.plot([0],[0],'ok')
    plt.plot([1],[0],'ok')
    plt.plot([0],[1],'ok')
    plt.plot([1],[1],'ok')
    plt.grid()
    plt.xlabel('x_1')
    plt.ylabel('x_2')
    plt.xlim(-0.5,1.5)
    plt.ylim(-0.5,1.5)
    plt.title('Hyperplane')
    if(w1 == 0):
        x1=np.linspace(-0.1*theta/1.e-2,1.1*theta/1.e-5,51)
    else:
        x1=np.linspace(-0.1*theta/w1,1.1*theta/w1,51)
    if(w2 == 0):
        plt.plot(x1,(theta-w1*x1)/1.e-2,'k')
    else:
        plt.plot(x1,(theta-w1*x1)/w2,'k') 

In [None]:
interactive(boolean_matrix,
            w1=widgets.FloatSlider(min=-2.0,max=2.0,step=0.05,value=0.0, continuous_update=False),
            w2=widgets.FloatSlider(min=-2.0,max=2.0,step=0.05,value=0.0, continuous_update=False),
            theta=widgets.FloatSlider(min=-2.0,max=2.0,step=0.05,value=1.0, continuous_update=False)
           )

For bias zero we struggle to define our functions, which also makes sense considering the hyperplane separating value 1 and value 0 coordinates, must now go though the origin. Let us now try to train the perceptron using "the perceptron rule".

The change in weights is expressed as the rule

\begin{equation*}
\Delta w_i = \eta (y-\hat{y}) x_i,
\end{equation*}

where $\eta$ is the training rate, $y$ is the output as determined by training data, while $\hat{y}$ is the estimated output.

Let us give the training point: activation_function(x_1 = 1,x_2 = 1) = 1, and se if we manage to get an AND function from this data point alone if we start with weights equal to 2, and a bias $\theta = 0.5$.

In [None]:
def training_single(y,x1,x2,w1,w2,iterations):
    #Assuming y, and x can be lists of coordinates and outputs
    for i in range(0,iterations):
        try:
            for ind,element in enumerate(y):
                dw = deltaw(y[ind],[x1[ind],x2[ind]],[w1,w2])
                w1 += dw[0]
                w2 += dw[1]
        except:
            dw = deltaw(y,[x1,x2],[w1,w2])
            w1 += dw[0]
            w2 += dw[1]
    return w1,w2

def deltaw(y,x,w):
    trate = 0.05
    return trate*(y - activation_function(x,w,0.5))*x[0],trate*(y - activation_function(x,w,0.5))*x[1]

In [None]:
print("Initial state")
boolean_matrix(0.0,0.0,0.5)

#Specify training data
y = [1,2]# ,0,0,0]
x1 = [0,1]#,1,0,0]
x2 = [1,0]#,0,1,0]

#Initial weights
w1 = 0.0
w2 = 0.0

w1, w2 = training_single(y,x1,x2,w1,w2,iterations=100)
print('Weights (w1,w2):',w1, w2)
boolean_matrix(w1,w2,0.5)

Try setting a new set of points and train an OR function with the fewest set of points possible. Try to change the number of iterations between 1 and 100. The training rate $\eta$ is set to 0.05.

You cannot make an exclusive or (XOR) function however, as this requires either two hyperplanes or a curved one. Let us see if we can create one by adding another node and create a network, for we know we can create any boolean function by combining AND, OR and NOT functions. 

## The XOR function

<img src="percept_img/PerceptronXOR.png" width="400">
 
Keeping the parameters for the first node to that of an AND function (for example: $w_1=0.5, w_2=0.5, \theta=0.5$) and adding a second perceptron, we can create the XOR. Use the silders under to tune the XOR.

In [None]:
%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D

def boolean_matrix_XOR(w1,w2,w3,w4,w5,theta1,theta2):
    #Print matrix
    #os.system('clear')
    w0 = [w1,w2]
    w = [w3,w5,w4]
    xset = np.zeros([4,2])
    xset[:,0] = [0,1,0,1]
    xset[:,1] = [0,0,1,1]
    
    print('+-----+-----+-----+-----+')
    print('+ x_1 + x_2 + y_1 + y_2 +')
    for i in range(0,len(xset[:,0])):
        xpoint = xset[i,:]
        y1 = activation_function(xpoint,w0,theta1)
        xpoint = np.append(xpoint,y1)
        #print(xpoint,w)
        print('+-----+-----+-----+-----+')
        print('+  '+str(int(xpoint[0]))+'  +  '+str(int(xpoint[1]))+'  +  '+str(y1)+'  +  '+str(activation_function(xpoint,w,theta2))+'  +')
        print('+-----+-----+-----+-----+')
    
    x1=np.linspace(-0.5,1.5,51)
    x2=x1
    
    xx1, xx2 = np.meshgrid(x1,x2)

    # calculate corresponding y
    if(w4 == 0):
        y = (theta2-xx1*w3-xx2*w5)/1.e-2
    else:
        y = (theta2-xx1*w3-xx2*w5)/w4
    
    # plot the surface
    #plt3d = plt.figure().gca(projection='3d')
    #plt3d.plot_surface(xx1, xx2, y, alpha=0.2)
    
    #plt3d.set_xlabel('x_1')
    #plt3d.set_ylabel('x_2')
    #plt3d.set_ylabel('y_2')
    #plt3d.set_xlim(-0.5,1.5)
    #plt3d.set_ylim(-0.5,1.5)
    #plt3d.set_zlim(-0.5,1.5)
    #plt3d.set_title('Hyperplane')
    #ax = plt3d.gca()
    #ax.hold(True)
    #plt3d.scatter(xset[:,0], xset[:,1], [0,0,0,0], color='black')
    #plt3d.scatter(xset[:,0], xset[:,1], [1,1,1,1], color='black')
    #plt.show()

In [None]:
interactive(boolean_matrix_XOR,
            w1=widgets.FloatSlider(min=-2.0,max=2.0,step=0.05,value=0.0, continuous_update=False),
            w2=widgets.FloatSlider(min=-2.0,max=2.0,step=0.05,value=0.0, continuous_update=False),
            w3=widgets.FloatSlider(min=-2.0,max=2.0,step=0.05,value=0.0, continuous_update=False),
            w4=widgets.FloatSlider(min=-2.0,max=2.0,step=0.05,value=0.0, continuous_update=False),
            w5=widgets.FloatSlider(min=-2.0,max=2.0,step=0.05,value=0.0, continuous_update=False),
            theta1=widgets.FloatSlider(min=-2.0,max=2.0,step=0.05,value=1.0, continuous_update=False),
            theta2=widgets.FloatSlider(min=-2.0,max=2.0,step=0.05,value=1.0, continuous_update=False)
           )

# Fully connected feed forward neural networks

<img src="percept_img/NeuralNet2.png" width="800">

The previous examples show simple constructions of neural networks. In this section we'll look at fully connected feed forward neural networks, with multiple layers, and show how one can represent them in terms of a series of matrix multiplications. In this context 
- "feed forward" means that information is propagated in one direction only (from input to output) 
- "fully connected" means that all neurons in a layer $l$ is connected to all neurons in the previous ($l - 1$) and next ($l + 1$) layer.
- also of note is that there is no direct passage of information other than via neighbouring layers. (i.e. the structure of the XOR neural net is not possible, where the input has direct links to hidden layer 1 and 2).

The figure above shows a neural net with 1 input neuron, 2 hidden layers with 3 and 2 neurons respectively, and 1 output layer with 1 neron. In the figure we have also depicted the input and output (activations), all weights connecting the neurons, in addition to biases and activation of each neuron, which follows the following naming conventions:

- $\omega^l_{j,k}$ is the weight from the $k$<sup>th</sup> neuron in the ($l - 1$)<sup>th</sup> layer to the $j$<sup>th</sup> neuron in the $l$<sup>th</sup> layer
- $b^l_{j}$ is the bias of the $j$<sup>th</sup> neuron in the $l$<sup>th</sup> layer
- $a^l_{j}$ is the activation of the $j$<sup>th</sup> neuron in the $l$<sup>th</sup> layer

With this naming convention we note that the activation of neuron $a_j^l$ is given by:

\begin{equation}
a_j^l = \sigma \left( \sum_k \omega_{j, k}^l a_k^{l - 1} + b_j^l\right) \,,
\end{equation}
where $\sigma$ is an activation function. In the case of using perceptron neurons the output of $a_j^l$ would be: 
\begin{equation}\label{eq:}
\mathrm{output} = 
\begin{cases}
 1 & \mathrm{if} \left( \sum_k \omega_{j, k}^l a_k^{l - 1} + b_j^l\right) \geq 0 \\  
 0 & \mathrm{if} \left( \sum_k \omega_{j, k}^l a_k^{l - 1} + b_j^l\right) < 0 
\end{cases} \,,
\end{equation}
but we note that $\sigma$ can be any function. In any case one can see that the activation of a layer $l$, $a^l$ can be represented in a vectorized form:
\begin{equation}
a^l = \sigma \left( \omega^l a^{l - 1} + b^l\right) \,,
\end{equation}

With this the activation of the different layers in the example above can be computed as:
\begin{equation}
a^1 = \sigma \left(  \left[\begin{matrix}w^1_{11}\\w^1_{21}\\w^1_{31}\end{matrix}\right] \cdot \left[\begin{matrix}x\end{matrix}\right] + \left[\begin{matrix}b^1_{1}\\b^1_{2}\\b^1_{3}\end{matrix}\right]\right) \,, 
\end{equation}
\begin{equation}
a^2 = \sigma \left( \left[\begin{matrix}w^2_{11} & w^2_{12} & w^2_{13}\\w^2_{21} & w^2_{22} & w^2_{23}\end{matrix}\right] \cdot \left[\begin{matrix}a^1_{1}\\a^1_{2}\\a^1_{3}\end{matrix}\right]  + \left[\begin{matrix}b^2_{1}\\b^2_{2}\end{matrix}\right]\right) 
\,,
\end{equation}
\begin{equation}
y = \sigma \left( \left[\begin{matrix}w^3_{13} & w^3_{23}\end{matrix}\right] \cdot \left[\begin{matrix}a^2_{1}\\a^2_{2}\end{matrix}\right] + \left[\begin{matrix}b^3_{1}\end{matrix}\right] \right)
\,,
\end{equation}

# Setting up neural networks in tensorflow
TensorFlow is an open source platform for machine learning, which we will use in this notebook. Unless otherwise enabled (through twnsorflow.executing_eagerly()) tensorflow computations are run using symbolic handles through <em>graphs</em>. Execution of a graph is performed in tensorflow sessions. 
- Quoted from the TensorFlow website; "A computational graph (or graph in short) is a series of TensorFlow operations arranged into a graph of nodes". Basically, it means a graph is just an arrangement of nodes that represent the operations in your model.

Before we set up the neural net above, let's look at a simple tensorflow graph, and run it in a session
## TensorFlow graphs

In [None]:
import tensorflow as tf
#tf.set_random_seed(1)
a_scalar = 2
b_scalar = 3
c_scalar = tf.add(a_scalar, b_scalar, name='Add')
print(c_scalar)

W_tmp = tf.constant([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
a_tmp = tf.constant([[1], [2], [3]])
b_tmp = tf.constant([[1], [2], [3]])
a2_tmp = tf.add(tf.matmul(W_tmp, a_tmp), b_tmp)

## TensorFlow session

In [None]:
# run the first graph in a session
sess = tf.Session()
print(sess.run(c_scalar))
sess.close()

# run the second graph in a session
# using 'with tf.Session() as sess:' in which we do not need to close the session
with tf.Session() as sess:
    print(sess.run(a2_tmp))

## TensorFlow variables and placeholders
Trainable parameters such as weights and biases are declared using 'tensorflow.variable', whereas placeholders are used to feed actual training examples. 

In [None]:
W1 = tf.Variable(tf.random_normal([3, 1], stddev=0.75), name='W1')
b1 = tf.Variable(tf.random_normal([3, 1]), name='b1')
W2 = tf.Variable(tf.random_normal([2, 3], stddev=0.75), name='W2')
b2 = tf.Variable(tf.random_normal([2, 1]), name='b2')
W3 = tf.Variable(tf.random_normal([1, 2], stddev=0.75), name='W3')
b3 = tf.Variable(tf.random_normal([1, 1]), name='b3')

x = tf.placeholder(tf.float32, shape=[1, 1], name='x')

With this we can declare the weights and biases and create graphs for the different layers in the above example neural network. For simplicity we'll assume a linear activation function, $\sigma(x)=x$

In [None]:
a1 = tf.add(tf.matmul(W1, x), b1)
a2 = tf.add(tf.matmul(W2, a1), b2)
y = tf.add(tf.matmul(W3, a2), b3)


An finally we can initialize weights and biases and evaluate the neural network by feeding it with a (training example) x value

In [None]:
x_data = np.array([[1]])
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init) # initialize variables
    print(sess.run(y, feed_dict = {x:x_data}))

In the above example we have treated vectors as column-vectors, however by default tensorflow assumes row-vectors, and we will thus in the following work with the transpose the equation for the activation of a layer:
\begin{equation}
(a^l)^T = \sigma \left( \left(\omega^l a^{l - 1}\right)^T + \left(b^l\right)^T\right) = \sigma \left( {a^{l - 1}}^T {\omega^l}^T  + \left(b^l\right)^T\right)\,,
\end{equation}

With this convention the above neural network can be implemented as

In [None]:
W1 = tf.Variable(tf.random_normal([1, 3], stddev=0.75), name='W1')
b1 = tf.Variable(tf.random_normal([3]), name='b1')
W2 = tf.Variable(tf.random_normal([3, 2], stddev=0.75), name='W2')
b2 = tf.Variable(tf.random_normal([2]), name='b2')
W3 = tf.Variable(tf.random_normal([2, 1], stddev=0.75), name='W3')
b3 = tf.Variable(tf.random_normal([1]), name='b3')
x = tf.placeholder(tf.float32, shape=[1, 1], name='x')
a1 = tf.add(tf.matmul(x, W1), b1)
a2 = tf.add(tf.matmul(a1, W2), b2)
y = tf.add(tf.matmul(a2, W3), b3)
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init) # initialize variables
    print(sess.run(y, feed_dict = {x:x_data}))

# How do neural networks represent functions?

Unfortunately the perceptron rule generalises poorly to multiple layers of perceptrons and was an initial hurdle for the further development of neural networks. However, if we start introducting differentiable activation functions for every node we can rather work with the gradient descent rule, which is formalised as following:

If we define an error function as the sum of squared deviations from the data points $y$ we can write it as

\begin{equation}
E = \frac{1}{2}\sum_j ( y_j - \hat{y_j} )^2,
\end{equation}

where $j$ spans all points in a set of training data and

\begin{equation}
\hat{y_j} = \sum_j w_j x_j - \theta
\end{equation}

if we then differentiate with respect to a weight $w_i$

\begin{equation}
\frac{\partial E}{\partial w_i} = \sum_j ( y_j - \hat{y_j}) (-\sum_j x_j \frac{\partial w_j}{\partial w_i}) = \sum_j ( y_j - \hat{y_j}) (-\sum_j x_j \delta_{ij}) = - \sum_j ( y_j - \hat{y_j}) x_i
\end{equation}


This allows for backpropagation which is a technique for minimizing the error throughout a multilayered network. And we won't go through the seams of how this works, but we will use it in the coming examples. "The universal approximation theorem" states that:

### Theorem
- "A feedforward network with a single layer is sufficient to represent any function, but the layer may be infeasibly large and may fail to learn and generalize correctly."


So let us see how a simple neural network with a single hidden layer may be trained to represent a polynomial function. We will also learn to know the tensorflow package in python.

Some of the most normal activation functions are listed below. The parametric Rectified Linear Unit is often reffered to as the "leaky" ReLU, and the logistic function is also known as the sigmoid.

<img src="percept_img/ActivationFunctions.png" width="800">
<center>Source: https://towardsdatascience.com</center>

We will now try to represent a few functions using a single layer network, with a varying number of nodes, and activation functions.

We shall also mention that in the example below we employ a technique in ensemble learning where we pick randomly sampled subsets from the training data, for which we learn from each sample and average the functions we learn from each sample called an "epoch"

### Problems
Attempt to represent a: 

- linear function $f(x) = 2x+3$ using ReLU activation functions with 5, 10 and 20 nodes

- polynomial function $f(x) = x^3 + x^2 + x - 1$ using ReLU activation functions, sigmoid and tanh with 5, 10 and 100 nodes for $x \in [-2,2]$, learning rate = 0.02 and 10000 epochs. Try a higher learning rate for the sigmoid $\approx$ 0.1

- exponential function $f(x) = e^x$ using the "leaky ReLU", tanh and exponential linear unit with ..5 nodes

If you are interested, try to add some normally distributed white noise of standard deviation $\pm 2\%$ to the training data set. You may have to tune the hyperameters, learning rate, number of training points and epochs to make this work. Remember to check that the cost function values should diminish per epoch.



In [None]:
import tensorflow as tf
%matplotlib inline
#==========================================================================================#
# Set parampeters (hyperameters) and compute a training data set for the desired function  #
#==========================================================================================#

N = 10001 # Number of training points
N_train = 100 # Number of points to train on per training attempt
N_neurons = 10#10 # Number of neurons in the hidden layer
afunc = tf.nn.elu # Set activation function, other functions .relu, .sigmoid, .tanh, .elu

learning_rate = 0.1
epochs = 10000 # Number of subsamples to learn from  per epoch


#Set x-domain boundaries
x_start = -2.
x_end = 2.
x_data = np.linspace(x_start, x_end, N)
#===================================================
# Specify what function you wish to train for
#===================================================
#y_data = 2.0*x_data + 3.0 #Linear polynomial function
#y_data = x_data**3 + x_data**2 + x_data - 1.0 #Cubic polynomial function
y_data = np.exp(x_data) #Exponential function

#y_data = (1 + 0.02*np.random.randn(len(y_data)))*y_data #Added noise

In [None]:
x_data = x_data[:, np.newaxis] # turn 1D array into 2D array of shape (N, 1)
y_data = y_data[:, np.newaxis] # turn 1D array into 2D array of shape (N, 1)

idx = np.random.choice(x_data.shape[0], N_train, replace=False)
x_data_train = x_data[idx]
y_data_train = y_data[idx]

In [None]:
#===================================================
# set up placeholder for inputs and outputs
#===================================================
x = tf.placeholder(tf.float32, shape=[None, x_data.shape[1]], name='x')
y = tf.placeholder(tf.float32, shape=[None, y_data.shape[1]], name='x')
#===================================================
# declare weights and biases input --> hidden layer
#===================================================
W1 = tf.Variable(tf.random_normal([1, N_neurons], stddev=0.75), name='W1')
b1 = tf.Variable(tf.random_normal([N_neurons]), name='b1')
#===================================================
# declare weights and biases of hidden --> output layer
#===================================================
W2 = tf.Variable(tf.random_normal([N_neurons, 1], stddev=0.75), name='W2')
b2 = tf.Variable(tf.random_normal([1]), name='b2')
#===================================================
# declare output of NN
#===================================================
hidden_out = afunc(tf.add(tf.matmul(x, W1), b1)) #Apply activation function for inputs to hidden layer
y_NN = tf.add(tf.matmul(hidden_out, W2), b2) #Apply linear sum for outputs from each hidden layer node

In [None]:
#===========================================================
# plot y_pred before training using only initial conditions
#===========================================================
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init) # initialize variables
    y_pred_init = sess.run(y_NN, feed_dict = {x:x_data})
plt.figure()
plt.plot(x_data.flatten(), y_data.flatten())
plt.plot(x_data.flatten(), y_pred_init.flatten())

In [None]:
#===================================================
# Train the model
#===================================================
batch_size = N_train
loss = tf.reduce_mean(tf.square(y - y_NN)) # Minimize the mean (least) square error
optimiser = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)

print_every_N_batch = 1000
with tf.Session() as sess:
    sess.run(init) # initialize variables
    for epoch in range(epochs):
        avg_cost = 0
        _, c = sess.run([optimiser, loss], 
                     feed_dict={x: x_data_train, y: y_data_train})
        avg_cost += c
        if epoch % print_every_N_batch == 0:
            print("Epoch:", (epoch + 1), "cost =", "{:.6f}".format(avg_cost))

    y_pred = sess.run(y_NN, feed_dict = {x:x_data, y:y_data})
    loss_pred = sess.run(loss, feed_dict = {x:x_data, y:y_data})
    plt.figure()
    plt.plot(x_data.flatten(), y_data.flatten())
    plt.plot(x_data.flatten(), y_pred.flatten())
    print(loss_pred)

## Simplifying with keras