# Fully connected neural network

This notebook will cover FCNN with forward pass, backward pass, different activation functions and differenc cost functions.

## Weighted sum

This section will cover neurons as weighted sums, without regard to activation functions.

A single neuron in a fully connected neural network can be viewed as:

$$z = \sum{w_i x_i} + b$$

### Forward pass
**Given** multiple neurons in the both input and hidden layer, **find** the relationship between input and hidden layer in matrix notation.

**Solution:** This can be described in matrix notation as such:

$$\vec{z} = W \vec{x} + \vec{b}$$

If $\vec{z}$ is a vector of length $m$ and $\vec{x}$ is a vector of length $n$, then matrix $W$ is of shape $(m, n)$ and vector $\vec{b}$ is of shape $m$.

#### Numerical example
This following numerical example declares an input vector of length $n = 10$, a hidden layer of length $m$, weight matrix and bias vector and performs a forward pass. It does not regard multiple samples in a mini-batch, but could have easily been extended to cover that.

The weight matrix is decleared to ones, and the input vector is given values evenly spaced in range $[0,9]$. The bias is also given linearly spaced values in range $[0,4]$. From the formulas, it should be clear that with a weight matrix of ones, each output vector element should contain the sum of the inputs plus the bias. This is confirmed in the numerical example:

In [13]:
import numpy as np

n = 10
m = 5

w = np.ones((m,n))
b = np.linspace(0,4,(m))

x = np.linspace(0, 9, (n))

print(x)

z = w @ x + b

print(z)

[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[45. 46. 47. 48. 49.]


### Backward pass

The backward pass of such a linear layer can be calculated by diffrentiating with respect to the weights.

In the one hidden neuron example this is simply:

$$\frac{\partial z}{\partial \vec{w}} = \vec{x}^T$$  


$$\frac{\partial z}{\partial b} = 1$$


The vector $\vec{x}$ must be transposed as $\vec{w}$ is a row vector with weights corresponding to the input column vector $\vec{x}$

This will be more interesting when we add activation functions later on, and I will wait with a numerical example untill then.

## Activation functions

There are numerous activation functions. They provide non-linarities into the network so that it can learn non-linear functions.
Here I will cover:

* ReLU
* Sigmoid
* Softmax

**Given** the three functions names, I want to **find** the forward and backward functions.

### ReLU

ReLU can be simply described as $$f(z) = max(0, z)$$

Differentiating ReLU gives f'(z) =  1 if $z>0$ else 0.

### Sigmoid

Sigmoid is another non-linear activation function. It is expressed as:

$$ f(z) = \frac{1}{1 + e^{- z}}$$

The sigmoid function is usefull as it is non-linear, it squashes the output between 0 and 1, and it has a simple derivative.

$$f'(z) = \frac{e^{-z}}{(1+ e^{-z})^2} = f(z) (1-f(z))$$

### Softmax
The softmax function is widely used for classification tasks. Given an input vector $\vec{z}$ that represents $n$ classes, it gives an output vector $f(\vec{z})$ which sums to one. Thus the output can be viewed as a probability distribution over all the classes.

The function is given as follows:

$$f(z) = \frac{e^{z_j}}{\sum{e^{z_j}}}$$

The derivative of the function is a bit more difficult to derive:
Using $y_k$ as the k'th element in the vector $\vec{f(z)}$ 

$$\frac{\partial y_k}{\partial z_j} = \frac{\partial}{\partial z_j} \frac{e^{z_k}}{\sum{e^{z_i}}}$$

using $g_k = e^{z_k}$ and $h_k = \sum{e^{z_i}}$

$$\frac{\partial g_k}{\partial z_j} = e^{z_j}, j = k$$

$$\frac{\partial g_k}{\partial z_j} = 0, j \neq k$$

and 

$$\frac{\partial h_k}{\partial z_j} = (e^{z_1} + e^{z_2} + ... + e^{z_j} + ... + e^{z_n})' = e^{z_j}$$


then we can use

$$\frac{\partial y_k}{\partial z_j} = \frac{g_k' h_k' - h_k' g_k}{h_k^2}$$

Using the case $j = k$ from above I get:

$$\frac{\partial y_k}{\partial z_j} = \frac{e^{z_j}}{\sum{e^{z_i}}} \big(\frac{\sum{e^{e_i} - e^{z_j}}}{\sum{e^{z_i}}} \big)= y_k* (1-y_k)$$

if $j \neq k$ we get a different solution:

$$\frac{\partial y_k}{\partial z_j}= \frac{-e^{z_j} e^{z_k}}{(\sum{e^{z_i}})^2} = - y_j * y_k$$

## Bringing it together

A successfull forward pass will consist of both a weighted sum and an activation function

$$a = f(wx + b)$$

Derivating this with respect to w one gets the following:

$$\frac{\partial a}{\partial w} = f'(wx + b) x^T$$

and $$\frac{\partial a}{\partial b} = f'(wx + b)$$

### Cost function

The last activation layer will usually be passed to a cost function. For regression loss the mean square error is a common function. It is defined as:

$$C(y) = \frac{1}{N} \sum{(y - y')^2}$$

Where $y'$ is the ground truth.

This is the function we want to differentiate in respect to. A common notation that utilizes the chain rule is to define:

$$\delta^l = \frac{\partial C}{\partial z^l}$$

### Numerical task

This task will with a practical example illustrate forward and backward pass with matrix notation and chain-rule.

This following graph oulines a graphical representation of a neural network with two hidden layers and one output layer.
![Task](fcnn_task.jpg)

**Given** this network, **find** the matrix representation of the network, $\frac{\partial C}{\partial b^2}$ and $\frac{\partial C}{\partial w^1}$


**Solutions**:
I first calculate the solutions by hand:
![Forward](fcnn-forward.jpg)
![Backward](fcnn-backward.jpg)



Then I verify my results with a python script:

In [13]:
import numpy as np

def ReLU_forward(z):
    return np.maximum(0,z)

def ReLU_backward(intermediate_a):
    return np.where(intermediate_a>0, 1, 0)

def MSE(y, target):
    return 1./2. * (y-target)**2

layer_sizes = [3, 2, 1]

input_size = 2

w1 = np.array([[3,-3], [2,-2], [1,-1]])
w2 = np.array([[-1,-3,2], [1,3,2]])
w3 = np.array([[-1,1]])

b1 = np.array([[1],[2],[3]])
b2 = np.array([[1],[0]])
b3 = np.array([[2]])

x = np.array([[1],[-1]]) #Column vector
target = 1

#Forward pass
a1 = ReLU_forward(w1 @ x  + b1)
#print(f"a1: {a1}")
a2 = ReLU_forward(w2 @ a1 + b2)
#print(f"a2: {a2}")
a3 = ReLU_forward(w3 @ a2 + b3)
#print(f"a3: {a3}")
loss = MSE(a3, target)
print(f"The mean square error given this network and input is: {loss}")

##Backward pass

#Derivative of cost function multiplied with derivative of activation function
delta3 = (a3 - target)* ReLU_backward(a3) 

dw3 = delta3 @ a2.transpose()
db3 = delta3

delta2 = ReLU_backward(a2) * (w3.transpose() @ delta3)

dw2 = delta2 @ a1.transpose()
db2 = delta2

delta1 = ReLU_backward(a1) * (w2.transpose() @ delta2)

dw1 = delta1 @ x.transpose()
db1 = delta1

print(f"db2 =\n{db2}\n\n\ndw1 = \n{dw1}")

#Weight update
w1 -= dw1
w2 -= dw2
w3 -= dw3


b1 -= db1
b2 -= db2
b3 -= db3



print("As you can see, both the python script and calculation by hand gives the same results.")




The mean square error given this network and input is: [[648.]]
db2 =
[[ 0]
 [36]]


dw1 = 
[[  36  -36]
 [ 108 -108]
 [  72  -72]]
As you can see, both the python script and calculation by hand gives the same results.
