Example: let say we define a input $x = [1, 2]$ and output is $y = [1]$. We are going to use 3 layered network with 2 neurons in first layer, 2 neurons in second layer and 1 neuron in last layer. As a loss function we are going to use $MSE$


**TODO:** image with inputs and all notation



**Notation:**
- **L** - layer where *L* is output layer, _L-1_ is previous to last and so on
- **w** - weight on certain conection
- **b** - bias on certain neuron
- **z** - sum of weights and values. It's not output from neuron because an activation function is not included
- **a** - input value or output value from some neuron (that is input value to next layer). It's a __z__ with activation function


<br><br>


<hr>

**In order to see how much we are from our result we need some kind of a measure, for this example we are going to use a loss function called *Mean squared error* expressed as follows:**

$MSE = \sum_{i=1}^n{(y - \hat{y})^2}$

<hr>


**We see we are far of the right solution so we need to optimize it**

For that we will use gradient descent. The thing that we tried to optimize is our **loss function** which we want to be minimal. Our cost function is $MSE$ and if we remember is expressed as (we will use $C$ notation for loss/cost):


$C = (\hat{y} - y)^2$

<hr>


Now from this pile of parameters we can tune the weights $w$ and biases $b$. We mentioned we are going to use gradient descent. For that we need to do a derivitive of $C$, but we cannot do it as function involves multiple variables.
What we can do is do a partial derivitive of $C$ with respect to a certain weight $w$ or bias $b$.

<!---
Optimize weight $w_8$:

$ \frac{\partial C}{\partial w_8} = 2 * (z_{n3} * w_7 + z_{n4} * w_8 + b_5) * (z_{n3} * w_7 + z_{n4} + b_5) $

$ \frac{\partial C}{\partial w_8} = 2 * (86 * 7 + 106 + 5) * (86 * 7 + 106 + 5) = 1,016,738$


Our new weight $w_8 = 8 - 1016738 = -1,016,730$


**Sidenote:** If we run the calulation now we will get output value of $\hat{y} = -107773621$ and $MSE = $
-->


$\frac{\partial C_o}{\partial w^{(L)}} = \frac{\partial z^{(L)}}{\partial w^{(L)}} \frac{\partial a^{(L)}} {\partial z^{(L)}} \frac{\partial C_o}{\partial a^{(L)}}$






$
\frac{\partial C_o}{\partial w_8} = \frac{\partial z^{(L)}}{\partial w^{(L)}} \frac{\partial a^{(L)}} {\partial z^{(L)}} \frac{\partial C_o}{\partial a^{(L)}}
$


**Cost or loss can be calculated as follows:**
$C_o = (a^{(L)} - y)^2$

<br>

**Then we derivative a the cost with respect to a weight:** 
$\frac{\partial C_o}{\partial w^{(L)}} = \frac{\partial z^{(L)}}{\partial w^{(L)}} \frac{\partial a^{(L)}} {\partial z^{(L)}} \frac{\partial C_o}{\partial a^{(L)}}$


<br><hr>
$C_o$ with respect to $a^{(L)}$ derivatived will be:

$\frac{\partial C_o}{\partial a^{(L)}} = 2(a^{(L)}-y)$


<hr>
This next derivative is going to be just a derivate of activation function:

$\frac{\partial a_{(L)}}{\partial z^{(L)}} = \sigma'(z^{(L)}) $ 

<hr>
This derivative will be an activated output from previous layer:

$\frac{\partial z^{(L)}}{\partial w^{(L)}} = a ^{(L-1)}$


<hr>
Derivative for bias will be:

$\frac{\partial C_o}{\partial b^{(L)}} = \frac{\partial z^{(L)}}{\partial b^{(L)}} \frac{\partial a^{(L)}} {\partial z^{(L)}} \frac{\partial C_o}{\partial a^{(L)}}$


<hr>

When we have a multiple output neurons, our cost function will be:

$C_o = \sum_{j=0}^{n_{L-1}}{(a_j^{(L)} - y_j)^2}$


<hr>

When calculating a derivative for neuron that will affect multiple neurons in the next layer ($L$), we must calculate the cost functions for all those neurons that he is connected to:

$\frac{\partial C_o}{\partial a^{(L-1)}} = \sum_{i=0}^{n_{L-1}}{\frac{\partial z^{(L)}}{\partial a^{(L-1)}}
\frac{\partial a^{(L)}}{\partial z^{(L)}}
\frac{\partial C_0}{\partial a^{(L)}}}$

<hr>



## Super-simple example

<pre>
   w
x ----- [n1] ----> $\hat{y}$
</pre>

Let say $x = 1, w = 2, y=1$

<hr>

If we forward propagate our input value we get:

$\hat{y} = x * w = 1 * 2 = 2$

<hr>

We now need to measure how far off we are from our desired output and for that we are going to use $MSE$ expressed as:

$MSE = \sum_{i=1}^n{(y - \hat{y})^2}$

Then our loss will be:

$MSE = (1 - 2)^2 = 1$

<hr>

In order to optimize our network we must optimize our loss/cost. Our loss/cost is depended on output of last neuron, meaning we somehow must optimize it's output to help loss function to optimize. 
We do that by taking a partial derivitive of loss (notation is $C$) function with respect to last neuron output value:

$\frac{\partial C}{\partial a} = -2 (y - a)$

<hr>


But $a$ is just a output from neuron which is determined by the weight $w$. So we need to change $w$ to change $a$ who will optimize our loss function. To do so we need to do a partial derivitive of $a$ with respect to $w$:

(Sidenote: $a$ was just $w * x$)

$\frac{\partial a}{\partial w} = x$

<hr>


Now when we computed partial derivitives we can calculate partial derivitive of cost with respect to $w$:

$\frac{\partial C}{\partial w} = \frac{\partial a}{\partial w} \frac{\partial C}{\partial a}$

and now we just insert what we got before:

$ = x * (-2(y - a))$

<hr>

Now we can update our weight. We are going to include one more thing called learning rate $lr$ and set it to $0.1$ for now. Then we only need to calculate our new weight:

$w = w - lr * (x * (-2(y - a))) = 1.8$


<hr>

If we run the example $x$ again we see that $\hat{y} = 1.8$ and $MSE = 0.64$ which is better than before.

In [2]:
x = 1
w = 2
y = 1
lr = 0.1

a = x * w

w = w - lr * (x * (-2 * (y - a)))

print(w)

1.8


In [122]:
# Validation with tensorflow
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.optimizers import SGD

x = [1]
y = 1
w = 2

model = Sequential()
model.add(Dense(1, input_dim=1, use_bias=False))
model.compile(loss='mse', optimizer=SGD(lr=0.1), metrics=['mse'])

model.set_weights([[[w]]])
print("weights:", model.get_weights())
print("y_hat:", model.predict(np.array(x).reshape(1, 1)))

model.fit(np.array([x]), [y], epochs=1, verbose=0)
print("weights_1:", model.get_weights())
print("y_hat_1:", model.predict(np.array(x).reshape(1, 1)))

weights: [array([[2.]], dtype=float32)]
y_hat: [[2.]]
weights_1: [array([[1.8]], dtype=float32)]
y_hat_1: [[1.8]]


<hr><hr>

## Super-simple example 2

<pre>
   w1         w2
x ----- [n1] ----- [n2] -----> $\hat{y}$
</pre>

Let say $x = 1, w_1 = 2, w_2 = 3, y=1$

Additional notation:

$a_1$: output from neuron $n_1$

$a_2$: output from neuron $n_2$

<hr>

After forward propagation we get:

$a_1 = x * w_1 = 1 * 2 = 2$

$a_2 = a_1 * w_2 = 2 * 3 = 6$

or

$\hat{y} = w_2 (x * w_1) = 3 (1 * 2) = 6$

<hr>

Our $MSE$ loss will be:

$MSE = (1 - 6)^2 = 25$

<hr>

Now we optimize our loss/cost:

$\frac{\partial C}{\partial a_2} = -2 (y - a_2)$

<hr>

Next we need to find a derivitive for $a_2$ with respect to $w_2$:

(Sidenote: $a_2$ was $w_2 * a_1$)

$\frac{\partial a_2}{\partial w_2} = a_1$

<hr>

We are now ready to calculate partial derivitive of cost with respect to $w_2$:

$\frac{\partial C}{\partial w_2} = a_1 * (-2 (y - a_2))$

<hr>

Now update $w_2$:

$w_2 = w_2 - lr * (a_1 * (-2 (y - a_2))) = 1$

<hr>
If we run the example $x$ again we see that $\hat{y} = 2$ and $MSE = 1$ which is way better than before.

But we are not done yet! We need to update $w_1$ too.

<hr>

Meaning we need to find partial derivitive of $C$ with respect to $w_1$:

$\frac{\partial C}{\partial w_1} = \frac{\partial a_1}{\partial w_1} \frac{\partial a_2}{\partial a_1} \frac{\partial C}{\partial a_2}$

<hr>

We calculated the $\frac{\partial C}{\partial a_2}$ before and we can reuse it here, which means we only need to calculate $\frac{\partial a_1}{\partial w_1}$ and $\frac{\partial a_2}{\partial a_1}$:

(Sidenote: $\frac{\partial C}{\partial a_2} = -2 (y - a_2)$)

<hr>

(Sidenote: $a_2 = a_1 * w_2$)

$\frac{\partial a_2}{\partial a_1} = w_2$

<hr>

(Sidenote: $a_1 = x * w_1$)

$\frac{\partial a_1}{\partial w_1} = x$


<hr>

Now our complete derivitive for derivitive $C$ with respect to $w_1$ would be:

$\frac{\partial C}{\partial w_1} = x * w_2 * (-2 (y - a_2))$

<hr>

Now update $w_1$:

$w_1 = w_1 - lr * (x * w_2 * (-2 (y - a_2))) = -1$


<hr>

Running the example gives us $\hat{y} = -1$ and $MSE = 4$ which is better than we started with.

In [83]:
x = 1
w_1 = 2
w_2 = 3
y=1
lr = 0.1

a_1 = x * w_1
a_2 = a_1 * w_2

new_w_2 = w_2 - lr * (a_1 * (-2 * (y - a_2)))
print("New w_2:", new_w_2)

new_w_1 = w_1 - lr * (x * w_2 * (-2 * (y - a_2)))
print("New w_1:", new_w_1)

print("Update network...")

w_2 = new_w_2
w_1 = new_w_1

a_1 = x * w_1
a_2 = a_1 * w_2
print(a_1,a_2)
print("y_hat: ", a_2, "MSE: ", (y-a_2)**2)


New w_2: 1.0
New w_1: -1.0
Update network...
-1.0 -1.0
y_hat:  -1.0 MSE:  4.0


<hr>

**May as well do another epoch**

Lets recap our values:

$x = 1, w_1 = -1, w_2 = 1, y = 1$

$a_1 = x * w_1 = -1$

$a_2 = a_1 * w_2 = 0.2 * 1 = -1$


$\hat{y} = -1$, $MSE = 4$

<hr>

**Update $w_2$:**

$\frac{\partial C}{\partial a_2} = - 2(y - a_2)$

<hr>

(Sidenote: $a_2 = w_2 * a_1$)

$\frac{\partial a_2}{\partial w_2} = a_1$

<hr>

$\frac{\partial C}{\partial w_2} = a_1 * (-2(y - a_2))$

<hr>

$w_2 = w_2 - lr * (a_1 * (-2(y - a_2))) = 0.6$

<hr>

**Update $w_1$:**

$\frac{\partial C}{\partial w_1} = \frac{\partial a_1}{\partial w_1} \frac{\partial a_2}{\partial a_1} \frac{\partial C}{\partial a_2}$


(Sidenote: $\frac{\partial C}{\partial a_2} = -2(y - a_2)$)

<hr>

(Sidenote: $a_2 = a_1 * w_2$)

$\frac{\partial a_2}{\partial a_1} = w_2$

<hr>

(Sidenote: $a_1 = x * w_1$)

$\frac{\partial a_1}{\partial w_1} = x$

<hr>

$\frac{\partial C}{\partial w_1} = x * w_2 * (- 2(y - a_2))$

<hr>

$w_1 = w_1 - lr * (x * w_2 * (- 2(y - a_2))) = -0.6$

<hr>

If we check our example now we well get $\hat{y} = -0.36$ and $MSE = 1.85$ which is slow but good change.

In [85]:
x = 1
w_1 = -1
w_2 = 1
y = 1
lr = 0.1

a_1 = x * w_1
a_2 = a_1 * w_2

new_w_2 = w_2 - lr * (a_1 * (-2 * (y - a_2)))
print("New w_2:", new_w_2)

new_w_1 = w_1 - lr * (x * w_2 * (-2 * (y - a_2)))
print("New w_1:", new_w_1)

print("Update network...")

w_2 = new_w_2
w_1 = new_w_1

a_1 = x * w_1
a_2 = a_1 * w_2
print(a_1,a_2)
print("y_hat: ", a_2, "MSE: ", (y-a_2)**2)

New w_2: 0.6
New w_1: -0.6
Update network...
-0.6 -0.36
y_hat:  -0.36 MSE:  1.8495999999999997


In [120]:
# Validation with tensorflow
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.optimizers import SGD

x = [1]
y = 1
w_1 = 2
w_2 = 3

model = Sequential()
model.add(Dense(1, input_dim=1, use_bias=False))
model.add(Dense(1, use_bias=False))
model.compile(loss='mse', optimizer=SGD(lr=0.1), metrics=['mse'])

model.set_weights([[[w_1]], [[w_2]]])
print("weights:", model.get_weights())
print("y_hat:", model.predict(np.array(x).reshape(1, 1)))

model.fit(np.array([x]), [y], epochs=1, verbose=0)
print("weights_1:", model.get_weights())
print("y_hat_1:", model.predict(np.array(x).reshape(1, 1)))

model.fit(np.array([x]), [y], epochs=1, verbose=0)
print("weights_2:", model.get_weights())
print("y_hat_2:", model.predict(np.array(x).reshape(1, 1)))

y_hat: [[6.]]
weights: [array([[2.]], dtype=float32), array([[3.]], dtype=float32)]
y_hat_1: [[-1.]]
weights_1: [array([[-1.]], dtype=float32), array([[1.]], dtype=float32)]
y_hat_2: [[-0.36]]
weights_2: [array([[-0.6]], dtype=float32), array([[0.6]], dtype=float32)]


<hr>

## Lets mix it up a bit!
**We must include bias and activation function in our network**

**Lets consider following example:**


<pre>
   b1___        b2__
   w1   \       w2  \
x ----- [n1]\ ----- [n2]\ -----> $\hat{y}$
</pre>

\ - means activation function

Let say $x = 1, w_1 = 1, b_1 = 1, b_2 = 2, w_2 = 2, y=2$

**Notation:**

$x$: input value

$w_1, w_2$: weights

$b_1, b_2$: biases for neurons

$z_1, z_2$: sum of input value and weights

$a_1, a_2$: output from neuron ($z_1, z_2$ with activation function)

<br>

**Formulas:**

For activation function we will use $\sigma$:

$\displaystyle \sigma(x)={\frac {1}{1+e^{-x}}}$

Derivitive will be:

$\displaystyle \sigma'(x)= \sigma(x) * (1 - \sigma(x))$


<br>

$z_1 = x * w_1 + b_1$

$a_1 = \sigma(z_1)$

$z_2 = a_1 * w_2 + b_2$

$a_2 = \sigma(z_2)$

<hr>

When we run $x$ through network we get $\hat{y} = 0.9772$ and $MSE = 1.046$

<hr>

**First things first! We now have some extra parameters which we need to take into account.**

Our derivitive of $C$ with respect to $w$ will now depend on activation function as well:

$\displaystyle \frac{\partial C}{\partial w} = \frac{\partial z}{\partial w} \frac{\partial a}{\partial z} \frac{\partial C}{\partial a}$


But we need to update our bias too and it will look very similar to formula for $w$:

$\displaystyle \frac{\partial C}{\partial b} = \frac{\partial z}{\partial b} \frac{\partial a}{\partial z} \frac{\partial C}{\partial a}$


Okay, we are ready.

<hr>

As always we start by finding a derivitive of our loss/cost function:

$\frac{\partial C}{\partial a_2} = -2 (y - a_2)$

<hr>

(Sidenote: $a_2 = \sigma(z_2)$)

$\frac{\partial a_2}{\partial z_2} = \sigma(z_2) * (1 - \sigma(z_2))$

<hr>

(Sidenote: $z_2 = a_1 * w_2 + b_2$)

$\frac{\partial z_2}{\partial w_2} = a_1$

<hr>

$\frac{\partial C}{\partial w_2} = (a_1) * (\sigma(z_2) * (1 - \sigma(z_2))) * (-2 (y - a_2))$

<hr>

$w_2 = w_2 - lr * ((a_1) * (\sigma(z_2) * (1 - \sigma(z_2))) * (-2 (y - a_2)))$

$w_2 = 2.004$

<hr>

On to update $b_2$:

(Sidenote: $z_2 = a_1 * w_2 + b_2$)

$\frac{\partial z_2}{\partial b_2} = 1$


$\frac{\partial C}{\partial b_2} = (1) * (\sigma(z_2) * (1 - \sigma(z_2))) * (-2 (y - a_2))$

<hr>

$b_2 = b_2 - lr * ((1) * (sigma(z_2) * (1 - sigma(z_2))) * (-2 * (y - a_2)))$

$b_2 = 2.0045$

<hr>

**On to update $w_1$ and $b_1$:**

$\displaystyle \frac{\partial C}{\partial w_1} = \frac{\partial z_1}{\partial w_1} \frac{\partial a_1}{\partial z_1} \frac{\partial z_2}{\partial a_1} \frac{\partial a_2}{\partial z_2} \frac{\partial C}{\partial a_2}$

$\displaystyle \frac{\partial C}{\partial b_1} = \frac{\partial z_1}{\partial b_1} \frac{\partial a_1}{\partial z_1} \frac{\partial z_2}{\partial a_1} \frac{\partial a_2}{\partial z_2} \frac{\partial C}{\partial a_2}$

<hr>

**Calculate gradient for $w_1$:**

(Sidenote: $z_2 = a_1 * w_2 + b_2$)

$\frac{\partial z_2}{\partial a_1} = w_2 + 0$

<hr>

(Sidenote: $a_1 = \sigma(z_1)$)

$\frac{\partial a_1}{\partial z_1} = \sigma(z_1) * (1 - \sigma(z_1))$

<hr>

(Sidenote: $z_1 = x * w_1 + b_1$)

$\frac{\partial z_1}{\partial w_1} = x$

<hr>

Finally we get:

$\frac{\partial C}{\partial w_1} = 
(x) 
* 
(\sigma(z_1) * (1 - \sigma(z_1)))
*
(w_2)
*
(\sigma(z_2) * (1 - \sigma(z_2)))
*
(-2 (y -a_2))$


$w_1 = w_1 - lr * ((x) * (\sigma(z_1) * (1 - \sigma(z_1))) * (w_2) * (\sigma(z_2) * (1 - \sigma(z_2))) * (-2 (y -a_2)))$

$w_1 = 1.00095$

<hr>

**Calculate gradient for $b_1$:**

(Sidenote: $z_1 = x * w_1 + b_1$)

$\frac{\partial z_1}{\partial b_1} = 1$


<hr>

Then we get:

$\displaystyle \frac{\partial C}{\partial b_1} = 
(1)
*
(\sigma(z_1) * (1 - \sigma(z_1)))
*
(w_2)
*
(\sigma(z_2) * (1 - \sigma(z_2)))
*
(-2 (y - a_2))$


$b_1 = b_1 - lr * ((1) * (\sigma(z_1) * (1 - \sigma(z_1))) * (w_2) * (\sigma(z_2) * (1 - \sigma(z_2))) * (-2 (y - a_2)))$

$b_1 = 1.00095$

<hr>

**There we have it! One epoch is done.**

Our initial result was: $\hat{y}=0.9772$ and $MSE=1.046$

And after one epoch result is: $\hat{y}=0.9774$ and $MSE=1.045$

In [129]:
import numpy as np
def sigma(Z):
    return 1/(1+np.exp(-Z))

x = 1
y = 2
w_1 = 1
b_1 = 1
w_2 = 2
b_2 = 2

z_1 = x * w_1 + b_1
a_1 = sigma(z_1)
z_2 = a_1 * w_2 + b_2
a_2 = sigma(z_2)


new_w_2 = w_2 - lr * ((a_1) * (sigma(z_2) * (1 - sigma(z_2))) * (-2 *(y - a_2)))
print("New w_2:", new_w_2)

new_b_2 = b_2 = b_2 - lr * ((1) * (sigma(z_2) * (1 - sigma(z_2))) * (-2 * (y - a_2)))
print("New b_2:", new_b_2)

new_w_1 = w_1 - lr * ((x) * (sigma(z_1) * (1 - sigma(z_1))) * (w_2) * (sigma(z_2) * (1 - sigma(z_2))) * (-2 *(y - a_2)))
print("New w_1:", new_w_1)

new_b_1 = b_1 - lr * ((1) * (sigma(z_1) * (1 - sigma(z_1))) * (w_2) * (sigma(z_2) * (1 - sigma(z_2))) * (-2 * (y - a_2)))
print("New b_1:", new_b_1)

print("Update network...")

w_2 = new_w_2
b_2 = new_b_2
w_1 = new_w_1
b_1 = new_b_1

z_1 = x * w_1 + b_1
a_1 = sigma(z_1)
z_2 = a_1 * w_2 + b_2
a_2 = sigma(z_2)

print("y_hat: ", a_2, "MSE: ", (y-a_2)**2)

New w_2: 2.0040000160377818
New b_2: 2.0045413593412063
New w_1: 1.000953627199678
New b_1: 1.000953627199678
Update network...
y_hat:  0.9774686759735803 MSE:  1.045570308615223


In [131]:
# Validation with tensorflow
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.optimizers import SGD

x = [1]
y = 2
w_1 = 1
b_1 = 1
w_2 = 2
b_2 = 2

model = Sequential()
model.add(Dense(1, input_dim=1, use_bias=True, activation="sigmoid"))
model.add(Dense(1, use_bias=True, activation="sigmoid"))
model.compile(loss='mse', optimizer=SGD(lr=0.1), metrics=['mse'])

model.set_weights([[[w_1]], [b_1], [[w_2]], [b_2]])
print("weights:", model.get_weights())
print("y_hat:", model.predict(np.array(x).reshape(1, 1)))

model.fit(np.array([x]), [y], epochs=1, verbose=0)
print("weights:", model.get_weights())
print("y_hat:", model.predict(np.array(x).reshape(1, 1)))

weights: [array([[1.]], dtype=float32), array([1.], dtype=float32), array([[2.]], dtype=float32), array([2.], dtype=float32)]
y_hat: [[0.9772815]]
weights: [array([[1.0009537]], dtype=float32), array([1.0009537], dtype=float32), array([[2.004]], dtype=float32), array([2.0045414], dtype=float32)]
y_hat: [[0.97746867]]


# TODO: add for multiple neurons in a layer + matrix representation of weights and biases