## Sigmoid neurons

Just like a perceptron the sigmoid neurons has weights ($x1, x2, ... x_n$) and a $bias$. But, instead of outputing 0 or 1 the output is passed through the $sigmoid \space function$. $\sigma(w \cdot x + b)$.

$$ \sigma(w \cdot j + b) = \frac{1}{1 + e^{-z}}, z = w \cdot x + b $$
$$ w \cdot b = \sum_{j} x_j \cdot w_j + b $$

The sigmoid function can be vizualized below.

In [6]:
import numpy as np
import matplotlib.pyplot as plt

sigmoid = lambda z: 1 / (1 + np.e**-z)
x = np.linspace(-4, 4, 100)
y = sigmoid(x)

fig, axes = plt.subplots()

plt.plot(x, y)

 The shape of the sigmoid function is what's important. The exact shape doesn't really matter but it will simplify the calculations. If we make small changes to the weights and biases so that the sigmoid function is in that region of smoothness we can control the output. So changes in $\Delta w$ or $\Delta b$ will cause small changes in $\Delta output$. Mathematically

$$ \Delta output \approx \sum_{j} \frac{\partial output}{\partial w_j} \cdot \Delta w_j + \frac{\partial output}{\partial b} \cdot \Delta b $$

$ x - training \space input$
 
$ y = y(x) - desired \space output $

When we need is an algorithm that will adjust the weights and biases to produce an output close to the desired output. To quantify this we use a **cost function**(aka loss or objective).

$$ C(w, b) = \frac{1}{2n} \sum_x || y(x) - a ||^2 $$
$ w - all \space weights $

$ b - all \space biases $

$ a - output \space when \space x \space input $

This particular *cost function* is known as *the quadratic cost function* or *mean squared error* or just MSE.

Minimizing the cost function is finding the weights and biases that will reduce that cost function. Gradient descent is a common technique used to minize a function.

Assuming C is a function that depends on v1, v2 we know from calculus:

$$ \Delta C \approx \frac{\partial C }{\partial v1} \cdot \Delta v1 + \frac{\partial C }{\partial v2} \cdot \Delta v2 $$

So we are going to find v1, v2 to make the $ \Delta C $ negative.

We can define $ \Delta v $ as the vector of changes.

$$ \Delta v = (\Delta v1, \Delta v2)^T $$

And define the partial derivatives with respect to each variable as the gradient. That's what is the gradient in fact.

$$ \nabla C = (\frac{\partial C}{\partial v1}, \frac{\partial C}{\partial v2})^T  $$

With all the definitions we can express the equation for $ \Delta C $ as a dot product.

$$ \Delta C \approx \nabla C \cdot \Delta v $$

This equation is very important because it shows that if we choose $ \Delta v $ to be $ \eta \cdot \nabla C  $ ($\eta$ known as *learning rate*) then we will be sure that the $ \Delta C $ will alway decrease never increase.

$$ \Delta v = -\eta \cdot \nabla C $$
$$ \Delta C \approx - \eta \cdot \nabla C = - \eta || \nabla C ||^2 $$

because $ || \nabla C ||^2  \ge 0  $ within the boundaries of our approximation.

THat is we will compute a value for $ \Delta v $ and move the ball postition $ v $ by the amount.

$$ v \rightarrow v' = v - \eta \nabla C $$

## Backpropagation algorithm

Backgpropagation is an algorithm used by a neural network to train. At its core is an expression between each weight, bias and the cost function.

### Conventions used for weights in the network

$$ w_{jk}^{l} \newline w - weight$$
weight of the $j-th$ neuron in the $l$ layer and $k-th$ neuron in $l-1$ layer

$$ b_j^l \newline b - bias $$

So we can succintly write.

$$ a^l_k =  \sigma(\sum_k w_{jk}^l a_k^{l-1} + b_l^j) $$

We can write this in a vectorized form.

$$ a^l = \sigma(w^la^{l-1} + b^l) $$

The intermediate quantity inside the $\sigma$ function its useful enough to be worth naming.

$$ z^l \equiv w^la^{l-1} + b^l$$
$$ z^l - weighted \space input$$

Sometimes we can write the activation matrix of a layer in terms of the weigthed input like so:

$$ a^l = \sigma(z^l) $$

## The Hadamard product ($\bigodot$)



$Hadamard \space or \space Schur \space product$
$$
\begin{bmatrix} 1 \\ 2 \end{bmatrix}
\bigodot
\begin{bmatrix} 3 \\ 4 \end{bmatrix}
=
\begin{bmatrix} 1* 3 \\ 2 * 4 \end{bmatrix}
=
\begin{bmatrix} 3 \\ 8 \end{bmatrix}
$$

Backpropagatian is all about computing the partial derivatives of the cost function with respect with every single weight and bias. Mathematically we can express those as $ \frac{\partial C}{\partial w^l_{jk}} $ and $ \frac{\partial C}{b_j^l} $

To compute those we introduce an intermediat quantity called the error for the $j-th$ neuron in the $l-th$ layer

$$\delta^l_j = \frac{\partial C}{\partial z_j^l}$$

## Four equations for backpropagation

### An equation for the error in the output layer

$$ \delta_{j}^L = \frac{\partial C}{\partial a_j^L} \sigma'(z_j^L) $$

The first term $ \frac{\partial C}{\partial a_j^L} $ measures how much the cost is changing as a function of the $h-th$ output activation. The second term $\sigma'(z_j^L) $ measures  how fast the activation function $ \sigma $ is changing.

The matrix based expression can be rewritten as:

$$ \delta^L = \nabla_a C \bigodot \sigma'(z^L) $$