## Neural Network
### Neuron
A neuron is a small component in the neural network, it computes a weighted sum of inputs plut a bias, and then applies it to an activation function

So it should receive inputs and it will eventually given an ouput

$$
z = w^{T}x + b = \sum_{i=1}^{n}w_{i}x_{i} + b
$$
here $w^{T}$ is the weight matrix and $x$ are the inputs (previous values)

we can do a small example:
$$
x = \begin{bmatrix} 1 \\ 2 \end{bmatrix} w = \begin{bmatrix} 0.5 \\ -1.0 \end{bmatrix}, b = 0.1\\
w_{1}x_{1} = 0.5 * 1 = 0.5\\
w_{2}x_{2} = -1.0 * 2 = -2.0\\
\sum{}{} = 0.5 + (-2.0) = -1.5\\
add bias: z = -1.5 + 0.1 = -1.4\\
ReLU a = max(0,z) = 0\\
$$

### Activation function
It's a function that calculates the output of the node based on its individual inputs and theri weights

Previously we used in logistic regression the sigmoid activation function

Recalling it: $\sigma(x) = \frac{1}{1+e^{-x}}$, but there are other activation functions

$ReLU = max(0,z)$ very effective and simple in these neural network

$$
Softmax = \frac{e^x_{i}}{\sum_{j=1}^{J}e^x_{j}}
$$ 
usually used in the categorial last layer outputs, forming percentage for each prediction

### Layers
The first layer is called input layer, which nodes don't don't operations, and usually size of inputs

The last layer is the output layer, usually size of the desidered output vector size

For example in the well-known number recognition, the input layer is the size of the image (length*width), and the output layer is 10 to get the vector of probability for each number

The layer between these 2 are called hidden layer, since they could be any size, now for a hidden layer $l$ it will have a weighted matrix $W^{l} \in \mathbb{R}^{n}$ (m neurons, n inputs), a bias vector (1 hot) $b^{l} \in \mathbb{R}^{n}$ and input vector activations from previous layer $a^{l-1} \in \mathbb{R}^{n}$

So generally for a layer $l$:

Pre-activation:
$z^{l} = W^{l}a^{l-1}+b^{l}$

Post-activation: $a^{l} = \phi(z^{l})$

Shape reminder: if $W^{l}$ is $m * n$ then $a^{l-1}$ must be $n * 1$, producing $z^{l}$ of shape $m * 1$

### Loss functions
Measures how bad a single prediction $\^{y}$ is compared to the true label y
#### Mean squared error (regression)
Penalizes larger deviations more heavily, differentiable, maximum likelihhod under Gaussian noise
$$
L(\^{y},y) = \frac{1}{2} \sum_{i}{}(\^{y_{i}} - y)^2
$$
#### Binary classification loss
Negative log-likelihood of a Bernoulli distribution, encourage the model to assign high probability to the correct class, penalizes wrong prediction strongly
$$
L(\^{y},y) = -(ylog(\^{y}) + (1-y)log(1-\^{y}))
$$
#### Categorical Cross-Entropy (Multi-class classificatio) 
The target is k classes, represented by 1-hot vector
$$
L(\^{y},y) = -\sum_{j-1}^{K}y_{j}log(\^{y}_{j})
$$

### Back propagation
The above points applies to the step called Forward propagation, since it calculates the values from previous to next neurons

Now the core step is to update the model's parameters to align with the ground truth.

Thus we should know if we need to increase or decrease a certain neuron's parameter, and how many we should increase/decrease

And we should change the parameter for each neuron of each layer $l$, by knowing what are the requirements on the layer $l+1$, so this is the meaning of propagating backwards 

So from forward propogation, we have these formulas:
(Supposing the loss is MSE)
$$
L(\^{y},y) = \frac{1}{2} \sum_{i}{}(\^{y_{i}} - y)^2\\
z = w^{T}x + b = \sum_{i=1}^{n}w_{i}x_{i} + b\\
a^{l} = \phi(z^{l})
$$

We can cleary notice here that $\^{y_{i}}$ influences $L(\^{y},y)$

$w^{T}, x, b$ influences $z$

$z$ influences $a$

What we want to calculate is how much we should increase/decrease the previous layer's neuron

Basically we're trying to calculate:
$$\frac{dCost}{dw^{l}}$$

The derivative basically tells us the sensitivity of Cost in terms of w, so changing a bit w how much will the cost change?

So conceptually since we have the influence change then:
$$
\frac{dCost}{dw^{l}} = \frac{dz^{l}}{dw^{l}}*\frac{da^{l}}{dz^{l}}*\frac{dCost}{da^{l}}
$$

And this is the chain rule in networks

Now we can basically calculate all the derivatives, since we have the formula and the all of them are derivable

$$
\frac{dz^{l}}{dw^{l}} = a^{l-1}\\
\\
\frac{da^{l}}{dz^{l}} = \phi^{I}(z^{l})\\
\\
\frac{dCost}{da^{l}} = (a^{l} - y)
$$
now we can substitute back to original formula:
$$
\frac{dCost}{dw^{l}} = a^{l-1}\phi^{'}(z^{l})(a^{l} - y)
$$

Since the full cost is averaged between all the examples, also this ratio should be averaged to all training examples

To compute the same thing with bias just substitute
$$
\frac{dz^{l}}{dw^{l}} to \frac{dz^{l}}{db^{l}}\\
\frac{dCost}{db^{l}} = 1\phi^{I}(z^{l})(a^{l} - y)
$$

#### Multiple layer
So this was a simplified example, but when we have multiple neurons at each layer?
The cost would become:
$\sum_{i}{}(\^{y_{i}} - y)^2$ and the $z$ would be: $\sum_{i=1}^{n}w_{i}x_{i} + b$
The derivatives are not influenced

But in reality we don't use $\frac{dCost}{dw^{l}}$ but rather the error signal per layer
$$\delta^{l} = \frac{dL}{dz^{l}} = \frac{da^{l}}{dz^{l}}⊙\frac{dCost}{da^{l}}$$

So this error signal will be passed during the back propagation, rather recomputing all the previous calculation for each neuron in each layer:

$$
\frac{dCost}{dW^{l}} = \delta^{l}(a^(l-1))^{T}
$$

here the $W$ indicates the matrix of layer $l$, and for bias:

$$
\frac{dCost}{db^{l}} = \delta^{l}
$$

We differentiate 2 cases: one for output layer, and the other for the hidden layer

Output:
$$
\delta^{L} = \frac{dL}{dz^{l}} = \frac{dCost}{da^{l}}⊙phi^{'}(z^{l}) = (a^{L} - y)⊙phi^{'}(z^{L})
$$

Hidden:
$$
\delta^{l} = (W^{l+1})^{T}\delta^{l+1}⊙\phi^{'}(z^{l})
$$

Note: $(W^{l+1})^{T}\delta^{l+1}$ is matrix multiplication and $⊙$ is elementwise multiplication

As we can see the $\delta^{l+1}$ got passed to previous level, so we don't actually need to recompute every derivative each time for each neuron

$(W^{l+1})^{T}\delta^{l+1}$ tells how each neuron in layer $l$ matters to the error in the next layer

$\phi^{'}(z^{l})$ adjusts this by how sensitive the activation is to its input

### Update Rules
Now that we know "how sensitive" are each neuron for adjusting the final result we need to indeed modify and replace them

#### Stochastic Gradient Descent (SGD)
Compute each image in mini-batches, and update in mini-batches
$$ W = W - \alpha\frac{dL}{dW}$$

#### SGD with momentum
Don't instantly change direction, but rather buil the momentum
$$
v_{t} = \mu v_{t-1} - \alpha\frac{d}{d\Theta_{j}}{L_{t}}\\
W_{t} = W_{t-1} + v_{t}
$$
we need to choose the $\mu = 0.9$ or you can increment it for noisy problems

$v$ is the velocity, accumulates a moving average of past fradients

$\mu$ a momentum coefficient

$\frac{d}{d\Theta_{j}}{L_{t}}$ normal gradient step
#### RMSProp (Adaptive learning rate)
$$
s_{t} = \Beta s_{t-1} + (1 + \Beta)(\frac{d}{d\Theta_{j}}{L_{t}})^2\\
W_{t} = W_{t-1} - \frac{\alpha}{\sqrt{s_{t} + \epsilon}}\frac{d}{d\Theta_{j}}{L_{t}}
$$

$s$: average of squared gradients

$\epsilon = 10^{-8}$: avoid division by 0

$\Beta = 0.9$: Momentum

#### Adam
A combination of Momentum and RMSProp

- Momentum smooths gradients over time

- RMSProp rescales learning rate per parameter (large gradient -> smaller)

Compute biased first moment (mean of gradients)
$$
m_{t} = \Beta_{1}m_{t-1} + (1-\Beta_{1})Gradient(L_{t})
$$
Compute biased second moment
$$
v_{t} = \Beta_{2}m_{t-1} + (1-\Beta_{2})Gradient(L_{t})^2
$$
Bias correction, because at beginning $m_{t},v_{t}$ are too small
$$
\^{m_{t}} = \frac{m_{t}}{1-\Beta_{1}^{t}}, \^{v_{t}} = \frac{v_{t}}{1-\Beta_{2}^{t}}
$$
Update the weights and bias
$$
W_{t} = W_{t-1} - \alpha\frac{\^{m_{t}}}{\sqrt{\^{v_{t}}} + \epsilon}
$$
$\Beta_{2} = 0.999$