# Backpropagation:

### Ockham's Razor:

Ockham's razor is a problem-solving principle that asserts that "the most likely hypothesis is the simplest one consistent with the data"

<img src="images/ockham's-razor.png" width="50%">

*Overfitting* is when the learning model is too fine-tuned to the noise rather than the true signal. In the third panel above, the network has learned a decision boundary that doesn't generalise to the true pattern presented by the dataset.


### Training Neural Networks:
In the context of neural networks, *learning* is the process of adjusting a network's weights to achieve a configuration that minimises the <em>error function</em>.

#### Error Functions:
An example *error function*, also called *loss function*, that can be used to quantify the incorrectness of a network's prediction is the *mean squared error function*.

- *Mean squared error* &mdash; the sum of the squared difference between the network's predicted value $z_i$ and the expected value $t_i$ for all $i$ neurons in the output layer:
$$
    E(z_i, t_i)=\frac{1}{2}\sum_{i} (z_i-t_i)^2. \tag{1}
$$ 
Note the $\frac{1}{2}$ is chosen so that the coefficient is conveniently canceled when we differentiate this error function with respect to a network weight.

- *Cross-entropy loss* &mdash;  ... for all $i$ neurons in the output layer:
$$
    E(z_i, t_i)=-\sum_{i}t_i\log{(z_i)}. \tag{2}
$$
One way to interpret cross-entropy is to see it as a negative log-likelihood for the network to produce $t_i$(?). TODO: What is negative log likelihood even... 
Cross-entropy is a good error function for measuring the 'distance' between the vector of the output layer's predicted probabilities for each class and the expected label vector (which is 1-hot encoded).
- Binary cross-entropy loss &mdash; a variation on regular cross-entropy loss 
- Categorical cross-entropy loss &mdash; also called *softmax loss*. It is just softmax followed by a cross-entropy loss
https://gombru.github.io/2018/05/23/cross_entropy_loss/ AMAZING RESOURCE


#### Differentiable Activation Functions:

A requirement for backpropagation in neural networks is a *differentiable activation function*, since backpropagation uses *gradient descent* to adjust weights. 

The Heaviside step function's derivative has a value of $0$ everywhere, so gradient descent can't make progress in updating the weights. One very important transition beyond the perceptron model is the introduction of continuous activation functions.

Alternative differentiable activation functions:

- Logistic sigmoid: $\sigma(s) =\frac{1}{1+e^{-s}}$
    - Output values range: $[0, 1]$
    - In statistics, *logistic regression* is used to model the probability of a binary event (eg. alive/dead, cat/non-cat, etc.). Like all regression analyses, logistic regression is a form of predictive analysis
- Hyperbolic tan: $\texttt{tanh}(s)=\frac{e^s-e^{-s}}{e^s+e^{-s}}=2(\frac{1}{1+e^{-s}}) - 1=2\sigma(s)-1$
    - Output values range: $[-1, 1]$
    - $\texttt{tanh}$ is just the logistic sigmoid function doubled and then shifted down 1 unit vertically
- Rectified linear unit &mdash; 
$\texttt{ReLU}(s) = \begin{cases}
x, & s > 0,\\
0, & s \leq 0. 
\end{cases}$




### Gradient Descent:

 


## Backpropagation

Backpropagation is a gradient computing technique. the process of computing derivatives in a 

- This is just a simple implementation of chain rule 



Gradient descent is the process of adjusting network parameters to minimise the loss function, using the derivatives calculated in backpropagation.

- Stochastic gradient descent — a *first order optimiser*. TODO


The __*loss function*__ computes the error for a single traning example. 

- Mean squared error: $E(z, t)=\frac{1}{2}(z-t)^2$, where $z$ is the predicted value and $t$ is the target/expected value
- Cross-entropy: $E(z, t)=-(t\log(z)+(1-t)\log(1-z))$



The __*cost function*__ computes the average of the loss function values across the entire training set: $\frac{1}{m}\sum_{i=1}^m E(z^{(i)}, t^{(i)})$, where $m$ is the total number of training examples and $z^{(i)}$ and $t^{(i)}$ are the prediction and target value for the $i^{\text{th}}$ training sample respectively.




### Resources
- Backpropagation:
    - https://towardsdatascience.com/the-heart-of-artificial-neural-networks-26627e8c03ba
- Gradient descent:
    - https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9