#### Entropy:
In statistics, thermodynamics and information theory, *entropy* is a measure of uncertainty or information (the connection between uncertainty and information is that "an occurrence of an unlikely event gives you more *information* than the occurrence of a likely event"). The *entropy* of a discrete probability distribution with $n$ different outcomes, $p=\langle p_1, ..., p_n \rangle$, is:

$$H(P)=-\sum_{i=1}^{n} p_i \cdot \log_{2}p_i.$$

For a continuous probability distribution, $p()$, the entropy is given by:

$$H(P)=-\int_\theta p(\theta) \cdot \log p(\theta) d\theta.$$

For a *multivariable* Normal distribution:

$$H(p)=\sum_i \log \sigma_i.$$

#### KL-Divergence:

For two probability distributions: $p=\langle p_1, ..., p_n \rangle$ and $q=\langle q_1, ..., q_n \rangle$, the *Kullback-Leibler Divergence* between $p$ and $q$ is:

$$D_{KL}(p \parallel q) = \sum_{i=1}^{n} p_i (\log_2 p_i - \log_2 q_i).$$

This gives a measure of similarity or 'distance' between two probability distributions. Note that $D_{KL}(p \parallel q) \neq D_{KL}(q \parallel p)$.

For continuous distributions, $p and q$, KL-divergence is given by:

$$D_{KL}(p \parallel q) = \int_\theta p(\theta)(\log p(\theta) - \log q(\theta)) d\theta.$$




### Entropy and Huffman Encoding:

The entropy value of a frequency distribution of characters to send in a message gives the *average number of bits* required to represent a character drawn randomly from the message (in the most efficient character encoding scheme).

Example 1. Suppose a message consists only of $A$ and $B$ of equal frequency. The huffman encoding would assign $A=\texttt{0}$ and $B=\texttt{1}$, for example. 

The entropy value here would be $H(\langle 0.5, 0.5 \rangle)=-(0.5 \cdot \log_2\frac{1}{2} + 0.5 \cdot \log_2\frac{1}{2}) = 1$ bit.

Example 2. Suppose a message consists of $A, B \text{ and } C$ with frequencies $0.5, 0.25 \text{ and } 0.25$ respectively. The huffman encoding would assign $A=0, B=10, C=11$, for example.

The entropy value here would be $H(\langle 0.5, 0.25, 0.25 \rangle)=-(0.5 \cdot \log_2\frac{1}{2} + 0.25 \cdot \log_2\frac{1}{4} + 0.25 \cdot \log_2\frac{1}{4}) = 1.5$ bits. 


Note: $D_{KL}(q \parallel p)$ is the number of 'extraneous' bits that would be transmitted if we designed an encoding scheme based on $q's$ frequency distribution but it turned out the samples would be drawn from $p's$ frequency distribution instead.

Eg. If we designed a huffman encoding for the alphabet around the frequency distribution in English, but applied for German text, then the number of 'wastage' bits would be $D_{KL}(q \parallel p)$.



### Forward/Reverse KL-Divergence:
Forward KL-Divergence: when we have a distribution $P$ and we want to choose a Normal distribution $Q$ which is 'close' to $P$, or 'approximately' $P$. In this case, we can use the KL-divergence, $D_{KL}(P \parallel Q)$, as a loss function to minimise.

Reverse KL-Divergence: when we have a distribution $P$ and we want to choose a Normal distribution that minimises $D_{KL}(Q \parallel P)$ rather than $D_{KL}(P \parallel Q)$.


Forward             |  Reverse
:-------------------------:|:-------------------------:
![](images/forward-kl-divergence.png)   |  ![](images/reverse-kl-divergence.png)

## Variations on Backpropagation:

### Cross Entropy
Cross-entropy is a measure of the relative entropy between two probability distributions over the same set of events.

For classification tasks where the output should be either 0 or 1, the mean squared error loss function works poorly. 

Instead, we can use the cross-entropy error function $E=-(t\log (z) + (1-t)\log(1-z))$, where $z$ is the sigmoid function and $t$ is the target value.

- If $t=1$, $E=-\log (y)$
- If $t=0$, $E=-\log(1-y)$

This forces the network to put higher emphasis on misclassifications 

Eg. In the case of detecting credit card fraud, there would be a way bigger proportion of negative instances than positive instances. Cross-entropy would place greater emphasis on the positive instances it misclassifies, which in this application, is extremely important.

Choosing the cross-entropy error function makes backpropagation computations simpler. Suppose we have the logistic sigmoid activation function: $z=\frac{1}{1+e^{-s}}$. Note how

$$\frac{\partial E}{\partial z}=\frac{z-t}{z(1-z)},$$
and applying the chain rule, we have
$$\frac{\partial E}{\partial s}=\frac{\partial E}{\partial z}\cdot \frac{\partial z}{\partial s}=z-t,$$

a very simple result.

<a href="https://towardsdatascience.com/cross-entropy-for-dummies-5189303c7735">Good article for clarification on cross-entropy</a>

#### Maximum Likelihood:
A hypothesis is a particular set of weights.

$P(D|h)= \text{probability of data } D \text{ being generated under hypothesis } h \in H$, where $H$ is a class of hypotheses. Ie. $P(D|h)$ is the probability that we observe $D$, given a particular set of weights.

$\log{P(D|h)}$ is called the *likelihood*.


We want to maximise $P(D|h)$.

... TODO. What the fuck is data $D$


### Weight Decay
When weights 'blow up', it can inhibit the network's learning ability. For instance, when the weights are for a neuron, the output will be large which means a sigmoid activation function is virtually the same as a step function.

To 'encourage' the weights to remain small, we add a *penalty* term to the loss function. 

$$E=\frac{1}{2}\sum_i (z_i-t_i)^2+\underbrace{\frac{\lambda}{2}\sum_j w_j^2}_{\text{Penalty term}}$$

Since the goal is to *minimise* the error function, by adding an additional term like $\frac{\lambda}{2}\sum_j w_j^2$, ... TODO wait how does this work

Where $\lambda$ is empircally determined by fine-tuning. Eg. $\lambda=0.00001$.


TODO: What is Bayesian inference?




### Momentum

Dampens oscillations in a 'rain gutter' part of the error landscape. Adding the momentum factor amplifies the descent to the bottom by $\frac{1}{1-\alpha}$



TODO: ... what


