## Information and Entropy

Entropy is a concept from information theory designed to describe amount of uncertainty we have with a random variable in the given probability distribution. The idea is the more uncertain we are, the less information we have. To find entropy in essence we need to calculate average amount of information we need to describe a random variable. So, how would we measure information? We use "bits". To see how this works, let's look at two examples. 

#### Example 1
Supppose we have the following distribution:

$$Pr(x) = \begin{cases} 
          0.25 & x =A \\
          0.25 & x =B \\
          0.25 & x =C \\
          0.25 & x =D \\
       \end{cases}
$$

Given a random $x$, how many questions can we ask to determine what it equals to?
Since all probabilties are the same, the mos efficient way of doing it is to ask two questions:

1. Is it A or B?

2. If the answer was "Yes", then we ask: Is it A?; If the answer was "No", we ask: Is it C?

So, we always ask two questions, therefore the average number of questions asked is two as well. And so, we say that the entropy here is 2 bits.

#### Example 2

Suppose now our probabilities are not the same:

$$Pr(x) = \begin{cases} 
          0.5 & x =A \\
          0.25 & x =B \\
          0.125 & x =C \\
          0.125 & x =D \\
       \end{cases}
$$

We can do the same as in the example 1, but can we do better? Well, Since the probability $Pr(x=A)$ is already a half, $A$ will appear very often, so let's ask our questions in the following way:

1. Is it A?

2. If the answer was "No", we ask: Is it B?

3. If the answer was "No", we ask: Is it C?

On the first glance, it may seem like we are asking more questions now, but it depends. If $x=A$ for example, we are asking only 1 question. So what is the average number of questions we are asking? To find this, we just need to calculate the Expected Value of $x$. Let $N(x)$ detone number of questions we are asking to get to $x$ (so $N(A)=1$ and $N(C)=3$ for example)

$$E(x)=\sum_{x=A, B, C, D } Pr(x) \cdot N(x)=0.5 \cdot 1+0.25\cdot 2+0.125\cdot 3+0.125\cdot 3=1.75$$

So, on average we are asking less than two questions and we say that the entropy for this distribution is 1.75 bits.

In the example 2, we have less entropy, because we have more predictive power than inthe example 1.

Let's formalize the our calculation. To find $N(x)$ notice that it depends on $Pr(x)$. The lower the probability, the questions we need to ask and since each question has binary answer, we need to look at powers of 2:

$$N(x)=\log_2\left(\frac{1}{Pr(x)}\right)$$

So, suppose we have some probability distribution $P$ in which $P(i)=p_i$, then the entropy is:

$$H(P)=\sum_{i}p_i\cdot\log_2(1/p_i) = - \sum_{i}p_i\cdot\log_2(p_i)$$

Finally, we will only care about entropy as a relative quantity and so we can replace $\log_2()$ with $\ln()$ as they differ only by a constant:

$$H(P)= - \sum_{i}p_i\cdot\ln(p_i)$$

### Example 3

Suppose I have a bag with 10 blue and 10 red marbles and I want to pick one at random, then I can't really predict what kind of marble I will get, but if the bag had 19 red marbles and 1 blue marble, then I should be expecting to get a red marble. So in the first case I should have have larger entropy.

In our first bag case, entropy is 

$$H= -\frac{1}{2}\ln\left(\frac{1}{2}\right)-\frac{1}{2}\ln\left(\frac{1}{2}\right) \approx 0.6931$$

And in the second case:

$$H=-\frac{1}{20}\ln\left(\frac{1}{20}\right)-\frac{19}{20}\ln\left(\frac{19}{20}\right) \approx 0.198514$$


Note: If we $p_k=0$, we would get $0 \cdot \ln(0)$ which is undefined. However, since 0 probability is perfect certainty, we set this to 0.

## KL-Divergence

KL-Divergence (also known as Relative Entropy) is a way to compare two probability distributions. This comparison is one directional, meaning one of the distribution is taken as ground truth and the other as the testing distribution. This means KL-Divergence is not symmetric. Let's try to develop this measure. We will consider $P$ as true distribution ( or observations) and $Q$ as testing distribution (theory or predictions).

Suppose we have a coin. As we flip our coin many times we record its distribution. Suppose we get $Pr(T|Q)=0.25$ and $Pr(H|Q)=0.75$. We want to compare this distribution to the distribution of a fair coin: $Pr(T|P)=Pr(H|P)=0.5$. One way we can do this is by looking at large random samples, computing likelihoods of such sample according to each distribution and finding a ratio of the two. 

For example, let's say we have $S=\{H, T, T, H, H, H, H, T, T, H \}$. Let $N_H$ and $N_T$ be number of heads and tails, so $N_H=6$, $N_T=4$. Then

$$Pr(S|P)=Pr(H|P)^{N_H}\cdot Pr(H|P)^{N_H}=\left(\frac{1}{2}\right)^6 \cdot \left(\frac{1}{2}\right)^4=\left(\frac{1}{2}\right)^{10} \approx 0.00097656$$
$$Pr(S|Q)=Pr(H|Q)^{N_H}\cdot Pr(H|Q)^{N_H}=\left(\frac{3}{4}\right)^6 \cdot \left(\frac{1}{4}\right)^4=\frac{3^6}{4^{10}}\approx 0.00069523$$

Taking the ratio:

$$\frac{Pr(S|P)}{Pr(S|Q)}=\frac{0.00097656}{0.00069523} \approx 1.404664$$

This is almost what we want. There are three things we need to fix:

1. So far this depends on sample. If we want to compare distributions we would want this to be independed of a sample taken. We can achieve this by leting our sample size go to infinity.

2. Above creates a new problem. If sample size is infinite, then the powers we take will create problems. So, we normalize this taking $1/N$ power of our results, where $N$ is a sample size. Note that in this case $N_H/N$ and $N_T/N$ will approach the corresponding probabilities of the true distribution.

3. Finally, we want the value to be similar if for example probability is doubled or halved. We will take natural log to achive this, since $\ln(2)= -\ln(1/2)$


Let's create and finalize our formula and apply to our example.

Let $Pr(T|P)=p_1$ and $Pr(H|P)=p_2$; let $Pr(T|Q)=q_1$ and $Pr(H|Q)=q_2$. Then the ratio we had before would be:

$$\frac{p_1^{N_T}p_2^{N_H}}{q_1^{N_T}q_2^{N_H}}$$.

Let's take the power $1/N$ and a natural log:

$$\ln\left(\left(\frac{p_1^{N_T}p_2^{N_H}}{q_1^{N_T}q_2^{N_H}}\right)^{1/N}\right)=\frac{1}{N}\left(\ln(p_1^{N_T})+\ln(p_2^{N_H})-\ln(q_1^{N_T})-\ln(q_2^{N_H})\right)$$
$$= \frac{N_T}{N}\ln(p_1)+\frac{N_H}{N}\ln(p_2)-\frac{N_T}{N}\ln(q_1)-\frac{N_H}{N}\ln(q_2)$$

And as $N \rightarrow \infty$, we get:

$$ p_1\ln(p_1)+p_2\ln(p_2)-p_1\ln(q_1)-p_2\ln(q_2) = p_1 \ln \left(\frac{p_1}{q_1}\right) + p_2\ln\left(\frac{p_2}{q_2}\right)=\sum_i p_i\ln \left(\frac{p_i}{q_i}\right)$$

This is the KL-Divergence. The notation for it is

$$D_{KL}(P||Q)=\sum_i p_i\ln \left(\frac{p_i}{q_i}\right),$$
where $p_i=Pr(i|P)$ and $q_i=Pr(i|Q)$

Applying to our coin example, we would get:

$$D_{KL}(P||Q)=\frac{1}{2}\ln\left(\frac{1/2}{3/4}\right)+\frac{1}{2}\ln\left(\frac{1/2}{1/4}\right)\approx 0.14384 $$

If, for example, our $Q$ distribution was $Pr(T|Q)=0.1$ and $Pr(H|Q)=0.9$, then we should expect to get larger value for KL-Divergence:

$$D_{KL}(P||Q)=\frac{1}{2}\ln\left(\frac{1/2}{9/10}\right)+\frac{1}{2}\ln\left(\frac{1/2}{1/10}\right)\approx 0.5108 $$

## Entropy and cross-Entropy




If we look carefully at KL-Divergence formula, we would notice that it is related to entropy:

$$D_{KL}(P||Q)=\sum_i p_i\ln \left(\frac{p_i}{q_i}\right)=\sum_i p_i\ln \left(p_i\right)-\sum_i p_i\ln \left(q_i\right)=-H(P)-\sum_i p_i\ln \left(q_i\right)$$

The remaining sum in above formula is called cross entropy:

$$H(P,Q)=-\sum_i p_i\ln \left(q_i\right)$$ and so final relation is:

$$D_{KL}(P||Q)=H(P,Q)-H(P)$$

So what is cross entropy? How is it different from KL-divergence?

Both in essence are measures of how different two distributions are. If KL-Divergence is an extra suprise of distribution Q if P is actual distribution, then cross entropy is the total surprise of Q distribution if P is the true.


### Cross-Entropy Loss

Suppose we have some labeled training data and we have some classification model paramterized with some parameter $\theta$. Then given a data point $x^{(i)}$, our model produces a prediction $\hat{y}^{(i)}$. This prediction is a vector with number of entries equal to number of classes.