## Information and Entropy

Entropy is a concept from information theory designed to describe amount of uncertainty we have with a random variable in the given probability distribution. The idea is the more uncertain we are, the less information we have. To find entropy in essence we need to calculate average amount of information we need to describe a random variable. So, how would we measure information? We use "bits". To see how this works, let's look at two examples.

#### Example 1
Supppose we have the following distribution:

$$Pr(x) = \begin{cases}
          0.25 & x =A \\
          0.25 & x =B \\
          0.25 & x =C \\
          0.25 & x =D \\
       \end{cases}
$$

Given a random $x$, how many questions can we ask to determine what it equals to?
Since all probabilties are the same, the mos efficient way of doing it is to ask two questions:

1. Is it A or B?

2. If the answer was "Yes", then we ask: Is it A?; If the answer was "No", we ask: Is it C?

So, we always ask two questions, therefore the average number of questions asked is two as well. And so, we say that the entropy here is 2 bits.

#### Example 2

Suppose now our probabilities are not the same:

$$Pr(x) = \begin{cases}
          0.5 & x =A \\
          0.25 & x =B \\
          0.125 & x =C \\
          0.125 & x =D \\
       \end{cases}
$$

We can do the same as in the example 1, but can we do better? Well, Since the probability $Pr(x=A)$ is already a half, $A$ will appear very often, so let's ask our questions in the following way:

1. Is it A?

2. If the answer was "No", we ask: Is it B?

3. If the answer was "No", we ask: Is it C?

On the first glance, it may seem like we are asking more questions now, but it depends. If $x=A$ for example, we are asking only 1 question. So what is the average number of questions we are asking? To find this, we just need to calculate the Expected Value of $x$. Let $N(x)$ detone number of questions we are asking to get to $x$ (so $N(A)=1$ and $N(C)=3$ for example)

$$E(x)=\sum_{x=A, B, C, D } Pr(x) \cdot N(x)=0.5 \cdot 1+0.25\cdot 2+0.125\cdot 3+0.125\cdot 3=1.75$$

So, on average we are asking less than two questions and we say that the entropy for this distribution is 1.75 bits.

In the example 2, we have less entropy, because we have more predictive power than inthe example 1.

Let's formalize the our calculation. To find $N(x)$ notice that it depends on $Pr(x)$. The lower the probability, the questions we need to ask and since each question has binary answer, we need to look at powers of 2:

$$N(x)=\log_2\left(\frac{1}{Pr(x)}\right)$$

So, suppose we have some probability distribution $P$ in which $P(i)=p_i$, then the entropy is:

$$H(P)=\sum_{i}p_i\cdot\log_2(1/p_i) = - \sum_{i}p_i\cdot\log_2(p_i)$$

Finally, we will only care about entropy as a relative quantity and so we can replace $\log_2()$ with $\ln()$ as they differ only by a constant:

$$H(P)= - \sum_{i}p_i\cdot\ln(p_i)$$

### Example 3

Suppose I have a bag with 10 blue and 10 red marbles and I want to pick one at random, then I can't really predict what kind of marble I will get, but if the bag had 19 red marbles and 1 blue marble, then I should be expecting to get a red marble. So in the first case I should have have larger entropy.

In our first bag case, entropy is

$$H= -\frac{1}{2}\ln\left(\frac{1}{2}\right)-\frac{1}{2}\ln\left(\frac{1}{2}\right) \approx 0.6931$$

And in the second case:

$$H=-\frac{1}{20}\ln\left(\frac{1}{20}\right)-\frac{19}{20}\ln\left(\frac{19}{20}\right) \approx 0.198514$$


Note: If we $p_k=0$, we would get $0 \cdot \ln(0)$ which is undefined. However, since 0 probability is perfect certainty, we set this to 0.

#### Cross-Entropy and KL-Divergence


Suppose we have two distributions from examples 1 and 2. I will think of them as some 4-sided dice. I will call first example distribution as $Q$ and the other as $P$. I would like to compare these two distributions. To do that we will look at the probabilities of getting specific samples. In other words, suppose I roll die $P$ 16 times, then we would expect to get 8 $A$'s, 4 $B$'s , 2 $C$'s and 2 $D$'s. (In fact, if we roll the die $N$ times then the expected amount of let's say $A$'s is $N \cdot Pr(A) = N \cdot 0.5$.)

Suppose we get these values in this specific order. Then the probability of getting this sequence is

$$Prob=0.5^8 \cdot 0.25^4 \cdot 0.125^2 \cdot 0.125^2$$

This value is very tiny and as $N$ grows, it get even smaller. To fix this we will take $N$'s root to normalize it with respect to sample size. Since the powers in the above calculations are of the form $N\cdot Pr$, this root basically will eliminate $N$'s from the power, leaving just the probability.

The second problem is that our numbers are still quite small and we are multiplying them which makes it even worse. Usual way to fix it is to take logarithm. We will take negative logarithm to keep our answer positive:

$$-\ln\left(Prob^{1/N}\right)=-\frac{8}{16}\ln(0.5)-\frac{4}{16}\ln(0.25)-\frac{2}{16}\ln(0.125)-\frac{2}{16}\ln(0.125)$$

$$=-0.5\ln(0.5)-0.25\ln(0.25)-0.125\ln(0.125)-0.125\ln(0.125) \approx 1.213$$

Notice what we got is just entropy of $P$: $H(P)=-\sum_i p_i \ln(p_i)$. However, in this case I will think of as cross-entropy of $P$ with respect to itself (I am using specific distribution to get expected distribution in my sample and I am using specific distribution to calculate the probability of getting that sample. It just happens to be the same distribution.)

So, what if we look for probability of getting the same sample but using die $Q$? The closer the $Q$ is to $P$ the closer the probabilites should be to the ones we got with a die $P$. So, we calculate:

$$Prob=0.25^8 \cdot 0.25^4 \cdot 0.25^2 \cdot 0.25^2$$

Performing the same normalization:

$$-\ln\left(Prob^{1/N}\right)=-0.5\ln(0.25)-0.25\ln(0.25)-0.125\ln(0.25)-0.125\ln(0.25) \approx 1.386$$

What we got is a cross-entropy of $P$ relative to $Q$. In general this is if we use $p_i$'s as probabilties from $P$ and $q_i$'s as probabilities from $Q$, the cross-entropy is equal to

$$H(P,Q)=-\sum_i p_i \ln(q_i)$$

Few properties of cross-entropy:

1. $H(P,P)=H(P)$

2. $H(P,Q) \geq H(P) \geq 0$

3. In general, $H(P,Q) \neq H(Q,P)$


Using these cross-entropies we can measure how far $Q$ is from $P$ by subtraction:

$$D_{KL}(P||Q)= H(P,Q)-H(P)$$

This is called KL-Divergence. So, in our example,

$$D_{KL}(P||Q) \approx 1.386- 1.213=0.173$$

Properties of KL-Divergence:

1. $D_{KL}(P||Q) \geq 0$

2. $D_{KL}(P||P) = 0$

3. In general, $D_{KL}(P||Q) \neq D_{KL}(Q||P) $

### Example 4

Suppose we have a third die $R$ with the following likelihoods:

$$Pr(x) = \begin{cases}
          0.1 & x =A \\
          0.1 & x =B \\
          0.4 & x =C \\
          0.4 & x =D \\
       \end{cases}
$$

If you look carefully, it should be quite obvious that $R$ is much further from $P$ than $Q$ was. So let's see that by calculating $D_{KL}(P||R)$

$$H(P, R)=-0.5\ln(0.1)-0.25\ln(0.1)-0.125\ln(0.4)-0.125\ln(0.4) \approx 1.956$$

And so, $$D_{KL}(P||R) \approx 1.956- 1.213=0.743$$

### Cross-Entropy Loss

Suppose we have some labeled training data and we have some classification model parametrized with some parameter $\theta$. Then, given a data point $x$, our model produces a prediction $\hat{y}$. This prediction is a vector with number of entries equal to number of classes. Each entry is the probability that our data point belongs to a corresponding class. For example, if
$$\hat{y}=\begin{bmatrix} 0.2 \\ 0.3 \\0.5 \end{bmatrix}$$
Then the probability that $x$ belongs to class 0 is 0.2, the probability that $x$ belongs to class 1 is 0.3, and the probability that $x$ belongs to class 2 is 0.5. So, we get a distribution for $x$. I will denote this as $$ \hat{P}(x|\theta)=\begin{bmatrix} \hat{p}_1 \\ \hat{p}_2 \\\hat{p}_3 \end{bmatrix}$$

However, our data is labeled, so we do have the actual label $y$ for $x$. This is also a probability vector. However, it contains a single 1 in the position corresponding to a true class of $x$, and it has zeros everywhere else. Nevertheless, it is still a distribution, and I will denote it as

$$P(x)=\begin{bmatrix} {p}_1 \\ {p}_2 \\{p}_3 \end{bmatrix}$$ Note, that it doesn't depend on $\theta$ since it doesn't come from our training model.

We would like our model to predict the correct class, so we want $ \hat{P}(x|\theta)$ to be close to $P(x)$. We can use KL-divergence to see how close two distributions are and try to minimize it.

$$D_{KL}\left(P(x) || \hat{P}(x|\theta)\right)=H\left(P(x), \hat{P}(x|\theta)\right) - H\left(P(x)\right)$$

Two things to note here:

1. We can not reverse distributions inside $D_{KL}$, since then we would have to use entropy of $\hat{P}$. However, we don't want this entropy to affect our optimization.

2. Since $H(P(x))$ doesn't depend on $\theta$, our optimization doesn't depend on this term. In other words, thr parameter $\theta$ that minimizes KL-divergence will also minimize cross-entropy:

$$\mathop{\arg\min}_{\theta} D_{KL}\left(P(x) || \hat{P}(x|\theta)\right) = \mathop{\arg\min}_{\theta}H\left(P(x), \hat{P}(x|\theta)\right)$$

So, we now define the Cross-Entropy Loss function for a single datapoint $x$ as follows :

$$CELoss(x)=H\left(P(x), \hat{P}(x|\theta)\right)=-\sum_i p_i\ln(\hat{p}_i)$$

And for the whole data set, as before, we add losses for each point in the data set.

## Relation of CELoss to NLLoss

When we have only two classes, then

$$P(x)=\begin{bmatrix} p \\ 1-p  \end{bmatrix}, \hat{P}(x)=\begin{bmatrix} \hat{p} \\ 1-\hat{p}  \end{bmatrix}$$

Then,

$$CELoss(x)=-p\ln(\hat{p})-(1-p)\ln(1-\hat{p})=NLLoss(x)$$