## $\textcolor{green}{\text{Information and Entropy}}$

Entropy is a concept from information theory designed to describe amount of uncertainty or disorder we have in the given probability distribution. The idea is the more uncertain we are, the less information we have about what we can get.  To find entropy we need to calculate average amount of information we need to describe a random variable. In other words, the more information we need to k find out the value of a random variable, the more disorder we have, and so the more entropy we have. 

So, how do we measure information? We use "bits". A bit is a binary variable (0 or 1). So, we ask Yes/No questions and see how many questions on average we can ask to find out the value of a random variable. To see how this works, let's look at two examples.

#### $\textcolor{red}{\text{Example 1}}$
Supppose we have the following distribution:

$$Q(x) = \begin{cases}
          0.25 & x =A \\
          0.25 & x =B \\
          0.25 & x =C \\
          0.25 & x =D \\
       \end{cases}
$$

Given a random variable $x$, how many questions can we ask to determine what it equals to?
Since all probabilities are the same, the most efficient way is to ask two questions:

1. Is it A or B?

2. If the answer was "Yes", then we ask: Is it A?; If the answer was "No", we ask: Is it C?

So, we always ask two questions, therefore the average number of questions asked is two as well. And so, we say that the entropy here is 2 bits. We can denote this by $H(Q)=2$.

#### $\textcolor{red}{\text{Example 2}}$

Suppose now we have the following non-uniform distribution:

$$P(x) = \begin{cases}
          0.5 & x =A \\
          0.25 & x =B \\
          0.125 & x =C \\
          0.125 & x =D \\
       \end{cases}
$$

We can still determine the value of $x$ by asking the same two questions as in the example 1, but can we do better than that? Well, since the probability $P(x=A)$ is already a half, $A$ will appear very often, so let's ask our questions in the following way:

1. Is it A?

2. If the answer was "No", we ask: Is it B?

3. If the answer was "No", we ask: Is it C?

On the first glance, it may seem like we are asking more questions now, but it depends. If $x=A$ for example, we are asking only 1 question. So, what is the average number of questions we are asking? To find this, we just need to calculate the Expected Value of $x$. Let $N(x)$ denote number of questions we are asking to get to $x$ (so $N(A)=1$ and $N(C)=3$ for example). Then to find the expected value of $x$ we need to multiply probabilities of specific outcome $P(x)$ by how many questions we need to ask to get that outcome $N(x)$, and then add all the products:

$$E(x)=\sum_{x=A, B, C, D } P(x) \cdot N(x)=0.5 \cdot 1+0.25\cdot 2+0.125\cdot 3+0.125\cdot 3=1.75$$

So, on average we are asking less than two questions and we say that the entropy for this distribution is 1.75 bits, denoted by $H(P)=1.75$.

In the example 2, we have less entropy, because we have more predictive power than in the example 1.

#### $\textcolor{blue}{\text{Formalizing the formula}}$
Let's formalize our calculation. Suppose we have some probability distribution $P(x)$ in which $P(i)=p_i$. Notice that from Example 2 that $N(x)$ depends on $P(x)$. The lower the probability, the more questions we need to ask, and since each question has a binary answer, we look at the powers of 2:

$$N(x)=\log_2\left(\frac{1}{P(x)}\right)$$

Then the entropy is:

$$H(P)=\sum_{i}p_i\cdot N(i)=\sum_{i}p_i\cdot\log_2(1/p_i) = - \sum_{i}p_i\cdot\log_2(p_i)$$

Finally, we will only care about entropy as a relative quantity and so we can replace $\log_2()$ with $\ln()$ as they differ only by a constant:

$$\textcolor{blue}{H(P)= - \sum_{i}p_i\cdot\ln(p_i)}$$

Note, we can use either natural logarithm and base2 logarithm. While they do produce different values, they keep the relative difference intact. This means if you want to compare entropies of diffeent distribution, you should use the same logarithm for both. Base 2 logarithm has real meaning (average number of question we have to ask), but natural logarithm is generally considered easier one to use.

#### $\textcolor{red}{\text{Example 3}}$

If we have a bag with 10 blue and 10 red marbles and we pick one marble at random, then we can't really predict what kind of marble we will get, since it could be either color with equal probability. However, if the bag had 19 red marbles and 1 blue marble, then we should be expecting to get a red marble. So, in the first case we have larger entropy.

In our first bag case, entropy is

$$H= -\frac{1}{2}\ln\left(\frac{1}{2}\right)-\frac{1}{2}\ln\left(\frac{1}{2}\right) \approx 0.6931$$

And in the second case:

$$H=-\frac{1}{20}\ln\left(\frac{1}{20}\right)-\frac{19}{20}\ln\left(\frac{19}{20}\right) \approx 0.198514$$


$\bf{Note}$: If $p_k=0$, we would get $0 \cdot \ln(0)$ which is undefined. However, since 0 probability is a perfect certainty, we set this equal to 0.

## $\textcolor{green}{\text{Cross-Entropy and KL-Divergence}}$


Suppose we have two distributions from examples 1 and 2. We will think of them as some kind of 4-sided dice. We would like to compare these two distributions. To do that, we look at the probabilities of getting specific samples. In other words, suppose I roll die $P$ 16 times. Then we would expect to get on average 8 $A$'s, 4 $B$'s , 2 $C$'s and 2 $D$'s. (In fact, if we roll the die $N$ times, then the expected amount of let's say $A$'s is $N \cdot P(A) = N \cdot 0.5$.)

Suppose we get these values in this specific order (8 $A$'s followed by 4 $B$'s followed by 2 $C$'s followed by 2 $D$'s). Then the probability of getting this sequence is

$$Prob=0.5^8 \cdot 0.25^4 \cdot 0.125^2 \cdot 0.125^2$$

This value is very tiny, and as $N$ grows, it gets even smaller. To fix this we will take $N$-th root to normalize it with respect to the sample size. Since the powers in the above calculations are of the form $N\cdot p_i$, this root will eliminate $N$'s from the power, leaving just the probability.

The second problem is that our numbers are still quite small and we are multiplying them which makes it even worse. Usual way to fix this is to take a logarithm. We will take a negative logarithm to keep our answer positive:

$$-\ln\left(Prob^{1/N}\right)=-\frac{8}{16}\ln(0.5)-\frac{4}{16}\ln(0.25)-\frac{2}{16}\ln(0.125)-\frac{2}{16}\ln(0.125)$$

$$=-0.5\ln(0.5)-0.25\ln(0.25)-0.125\ln(0.125)-0.125\ln(0.125) \approx 1.213$$

Notice what we got is just entropy of $P$: $H(P)=-\sum_i p_i \ln(p_i)$. However, in this case we will think of this as cross-entropy of $P$ with respect to itself (I am using specific distribution ($P$) to get expected distribution in my sample and I am using specific distribution ($P$ again) to calculate the probability of getting that sample. So, I am using $P$ for both parts.)

Now, what if we look for probability of getting the same sample (AAAAAAAA BBBB CC DD) but using die $Q$? The closer the $Q$ is to $P$ the closer the probabilites should be to the ones we got with a die $P$. Recall that probability of getting any of the values in $Q$ are all 0.25. So, we calculate:

$$Prob=0.25^8 \cdot 0.25^4 \cdot 0.25^2 \cdot 0.25^2$$

Performing the same normalization:

$$-\ln\left(Prob^{1/N}\right)=-0.5\ln(0.25)-0.25\ln(0.25)-0.125\ln(0.25)-0.125\ln(0.25) \approx 1.386$$

What we got is a cross-entropy of $P$ relative to $Q$. To generalize this, let $p_i$'s be probabilties from $P$ and $q_i$'s be probabilities from $Q$, then the cross-entropy of $P$ relative to $Q$ is equal to

$$\textcolor{blue}{H(P,Q)=-\sum_i p_i \ln(q_i)}$$

Few properties of cross-entropy:

$\textcolor{blue}{\text{Properties of H}}$

If $P$ and $Q$ are two probability distributions, then
1. $H(P,P)=H(P)$

2. $H(P,Q) \geq H(P) \geq 0$

3. In general, $H(P,Q) \neq H(Q,P)$


Using these cross-entropies we can measure how far $Q$ is from $P$ by subtraction:

$$\textcolor{blue}{D_{KL}(P||Q)= H(P,Q)-H(P)}$$

This is called $\bf{\text{KL-Divergence}}$. So, in our example,

$$D_{KL}(P||Q) \approx 1.386- 1.213=0.173$$

$\textcolor{blue}{\text{Properties of KL-Divergence}}$:

If $P$ and $Q$ are two probability distributions, then

1. $D_{KL}(P||Q) \geq 0$

2. $D_{KL}(P||P) = 0$

3. In general, $D_{KL}(P||Q) \neq D_{KL}(Q||P) $

The main difference between cross-entropy and KL-divergence is as follows:

1. Cross-entropy $H(P,Q)$ is average amount of "bits" we need to represent event from $Q$ instead of $P$.
2. KL-divergence $D_{KL}(P||Q)$ is average amount of extra "bits" we need to represent event from $Q$ instead of $P$.

Since $D_{KL}(P||Q) \neq D_{KL}(Q||P) $, KL-divergence is techically not a distance between two distributions.

#### $\textcolor{red}{\text{Example 4}}$

Suppose we have a third die $R$ with the following likelihoods:

$$R(x) = \begin{cases}
          0.1 & x =A \\
          0.1 & x =B \\
          0.4 & x =C \\
          0.4 & x =D \\
       \end{cases}
$$

If you look carefully, it should be quite obvious that $R$ is much further from $P$ than $Q$ was. So let's see that by calculating $D_{KL}(P||R)$

$$H(P, R)=-0.5\ln(0.1)-0.25\ln(0.1)-0.125\ln(0.4)-0.125\ln(0.4) \approx 1.956$$

And so, $$D_{KL}(P||R) \approx 1.956- 1.213=0.743$$

## $\textcolor{green}{\text{Cross-Entropy Loss}}$

Suppose we have some labeled training data and we have some classification model parametrized with some parameter $\theta$. Then, given a data point $x$, our model produces a prediction $\hat{y}$. This prediction is a vector with number of entries equal to number of classes. Each entry is the probability that our data point belongs to a corresponding class. Such vector is called probability vector. For example, if
$$\hat{y}=\begin{bmatrix} 0.2 \\ 0.3 \\0.5 \end{bmatrix}$$
Then the probability that $x$ belongs to class 0 is 0.2, the probability that $x$ belongs to class 1 is 0.3, and the probability that $x$ belongs to class 2 is 0.5. So, we get a distribution for $x$. I will denote this as $$ \hat{P}(x|\theta)=\begin{bmatrix} \hat{p}_1 \\ \hat{p}_2 \\\hat{p}_3 \end{bmatrix}$$

However, our data is labeled, so we do have the actual label $y$ for $x$. This is also a probability vector. However, it contains a single 1 in the position corresponding to a true class of $x$, and it has zeros everywhere else. Nevertheless, it is still a distribution, and we will denote it as

$$P(x)=\begin{bmatrix} {p}_1 \\ {p}_2 \\{p}_3 \end{bmatrix}$$ Note that it doesn't depend on $\theta$ since it doesn't come from our training model.

We would like our model to predict the correct class, so we want $ \hat{P}(x|\theta)$ to be close to $P(x)$. We can use KL-divergence to see how close two distributions are and try to minimize it.

$$D_{KL}\left(P(x) || \hat{P}(x|\theta)\right)=H\left(P(x), \hat{P}(x|\theta)\right) - H\left(P(x)\right)$$

Two things to note here:

1. We can not reverse distributions inside $D_{KL}$, since then we would have to use entropy of $\hat{P}$. However, we don't want this entropy to affect our optimization.

2. Since $H(P(x))$ doesn't depend on $\theta$, our optimization doesn't depend on this term. In other words, the parameter $\theta$ that minimizes KL-divergence will also minimize cross-entropy:

$$\mathop{\arg\min}_{\theta} \ D_{KL}\left(P(x) || \hat{P}(x|\theta)\right) = \mathop{\arg\min}_{\theta} \ H\left(P(x), \hat{P}(x|\theta)\right)$$

The $\mathop{\arg\min}_{\theta}$ means the value $\theta$ that minimizes given quantity.

So, we now define the Cross-Entropy Loss function for a single datapoint $x$ as follows :

$$\textcolor{blue}{CELoss(x)=H\left(P(x), \hat{P}(x|\theta)\right)=-\sum_i p_i\ln(\hat{p}_i)}$$

And for the whole dataset, as before, we find the average of losses over the whole dataset:

$$\textcolor{blue}{CELoss(X)=-\frac{1}{N}\sum_{x \in X}CELoss(x)=-\frac{1}{N}\sum_{x \in X}\sum_i p_i\ln(\hat{p}_i)}$$

## $\textcolor{green}{\text{Relation of CELoss to NLLoss}}$

When we have only two classes, then

$$P(x)=\begin{bmatrix} p \\ 1-p  \end{bmatrix}, \hat{P}(x)=\begin{bmatrix} \hat{p} \\ 1-\hat{p}  \end{bmatrix}$$

Then,

$$CELoss(x)=-p\ln(\hat{p})-(1-p)\ln(1-\hat{p})=NLLoss(x)$$

##  $\textcolor{green}{\text{Homework}}$

### Problem Set 1

In Problem Set 1, use logarithm base 2 for calculating entropies.

1. Suppose we have the following distribution $P$:
    $$P(x) = \begin{cases}
          0.5 & x =A \\
          0.3 & x =B \\
          0.1 & x =C \\
          0.1 & x =D \\
       \end{cases}$$

    a. Compute Entropy for this distribution.

    b. Suppose we get a random letter and we ask "Is it A?". If the answer is "Yes", then we are left with only one possibility: it is an A. So in this case, we have a new distribution that just says $P(A)=1$. What is the entropy of this distribution?

    c. Suppose the answer to the "Is it A?" was a "No". What is the new distribution? and what is the entropy of it?  

    d. Since the probability of "Yes" or "No" was 50% each in this case, we can find expected entropy of a new distrubution by just computing the average of the two new entropies. Find this expected entropy. Then subtract it from original entropy. 

    e. What do you think the final answer in part d. represents?



2. Repeat problem 1, but in part b. ask "Is is B or C?" instead. Note that in part d. to find expected entropy you would need use the fact that the probability of "Yes" is 0.4 and probability of "No" is 0.6.

3. In the above problems, which question was better "Is it A?" or "Is it B or C?"



### Problem set 2

Consider two distributions $Q$ and $R$:

$$Q(x) = \begin{cases}
      0.55 & x =A \\
      0.25 & x =B \\
      0.05 & x =C \\
      0.15 & x =D \\
   \end{cases}$$

and 

$$R(x) = \begin{cases}
      0.4 & x =A \\
      0.3 & x =B \\
      0.15 & x =C \\
      0.15 & x =D \\
   \end{cases}$$

1. Which of these two distributions is "closer" to the distribution $P$ from Problem Set 1?

2. Generate three random sets of size 1000 using each of the three distributions. Calculate KL-Divergence between $P$ and $Q$, and between $P$ and $R$. Do we get similar results?

Note:
You can generate sets using `np.random.choice`. For example, for distribution $P$ is `[np.random.choice(np.arange(1, 5), p=[0.5, 0.3, 0.1, 0.1]) for i in range(1000)]`

You can compute KL-divergence between sets $P$ and $Q$ using `entropy(P,Q)` from `from scipy.stats import entropy`. And if you use `entropy(P)`, you get entropy of $P$. Both use natural log. If you wish to use base two log, add `base=2` inside entropy command.