<a href="https://colab.research.google.com/github/M-Amrollahi/Personal-Notes/blob/master/ML-notes/how-crossentropy-loss-works.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
import numpy as np

In [None]:
y_p = torch.tensor([[.0,.4,.6],[.7,0,.3]], dtype=torch.float32)
y_p1 = y_p.softmax(dim=1)
y = torch.tensor([[0,0,1],[1,0,0]], dtype=torch.float32)
y1 = torch.tensor([2,0])

### Entropy

$H(x)=-\sum_{i}^{C} p(x)log(p(x))$

$\text{Entropy can translate as Certainty. How much we cartain about the probability of any class?}$

$\text{ If we have two classes, we may have the probability of .1 and .9. So, the certainty of picking one sample or unpicking sample is high. But, when we have the probability of .5 and .5, we can not certain about picking one or not picking it.}$

## How CrossEntropy calculated

$loss = \sum_{i}^{C} y.log(\hat y)  $

$\hat y \text{ can be eighter probability list for each class like [.2,.3,.5] or the number for that class like [2]}$

$\text{Note: CrossEntropy automomously calculate the softamx and consider it as the probability.}$

In [None]:
criterion = torch.nn.CrossEntropyLoss()
torch.allclose(-1*torch.sum( y * torch.log(y_p1))/2 , criterion.forward(y_p, y1))

True

## How to calculate BinaryCrossEntropy (Log-Loss)

$loss=\dfrac{-1}{N} \sum_{i}^{N} ylog(\hat y) + (1-y)log(1-\hat y)$

$\text{Note: All inputs consider as as probability, so no softmax inculdes}$

$\:$ N is numbre_of_samples * count_prob_list $\:$

In [None]:
y_p = torch.tensor([[.2,.8],[.3,.7]], dtype=torch.float32)
y = torch.tensor([[1,0],[0,1]], dtype=torch.float32)

cri3 = torch.nn.BCELoss()
torch.allclose(-1*torch.sum(y*torch.log(y_p) + (1-y)*torch.log(1-y_p))/4, cri3.forward(y_p, y))

True

In [None]:
y_p = torch.tensor([[.2],[.8]], dtype=torch.float32)
y_p1 = y_p.softmax(dim=1)
y = torch.tensor([[0],[1]], dtype=torch.float32)
y1 = torch.tensor([2,0])

In [None]:
m = torch.nn.Sigmoid()
loss = torch.nn.BCELoss()
input = torch.randn(3, requires_grad=True)
target = torch.empty(3).random_(2)
output = loss(m(input), target)
output

tensor(0.6612, grad_fn=<BinaryCrossEntropyBackward0>)

### How to calculate the LogSoftmax

In [None]:
cri2 = torch.nn.LogSoftmax(dim=1)
torch.allclose(torch.log(y_p.softmax(dim=1)) ,cri2.forward(y_p))

True

## How Negative-LogLikelihood works


$\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad
        l_n = - w_{y_n} x_{n,y_n}, \quad
        w_{c} = \text{weight}[c] \cdot \mathbb{1}\{c \not= \text{ignore\_index}\}$

$LogSoftmax=
\begin{bmatrix}
-1.5722 & -1.2722 & -1.7722 & -1.0722 \\
-1.0391 & -2.1391 & -0.8391 & -2.3391 \\
-2.1627 & -1.1627 & -1.3727 & -1.1427
\end{bmatrix}
$

$Target=\begin{bmatrix} 1 & 2 & 3 \end{bmatrix} \implies$

$-1*sum(-1.2722 , -0.8391, -1.1427)/3$


In [None]:
cri1 = torch.nn.NLLLoss()
torch.allclose(-torch.sum(y_p*y)/2, cri1.forward(y_p,y1))

True

In [None]:
cri1.forward(y_p,y1)

tensor(-0.6500)

Entropy allows us to quantify the expected value of the information contained in a message. It provides insights into the average amount of “surprise” or “information” that we receive when we learn the outcome of a random variable.

However, entropy alone may not be sufficient to capture the differences between probability distributions or the amount of information gained when moving from one distribution to another. This is where Kullback-Leibler (KL) Divergence comes into play.

KL Divergence, named after Solomon Kullback and Richard Leibler, is a measure of the difference between two probability distributions. It provides a way to quantify how one distribution diverges from another. KL Divergence is often used in various fields, including information theory, statistics, and machine learning, to compare and analyze probability distributions.

The information content of an event is inversely proportional to the probability of the event. We can define the information content I(x) of an event x with probability P(x) as follows:

$$I(x) = -log(P(x))$$

Entropy is defined as the expected value of the information content.

$$H(x) = E[I(x)] = \sum{P(x)*I(x)} = -\sum{P(x)*log(P(x))}$$

This is the formula for entropy. It tells us that the entropy of a random variable is the sum over all possible outcomes of the product of the probability of each outcome and the logarithm of the probability of each outcome, with the whole sum multiplied by -1.

The logarithm ensures that the information content is inversely proportional to the probability of an event. Rare events (with low probability) have high information content, while common events (with high probability) have low information content.

## Example
Consider a fair coin toss. The coin has two outcomes: heads (H) and tails (T), each with a probability of 0.5. We can calculate the entropy of this system as follows.
So, the entropy of a fair coin toss is 1 bit, which means that each toss of the coin provides 1 bit of information.


In the case of a fair coin, the probability of getting heads (H) or tails (T) is the same, 0.5. This means that each outcome is equally likely, and thus, the uncertainty or randomness is at its maximum. This is why the entropy, denoted as H(X), is 1 bit. Each toss of the coin provides 1 bit of information because you are completely uncertain about the outcome before the toss.

In the case of a biased coin, the probability of getting heads is 0.9 and tails is 0.1. This means that getting heads is much more likely than getting tails. Because one outcome (heads) is much more likely, there is less uncertainty or randomness in the toss. You can often predict the outcome (it’s likely to be heads). This is why the entropy, H(X), is lower for the biased coin (approximately 0.469 bits). The outcome is less uncertain, so each toss of the coin provides less than 1 bit of information.


So, when the statement says “The entropy is lower for the biased coin toss because the outcome is less uncertain”, it means that because we can more accurately predict the outcome of the toss (due to the bias), there is less new information provided by each toss, and thus, the entropy (a measure of this new information or uncertainty) is lower.

The entropy is 0 when the outcome is certain. For example, if you have a coin that always lands on heads, there’s no uncertainty, so the entropy is 0.

The entropy is 1 bit when the outcomes are equally likely, as in the case of a fair coin. This is the maximum entropy for a binary system because there’s maximum uncertainty — you have no way of predicting whether the next toss will result in heads or tails.

So, when we say the entropy of a fair coin toss is 1 bit, it means that the uncertainty is at its maximum possible value for this system. Similarly, when we say the entropy of the biased coin toss is approximately 0.469 bits, it means the uncertainty is less than half of what it could be in the most uncertain (or random) case.


Moreover, entropy doesn’t measure the difference between two probability distributions. For example, if we have two different distributions over the same set of events, entropy doesn’t tell us how much information we gain or lose when moving from one distribution to the other.


https://medium.com/@hosamedwee/kullback-leibler-kl-divergence-with-examples-part-1-8650ee4b329c


## KL Divergence:
When Q used to approximate P(That is not symmetric):

$$D_{KL}{(P||Q)} = \sum P(x)*log(\frac {P(x)}{Q(x)})$$

$P(x)$: This is the probability of event x according to the first distribution. This term is used as a weighting factor, meaning events that are more probable in the first distribution have a larger impact on the divergence.

We use this term because we’re interested in the difference between P and Q where P has assigned more probability. If an event is highly probable in P but not in Q, we want that to contribute more to our divergence measure. Conversely, an event is highly improbable in P, we don’t want it much to our divergence measure, even if Q assigns it a high probability. This is because we’re measuring the divergence from P to Q, not the other way around.

$log(\frac{P(x)}{Q(x)})$: This term is the ratio of the probabilities assigned to event x by P and Q. If P and Q assign the same probability to x, then this ratio is 1, and the logarithm of 1 is 0, so events that P and Q agree on don’t contribute to the divergence. If P assigns more probability to x than Q does, then this ratio is greater than 1, and the logarithm is positive, so this event contributes to the divergence. If P assigns less probability to x than Q does, then this ratio is less than 1, and the logarithm is negative, but remember that we’re multiplying this by P(x), so events that P assigns low probability to don’t contribute much to the divergence. The logarithm function is used for a few reasons. One is that it turns ratios into differences, which are often easy to work with. Another is that it ensures that the KL divergence is always non-negative according to Jensen’s inequality. The logarithm also has a nice interpretation in terms of information theory.
We then sum over all possible outcomes. This gives us a single number that represents the total difference between P and Q.

Non-negativity: is always greater than or equal to 0. This is known as Gibbs’ inequality. The KL Divergence is 0 if and only if P and Q are the same distribution, in which case there is no information loss.

Not Symmetric


The KL divergence can then be used to measure how well your spam filter’s estimated distribution `Q` approximates the true distribution `P`. If the KL divergence is high, it means your spam filter’s estimates are quite different from the true distribution, and it might not be very good at distinguishing spam from non-spam emails. If the KL divergence is low, it means your spam filter’s estimates are close to the true distribution, and it’s likely doing a good job.

In this, even though you know the true distribution `P`, you can’t directly use it to filter emails, because your spam filter needs to make its estimates based on the emails it sees. The KL divergence is a way to measure how well it’s doing that.

Moreover, KL divergence can be beneficial in cases where you have a large number of potential features (like all the words that could appear in an email). Metrics like false positives and true positives give you a single number that summarizes the performance of your classifier, but they don’t tell you anything about which features (words) your classifier is handling well and which ones it’s not. KL divergence, on the other hand, can provide this information.


https://medium.com/@hosamedwee/kullback-leibler-kl-divergence-with-examples-part-2-9123bff5dc10

In [1]:
import torch
from torch.functional import F
y_true = torch.randn((5,4)).softmax(dim=1)
y_pred = F.log_softmax(torch.randn((5,4)),dim=1)

F.kl_div(y_pred, y_true,reduction="mean")



tensor(0.1202)

In [2]:
(y_true*(torch.log(y_true)-y_pred)).mean()

tensor(0.1202)

In [25]:
y_true,y_pred

(tensor([[0.3288, 0.0666, 0.1597, 0.4450],
         [0.5660, 0.0756, 0.1100, 0.2483],
         [0.1961, 0.1280, 0.4610, 0.2149],
         [0.1619, 0.2797, 0.2355, 0.3228],
         [0.3394, 0.1385, 0.4086, 0.1135]]),
 tensor([[-1.4077, -2.2463, -3.9153, -0.4627],
         [-2.3076, -0.8515, -1.2788, -1.6330],
         [-2.2855, -2.5299, -1.1992, -0.6594],
         [-1.0311, -0.9609, -2.7155, -1.6366],
         [-1.8571, -3.1742, -3.8880, -0.2465]]))