# Cross Entropy

Cross entropy is often our target function. Suppose we have some fixed model, which predicts n classes {1, 2, ... n} and their hypothetical occurence probabilities y1, y2, ..., yn. Now suppose in reality you now observe (in reality) k1 instances of class 1, k2 instances of class 2, kn instances of class n, etc

In [1]:
%%latex
\begin{align}
P(A|B) = \text{The Conditional Probability of A given that B has happened}
\newline
P(data|model) = y_{1}^{k_{1}}y_{2}^{k_{2}}...y_{3}^{k_{3}}
\newline
\text{Take the logorithm}
\newline
-logP(data|model) = -k_{1}log(y_{1})-k_{2}log(y_{2})-...-k_{n}log(y_{n}) = -\sum_{i=1}^{n}{k_{i}log(y_{i})}
\newline
\text{This comes from the logarithm rules of}
\newline
log_a(uv) = log_a(u) + log_a(v)
\newline
log_a(u^n) = n*log_a(u)
\newline
\text{Now lets say we have N obersvations, } N = k_1 + k_2 + ... + \:k_n \text{and we denote probababilities as } y_{i}^{'}
= k_i \: / \: N
\newline
\text{Now we divide what we have by our number of observations}
\newline
-\frac{1}{N}logP(data|model) = -\frac{1}{N}\sum_{i=1}^{n}{k_{i}log(y_{i})} = -\sum_{i=1}^{n}{y_{i}^{'}log(y_{i})}
\newline
\text{This is how we derive the cross-entropy}
\end{align}

<IPython.core.display.Latex object>

What is __probabilistic classification__ in Machine Learning? It means given an input, output a probability distribution over a set of classes, rather than only outputing the most likely class. 

Let's say we are trying to create a model that has 3 distinct classes. For one of the inputs the ground truth of the input is (1.0, 0, 0) but our model predicts a different distribution let's say (0.4, 0.1, 0.5) then we would like to nudge our parameters so that our prediction get's closer to the ground truth. 

But the thing is, what does it exactly mean to "get closer to"? In particular how should we measure the difference between our prediction and the ground truth.

Cross entropy is the number of bits we'll need if we encode symbols from $y$ using the wrong tool $\hat{y}$. This consists of encoding the $i-th$ symbol using $log\frac{1}{\hat{y_i}}$ bits instead of using $log\frac{1}{y_i}$ bits. We of course still take the expected value to the true distribution $y$, since it's the true distribution that generates the symbols:

$$ H(y, \hat{y}) = \sum_i y_i \log \frac{1}{\hat{y}_i} = -\sum_i y_i \log \hat{y}_i $$

Cross entropy is always larger than entropy; encoding symbols according to the wrong distribution $\hat{y}$ will always make us use more bits. The only exception is the trivial case where $y$ and $\hat{y}$ are equal, and in this case entropy and cross entropy are equal.

__KL divergence__ from $\hat{y}$ to $y$ is simply the difference between cross entropy and entropy.