## Machine Learning for Neuroscience, <br>Department of Brain Sciences, Faculty of Medicine, <br> Imperial College London
### Contributors: Payam Barnaghi, Francesca Palermo, Nan Fletcher-Lloyd, Alex Capstick, Yu Chen, Tianyu Cui, Marirena Bafaloukou, Ruxandra Mihai
**Spring 2024**

<h2>Cross Entropy</h2>

The cross entropy formula takes in two distributions, $𝑝(𝑥)$, the true distribution, and $𝑞(𝑥)$, the estimated distribution, defined over the discrete variable $𝑥$ and is given by

𝐻(𝑝,𝑞)=−∑∀𝑥𝑝(𝑥)log(𝑞(𝑥))

For a neural network, the calculation is independent of the following:

- What kind of layer was used.

- What kind of activation was used - although many activations will not be compatible with the calculation because their outputs are not interpretable as probabilities (i.e., their outputs are negative, greater than 1, or do not sum to 1). Softmax is often used for multiclass classification because it guarantees a well-behaved probability distribution function.

For a neural network, you will usually see the equation written in a form where $𝐲$ is the ground truth vector and $𝐲̂$  (or some other value taken direct from the last layer output) is the estimate. For a single example, it would look like this:

𝐿=−𝐲⋅log(𝐲̂ )
where ⋅ is the inner product.

For example if the ground truth $𝐲$ gives all probability to the first value, and the other values are zero, so we can ignore them, and just use the matching term from  estimates $𝐲̂$ 

𝐿=−(1×𝑙𝑜𝑔(0.1)+0×log(0.5)+...)

𝐿=−𝑙𝑜𝑔(0.1)≈2.303

An important point from comments

That means, the loss would be same no matter if the predictions are [0.1,0.5,0.1,0.1,0.2] or [0.1,0.6,0.1,0.1,0.1]?

Yes, this is a key feature of multiclass logloss, it rewards/penalises probabilities of correct classes only. The value is independent of how the remaining probability is split between incorrect classes.

You will often see this equation averaged over all examples as a cost function. It is not always strictly adhered to in descriptions, but usually a loss function is lower level and describes how a single instance or component determines an error value, whilst a cost function is higher level and describes how a complete system is evaluated for optimisation. A cost function based on multiclass log loss for data set of size 𝑁 might look like this:

$𝐽=−1𝑁(∑𝑖=1𝑁𝐲𝐢⋅log(𝐲̂ 𝐢))$
Many implementations will require your ground truth values to be one-hot encoded (with a single true class), because that allows for some extra optimisation. However, in principle the cross entropy loss can be calculated - and optimised - when this is not the case.


<i>source: https://datascience.stackexchange.com/questions/20296/cross-entropy-loss-explanation</i>

<br>
<br>

In [4]:
from math import log2
 

def cross_entropy(p, q):
    return -sum([p[i]*log2(q[i]) for i in range(len(p))])
 

p = [0.10, 0.20, 0.30]
q = [0.90, 0.80, 0.70]

ce_pp = cross_entropy(p, p)
print('H(P,P): %.3f' % ce_pp)

ce_qq = cross_entropy(q, q)
print('H(Q,Q): %.3f' % ce_qq)

ce_pq = cross_entropy(p, q)
print('H(P,Q): %.3f' % ce_pq)

ce_qp = cross_entropy(q, p)
print('H(P,Q): %.3f' % ce_qp)

H(P,P): 1.318
H(Q,Q): 0.755
H(P,Q): 0.234
H(P,Q): 6.063


In [5]:
def kl_divergence(p, q):
    return sum(p[i] * log2(p[i]/q[i]) for i in range(len(p)))

In [9]:
kl_pp = kl_divergence(p, p)
print('KL(P || P): %.3f' % kl_pp)

kl_qq = kl_divergence(p, p)
print('KL(Q || Q): %.3f' % kl_qq)


kl_pq = kl_divergence(p, q)
print('KL(P || Q): %.3f' % kl_pq)

kl_qp = kl_divergence(q, p)
print('KL(Q || P): %.3f' % kl_qp)

KL(P || P): 0.000
KL(Q || Q): 0.000
KL(P || Q): -1.084
KL(Q || P): 5.309


In [21]:
from math import log
from numpy import mean
 
def cross_entropy(p, q):
    return -sum([p[i]*log(q[i]) for i in range(len(p))])
 

p = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
q = [0.8, 0.9, 0.9, 0.6, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3]

sum_cross_entropy = list()
for i in range(len(p)):
    
    #we need to transform each value to a distirbution 
    expected = [1.0 - p[i], p[i]]
    predicted = [1.0 - q[i], q[i]]
    
    cross_ent = cross_entropy(expected, predicted)
    print('[p=%.1f, q=%.1f] Cross Entropy: %.3f' % (p[i], q[i], cross_ent))
    sum_cross_entropy.append(ce)

mean_corss_entropy = mean(sum_cross_entropy)
print('\nMean Cross Entropy: %.3f' % mean_corss_entropy)

[p=1.0, q=0.8] Cross Entropy: 0.223
[p=1.0, q=0.9] Cross Entropy: 0.105
[p=1.0, q=0.9] Cross Entropy: 0.105
[p=1.0, q=0.6] Cross Entropy: 0.511
[p=1.0, q=0.8] Cross Entropy: 0.223
[p=0.0, q=0.1] Cross Entropy: 0.105
[p=0.0, q=0.4] Cross Entropy: 0.511
[p=0.0, q=0.2] Cross Entropy: 0.223
[p=0.0, q=0.1] Cross Entropy: 0.105
[p=0.0, q=0.3] Cross Entropy: 0.357

Mean Cross Entropy: 0.357


In [11]:
expected

[1.0, 0]

In [12]:
predicted

[0.7, 0.3]