# Lesson 4: Cross-Entropy Loss Function

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import torch
plt.style.use(['science', 'notebook'])

## An initial problem

Suppose I have a set of numbers $\vec{p} = [1,3,5,2,...]$ of length $N$. I want to know what the corresponding matching set of numbers $\vec{q} = [q_1,q_2,q_3,...]$ is such that the following function is minimized: 

$$H(\vec{p}, \vec{q}) = - \sum^{N}_{i=1} p_i ln (q_i)$$

with the constraint that $\sum_{i} p_i = \sum_{i} q_i$

In [2]:
p = np.array([5,1,4,6,2,4])
q1 = p
q2 = np.array([3,7,1,4,1,6])
q3 = np.array([2,5,7,2,1,5])

Obviously the defined vectors meet the constraint mentioned before:

In [9]:
print(sum(p), sum(q1), sum(q2), sum(q3))

22 22 22 22


In [10]:
def Sum_Log(p, q):
    return - sum(p*np.log(q))

In [12]:
[Sum_Log(p, i) for i in [q1, q2, q3]]

[-31.27439562761785, -22.923775636027425, -23.455449144551153]

It turns out that the value that minimizes $H(\vec{p}, \vec{q})$ is $\vec{q} = \vec{p}$ (proved via Lagrange Multipliers)

### Cross Entropy Loss

In classification problems, an input is taken in, such as an image. We'll call this x. This image may, for example, contain 1 of 5 different objects (dog, cat, house, ...). We'll call the **true likelihood** of an image x belonging to class $i$ as $p_i$ (length $5$ vector) The goal of a classifier is to create a function $f$ such that

$$f(x) = \vec{q}$$

where $\vec{q}$ (length $5$ vector) is as close to $\vec{p}$ as possible.

- Note that $\vec{p}$ and $\vec{q}$ are probability mass functions, and each element of the vector represents a different class. It follows that $\sum_{i} p_i = \sum_{i} q_i = 1$
- Typically we know what class $\hat{c}$ an image $x$ belongs to. In this case, its typically the case that $p_{\hat{c}} = 1$ for the class $i = \hat{c}$ that we know it is, and all the other components are equal to zero.

To minimize the difference between $\vec{p}$ and $\vec{q}$, we can minimize the **cross entropy loss function**

$$H(\vec{p}, \vec{q}) = - \sum^{N}_{i=1} p_i ln (q_i)$$

Since the minimum of this function occurs precisely when $q_i = p_i$ for all $i$. In the case where we know which class an image belongs to (i.e. one of the $p_i$ is $1$ and the rest are $0$) we have:

$$H(\vec{p}, \vec{q}) = - ln (q_\hat{c})$$

Example:

In [8]:
# p is the pmf with 10 categories and in this case the image belongs to the fifth since p[4] = 1
p = np.zeros(10, dtype = int); p[4] = 1
p

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0])

In [7]:
# q is the random pmf our 'model' predicts initially
q = np.random.rand(10); q /= sum(q) # normalized
q

array([0.05832266, 0.00710061, 0.09202471, 0.17511918, 0.06094468,
       0.19554355, 0.04165075, 0.05929879, 0.2142208 , 0.09577427])

In [11]:
# cross entropy loss for that prediction
H = -np.log(q[p>0])
H

array([2.79778871])

What if we make the probability higher of this fifth category?

In [13]:
q[4] = 20; q /= sum(q) # renormalization of q
q

array([1.38956060e-04, 1.69174805e-05, 2.19252545e-04, 4.17228425e-04,
       9.97762663e-01, 4.65890284e-04, 9.92345696e-05, 1.41281738e-04,
       5.10389584e-04, 2.28186009e-04])

In [14]:
H = -np.log(q[p>0])
H

array([0.00223984])

now there is a higher probability that the fifth class is the correct prediction, this makes sense since the cross entropy measures the similarity of the two so as it is getting smaller the similarity is increasing

### Multiple images case:

In this case, consider us computing this loss over N images (multiple images $x_n$). Suppose also that we know exactly what class the image belongs to. We can thus express the true class of the nth image as $\hat{c}(n)$. We will express the probability of image $n$ belonging to class $c$ as $q_n(c)$.

- Thus the predicted probability of the image $x_n$ belonging to its true class $\hat{c}(n)$ is $q_n(\hat{c}(n))$

We sum it together: 

$$L(p, q) = \sum^{N}_{n=1} H(p_n, q_n) = - \sum^{N}_{n=1} ln (q(\hat{c}(n))$$

Example:

In [16]:
# pmf of each image (row-wise) each image can belong to ten different classes
p = np.zeros((4, 10), dtype = int)
p[0, 4] = 1
p[1, 2] = 1
p[2, 8] = 1
p[3, 6] = 1
p

array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]])

In [20]:
q = np.random.rand(40).reshape(4,10)
# normalizing each image total prob
for i in range(4):
    q[i, :] /= sum(q[i, :])
q

array([[0.18186801, 0.05261113, 0.14999476, 0.07704911, 0.1938232 ,
        0.02940752, 0.00988255, 0.09391811, 0.04980343, 0.16164218],
       [0.03718438, 0.14422244, 0.19718861, 0.03476578, 0.10705033,
        0.1433361 , 0.03161328, 0.1358876 , 0.16380204, 0.00494945],
       [0.15827281, 0.12243628, 0.09479612, 0.12261051, 0.06296463,
        0.08616485, 0.1297719 , 0.02392835, 0.06144067, 0.13761388],
       [0.13533321, 0.10988344, 0.00276084, 0.11270723, 0.18003496,
        0.11734929, 0.12199301, 0.00741246, 0.03662287, 0.1759027 ]])

In [29]:
# normalizing q USING BROADCASTING RULES 
q = np.random.rand(40).reshape(4,10)
q /= np.expand_dims(np.sum(q, axis = 1), axis = 1)
q

array([[0.01580981, 0.14170686, 0.00465834, 0.13535099, 0.1361967 ,
        0.12893708, 0.10660179, 0.1026213 , 0.10377294, 0.12434419],
       [0.13202032, 0.06455847, 0.13208139, 0.16878477, 0.16132915,
        0.00127524, 0.18918898, 0.04025855, 0.08213276, 0.02837036],
       [0.07193565, 0.16561049, 0.07452134, 0.08693843, 0.014086  ,
        0.12766727, 0.13311377, 0.05452521, 0.16718668, 0.10441517],
       [0.12610882, 0.04518115, 0.01601851, 0.14483608, 0.18883643,
        0.12819185, 0.06250566, 0.03809882, 0.11044634, 0.13977633]])

Compute $H$ for each term in the sum and then sum together to get $L$:

In [33]:
Hs = -np.log(q[p>0])
L = sum(Hs)
L

8.579134447144245

## How do we get the q's in Machine Learning?

$\vec{q}$ should be related to a probability density function

- Bounded between $0$ and $1$
- The closer to $0$, the less likely we are confident that image $n$ is class $c$
- The closer to $1$, the more likely we are confident that image $n$ is class $c$
- $\sum^{C}_{c} q_c = 1$ for each image (we enforce this in the previous examples in a different way than PyTorch)

Suppose a Neural Network outputs $f(x_n) = \hat{y}_n$ where $\hat{y}_n$ is a vector with the same length as the number of classes, but $\hat{y}_n$ is not normalized like $\vec{q}$ should be. We can enforce the last condition by normalizing the following way (PyTorch's way)

$$q_n(c) = \frac{ exp(\hat{y}_n (c)) }{ \sum^{C}_{c'=0}exp(\hat{y}_n (c')) }$$

So we can rewrite our loss function as

$$L(\hat{y}) = - \sum^{N}_{n=0} ln (q_n(\tilde{c}(n)) = - \sum^{N}_{n=0} \frac{ exp(\hat{y}_n (\tilde{c})) }{ \sum^{C}_{c=0}exp(\hat{y}_n (c)) }$$

Get sample $\hat{y}$ :

In [3]:
# rows correspond to a specific image, columns to each possible class the image belongs to
# as usual with NN is not normalized
yhat = 20*np.random.rand(40).reshape(4, 10)**2
yhat

array([[4.57614589e-01, 5.59978652e+00, 1.80661801e+01, 1.32723878e+01,
        1.17101353e+01, 1.93854974e+00, 1.40678832e+01, 1.52104690e+01,
        5.99066838e+00, 1.82897163e+01],
       [1.22847764e+00, 8.14917532e-01, 4.99672157e-01, 6.51800729e-01,
        7.53291377e+00, 1.90712688e+01, 1.29011070e-02, 5.56937232e+00,
        2.20279184e+00, 2.35152472e+00],
       [2.43325579e-01, 1.23062427e+01, 9.09630917e-01, 9.45112296e+00,
        1.95408592e+01, 1.65091263e+01, 1.01041873e+01, 1.94139224e+00,
        9.84814328e+00, 1.94443891e-01],
       [1.92347521e+01, 4.19035110e+00, 2.31099677e-01, 8.41312546e+00,
        7.11109410e+00, 1.31115778e+00, 1.83897563e+00, 2.09734709e+00,
        5.78187038e+00, 1.11305076e+00]])

In [7]:
q = np.exp(yhat)
# normalizating pytorch like
q /= np.expand_dims(np.sum(q, axis = 1), axis = 1)
np.sum(q, axis = 1)

array([1., 1., 1., 1.])

Compute $\tilde{c}(n)$

In [8]:
# pmf of each image (row-wise) each image can belong to ten different classes
p = np.zeros((4, 10), dtype = int)
p[0, 4] = 1
p[1, 2] = 1
p[2, 8] = 1
p[3, 6] = 1
p

array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]])

In [20]:
# real class index of each image (known beforehand)
c_tilde = p>0
c_tilde

array([[False, False, False, False,  True, False, False, False, False,
        False],
       [False, False,  True, False, False, False, False, False, False,
        False],
       [False, False, False, False, False, False, False, False,  True,
        False],
       [False, False, False, False, False, False,  True, False, False,
        False]])

In [22]:
Hs = -np.log(q[c_tilde])
L = np.sum(Hs)
L

52.91274203723633

## Proof this is equal to PyTorch

Create the Loss function

In [23]:
L = torch.nn.CrossEntropyLoss(reduction = 'sum')

Evaluate on data above

In [24]:
L(torch.tensor(yhat), torch.tensor(p, dtype = torch.float))

tensor(52.9127, dtype=torch.float64)