## Softmax and cross entropy

### $S(y_i) = \frac{e^{y_i}}{\sum e^{y_i}}$

Compare the output probability with one hot encoded true label (y_true) to find the cross entropy loss as such:

$D(y, \hat{y}) = - \frac{1}{N}\sum Y_i \times log(\hat{Y_i})$

where: 
$\hat{Y_i}$ = predicted label<br>
$Y_i$ = true label

In [1]:
import numpy as np 
import torch 
import torch.nn

In [8]:
# softmax in numpy

def softmax(x):
    return np.exp(x)/np.sum(np.exp(x), axis = 0)

In [9]:
x = np.array([1.0, 2.0, 0.3])

softmax(x)

array([0.2372554 , 0.64492705, 0.11781755])

In [11]:
# softmax on tensors
t = torch.tensor([1.0, 2.0, 0.3])

torch.softmax(t, dim=0)

tensor([0.2373, 0.6449, 0.1178])

This softmax coefficient is often used in multiclass classification with cross entropy loss.

For example: 

- $\hat{Y}$ = [0.7, 0.2, 1] and $Y$ = [1 0 0] $\implies D(Y, \hat{Y})$  = 0.35

- $\hat{Y}$ = [0.1, 0.3, 0.6] and $Y$ = [1 0 0] $\implies D(Y, \hat{Y})$  =  2.30

So: the further _away_ is the prediction, more is the loss; which makes sense for the loss function. 

### CE loss in numpy:

In [None]:
def CEloss (actual, predicted):
    # print(np.log(predicted))
    # print(actual * np.log(predicted))
    # print(np.sum(actual * np.log(predicted)))
    return - np.sum(actual * np.log(predicted))

In [15]:
a1 = np.array([1 ,-2, 3])
p1 = np.array([0.1, 8, 3])

CEloss(a1,p1)

[-2.30258509  2.07944154  1.09861229]
[-2.30258509 -4.15888308  3.29583687]
-3.1656313103493883


np.float64(3.1656313103493883)

### CE loss in pytorch

<span style="color:#FF0000; font-family: 'Bebas Neue'; font-size: 01em;">CAUTION:</span>

`nn.CrossEntropyLoss()` applies softmax and negative log likelihood _automatically_.

$\implies$ we must not additionally add softmax. 

Further, Y_true must be the actual label, __not__ on-hot encoded, unlike the numpy case!

In [20]:
loss_fn = torch.nn.CrossEntropyLoss()

y_true = torch.tensor([1]) #actual label is '1'

# raw logit score -> no softmax
y_pred_bad = torch.tensor([[2.0, 0.7, 0.2]]) # list of lists -- makes sense
y_pred_good = torch.tensor([[1.0, 3.0, 0.2]]) # predicts maximum chance of 1, unlike the bad pred 

l1 = loss_fn(y_pred_good, y_true)
l2 = loss_fn(y_pred_bad, y_true) #expect high than l1

print(f'Loss with good predcition = {l1.item()} \nLoss with bad prediction = {l2.item()}')


Loss with good predcition = 0.17910423874855042 
Loss with bad prediction = 1.6631355285644531


And the corresponding output labels for `y_pred_bad` and `y_pred_good` are:

In [21]:
_, pred1 = torch.max(y_pred_bad, 1) # axis = 1
_, pred2 = torch.max(y_pred_good, 1) 

print(f"label from pred1 = {pred1.item()} \nLabel from pred2 = {pred2.item()}")

label from pred1 = 0 
Label from pred2 = 1


Multilabel classification is also (exactly) similarly possible if y_true is say 3 dim tensor and y_pred is a list of 3 lists!

For ex: 
``Y_pred_bad = torch.tensor(
    [[0.9, 0.2, 0.1],
    [0.1, 0.3, 1.5],
    [1.2, 0.2, 0.5]])``

    y_true = torch.tensor([1,0,2])

loss, multi label prediction are similarly available. 