In [4]:
import torch

## Loss function

In neural network supervised training (and to lesser extent in evauation) one of the key concepts is the loss function. The loss function measures what is the distance between the known target values and neural network predictions. A good loss function is differentiable and has non-zero gradients. 

As an example, Accuracy, a measure often used in neural network evaluation in classification tasks, has gradients almost everywhere equal to zero. This is not good to train the neural network, which needs to update the weights in a process called backpropagation, and for that it needs non-zero gradients. Thus, Accuracy isn't suitable as a loss function.

Examples of loss functions:

- Mean Square Error - MSE - perfect for regression,
- Hinge Loss - classification loss,
- Binary Cross Entropy - a measure used for classification with two classes, it approximates Accuracy but has non-zero gradients,
- Cross Entropy - a measure used for general classification. 

Let us provide an exact definition of MSE.

If a target vector is

$T=(T_i), i=1, \ldots, N$

and a prediction vector of a regressor is 

$P=(P_i), i=1, \ldots, N$

Then $MSE(P, T) = \frac{\sum_{i=1}^N (T_i-P_i)^2}{N}$

## Your task

Calculate, using Python, MSE loss of the prediction $P=(1.1, 4.12, 8.9, 14.85)$ versus the target values $T=(1,4,9,16)$

In [6]:

target = [1, 4, 9, 16]
prediction = [1.1, 4.12, 8.9, 14.85]
sum = 0.0
for i, pred in enumerate(prediction):
    sum += (target[i]-pred)**2

print(sum/len(target))
    

0.3392250000000002


PyTorch has predefined the loss functions. Of course, one is free to define his own loss functions, too, but a predefined loss functions have some advantages
- the implementation is numerically stable. As an example, Binary Cross Entropy has a logarithm following the exponent. If you do that correctly, the result is identity. But the exponent of even moderately large values is infinite in numerical calculations.
- they usually have more efficient implementations
- they have built-in reduction methods

In [7]:
torch.nn.functional.mse_loss(torch.tensor([1.1, 4.12, 8.9, 14.85]), torch.tensor([1, 4, 9, 16]))

tensor(0.3392)

In [10]:
torch.nn.functional.mse_loss(torch.tensor([1.1, 4.12, 8.9, 14.85]), torch.tensor([1, 4, 9, 16]), reduction="sum")

tensor(1.3569)

In [11]:
torch.nn.functional.mse_loss(torch.tensor([1.1, 4.12, 8.9, 14.85]), torch.tensor([1, 4, 9, 16]), reduction="none")

tensor([0.0100, 0.0144, 0.0100, 1.3225])

We can see that the fourth element of a prediction error had a largest contribution in a MSE by using `reduction = "none"`. If you want to sum the values, rather than average them, use `reduction = "sum"`. 

## Classification loss

### Two classes - Binary Cross Entropy

OK, now let's examine other loss functions, the ones that are used for classification. If the target is only classes 0 or 1, and predictions are floats between 0 and 1, then it is binary classification and the appropriate loss is Binary Cross Entropy loss with the following usage example in PyTorch:

In [23]:
torch.nn.functional.binary_cross_entropy(torch.tensor([0.0, 0.77, 0.11, 0.99]), torch.tensor([0,1,0,1]).type(torch.float), reduction="none")

tensor([0.0000, 0.2614, 0.1165, 0.0101])

If the target is only classes 0 or 1, and predictions are arbitrary floats, then it is still binary classification but before using Binary Cross Entropy loss the values should be transformed first into $<0,1>$ interval with the use of a Sigmoid: 

In [24]:
torch.nn.functional.binary_cross_entropy(torch.sigmoid(torch.tensor([-2.0, 3.13, 0.0, -120.0])), torch.tensor([0,1,0,1]).type(torch.float), reduction="none")

tensor([1.2693e-01, 4.2789e-02, 6.9315e-01, 1.0000e+02])

Or, you can apply a sigmoid automatically (which is recommended because of numerical stability) with using `torch.nn.functional.binary_cross_entropy_with_logits()`

In [31]:
torch.nn.functional.binary_cross_entropy_with_logits(torch.tensor([-2.0, 3.13, 0.0, -120.0]), torch.tensor([0,1,0,1]).type(torch.float), reduction="none")

tensor([1.2693e-01, 4.2789e-02, 6.9315e-01, 1.2000e+02])

### Arbitrary number of classes - Cross Entropy

It used for multiclass classification, but - in principle - there is nothing stopping us from using it for two classes, too. Please observe, that instead of a vector of logits, you must provide the loss function with raw predictions for all classes (thus we provide a tensor of increased order as predictions). A softmax is executed internally

In [32]:
torch.nn.functional.cross_entropy(torch.sigmoid(torch.tensor([[-2.0, -2.0], [3.14, 3.14], [0.0, 0.0], [120.0, 0.0]])), torch.tensor([0,1,0,1]).type(torch.long), reduction="none")

tensor([0.6931, 0.6931, 0.6931, 0.9741])