# When to use which loss?

You can find an overview of implemented losses in the [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/losses)
## Classification

### Binary cross entropy

In [1]:
import tensorflow as tf
import numpy as np

When you have a binary classification problem, you will typically have labels `True` and `False`, which equals to labels `1` and `0`. When using a sigmoid in your output layer, you will get values in the range $[0,1]$ which are interpreted as probailities.

The base formula for cross entropy is $-y \cdot log(p(y))$ where $y$ is the true label, and $p(y)$ is the predicted probability that your label is 1. The probability will be in the range $[0,1]$.

If your label is 1, and you predict a probability of 0.9, you can calculate the binary cross entropy by 
multiplying the negative label with the log of the probability:

In [2]:
y = 1
py = 0.9

-y * np.log(py)


0.10536051565782628

If we had a label 1, and predicted a low probability of 0.1, we need the loss to be high.

In [3]:
y = 1
py = 0.1

-y * np.log(py)

2.3025850929940455

Yet, if the label is 0 and we predicted a low probability of 0.1, that would have been right. We can obtain this by subtracting from 1:

In [4]:
y = 0
py = 0.1

-(1-y) * np.log(1-py)


0.10536051565782628

In [5]:
y = 0
py = 0.9

-(1-y) * np.log(1-py)

2.302585092994046

Now, we can combine these two situations in one formula. If the label is 1 or 0, one of the two parts will go to zero and will be ignored.

$$J(\theta) = - \frac{1}{N} \sum_{i=1}^N y_i log(p(y_i)) + (1-y_i) log (1-p(y_i))$$

Tensorflow implements this as the Binary Croosentropy loss. With `from_logits` set to false, the predicted value is expected to be on the range $[0,1]$.

In [6]:
y = [0, 1, 0, 1]
yhat = [0.1, 0.7, 0.3, 0.9]
bce = tf.keras.losses.BinaryCrossentropy(from_logits=False)
bce(y, yhat)

2021-11-22 14:46:58.803735: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<tf.Tensor: shape=(), dtype=float32, numpy=0.23101759>

If we dont use a sigmoid, we can get the output of a linear model, which is called a **logit** and takes values in the range $[-\infty, \infty]$

In [7]:
y = [0, 1, 0, 1]
yhat = [-10.3, 3.2, -17.18, 12.92]
bce = tf.keras.losses.BinaryCrossentropy(from_logits=True)
bce(y, yhat)

<tf.Tensor: shape=(), dtype=float32, numpy=0.00999736>

### Categorical Crossentropy

If there are two or more label classes in a one-hot encoding, you can use categorical cross entropy.

Let's say we have three possible classes, and the label is the first class, we will have $[1, 0 ,0]$. If we predict the first class with high probability, but the second with small probability we could get something like $[0.95, 0.04,, 0.01]$

Note how the sum of the three proabbilities adds up to 1. This is a fundamental property of probabilities and, while in the binary case we enforced it by applying the sigmoid function to the logits, in the multiclass case we use the **softmax** function.

If you have logits on the range $[-\infty, \infty]$ as output (cf, you dont use an activation function) you can set `from_logits=True`.

In [8]:
y = [[0, 1, 0], [0, 0, 1]]
yhat = [[0.04, 0.95, 0.01], [0.1, 0.8, 0.1]]
# Using 'auto'/'sum_over_batch_size' reduction type.

loss = tf.keras.losses.CategoricalCrossentropy()
loss(y, yhat).numpy()


1.1769392

### Sparse categorical cross entropy

In the case of a lot of classes, a one-hot encoding can be impractical. So we can use a sparse representation, where we can write $[0,1,0]$ as 1, and $[0,0,1]$ as 2. If you have logits on the range $[-\infty, \infty]$ as output instead of probabilities you can set `from_logits=True`



In [9]:
y = [1, 2]
yhat = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
# Using 'auto'/'sum_over_batch_size' reduction type.
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y, yhat).numpy()


1.1769392

## Multi-class, Multi-label
The most general case occurs when there are multiple classes to predict, and each instance can take multiple labels from the set of categories at the same time. E.g. you can have a three classes for movies ['commedy', 'sci-fi', 'horror'] and you are watching a movie that is both a comedy, and sci-fi. Your label will be $[1, 0, 1]$ and your prediction might be something like $[0.7, 0.9, 0.1]$. Or in the case of an x-ray of a chest: you might have pneumonia and/or cancer, or none of them.

**Note**: for this case, the sum of the probabilities for each class does not necessarily have to add up to one because each class is not mutually exclusive.

In this case, you should use the binary crossentropy. If you use a softmax, your values will sum to zero. But that is not what you want! Because it is multilabel, you want to allow for multiple values to get close to one, so use a sigmoid as activation function. You can also use this with logits.

In [10]:
y_true = [[1, 0, 1], [0, 0, 1]]
y_pred = [[5.0, -10.0, 5], [-5.0, -10, 20]]
loss = tf.keras.losses.binary_crossentropy(y_true, y_pred, from_logits=True)
loss.numpy()

array([0.00449203, 0.00225358], dtype=float32)

In [11]:
y = [[0, 1, 0], [0, 0, 1]]
yhat = [[0.04, 0.95, 0.01], [0.1, 0.8, 0.1]]
loss = tf.keras.losses.binary_crossentropy(y, yhat, from_logits=False)
loss.numpy()

array([0.0340551, 1.3391274], dtype=float32)

In [69]:
y = np.array([10.2, 5.1, 8.12])
yhat = np.array([5.2, 6.0, 9.2])
loss = tf.keras.losses.MSE(y, yhat)

assert np.array_equal(
    loss.numpy(), np.mean(np.square(y - yhat), axis=-1))

loss.numpy()


8.992133333333332

Where the square punishes outliers (can you find the outlier in the yhat?), the mean average error puts a smaller penalty on outliers.

$$\mathcal{L}(\hat{y}, y)=\frac{1}{m}\sum_{i=1}^m |y-\hat{y}|$$

Try to change the outlier in the code, and note how the two loss functions react differently to the outlier.

In [70]:
y = np.array([10.2, 5.1, 8.12])
yhat = np.array([5.2, 6.0, 9.2])
loss = tf.keras.losses.MAE(y, yhat)

assert np.array_equal(
    loss.numpy(), np.mean(np.abs(y - yhat), axis=-1))

loss.numpy()

2.3266666666666667

If the target value has a huge spread, you might want to be easier on errors for the very large values. With this, you can use the mean squared logarithmic error:

$$\mathcal{L}(\hat{y}, y)=\frac{1}{m}\sum_{i=1}^m ((log(y+1) -log(\hat{y} + 1))^2$$

In [71]:
y = [1, 10, 1000]
yhat = [1.2, 13, 1100]
loss = tf.keras.losses.mean_squared_logarithmic_error(y, yhat)
loss.numpy()

0.02543662

Compare that to a regular mse:

In [72]:
loss = tf.keras.losses.MSE(y, yhat)
loss.numpy()

3336.3467