In [1]:
# Imports
import tensorflow as tf
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense


# [Loss Functions](https://keras.io/api/losses/)

The loss function quantifies how well a model is performing a task by calculating a single number, **the loss**, from the model output and the desired target.

If the model predictions are totally wrong, the loss will be a high number. If they’re pretty good, it will be close to zero.

You select the Loss function you want to use in the parameters of the Target block.

During training, the optimizer tunes the model to minimize the loss on training examples. After at least 1 epoch has run, the loss and metrics plot of the Evaluation view will show the average value of the loss over all the training examples, as well as over the validation examples.


![Loss Functions](image9.png)


Choosing an appropriate loss function is an important step. The following table can be helpful in this regard:

![ChoosingFunctions](image10.png)

**Compatibility with activation functions**

Some loss functions can only be calculated for a limited range of model outputs. You can ensure that the model output is always in the correct range by using an appropriate activation function on the last block of the model.

**Example**

The **categorical crossentropy** loss function needs to calculate the logarithm of the model prediction, which is only possible if the model output is strictly positive.

1. The TanH activation outputs values between -1 and 1, which makes it incompatible with the categorical crossentropy.

2. The sigmoid activation outputs values between 0 and 1, which makes it a perfect match for the categorical crossentropy!

***

## Probabilistic Losses

In this section, we will learn loss functions used for classification.
1. Categorical crossentropy
2. Binary crossentropy
3. Squared hinge

### Binary Crossentropy Loss Function

Binary crossentropy is a loss function that is used in binary classification tasks. Mathemtically, it can be defined as:

\begin{equation}
Loss = -\frac{1}{N}\sum_{i=1}^{N}\left(y_i. log\hat{y}_i+(1-y_i).log(1-\hat{y}_i)\right)
\end{equation}

where $\hat{y}_i$ and $y_i$ represents predicted and actual output for an example $i$.

**Please note that sigmoid is the only activation function compatible with the binary crossentropy loss function.**

***

In Keras, we can either use **BinaryCrossentropy class** or **binary_crossentropy function**.
***

#### binary_crossentropy function

Following code can be used to compute the binary_crossentropy using the binary_crossentropy function.
```
tf.keras.losses.binary_crossentropy(
y_true, y_pred, from_logits=False,
label_smoothing=0, axis=-1)
```

**Example 1:**

In [2]:
y_true = [[0, 1], [0, 0]]
y_pred = [[0.6, 0.4], [0.4, 0.6]]
loss = tf.keras.losses.binary_crossentropy(y_true, y_pred)
loss.numpy()


array([0.9162905 , 0.71355796], dtype=float32)

***
#### BinaryCrossentropy class

Following code can be used to compute the binary_crossentropy using the binary_crossentropy class.

```
tf.keras.losses.BinaryCrossentropy(
    from_logits=False,
    label_smoothing=0,
    axis=-1,
    reduction="auto",
    name="binary_crossentropy")
```

**_from_logits: when it is True, it means that the prediction result is not a probability distribution, but the exact category value; when it is False, it means that the output is a probability distribution_**

**Example 1:**


In [3]:
# Example 1: (batch_size = 1, number of samples = 4)  
y_true = [0, 1, 0, 0]
y_pred = [-18.6, 0.51, 2.94, -12.8]
bce = tf.keras.losses.BinaryCrossentropy(from_logits=True)
bce(y_true, y_pred).numpy()

0.865458

**Example 2:**

In [4]:
# Make the following updates to the above "Recommended Usage" section  
# 1. Set `from_logits=False`  
tf.keras.losses.BinaryCrossentropy() # OR ...('from_logits=False')
# 2. Update `y_pred` to use probabilities instead of logits  
y_pred = [0.6, 0.3, 0.2, 0.8] # OR [[0.6, 0.3], [0.2, 0.8]]
bce = tf.keras.losses.BinaryCrossentropy(from_logits=False)
bce(y_true, y_pred).numpy()

0.988211

**Example 3 (With tf.keras API):**
```
model.compile(
  loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
  ....
)
```

***

### [Categorical Crossentropy](https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/categorical-crossentropy)



Categorical crossentropy is a loss function that is used in **multi-class classification** tasks. These are tasks where an example can only belong to one out of many possible categories, and the model must decide which one.

The categorical crossentropy loss function calculates the loss of an example by computing the following sum:

\begin{equation}
Loss = - \sum_{i=1}^{N} y_i . log\left(\hat{y}_i\right)
\end{equation}

where $\hat{y}_i$ and $y_i$ represents predicted and actual output for an example $i$.

**Softmax is the only activation function recommended to use with the categorical crossentropy loss function.**

***
In Keras, we can either use **BinaryCrossentropy class** or **binary_crossentropy function**.
***

#### CategoricalCrossentropy class

Following code can be used to compute the Categorical Crossentropy using the CategoricalCrossentropy Class.

```
tf.keras.losses.CategoricalCrossentropy(
    from_logits=False,
    label_smoothing=0,
    axis=-1,
    reduction="auto",
    name="categorical_crossentropy",
)
```

It computes the crossentropy loss between the labels and predictions.

Use this crossentropy loss function when there are two or more label classes. We expect labels to be provided in a **one_hot** representation. If you want to provide labels as integers, please use **SparseCategoricalCrossentropy** loss. 

**Example 1**

In [5]:
y_true = [[0, 1, 0], [0, 0, 1]]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
# Using 'auto'/'sum_over_batch_size' reduction type.  
cce = tf.keras.losses.CategoricalCrossentropy()
cce(y_true, y_pred).numpy()


1.1769392

**Example 2:**

In [6]:
# Using 'sum' reduction type.  
cce = tf.keras.losses.CategoricalCrossentropy(
    reduction=tf.keras.losses.Reduction.SUM)
cce(y_true, y_pred).numpy()

2.3538785

**Example 3:**

In [7]:
# Using 'none' reduction type.  
cce = tf.keras.losses.CategoricalCrossentropy(
    reduction=tf.keras.losses.Reduction.NONE)
cce(y_true, y_pred).numpy()


array([0.05129331, 2.3025851 ], dtype=float32)

**Example 4 (Usage with the compile() API):**
```
model.compile(
optimizer='sgd', 
loss=tf.keras.losses.CategoricalCrossentropy()
)
```

#### categorical_crossentropy function

Following code can be used to compute the binary_crossentropy using the binary_crossentropy function.
```
tf.keras.losses.categorical_crossentropy(
    y_true, y_pred, from_logits=False, 
    label_smoothing=0, axis=-1
)
```

**Example 1:**

In [8]:
y_true = [[0, 1, 0], [0, 0, 1]]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
loss = tf.keras.losses.categorical_crossentropy(y_true, y_pred)
loss.numpy()

array([0.05129331, 2.3025851 ], dtype=float32)

### [Sparse Categorical Crossentropy](https://sanjivgautamofficial.medium.com/categorical-cross-entropy-vs-sparse-categorical-cross-entropy-b6a24de2b7f0)


If your targets are one-hot encoded, use **Categoricall Crossentorpy** loss function.

Examples of one-hot encodings:
- $[1, 0, 0]$
- $[0, 1, 0]$
- $[0, 0, 1]$

***

But if your targets are integers, use **Sparse Categorical Crossentropy**

Examples of integer encodings
- $1$
- $2$
- $3$

***

Like binary crossentropy and categorical crossentropy, the Keras library has function and classes for Sparse Categorical Crossentorpies.
***

### Poisson Loss Function

The poisson loss function is used for regression when modeling count data. Use for data follows the poisson distribution. Ex: churn of customers next week. The loss takes the form of:

\begin{equation}
L\left(y, \hat{y}\right) = \frac{1}{N} \sum_{i=0}^{N}\left({\hat{y}}_i - y_ilog{\hat{y}}_i\right)
\end{equation}



where $\hat{y}$ is the predicted expected value.

Minimizing the Poisson loss is equivalent of maximizing the likelihood of the data under the assumption that the target comes from a Poisson distribution, conditioned on the input.

1. In Keras, Poisson Class can be used
```
tf.keras.losses.Poisson(reduction="auto", name="poisson")
```
2. In Keras, to compute Poisson loss, Poisson function be used as follows:
```
tf.keras.losses.poisson(y_true, y_pred)
```

***
***
## Regression Losses

Regression models are another family of machine learning and statistical models, which are used to predict a continuous target values³. They have a wide range of applications, from house price prediction, E-commerce pricing systems, weather forecasting, stock market prediction, to image super resolution, feature learning via auto-encoders, and image compression.

Models such as linear regression, random forest, XGboost, convolutional neural network, recurrent neural network are some of the most popular regression models.


***

### MeanSquaredError Loss

Mean Squared Error (MSE) is perhaps the most popular metric used for regression problems. It essentially finds the average squared error between the predicted and actual values, and in keras, we can use 

`tf.keras.metrics.MeanSquaredError(name="mean_squared_error", dtype=None)` 

to compute MSE.

Let’s assume we have a regression model which predicts the price of houses in Seattle area (show them with $\hat{y}_i$), and let’s say for each house we also have the actual price the house was sold for (denoted with $y_i$). Then the MSE can be calculated as:

\begin{equation}
MSE = \frac{1}{N}\sum_{i=1}^{N}\left(y_i - \hat{y}_i\right)^2
\end{equation}

Sometimes people use **RMSE** to have a metric with scale as the target values, which is essentially the **square root of MSE**.

**Why use mean squared error?**

MSE is sensitive towards outliers and given several examples with the same input feature values, the optimal prediction will be their mean target value. This should be compared with Mean Absolute Error, where the optimal prediction is the median. MSE is thus good to use if you believe that your target data, conditioned on the input, is normally distributed around a mean value, and when it’s important to penalize outliers extra much.


**When to use mean squared error?**

Use MSE when doing regression, believing that your target, conditioned on the input, is normally distributed, and want large errors to be significantly (quadratically) more penalized than small ones.

**Example:** You want to predict future house prices. The price is a continuous value, and therefore we want to do regression. MSE can here be used as the loss function.

***

We can use classes of functions, in Keras, to compute MSE.

#### MeanSquaredError class

Following code can be used to compute the MSE using the MeanSquaredError Class.

```
tf.keras.losses.MeanSquaredError(reduction="auto", name="mean_squared_error")
````

It computes the mean of squares of errors between labels and predictions.

loss = square(y_true - y_pred)

**Example 1**


In [9]:
y_true = [[0., 1.], [0., 0.]]
y_pred = [[1., 1.], [1., 0.]]
# Using 'auto'/'sum_over_batch_size' reduction type.  
mse = tf.keras.losses.MeanSquaredError()
mse(y_true, y_pred).numpy()

0.5

**Example 2**

In [10]:
# Calling with 'sample_weight'.  
mse(y_true, y_pred, sample_weight=[0.7, 0.3]).numpy()


0.25

**Example 3:**

In [11]:
# Using 'sum' reduction type.  
mse = tf.keras.losses.MeanSquaredError(
    reduction=tf.keras.losses.Reduction.SUM)
mse(y_true, y_pred).numpy()

1.0

**Example 4:**

In [12]:
# Using 'none' reduction type.  
mse = tf.keras.losses.MeanSquaredError(
    reduction=tf.keras.losses.Reduction.NONE)
mse(y_true, y_pred).numpy()


array([0.5, 0.5], dtype=float32)

**Example 5 (Usage with the compile() API):**

In [13]:
model = Sequential()
model.add(Dense(1, activation = 'linear', input_shape = (2,)))
model.compile(optimizer='sgd', 
              loss=tf.keras.losses.MeanSquaredError())

***
Similarly, the room mean squared error (RMSE) can be calculated, in keras, using 

`tf.keras.losses.RootMeanSquaredError(name="root_mean_squared_error", dtype=None)`.

***

#### mean_squared_error function

```
tf.keras.losses.mean_squared_error(y_true, y_pred)
```


In [14]:
y_true = np.random.randint(0, 2, size=(2, 3))
y_pred = np.random.random(size=(2, 3))
loss = tf.keras.losses.mean_squared_error(y_true, y_pred)
loss.numpy()

array([0.1604751 , 0.41266547])

### MeanAbsoluteError Loss

Mean absolute error (or mean absolute deviation) is another loss function that finds the average absolute distance between the predicted and target values. MAE is define as below:

\begin{equation}
MAE = \frac{1}{N}\sum_{i=1}^{N}\left|y_i - \hat{y}_i\right|
\end{equation}

**MAE is known to be more robust to the outliers than MSE.** The main reason being that in MSE by squaring the errors, the outliers (which usually have higher errors than other samples) get more attention and dominance in the final error and impacting the model parameters.

It is also worth mentioning that there is a nice maximum likelihood (MLE) interpretation behind MSE and MAE metrics. If we assume a linear dependence between features and targets, then MSE and MAE correspond to the MLE on the model parameters by assuming Gaussian and Laplace priors on the model errors respectively.


**Why use mean absolute error?**

MAE is not sensitive towards outliers and given several examples with the same input feature values, and the optimal prediction will be their median target value. This should be compared with Mean Squared Error, where the optimal prediction is the mean. A disadvantage of MAE is that the gradient magnitude is not dependent on the error size, only on the sign of $y - \hat{y}$. This leads to that the gradient magnitude will be large even when the error is small, which in turn can lead to convergence problems.

**When to use mean absolute error?**

Use Mean absolute error when you are doing regression and don’t want outliers to play a big role. It can also be useful if you know that your distribution is multimodal, and it’s desirable to have predictions at one of the modes, rather than at the mean of them.

Example: When doing image reconstruction, MAE encourages less blurry images compared to MSE. This is used for example in the paper Image-to-Image Translation with Conditional Adversarial Networks by Isola et al.

In Keras, we can use **MeanAbsoluteError class** as follows:

```
tf.keras.losses.MeanAbsoluteError(
    reduction="auto", name="mean_absolute_error"
)
```

In Keras, we can use **mean_absolute_error function** as follows:

```
tf.keras.losses.mean_absolute_error(y_true, y_pred)
```
***

**Example 1:**

In [15]:
y_true = [[0., 1.], [0., 0.]]
y_pred = [[1., 1.], [1., 0.]]
# Using 'auto'/'sum_over_batch_size' reduction type.  
mae = tf.keras.losses.MeanAbsoluteError()
mae(y_true, y_pred).numpy()


0.5

**Example 2:**

In [16]:
# Calling with 'sample_weight'.  
mae(y_true, y_pred, sample_weight=[0.7, 0.3]).numpy()


0.25

**Example 3:**

In [17]:
# Using 'sum' reduction type.  
mae = tf.keras.losses.MeanAbsoluteError(
    reduction=tf.keras.losses.Reduction.SUM)
mae(y_true, y_pred).numpy()

1.0

**Example 4:**

In [18]:
# Using 'none' reduction type.  
mae = tf.keras.losses.MeanAbsoluteError(
    reduction=tf.keras.losses.Reduction.NONE)
mae(y_true, y_pred).numpy()

array([0.5, 0.5], dtype=float32)

**Example 5 (Usage with compile() API):**

In [19]:
model = Sequential()
model.add(Dense(1, activation = 'linear', input_shape = (2,)))
model.compile(optimizer='sgd', 
              loss=tf.keras.losses.MeanAbsoluteError())


***
### Mean Squared Logarithmic Error Function

The MSLE is define as 

\begin{equation}
L(y,\hat{y}) = \frac{1}{N}\sum_{i=1}^{N}\left(log(y_i+1)-log(\hat{y}_i+1)\right)^2
\end{equation}

In Keras, we can use **MeanSquaredLogarithmicError class** as follows:

```
tf.keras.losses.MeanSquaredLogarithmicError(
    reduction="auto", name="mean_squared_logarithmic_error"
)
```

In Keras, we can use **mean_squared_logarithmic_error function** as follows:

```
tf.keras.losses.mean_squared_logarithmic_error(y_true, y_pred)
```
***

**Use MSLE when doing regression, believing that your target, conditioned on the input, is normally distributed, and you don’t want large errors to be significantly more penalized than small ones, in those cases where the range of the target value is large.**


**There are other metrics as well, you can learn more about them by visiting the references given at the end of this tutorial.**

# References

1. [Loss Functions in Keras](https://keras.io/api/losses/)
2. [Loss Functions](https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions)
3. [Sparse Categorical Crossentropy](https://sanjivgautamofficial.medium.com/categorical-cross-entropy-vs-sparse-categorical-cross-entropy-b6a24de2b7f0)
4. [Regression Losses](https://towardsdatascience.com/common-loss-functions-in-machine-learning-46af0ffc4d23)