In [40]:
import torch
import torchvision as tv
from PIL import Image
import torch.nn as nn
import numpy as np
torch.manual_seed(42)

<torch._C.Generator at 0x1114011f0>

<span style="font-size: 15px;">If we build a model that predicts an output quantity based on some input, we need a way to  **measure how the predicted quantity differs from the true one**. Let $y_{\rm true}$ be the true value for a given input, and $y_{\rm pred}$ the corresponding prediction from our model. The **loss function** $L$ is a function that quantifies the differens between $y_{\rm true}$ and $y_{\rm pred}$. For defining a loss function the **type and shape of the output** play a crucial role.

Let's now investigate various types of the loss function appearing in PyTorch package:
</span>

# Continuous-valued Outputs

## **L1Loss**

<span style="font-size: 15px;">

- Consider an output of arbitrary shape, then the L1Loss can define for us the following three types of loss functions:

   1. Scalar loss function for the mean absolute error (MAE) between each element in the true and predicted tensors:
   $$
\begin{equation} 
L = \frac{1}{N} \sum_{b=0}^{B-1} \sum_{i,j, \cdots} \left| {{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| \hspace{3cm} (1)
\end{equation}
   $$

        where $b$ denotes the sample number in the considered batch of size $B$, $N$ is the total number of elements in the output batch, i.e., $N = \texttt{y\_pred.numel()} = \texttt{y\_true.numel()}$.


   2. Scalar loss function for the absolute error (**without taking the mean**) between each element in the true and predicted tensors:
   $$
\begin{equation} 
L = \sum_{b=0}^{B-1} \sum_{i,j, \cdots} \left| {{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| \hspace{3.5cm} (2)
\end{equation}
   $$
   3. Tensor-valued loss function of the same shape as the output quantity:
      $$
\begin{equation} 
{L}^{b, i,j, \cdots} = \left| {{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| \hspace{4.2cm} (3)
\end{equation}
   $$

- The L1Loss is primarily used in regression tasks and most useful when the dataset contains outliers, as it is more robust to extreme values than L2 (MSE) loss, providing stable gradients for noisy data. It is applied, as we will see, in tasks like image reconstruction or when we want errors to scale linearly, not quadratically, which can be of big importance when the errors are large in magnitude.

As next, let us investigate how this function can be used in PyTorch:
</span>

In [70]:
# Let us generate some true and predicted quantities randomly. For that sake, we take the output quantities to be of shape (3, 5, 7)
# and we take a batch size of 10:
y_pred = torch.randn((10, 3, 5, 7))
y_true = torch.randn((10, 3, 5, 7))

In [71]:
# The init method of nn.L1Loss contains the following default argument: reduction='mean'
# This choice generates for us the MAE scalar loss function as explained above. The corresponding loss is:
loss = nn.L1Loss()
print(loss(y_pred, y_true).item())
# This gives us the same as if we would apply equation (1) directly: 
print(torch.abs(y_pred-y_true).mean().item())

1.1848663091659546
1.1848663091659546


In [72]:
# To obtain the non-averaged case, we have to change the default argument reduction from 'mean' to 'sum'), i.e.,
loss = nn.L1Loss(reduction='sum')
print(loss(y_pred, y_true).item())
# This gives us the same as if we would apply equation (2) directly: 
print(torch.abs(y_pred-y_true).sum().item())

1244.109619140625
1244.109619140625


In [73]:
# To obtain the tensor-valued loss function, we have to change the default argument reduction from 'mean' to 'none':
loss = nn.L1Loss(reduction='none')
loss_tensor = loss(y_pred, y_true)
print(loss_tensor.shape)
# This gives us the same as if we would apply equation (3) directly: 
print(torch.all(torch.abs(y_pred - y_true) - loss_tensor == torch.zeros(y_true.shape)))

torch.Size([10, 3, 5, 7])
tensor(True)


## **MSELoss**

<span style="font-size: 15px;">

- Consider an output of arbitrary shape, then the MSELoss can define for us the following three types of loss functions:

   1. Scalar loss function for the mean squared error (MSE) between each element in the true and predicted tensors:
   $$
\begin{equation} 
L = \frac{1}{N} \sum_{b=0}^{B-1} \sum_{i,j, \cdots} \left( {{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right)^2 \hspace{3cm} (1)
\end{equation}
   $$

        where $b$ denotes the sample number in the considered batch of size $B$, $N$ is the total number of elements in the output batch, i.e., $N = \texttt{y\_pred.numel()} = \texttt{y\_true.numel()}$.


   2. Scalar loss function for the squared error (**without taking the mean**) between each element in the true and predicted tensors:
   $$
\begin{equation} 
L = \sum_{b=0}^{B-1} \sum_{i,j, \cdots} \left( {{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right)^2 \hspace{3.5cm} (2)
\end{equation}
   $$
   3. Tensor-valued loss function of the same shape as the output quantity:
      $$
\begin{equation} 
{L}^{b, i,j, \cdots} = \left( {{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right)^2 \hspace{4.2cm} (3)
\end{equation}
   $$



- The MSE is primarily used in regression tasks where large deviations should be penalized more heavily, because the squared error grows faster for bigger differences between predicted and true values. This property makes it suitable when we want the model to prioritize reducing large mistakes, and it provides smooth gradients that facilitate optimization.

As next, let us investigate how this function can be used in PyTorch:
</span>

In [64]:
# Let us generate some true and predicted quantities randomly. For that sake, we take the output quantities to be of shape (3, 5, 7)
# and we take a batch size of 10:
y_pred = torch.randn((10, 3, 5, 7))
y_true = torch.randn((10, 3, 5, 7))

In [67]:
# The init method of nn.MSELoss contains the following default argument: reduction='mean'
# This choice generates for us the MAE scalar loss function as explained above. The corresponding loss is:
loss = nn.MSELoss()
print(loss(y_pred, y_true).item())
# This gives us the same as if we would apply equation (1) directly: 
print( ((y_pred-y_true)**2).mean().item() )

1.864696979522705
1.864696979522705


In [68]:
# To obtain the non-averaged case, we have to change the default argument reduction from 'mean' to 'sum'), i.e.,
loss = nn.MSELoss(reduction='sum')
print(loss(y_pred, y_true).item())
# This gives us the same as if we would apply equation (2) directly: 
print( ((y_pred-y_true)**2).sum().item() )

1957.931884765625
1957.931884765625


In [69]:
# To obtain the tensor-valued loss function, we have to change the default argument reduction from 'mean' to 'none':
loss = nn.MSELoss(reduction='none')
loss_tensor = loss(y_pred, y_true)
print(loss_tensor.shape)
# This gives us the same as if we would apply equation (3) directly: 
print(torch.all( (y_pred-y_true)**2 - loss_tensor == torch.zeros(y_true.shape)) )

torch.Size([10, 3, 5, 7])
tensor(True)


## **HuberLoss**

<span style="font-size: 15px;">

- Consider an output of arbitrary shape, then the HuberLoss takes the predicted and true quantities together with an additional number $\delta$ and defines for us the following three types of loss functions:

   1. Scalar loss function for between each element in the true and predicted tensors:
   $$
\begin{aligned} 
L_\delta &= \frac{1}{N} \sum_{b=0}^{B-1} \sum_{i,j, \cdots} \Biggl\{ 
\frac{1}{2}\times \left({{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right)^2 
\times
\Theta\left( \delta - \left|{{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| \right)
\\
&+
\delta\times \left( \left|{{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| - \frac{1}{2} \delta \right)
\times
\Theta\left( \left|{{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| - \delta  \right)
\Biggr\}
\hspace{3cm}(1)
\end{aligned}
   $$

        where $b$ denotes the sample number in the considered batch of size $B$, $N$ is the total number of elements in the output batch, i.e., $N = \texttt{y\_pred.numel()} = \texttt{y\_true.numel()}$. Moreover, $\Theta$ is the Heaviside function, defined as follows:
 
   $$
\begin{equation}
\Theta(x) =
\begin{cases}
1, & x > 0 \\
0, & \text{otherwise}
\end{cases}
\hspace{3cm}(2)
\end{equation}
   $$      


   2. Scalar loss function without dividing by the number of elements. This gives us the same as in equation (2) above but without the factor $\frac{1}{N}$
   3. Tensor-valued loss function of the same shape as the output quantity:
   $$
\begin{aligned} 
{L}^{b, i,j, \cdots}_\delta &=
\frac{1}{2}\times \left({{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right)^2 
\times
\Theta\left( \delta - \left|{{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| \right)
\\
&+
\delta\times \left( \left|{{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| - \frac{1}{2} \delta \right)
\times
\Theta\left( \left|{{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| - \delta  \right)
\hspace{3cm}(3)
\end{aligned}
   $$

- The Huber loss is commonly used in regression tasks where robustness to outliers is important. It behaves like the mean squared error for small prediction errors, providing smooth gradients and stable optimization, while transitioning to a linear penalty for large errors, which reduces the influence of outliers. This combination makes the Huber loss well suited for problems where most targets are well-behaved but occasional large deviations or noisy labels are present.

As next, let us investigate how this function can be used in PyTorch:
</span>

In [75]:
# Let us generate some true and predicted quantities randomly. For that sake, we take the output quantities to be of shape (3, 5, 7)
# and we take a batch size of 10:
y_pred = torch.randn((10, 3, 5, 7))
y_true = torch.randn((10, 3, 5, 7))
# We give delta now some value
delta = 2

In [82]:
# The init method of nn.MSELoss contains the following default argument: reduction='mean'
# This choice generates for us the MAE scalar loss function as explained above. The corresponding loss is:
loss = nn.HuberLoss(delta=delta)
print(loss(y_pred, y_true).item())
# This gives us the same as if we would apply equation (1). To obtain equation (1) we do the follows:
error = y_pred - y_true
huber_loss = torch.where(error.abs() < delta, 0.5 * (error**2), (delta * (error.abs() - 0.5 * delta)))
print(huber_loss.mean().item())

0.898237943649292
0.898237943649292


In [83]:
# To obtain the non-averaged case, we have to change the default argument reduction from 'mean' to 'sum'), i.e.,
loss = nn.HuberLoss(reduction='sum', delta=delta)
print(loss(y_pred, y_true).item())
# Or we can also obtain it manually as follows:
print(huber_loss.sum().item())

943.1498413085938
943.1498413085938


In [85]:
# To obtain the tensor-valued loss function, we have to change the default argument reduction from 'mean' to 'none':
loss = nn.HuberLoss(reduction='none', delta=delta)
loss_tensor = loss(y_pred, y_true)
print(loss_tensor.shape)
# This gives us the same as if we would apply equation (3) directly: 
print(torch.all( huber_loss - loss_tensor == torch.zeros(y_true.shape)) )

torch.Size([10, 3, 5, 7])
tensor(True)


## **SmoothL1Loss**

<span style="font-size: 15px;">

- Consider an output of arbitrary shape, then the SmoothL1Loss takes the predicted and true quantities together with an additional number $\beta$ and defines for us the following three types of loss functions:

   1. Scalar loss function for between each element in the true and predicted tensors:
   $$
\begin{aligned} 
L_\beta &= \frac{1}{N} \sum_{b=0}^{B-1} \sum_{i,j, \cdots} \Biggl\{ 
\frac{1}{2 \beta}\times \left({{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right)^2 
\times
\Theta\left( \beta - \left|{{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| \right)
\\
&+
\left( \left|{{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| - \frac{1}{2} \beta \right)
\times
\Theta\left( \left|{{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| - \beta  \right)
\Biggr\}
\hspace{3cm}(1)
\end{aligned}
   $$

        where $b$ denotes the sample number in the considered batch of size $B$, $N$ is the total number of elements in the output batch, i.e., $N = \texttt{y\_pred.numel()} = \texttt{y\_true.numel()}$. Moreover, $\Theta$ is the Heaviside function, defined as follows:
 
   $$
\begin{equation}
\Theta(x) =
\begin{cases}
1, & x > 0 \\
0, & \text{otherwise}
\end{cases}
\hspace{3cm}(2)
\end{equation}
   $$      


   2. Scalar loss function without dividing by the number of elements. This gives us the same as in equation (2) above but without the factor $\frac{1}{N}$
   3. Tensor-valued loss function of the same shape as the output quantity:
   $$
\begin{aligned} 
{L}^{b, i,j, \cdots}_\beta &=
\frac{1}{2 \beta}\times \left({{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right)^2 
\times
\Theta\left( \beta - \left|{{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| \right)
\\
&+
\left( \left|{{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| - \frac{1}{2} \beta \right)
\times
\Theta\left( \left|{{y^b}_{\rm true}}^{i,j, \cdots} - {{y^b}_{\rm pred}}^{i,j, \cdots} \right| - \beta  \right)
\hspace{3cm}(3)
\end{aligned}
   $$

- The SmoothL1 loss is commonly used in regression tasks where robustness to outliers is important. It behaves like the mean squared error for small prediction errors, providing smooth gradients and stable optimization, while transitioning to a linear penalty for large errors, which reduces the influence of outliers. This combination makes the SmoothL1Loss loss well suited for problems where most targets are well-behaved but occasional large deviations or noisy labels are present.

As next, let us investigate how this function can be used in PyTorch:
</span>

In [88]:
# Let us generate some true and predicted quantities randomly. For that sake, we take the output quantities to be of shape (3, 5, 7)
# and we take a batch size of 10:
y_pred = torch.randn((10, 3, 5, 7))
y_true = torch.randn((10, 3, 5, 7))
# We give delta now some value
beta = 3

In [98]:
# The init method of nn. contains the following default argument: reduction='mean'
# This choice generates for us the MAE scalar loss function as explained above. The corresponding loss is:
loss = nn.SmoothL1Loss(beta=beta)
print(loss(y_pred, y_true).item())
# This gives us the same as if we would apply equation (1). To obtain equation (1) we do the follows:
error = y_pred - y_true
SmoothL1_Loss = torch.where(error.abs() < beta, 0.5 * (error**2)/beta, (error.abs() - 0.5 * beta))
print(SmoothL1_Loss.mean().item())

0.36600905656814575
0.36600905656814575


In [101]:
# To obtain the non-averaged case, we have to change the default argument reduction from 'mean' to 'sum'), i.e.,
loss = nn.SmoothL1Loss(reduction='sum', beta=beta)
print(loss(y_pred, y_true).item())
# Or we can also obtain it manually as follows:
print(SmoothL1_Loss.sum().item())

384.30950927734375
384.30950927734375


In [103]:
# To obtain the tensor-valued loss function, we have to change the default argument reduction from 'mean' to 'none':
loss = nn.SmoothL1Loss(reduction='none', beta=beta)
loss_tensor = loss(y_pred, y_true)
print(loss_tensor.shape)
# This gives us the same as if we would apply equation (3) directly: 
print(torch.all( SmoothL1_Loss - loss_tensor == torch.zeros(y_true.shape)) )

torch.Size([10, 3, 5, 7])
tensor(True)
