In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision as tv
from PIL import Image
import numpy as np
torch.manual_seed(42)

<torch._C.Generator at 0x117db1250>

<span style="font-size: 15px;">If we build a model that predicts an output quantity based on some input, we need a way to  **measure how the predicted quantity differs from the true one (target)**. Let $y_{\rm true}$ be the true value for a given input, and $y_{\rm pred}$ the corresponding prediction from our model. The **loss function** $L$ is a function that quantifies the differens between $y_{\rm true}$ and $y_{\rm pred}$. For defining a loss function the **type and shape of the output** play a crucial role.

Let's now investigate various types of the loss function appearing in PyTorch package. In what follows:

1. The index $b$ always denotes the sample number in a considered batch of size $B$.
2. $N$ is the number of elements across the *entire predicted tensor* (batch + all other dimensions). It can be obtained through, e.g., $N = \texttt{y\_pred.numel()}$.
</span>

# Continuous-valued outputs

**Overview**

This table summarizes common regression loss functions in PyTorch, their key properties, and typical use cases.

| Loss Function | Basic Properties | Best Used For | Not Recommended For |
|---------------|------------------|---------------|---------------------|
| **L1Loss** | Computes mean absolute error; robust to outliers; produces sparse gradients | Datasets with outliers; when you want less sensitivity to extreme values | When outliers contain important information; smooth optimization paths |
| **MSELoss** | Computes mean squared error; penalizes large errors heavily; smooth gradients | General regression tasks; normally distributed errors; when large errors should be heavily penalized | Data with many outliers (very sensitive); when all errors should be treated equally |
| **HuberLoss** | Combines L1 and L2 loss; quadratic for small errors, linear for large ones; controlled by delta parameter | Datasets with moderate outliers; balance between MSE smoothness and L1 robustness | When you need pure L1 or L2 behavior; computational efficiency is critical |
| **SmoothL1Loss** | Similar to Huber but with different transition; smooth everywhere; less sensitive to outliers than MSE | Object detection and localization tasks; bounding box regression | Pure regression without geometric constraints; when outliers are extremely rare |
| **GaussianNLLLoss** | Models heteroscedastic uncertainty; predicts both mean and variance; negative log-likelihood for Gaussian distribution | Uncertainty estimation; when prediction confidence matters; heteroscedastic noise | Simple point predictions; when variance is constant; classification tasks |
| **PoissonNLLLoss** | Based on Poisson distribution; for count data; handles non-negative integer outputs | Count prediction (events, items); rare event modeling; rate estimation | Continuous unbounded data; negative values; binary outcomes |
| **KLDivLoss** | Measures divergence between probability distributions; asymmetric; expects log-probabilities as input | Distillation; distribution matching; variational inference; comparing probability distributions | Direct regression; when inputs aren't probability distributions; symmetric distance needed |

Detailed explanations of each loss function, including their mathematical formulations and implementation examples, follow below.

## **L1Loss**

<span style="font-size: 15px;">

The L1Loss is primarily used in regression tasks and most useful when the dataset contains outliers, as it is more robust to extreme values than L2 (MSE) loss, providing stable gradients for noisy data. It is applied, as we will see, in tasks like image reconstruction or when we want errors to scale linearly, not quadratically, which can be of big importance when the errors are large in magnitude.

- Consider an output of arbitrary shape $y_{\rm pred}$ and a target of the same shape $y_{\rm true}$, then the L1Loss defines the following loss tensor:


   $$
  \begin{aligned}
{L_{\rm L1}}^{b, i,j, \cdots} = \left| {y_{\rm true}}^{b,i,j, \cdots} - {y_{\rm pred}}^{b,i,j, \cdots} \right|
  \end{aligned}
  $$

  
- Moreover, the L1Loss can define the following two scalar loss functions:
   1. Scalar loss function for the mean absolute error (MAE) between each element in the true and predicted tensors:
   $$
\begin{aligned} 
L = \frac{1}{N} \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} \, {L_{\rm L1}}^{b, i,j, \cdots}
\end{aligned}
   $$



   2. Scalar loss function for the absolute error (**without taking the mean**) between each element in the true and predicted tensors:
   $$
\begin{aligned} 
L = \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_{\rm L1}}^{b, i,j, \cdots}
\end{aligned}
   $$

As next, let us investigate how this function can be used in PyTorch:
</span>

In [5]:
# Let us generate some true and predicted quantities randomly. For that sake, we take the output quantities to be of shape (3, 5, 7)
# and we take a batch size of 10:
y_pred = torch.randn((10, 3, 5, 7))
y_true = torch.randn((10, 3, 5, 7))
# The corresponding the loss tensor is obtained through:
L1_LossTensor = torch.abs(y_pred - y_true)

In [7]:
# The init method of nn.L1Loss contains the default argument: reduction='mean'
# This choice generates for us the MAE scalar loss function as explained above. The corresponding loss is:
loss = nn.L1Loss()
print(loss(y_pred, y_true).item())
# This gives us the same as if we would apply aligned (1) directly: 
print(L1_LossTensor.mean().item())

1.134202480316162
1.134202480316162


In [8]:
# To obtain the non-averaged case, we have to change the default argument reduction from 'mean' to 'sum'), i.e.,
loss = nn.L1Loss(reduction='sum')
print(loss(y_pred, y_true).item())
# This gives us the same as if we would just sum all elements of L1_LossTensor:
print(L1_LossTensor.sum().item())

1190.91259765625
1190.91259765625


In [9]:
# To obtain the tensor-valued loss function, we have to change the default argument reduction from 'mean' to 'none':
loss = nn.L1Loss(reduction='none')
loss_tensor = loss(y_pred, y_true)
print(loss_tensor.shape)
# To see that the resulting loss tensor is identical with L1_LossTensor, we subtract both tensors form each others and check if we get a zero tensor:
print(torch.all(L1_LossTensor - loss_tensor == torch.zeros(y_true.shape)))

torch.Size([10, 3, 5, 7])
tensor(True)


## **MSELoss**

<span style="font-size: 15px;">

The MSE (also called L2) is primarily used in regression tasks where large deviations should be penalized more heavily, because the squared error grows faster for bigger differences between predicted and true values. This property makes it suitable when we want the model to prioritize reducing large mistakes, and it provides smooth gradients that facilitate optimization.

- Consider an output of arbitrary shape $y_{\rm pred}$ and a target of the same shape $y_{\rm true}$, then the MSELoss defines the following loss tensor:
  
   $$
\begin{aligned} 
{L_{\rm MSE}}^{b, i,j, \cdots} = \left( {y_{\rm true}}^{b,i,j, \cdots} - {y_{\rm pred}}^{b,i,j, \cdots} \right)^2
\end{aligned}
   $$
- Moreover, the MSELoss can define the following two scalar loss functions:
   1. Scalar loss function for the mean squared error between each element in the true and predicted tensors:
   $$
\begin{aligned} 
L = \frac{1}{N} \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_{\rm MSE}}^{b, i,j, \cdots}
\end{aligned}
   $$
   2. Scalar loss function for the squared error (**without taking the mean**) between each element in the true and predicted tensors:
   $$
\begin{aligned} 
L = \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_{\rm MSE}}^{b, i,j, \cdots}
\end{aligned}
   $$






As next, let us investigate how this function can be used in PyTorch:
</span>

In [10]:
# Let us generate some true and predicted quantities randomly. For that sake, we take the output quantities to be of shape (3, 5, 7)
# and we take a batch size of 10:
y_pred = torch.randn((10, 3, 5, 7))
y_true = torch.randn((10, 3, 5, 7))
# The corresponding the loss tensor is obtained through:
L2_LossTensor = (y_pred - y_true)**2

In [12]:
# The init method of nn.MSELoss contains the following default argument: reduction='mean'
# This choice generates for us the mean squared error. The corresponding loss is:
loss = nn.MSELoss()
print(loss(y_pred, y_true).item())
# This gives us the same as if we would take the mean of L2_LossTensor
print( L2_LossTensor.mean().item() )

1.8992314338684082
1.8992314338684082


In [13]:
# To obtain the non-averaged case, we have to change the default argument reduction from 'mean' to 'sum'), i.e.,
loss = nn.MSELoss(reduction='sum')
print(loss(y_pred, y_true).item())
# This gives us the same as if we would some all elements of L2_LossTensor
print( L2_LossTensor.sum().item() )

1994.1929931640625
1994.1929931640625


In [14]:
# To obtain the tensor-valued loss function, we have to change the default argument reduction from 'mean' to 'none':
loss = nn.MSELoss(reduction='none')
loss_tensor = loss(y_pred, y_true)
print(loss_tensor.shape)
# To see that the resulting loss tensor is identical with L2_LossTensor, we subtract both tensors form each others and check if we get a zero tensor:
print(torch.all(L2_LossTensor - loss_tensor == torch.zeros(y_true.shape)))

torch.Size([10, 3, 5, 7])
tensor(True)


## **HuberLoss**

<span style="font-size: 15px;">

The Huber loss is commonly used in regression tasks where robustness to outliers is important. It behaves like the mean squared error for small prediction errors, providing smooth gradients and stable optimization, while transitioning to a linear penalty for large errors, which reduces the influence of outliers. This combination makes the Huber loss well suited for problems where most outputs are well-behaved but occasional large deviations or noisy labels are present.

- Consider an output of arbitrary shape $y_{\rm pred}$ and a target of the same shape $y_{\rm true}$, then the HuberLoss takes an additional number $\delta$ defines the following loss tensor:
  
   $$
\begin{aligned} 
{L_h}^{b, i,j, \cdots}_\delta &=
\frac{1}{2}\times \left({y_{\rm true}}^{b,i,j, \cdots} - {y_{\rm pred}}^{b,i,j, \cdots} \right)^2 
\times
\Theta\left( \delta - \left|{y_{\rm true}}^{b,i,j, \cdots} - {y_{\rm pred}}^{b,i,j, \cdots} \right| \right)
\\
&+
\delta\times \left( \left|{y_{\rm true}}^{b,i,j, \cdots} - {y_{\rm pred}}^{b,i,j, \cdots} \right| - \frac{1}{2} \delta \right)
\times
\Theta\left( \left|{y_{\rm true}}^{b,i,j, \cdots} - {y_{\rm pred}}^{b,i,j, \cdots} \right| - \delta  \right)
\end{aligned}
   $$
  where $\Theta$ is the Heaviside function, defined as follows:
 
   $$
\begin{aligned}
\Theta(x) =
\begin{cases}
1, & x > 0 \\
0, & \text{otherwise}
\end{cases}
\end{aligned}
   $$

- Moreover, the HuberLoss can define the following two scalar loss functions:
   1. Avaraged scalar loss function:
   $$
\begin{aligned} 
L = \frac{1}{N} \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_h}^{b, i,j, \cdots}_\delta
\end{aligned}
   $$
   2. Scalar loss function withouth avaraging:
   $$
\begin{aligned} 
L = \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_h}^{b, i,j, \cdots}_\delta
\end{aligned}
   $$






As next, let us investigate how this function can be used in PyTorch:
</span>

In [17]:
# Let us generate some true and predicted quantities randomly. For that sake, we take the output quantities to be of shape (3, 5, 7)
# and we take a batch size of 10:
y_pred = torch.randn((10, 3, 5, 7))
y_true = torch.randn((10, 3, 5, 7))
# We give delta now some value
delta = 2
# The corresponding the loss tensor is obtained through:
error = y_pred - y_true
Huber_LossTensor = torch.where(error.abs() < delta, 0.5 * (error**2), (delta * (error.abs() - 0.5 * delta)))

In [19]:
# The init method of nn.HuberLoss contains the following default argument: reduction='mean'
# This choice generates for us avaraged scalar loss function as explained above:
loss = nn.HuberLoss(delta=delta)
print(loss(y_pred, y_true).item())
# This gives us the same as would take the mean of Huber_LossTensor:
print(Huber_LossTensor.mean().item())

1.0335075855255127
1.0335075855255127


In [20]:
# To obtain the non-averaged case, we have to change the default argument reduction from 'mean' to 'sum'), i.e.,
loss = nn.HuberLoss(reduction='sum', delta=delta)
print(loss(y_pred, y_true).item())
# This is the same as summing all elements of Huber_LossTensor
print(Huber_LossTensor.sum().item())

1085.1829833984375
1085.1829833984375


In [21]:
# To obtain the tensor-valued loss function, we have to change the default argument reduction from 'mean' to 'none':
loss = nn.HuberLoss(reduction='none', delta=delta)
loss_tensor = loss(y_pred, y_true)
print(loss_tensor.shape)
# To see that the resulting loss tensor is identical with Huber_LossTensor, we subtract both tensors form each others and check if we get a zero tensor:
print(torch.all( Huber_LossTensor - loss_tensor == torch.zeros(y_true.shape)) )

torch.Size([10, 3, 5, 7])
tensor(True)


## **SmoothL1Loss**

<span style="font-size: 15px;">

The SmoothL1 loss is commonly used in regression tasks where robustness to outliers is important. It behaves like the mean squared error for small prediction errors, providing smooth gradients and stable optimization, while transitioning to a linear penalty for large errors, which reduces the influence of outliers. This combination makes the SmoothL1Loss loss well suited for problems where most outputs are well-behaved but occasional large deviations or noisy labels are present.

- Consider an output of arbitrary shape $y_{\rm true}$ and a target of the same shape $y_{\rm true}$, then the SmoothL1 takes an additional number $\beta$ defines the following loss tensor:
  
   $$
\begin{aligned} 
{Ls}^{b, i,j, \cdots}_\beta &=
\frac{1}{2 \beta}\times \left({y_{\rm true}}^{b,i,j, \cdots} - {y_{\rm pred}}^{b,i,j, \cdots} \right)^2 
\times
\Theta\left( \beta - \left|{y_{\rm true}}^{b,i,j, \cdots} - {y_{\rm pred}}^{b,i,j, \cdots} \right| \right)
\\
&+
\left( \left|{y_{\rm true}}^{b,i,j, \cdots} - {y_{\rm pred}}^{b,i,j, \cdots} \right| - \frac{1}{2} \beta \right)
\times
\Theta\left( \left|{y_{\rm true}}^{b,i,j, \cdots} - {y_{\rm pred}}^{b,i,j, \cdots} \right| - \beta  \right)
\end{aligned}
   $$
    where $\Theta$ is the Heaviside function explained in the previous section.

- Moreover, the SmoothL1 can define the following two scalar loss functions:
   1. Avaraged scalar loss function:
   $$
\begin{aligned} 
L = \frac{1}{N} \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {Ls}^{b, i,j, \cdots}_\beta
\end{aligned}
   $$
   2. Scalar loss function withouth avaraging:
   $$
\begin{aligned} 
L = \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {Ls}^{b, i,j, \cdots}_\beta
\end{aligned}
   $$

In [22]:
# Let us generate some true and predicted quantities randomly. For that sake, we take the output quantities to be of shape (3, 5, 7)
# and we take a batch size of 10:
y_pred = torch.randn((10, 3, 5, 7))
y_true = torch.randn((10, 3, 5, 7))
# We give delta now some value
beta = 3
# The corresponding the loss tensor is obtained through:
error = y_pred - y_true
SmoothL1_LossTensor = torch.where(error.abs() < beta, 0.5 * (error**2)/beta, (error.abs() - 0.5 * beta))

In [23]:
# The init method of nn.SmoothL1Loss contains the following default argument: reduction='mean'
# This choice generates for us avaraged scalar loss function as explained above:
loss = nn.SmoothL1Loss(beta = beta)
print(loss(y_pred, y_true).item())
# This gives us the same as would take the mean of Huber_LossTensor:
print(SmoothL1_LossTensor.mean().item())

0.3367563486099243
0.3367563486099243


In [24]:
# To obtain the non-averaged case, we have to change the default argument reduction from 'mean' to 'sum'), i.e.,
loss = nn.SmoothL1Loss(reduction='sum', beta=beta)
print(loss(y_pred, y_true).item())
# This is the same as summing all elements of SmoothL1_LossTensor
print(SmoothL1_LossTensor.sum().item())

353.59417724609375
353.59417724609375


In [25]:
# To obtain the tensor-valued loss function, we have to change the default argument reduction from 'mean' to 'none':
loss = nn.SmoothL1Loss(reduction='none', beta=beta)
loss_tensor = loss(y_pred, y_true)
print(loss_tensor.shape)
# To see that the resulting loss tensor is identical with SmoothL1_LossTensor, we subtract both tensors form each others and check if we get a zero tensor:
print(torch.all( SmoothL1_LossTensor - loss_tensor == torch.zeros(y_true.shape)) )

torch.Size([10, 3, 5, 7])
tensor(True)


## **GaussianNLLLoss**

<span style="font-size: 15px;">


- If our target is expected to be sampled from Gaussian distributions and we build a neural network that predicts for us the **mean** and the **variance**, then we can use the GaussianNLLLoss to compute the loss on the mean and variacne on the same time. To use the GaussianNLL loss function we need the following components:
     1. The target tensor ${{y}_{\rm true}}$ of shape $(B, *)$, where $*$ we mean some arbitrary shape.
     2. The mean tensor ${{y}_{\rm pred}}$, predicted by the neural network, of shape $(B, *)$.
     3. The variance tensor ${{y}_{\rm var}}$, predicted by the neural network, of shape $(B, *)$.
- It is worthy mentioning, that in Pytorch, both **target** and **variance** can use broadcasting (dimension=1 gets repeated to match the mean's shape).

- The GaussianNLLLoss can give us the following tensor-valued error:

   $$
\begin{aligned} 
{L_g}^{b, i,j, \cdots} &=   \frac{1}{2}\log\left({\rm max}\left({y_{\rm var}}^{b,i,j, \cdots},\, \epsilon  \right) \right) + \frac{\left( {y_{\rm true}}^{b,i,j, \cdots} - {y_{\rm pred}}^{b,i,j, \cdots} \right)^2}{2 {\rm \, max}\left({y_{\rm var}}^{b,i,j, \cdots},\, \epsilon  \right)}
\end{aligned}
   $$

     where $\epsilon$ is a small quantitiy needed for stability. Moreover, it also can give us the following two scalar loss functions

   1. Avaraged scalar loss function:


   $$
\begin{aligned} 
L &=  \frac{1}{N} \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_g}^{b, i,j, \cdots}
\end{aligned}
   $$

   2. Scalar loss function without averaging:

   $$
\begin{aligned} 
L &=   \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_g}^{b, i,j, \cdots}
\end{aligned}
   $$

As next, let us investigate how this function can be used in PyTorch:
</span>

In [27]:
# Let us generate some true and predicted quantities randomly. For that sake, we take the output quantities (i.e., the means) to be of shape (3, 5, 7)
# and we take a batch size of 10:
y_pred = torch.randn((10, 3, 5, 7))
y_var = torch.abs(torch.randn((10, 3, 5, 7)))
y_true = torch.randn((10, 3, 5, 7))
# The corresponding the loss tensor is obtained through:
error = y_pred - y_true
max_var = torch.clamp(y_var, min=1e-6)
GaussianNLLL_Tensor = 0.5 * (torch.log(max_var) + error**2 / max_var)

In [28]:
# The init method of nn.GaussianNLLLoss contains the following default arguments: reduction='mean', epsilon 1e-06
# To generate the mean scalar loss function we do the following:
loss = nn.GaussianNLLLoss()
print(loss(y_pred, y_true, y_var).item())
# Manually, the above result can be obtained as follows
print(GaussianNLLL_Tensor.mean().item())

8.570807456970215
8.570807456970215


In [29]:
# If we want the scalar loss function but without mean, we have to change the default argument reduction from 'mean' to 'sum'), i.e.,
loss = nn.GaussianNLLLoss(reduction='sum')
print(loss(y_pred, y_true, y_var).item())
# Or we can also obtain it manually as follows:
print(GaussianNLLL_Tensor.sum().item())

8999.34765625
8999.34765625


In [30]:
# To obtain the tensor-valued loss function, we have to change the default argument reduction from 'mean' to 'none':
loss = nn.GaussianNLLLoss(reduction='none')
loss_tensor = loss(y_pred, y_true, y_var)
print(loss_tensor.shape)
# To see that the resulting loss tensor is identical with GaussianNLLL_Tensor, we subtract both tensors form each others and check if we get a zero tensor:
print(torch.all( GaussianNLLL_Tensor - loss_tensor == torch.zeros(y_true.shape)) )

torch.Size([10, 3, 5, 7])
tensor(True)


## **PoissonNLLLoss**

<span style="font-size: 15px;">


- If our target is expected to be sampled from Poisson distribution (non-negative count or continuous data) and we build a neural network that predicts the rate parameter $\lambda$ (the average number of events expected to occur in a fixed interval of time or space), then we can use the PoissonNLLLoss to compute the loss in obtaining this rate parameter. To use the GaussianNLL loss function we need the following components:
     1. The target tensor $y_{\rm true}$ of arbitrary shape.
     2. The log-rate tensor $\log(\lambda)$ predicted by the neural network, and it should have the same shape and the target tensor, i.e., each element in the target has its own log-rate.

- The PoissonNLLLoss can give us the following two tensor-valued error:

   $$
\begin{aligned} 
{L_{p_1}}^{b, i,j, \cdots} &= \exp\left(  {\lambda^{b,}}^{b,i,j, \cdots} \right) - {\lambda^{b,}}^{b,i,j, \cdots} \times {{y^{b,}}_{\rm true}}^{b,i,j, \cdots}
\\
{L_{p_2}}^{b, i,j, \cdots} &= {\lambda^{b,}}^{b,i,j, \cdots} -  {{y^{b,}}_{\rm true}}^{b,i,j, \cdots} \times \log\left({\lambda^{b,}}^{b,i,j, \cdots} + \epsilon \right)
\end{aligned}
   $$

     where $\epsilon$ is a small quantitiy needed for stability. Moreover, it also can give us from each of the tensor losses above the following two scalar loss functions

   1. Avaraged scalar loss functions:


   $$
\begin{aligned} 
L_1 &=  \frac{1}{N} \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_{p_1}}^{b, i,j, \cdots}
\\
L_2 &=  \frac{1}{N} \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_{p_2}}^{b, i,j, \cdots}
\end{aligned}
   $$

  
   2. Scalar loss functions without averaging:

   $$
\begin{aligned} 
L_1 &=   \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_{p_1}}^{b, i,j, \cdots}
\\
L_2 &=   \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_{p_2}}^{b, i,j, \cdots}
\end{aligned}
   $$

As next, let us investigate how this function can be used in PyTorch:
</span>

In [50]:
# Let us generate some true and predicted quantities randomly. For that sake, we take the output quantities (i.e., the rate parametes) to be of shape (3, 5, 7)
# and we take a batch size of 10:
lmbda = torch.abs(torch.randn((10, 3, 5, 7)))
y_true = torch.randn((10, 3, 5, 7))
# The corresponding the loss tensors is obtained through:
epsilon = 1e-8
Lp1 = torch.exp(lmbda) - lmbda * y_true
Lp2 = lmbda - y_true * torch.log(lmbda + epsilon)

In [55]:
# The init method of nn.PoissonNLLLoss contains the following default arguments: log_input=True, reduction='mean', epsilon 1e-08
# To generate the mean scalar loss function L1 we do the following:
loss = nn.PoissonNLLLoss()
print(loss(lmbda, y_true).item())
# Manually, the above result can be obtained as follows
print(Lp1.mean().item())

# To generate the mean scalar loss function L2 we set the log_input argument to False:
loss = nn.PoissonNLLLoss(log_input=False)
print(loss(lmbda, y_true).item())
# Manually, the above result can be obtained as follows
print(Lp2.mean().item())

2.695701837539673
2.695701837539673
0.8175972104072571
0.8175972104072571


In [60]:
# To obtain the scalar loss functions without taking the mean, we set the default argument reduction from 'mean' to 'sum', i.e.:
loss = nn.PoissonNLLLoss(reduction='sum')
print(loss(lmbda, y_true).item())
# Manually, the above result can be obtained as follows
print(Lp1.sum().item())

# To generate L2 we set the log_input argument to False:
loss = nn.PoissonNLLLoss(log_input=False, reduction='sum')
print(loss(lmbda, y_true).item())
# Manually, the above result can be obtained as follows
print(Lp2.sum().item())

2830.48681640625
2830.48681640625
858.47705078125
858.47705078125


In [65]:
# To obtain the tensor-valued loss function, we have to change the default argument reduction from 'mean' to 'none':
loss = nn.PoissonNLLLoss(reduction='none')
loss_tensor = loss(lmbda, y_true)
print(loss_tensor.shape)
# To see that the resulting loss tensor is identical with Lp1, we subtract both tensors form each others and check if we get a zero tensor:
print(torch.all(Lp1 - loss_tensor == torch.zeros(y_true.shape)) )


loss = nn.PoissonNLLLoss(reduction='none', log_input=False)
loss_tensor = loss(lmbda, y_true)
print(loss_tensor.shape)
# To see that the resulting loss tensor is identical with Lp2, we subtract both tensors form each others and check if we get a zero tensor:
print(torch.all(Lp2 - loss_tensor == torch.zeros(y_true.shape)) )

torch.Size([10, 3, 5, 7])
tensor(True)
torch.Size([10, 3, 5, 7])
tensor(True)


## **KLDivLoss**

<span style="font-size: 15px;">
KLDivLoss

- If our target is a probability distribution and we build a neural network that predicts another probability distribution, then we can use the KLDivLoss to measure how much the predicted distribution diverges from the target distribution. To use the KLDivLoss function we need the following components:
     1. The target probability tensor ${{y}_{\rm true}}$ of arbitrary shape.
     2. The predicted probability tensor ${{y}_{\rm pred}}$, and it should have the same shape and the target tensor.

- The KLDivLoss can give us the following two tensor-valued error:

   $$
\begin{aligned} 
{L_{k_1}}^{b, i,j, \cdots} = &= {y_{\rm true}}^{b,i,j, \cdots} \times  \left(\log\left( {y_{\rm true}}^{b,i,j, \cdots} \right) -  \log\left({y_{\rm pred}}^{b,i,j, \cdots} \right) \right)
\\
{L_{k_2}}^{b, i,j, \cdots} = &= \exp\left({y_{\rm true}}^{b,i,j, \cdots}\right) \times  \left({y_{\rm true}}^{b,i,j, \cdots} -  \log\left({y_{\rm pred}}^{b,i,j, \cdots} \right) \right)
\end{aligned}
   $$

     Moreover, it also can give us the following two scalar loss functions coming from each of the tensor errors

   1. Avaraged scalar loss functions:


   $$
\begin{aligned} 
L_1 &=  \frac{1}{N} \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_{k_1}}^{b, i,j, \cdots}
\\
L_2 &=  \frac{1}{N} \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_{k_2}}^{b, i,j, \cdots}
\end{aligned}
   $$

  
   2. Scalar loss functions without averaging:

   $$
\begin{aligned} 
L_1 &=   \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_{k_1}}^{b, i,j, \cdots}
\\
L_2 &=   \sum_{b=0}^{B-1} \sum_{b,i,j, \cdots} {L_{k_2}}^{b, i,j, \cdots}
\end{aligned}
   $$
**Important Note:** Both ${L_{k_1}}$ and ${L_{k_2}}$ compute the **same KL divergence mathematically**. The difference is only in the **input format**:
   - ${L_{k_1}}$ (when `log_target=False`, default): Assumes $y_{\rm true}$ is in **regular probability space** and $y_{\rm pred}$ is in **log-probability space**
   - ${L_{k_2}}$ (when `log_target=True`): Assumes both $y_{\rm true}$ and $y_{\rm pred}$ are in **log-probability space**

Using `log_target=True` provides better numerical stability when your target is already in log-space, as it avoids the need to convert back to probability space internally.
As next, let us investigate how this function can be used in PyTorch:
</span>

In [21]:
# Let us generate some true and predicted quantities randomly. For that sake, we take the output quantities to be of shape (3, 5, 7)
# and we take a batch size of 10:
y_pred_raw = torch.abs(torch.abs(torch.randn((10, 3, 5, 7)))) + 0.1
y_true_raw = torch.abs(torch.randn((10, 3, 5, 7))) + 0.1

# For L1
# Target: regular probabilities
# Input: LOG probabilities
y_true = F.softmax(y_true_raw, dim=-1)
y_pred_log = F.log_softmax(y_pred_raw, dim=-1)  # ← LOG space

# Manual calculation

Lk1 = y_true * (torch.log(y_true) - y_pred_log)


# For L2
# Target: LOG probabilities
# Input: LOG probabilities
y_true_log = F.log_softmax(y_true_raw, dim=-1)  # ← LOG space
y_pred_log = F.log_softmax(y_pred_raw, dim=-1)  # ← LOG space

# Manual calculation
Lk2 = torch.exp(y_true_log) * (y_true_log - y_pred_log)

In [23]:
# The init method of nn.KLDivLoss contains the following default arguments: log_target=False, reduction='mean'
# To generate the mean scalar loss function L1 we do the following:
loss = nn.KLDivLoss()
print(loss(y_pred_log, y_true).item())
# Manually, the above result can be obtained as follows
print(Lk1.mean().item())

# To generate the mean scalar loss function L2 we set the log_input argument to False:
loss = nn.KLDivLoss(log_target=True)
print(loss(y_pred_log, y_true_log).item())
# Manually, the above result can be obtained as follows
print(Lk2.mean().item())

0.05183225870132446
0.05183225870132446
0.05183225870132446
0.05183225870132446


In [24]:
# To obtain the scalar loss functions without taking the mean, we set the default argument reduction from 'mean' to 'sum', i.e.:
loss = nn.KLDivLoss(reduction='sum')
print(loss(y_pred_log, y_true).item())
# Manually, the above result can be obtained as follows
print(Lk1.sum().item())

# To generate the mean scalar loss function L2 we set the log_input argument to False:
loss = nn.KLDivLoss(log_target=True, reduction='sum')
print(loss(y_pred_log, y_true_log).item())
# Manually, the above result can be obtained as follows
print(Lk2.sum().item())

54.42387008666992
54.42387008666992
54.42387008666992
54.42387008666992


In [31]:
# To obtain the tensor-valued loss function, we have to change the default argument reduction from 'mean' to 'none':
loss = nn.KLDivLoss(reduction='none')
loss_tensor = loss(y_pred_log, y_true)
print(loss_tensor.shape)
# Lets see if this gives us the same as Lk1 by using allclose
print(torch.allclose(Lk1, loss_tensor) )


loss = nn.KLDivLoss(reduction='none', log_target=True)
loss_tensor = loss(y_pred_log, y_true_log)
print(loss_tensor.shape)
# Agaim, we use allclose to see if this gives the same as Lk2
print(torch.allclose(Lk2, loss_tensor) )

torch.Size([10, 3, 5, 7])
True
torch.Size([10, 3, 5, 7])
True


# Discrete Outputs

**Overview**

This table summarizes common classification and ranking loss functions in PyTorch, their key properties, and typical use cases.

| Loss Function | Basic Properties | Best Used For | Not Recommended For |
|---------------|------------------|---------------|---------------------|
| **CrossEntropyLoss** | Combines softmax and NLLLoss; expects raw logits; numerically stable; handles multi-class classification | Multi-class classification (mutually exclusive classes); standard classification tasks | Multi-label problems; binary classification (use BCEWithLogitsLoss); when probabilities are already computed |
| **NLLLoss** | Negative log-likelihood; expects log-probabilities as input; no built-in softmax | When you need custom preprocessing before loss; when log-probabilities are already computed | Raw logits (use CrossEntropyLoss instead); when you need softmax built-in |
| **BCELoss** | Binary cross-entropy; expects probabilities in [0,1]; requires manual sigmoid | Binary classification when probabilities are pre-computed; multi-label with sigmoid already applied | Raw logits (numerical instability); prefer BCEWithLogitsLoss for better stability |
| **BCEWithLogitsLoss** | Combines sigmoid and BCELoss; expects raw logits; numerically stable | Binary classification; multi-label classification; imbalanced datasets (supports pos_weight) | Multi-class single-label problems; when probabilities are already computed |
| **MultiLabelSoftMarginLoss** | Multi-label version of logistic loss; expects raw logits; treats each label independently | Multi-label classification; when labels are not mutually exclusive; text categorization | Single-label classification; when class dependencies matter; small number of classes |
| **SoftMarginLoss** | Binary logistic loss with margin; expects raw predictions; labels are {-1, +1} | Binary classification with confidence margins; when using {-1,+1} labels; SVM-style objectives | Standard {0,1} binary classification; multi-class problems; when margin concept doesn't apply |
| **MultiMarginLoss** | Multi-class hinge loss; margin-based; penalizes incorrect predictions; expects raw scores | Multi-class classification with margin enforcement; SVM-style multi-class; when you want geometric margin | Standard softmax-based classification; probability estimation; when cross-entropy works well |
| **MarginRankingLoss** | Learns relative ranking; expects two inputs and ranking preference; margin-based | Ranking tasks; learning to rank; pairwise preference learning; recommendation systems | Standard classification; absolute value prediction; when ranking relationships don't exist |
| **HingeEmbeddingLoss** | Metric learning loss; binary similarity; expects distances and labels {-1, +1} | Similarity learning; determining if pairs are similar/dissimilar; Siamese networks | Multi-class classification; when absolute distances matter more than similarity; complex relationships |
| **CosineEmbeddingLoss** | Measures cosine similarity; expects embeddings and labels {-1, +1}; angle-based metric | Face recognition; semantic similarity; when angle matters more than magnitude; text similarity | Distance-based similarity; when magnitude is important; binary classification without embeddings |
| **TripletMarginLoss** | Anchor-positive-negative triplets; learns relative distances; enforces margin between similar/dissimilar | Face recognition; person re-identification; metric learning; when you have triplet data | Standard classification; binary similarity; when triplet mining is difficult; small datasets |
| **TripletMarginWithDistanceLoss** | Generalized triplet loss; custom distance functions; more flexible than TripletMarginLoss | Same as TripletMarginLoss but with non-Euclidean distances; custom similarity metrics | Standard classification; when Euclidean distance suffices (use TripletMarginLoss) |

Detailed explanations of each loss function, including their mathematical formulations and implementation examples, follow below.

## CrossEntropyLoss

<span style="font-size: 15px;">

The CrossEntropyLoss is the most commonly used loss function for multi-class classification problems where each sample belongs to exactly one class. It combines a softmax activation and negative log-likelihood loss in a single, numerically stable operation. This loss function expects raw, unnormalized scores (logits) from the model, not probabilities.

- Consider logits of shape $(B, C)$ where $C$ is the number of classes, denoted as ${y_{\rm pred}}^{b,c}$, and class labels ${y_{\rm true}}^b \in \{0, 1, \ldots, C-1\}$ for each sample $b$ in the batch. The CrossEntropyLoss first computes the softmax probabilities:

   $$
\begin{aligned} 
p^{b,c} = \frac{\exp({y_{\rm pred}}^{b,c})}{\sum_{c'=0}^{C-1} \exp({y_{\rm pred}}^{b,c'})}
\end{aligned}
   $$

   Then it defines the following loss tensor:

   $$
\begin{aligned} 
{L_{\rm CE}}^{b} = -\log(p^{b, {y_{\rm true}}^b})
\end{aligned}
   $$

- Moreover, the CrossEntropyLoss can define the following two scalar loss functions:
   1. Scalar loss function for the mean cross-entropy:
   $$
\begin{aligned} 
L = \frac{1}{B} \sum_{b=0}^{B-1} {L_{\rm CE}}^{b}
\end{aligned}
   $$

   2. Scalar loss function for the cross-entropy without taking the mean:
   $$
\begin{aligned} 
L = \sum_{b=0}^{B-1} {L_{\rm CE}}^{b}
\end{aligned}
   $$

As next, let us investigate how this function can be used in PyTorch:
</span>

In [32]:
# Let us generate some logits and true labels randomly. We take 5 classes and a batch size of 10:
C = 5  # number of classes
B = 10  # batch size
y_pred = torch.randn((B, C))  # raw logits
y_true = torch.randint(0, C, (B,))  # class labels

# The corresponding loss tensor is obtained through manual calculation:
log_probs = F.log_softmax(y_pred, dim=1)
CE_LossTensor = -log_probs[range(B), y_true]

# The init method of nn.CrossEntropyLoss contains the default argument: reduction='mean'
# This choice generates for us the mean cross-entropy loss. The corresponding loss is:
loss = nn.CrossEntropyLoss()
print(loss(y_pred, y_true).item())
# This gives us the same as if we would apply the formula directly:
print(CE_LossTensor.mean().item())

# To obtain the non-averaged case, we have to change the default argument reduction from 'mean' to 'sum':
loss = nn.CrossEntropyLoss(reduction='sum')
print(loss(y_pred, y_true).item())
# This gives us the same as if we would just sum all elements of CE_LossTensor:
print(CE_LossTensor.sum().item())

1.5826647281646729
1.5826647281646729
15.826647758483887
15.826647758483887


In [36]:
# To obtain the tensor-valued loss function, we have to change the default argument reduction from 'mean' to 'none':
loss = nn.CrossEntropyLoss(reduction='none')
loss_tensor = loss(y_pred, y_true)
print(loss_tensor.shape)
# To see that the resulting loss tensor is identical with CE_LossTensor, we use allclose:
print(torch.all(CE_LossTensor - loss_tensor == torch.zeros(loss_tensor.shape)))

torch.Size([10])
tensor(True)
