Copyright 2021-2023 Lawrence Livermore National Security, LLC and other MuyGPyS
Project Developers. See the top-level COPYRIGHT file for details.

SPDX-License-Identifier: MIT

# Loss Function Tutorial

This notebook illustrates the loss functions available in the `MuyGPyS` library.
These functions are used to formulate the objective function to be optimized while fitting hyperparameters, and so have a large effect on the outcome of training.
We will describe each of these loss functions and plot their behaviors to help the user to select the right loss for their problem.

Each function in this notebook is available for import from `MuyGPyS.optimize.loss`, and is an object of class `MuyGPyS.optimize.loss.LossFn`.
It is possible to define new loss functions by creating a new `LossFn` object.
View its documentation for more details.

We assume throughout a vector of targets $y$, a prediction (posterior mean) vector $\mu$, and a posterior variance vector $\sigma$ for a training batch $B$ with $b$ elements.

In [None]:
import sys
for m in sys.modules.keys():
    if m.startswith("Muy"):
        sys.modules.pop(m)
%env MUYGPYS_BACKEND=numpy
%env MUYGPYS_FTYPE=64

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from matplotlib.colors import SymLogNorm
from MuyGPyS.optimize.loss import mse_fn, cross_entropy_fn, lool_fn, lool_fn_unscaled, pseudo_huber_fn, looph_fn
from MuyGPyS._src.optimize.loss.numpy import _looph_fn_unscaled

In [None]:
plt.style.use('tableau-colorblind10')

In [None]:
mmax = 3.0
point_count = 50
ys = np.zeros(point_count)
residuals = np.linspace(-mmax, mmax, point_count)
smax = 3.0
smin = 1e-1
sigma_count = 50
sigmas = np.linspace(smin, smax, sigma_count)

## Variance-free Loss Functions

`MuyGPyS` features several loss functions that depend only upon the targets $y$ and posterior mean predictions $\bar{\mu}$ of your training batch.
These loss functions are situationally useful, although they leave the fitting of variance parameters entirely up to the separate, analytic `sigma_sq` optimization function and might not be sensitive to certain variance parameters.
As they do not require evaluating the posterior variance $\sigma^2$ or optimizing the variance scale parameter, these loss functions are generally more efficient to use in practice.

### Mean Squared Error (`mse_fn`)

The mean squared error (MSE) is a classic loss that computes

\begin{equation*}
\ell_\textrm{MSE}(\bar{\mu}, y) = \frac{1}{b} \sum_{i \in B} (\bar{\mu}_i - y_i)^2.
\end{equation*}

The following plot illustrates the MSE as a function of the residual.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(4,3))
ax.set_title("MSE as a function of the residual", fontsize=20)
ax.set_ylabel("loss", fontsize=15)
ax.set_xlabel("$\mu_i - y_i$", fontsize=15)
mses = [mse_fn(ys[i].reshape(1, 1), residuals[i].reshape(1, 1)) for i in range(point_count)]
ax.plot(residuals, mses)
plt.show()

### Cross Entropy Loss (`cross_entropy_fn`)

The cross entropy loss is a classic classification loss often used in the fitting of neural networks.
For targets in $\{0, 1\}$, the library first transforms the predictions to be row-stochastic and then computes

\begin{equation*}
\ell_\textrm{cross-entropy}(\bar{\mu}, y) =
\sum_{i \in B} y_i \log(\bar{\mu}_i) - (1 - y_i) \log(1 - \bar{\mu}_i)
\end{equation*}      

⚠️ This section is under construction. ⚠️ 

### Pseudo-Huber Loss (`pseudo_huber_fn`)

The pseudo-Huber loss is a smooth approximation to the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss), which is approximately quadratic for small residuals and approximately linear for large residuals.
This means that the pseudo-Huber loss is less sensitive to large outliers, which might otherwise force the optimizer to overcompensate in undesirable ways.
The pseudo-Huber loss computes

\begin{equation*}
\ell_\textrm{Pseudo-Huber}(\bar{\mu}, y \mid \delta) =
\sum_{i=1}^b \delta^2 \left ( 
\sqrt{1 + \left ( \frac{\bar{\mu}_i - y_i}{\delta} \right )^2} - 1
\right ),
\end{equation*}

where $\delta$ is a parameter that indicates the scale of the boundary between the quadratic and linear parts of the function.
The `pseudo_huber_fn` accepts this parameter as the `boundary_scale` keyword argument.
Note that the scale of $\delta$ depends on the units of $y$ and $\mu$.
The following plots show the behavior of the pseudo-Huber loss for a few values of $\delta$.

In [None]:
boundary_scales = [0.5, 1.0, 2.5]
phs = np.array([
    [pseudo_huber_fn(ys[i].reshape(1, 1), residuals[i].reshape(1, 1), boundary_scale=bs) for i in range(point_count)]
    for bs in boundary_scales
])
fig, ax = plt.subplots(1, 1, figsize=(4, 3))
# for i, ax in enumerate(axes):
ax.set_title(f"Pseudo-Huber", fontsize=20)
ax.set_ylabel("loss", fontsize=15)
ax.set_xlabel("$\mu_i - y_i$", fontsize=15)
ax.plot(residuals, phs[0, :], linestyle="solid", label=f"$\delta = {boundary_scales[0]}$")
ax.plot(residuals, phs[1, :], linestyle="dotted", label=f"$\delta = {boundary_scales[1]}$")
ax.plot(residuals, phs[2, :], linestyle="dashed", label=f"$\delta = {boundary_scales[2]}$")
ax.legend()
plt.show()

## Variance-Sensitive Loss Functions

`MuyGPyS` also includes loss functions that explicitly depend upon the posterior variances $\bar{\Sigma}$, which is a diagonal matrix for a univariate MuyGPs model.
These loss functions penalize large variances, and so tend to be more sensitive to variance parameters.
This comes at increasing the cost of the linear algebra involved in each evaluation of the objective function by a constant factor.
This causes an overall increase in compute time per optimization loop, but that is often worth the trade for sensitivity in practice.

$\sigma$ involves multiplying the unscaled `MuyGPS` variance by the `sigma_sq` variance scaling parameter, which at present must by optimized during each evaluation of the objective function.

### Leave-One-Out Loss (`lool_fn`)

The leave-one-out-loss or lool scales and regularizes the MSE to make the loss more sensitive to parameters that primarily act on the variance.
lool computes 

\begin{equation*}
\ell_\textrm{lool}(\bar{\mu}, y \mid \bar{\Sigma}) = 
\sum_{i \in B} \left ( \frac{\bar{\mu}_i - y_i}{\bar{\Sigma}_{ii}} \right )^2 + \log \bar{\Sigma}_{ii}.
\end{equation*}

The next plot illustrates the loss as a function of both the residual and of $\sigma^2$.

In [None]:
lools = np.array([
    [
        lool_fn_unscaled(
            ys[i].reshape(1, 1),
            residuals[i].reshape(1, 1),
            sigmas[sigma_count - 1 - j]
        )
        for i in range(point_count)
    ]
    for j in range(sigma_count)
])

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
axes[0].set_title("lool", fontsize=20)
axes[0].set_ylabel("$\sigma_i$", fontsize=15)
axes[0].set_xlabel("$\mu_i - y_i$", fontsize=15)
im = axes[0].imshow(
    lools, extent=[-mmax, mmax, smin, smax], norm=SymLogNorm(1e-1), cmap="coolwarm", aspect=2.0
)
fig.colorbar(im, ax=axes[0])

axes[1].set_title("lool residual cross-section", fontsize=14)
axes[1].set_ylabel("lool", fontsize=15)
axes[1].set_xlabel("$\mu_i - y_i$", fontsize=15)
axes[1].plot(residuals, lools[15, :], linestyle="solid", label="$\sigma_i = 1.0$")
axes[1].plot(residuals, lools[7, :], linestyle="dotted", label="$\sigma_i = 0.5$")
axes[1].plot(residuals, lools[0, :], linestyle="dashed", label="$\sigma_i = 0.1$")
axes[1].legend()

axes[2].set_title("lool, variance cross-section", fontsize=14)
axes[2].set_ylabel("lool", fontsize=15)
axes[2].set_xlabel("$\sigma_i$", fontsize=15)
axes[2].plot(sigmas, np.flip(lools[:, 33]), linestyle="solid", label="$\mid \mu_i - y_i \mid = 1.0$")
axes[2].plot(sigmas, np.flip(lools[:, 29]), linestyle="dotted", label="$\mid \mu_i - y_i \mid = 0.5$")
axes[2].plot(sigmas, np.flip(lools[:, 24]), linestyle="dashed", label="$\mid \mu_i - y_i \mid = 0.0$")
axes[2].legend()

plt.tight_layout()
plt.show()

Notice that the cross-section of the lool surface for a fixed $\sigma$ is quadratic, while the cross section of the lool surface for a fixed residual is logarithmic.
For small enough residuals, this curve inverts and assumes negative values for small $\sigma$.

### Leave-One-Out Pseudo-Huber (`looph_fn`)

The leave-one-out pseudo-Huber loss (looph) is similar in nature to the lool, but is applied to the pseudo-Huber loss instead of MSE.
looph computes

\begin{equation*}
\ell_\textrm{looph}(\bar{\mu}, y \mid \delta, \bar{\Sigma}) =
\sum_{i=1}^b \delta^2 \left ( 
\sqrt{1 + \left ( \frac{\bar{\mu}_i - y_i}{\delta \bar{\Sigma}_{ii}} \right )^2} - 1
\right ) + \log \bar{\Sigma}_{ii},
\end{equation*}

where again $\delta$ is the boundary scale.
The next plots illustrate the looph as a function of the residual, $\sigma$, and $\delta$.

In [None]:
loophs = np.array([
    [
        [
            _looph_fn_unscaled(
                ys[i].reshape(1, 1),
                residuals[i].reshape(1, 1),
                sigmas[sigma_count - 1 - j],
                boundary_scale=bs
            )
            for i in range(point_count)
        ]
        for j in range(sigma_count)
    ]
    for bs in boundary_scales
])

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(14,12))
for i, bs in enumerate(boundary_scales):
    axes[i, 0].set_title(f"looph ($\delta={bs}$)", fontsize=20)
    axes[i, 0].set_ylabel("$\sigma_i$", fontsize=15)
    axes[i, 0].set_xlabel("$\mu_i - y_i$", fontsize=15)
    im = axes[i, 0].imshow(
        loophs[i, :, :], extent=[-mmax, mmax, smin, smax], norm=SymLogNorm(1e-1), cmap="coolwarm", aspect=2.0
    )
    fig.colorbar(im, ax=axes[i, 0])

    axes[i, 1].set_title(f"looph residual cross-section ($\delta={bs}$)", fontsize=14)
    axes[i, 1].set_ylabel("looph", fontsize=15)
    axes[i, 1].set_xlabel("$\mu_i - y_i$", fontsize=15)
    axes[i, 1].plot(residuals, loophs[i, 15, :], linestyle="solid", label="$\sigma_i = 1.0$")
    axes[i, 1].plot(residuals, loophs[i, 7, :], linestyle="dotted", label="$\sigma_i = 0.5$")
    axes[i, 1].plot(residuals, loophs[i, 0, :], linestyle="dashed", label="$\sigma_i = 0.1$")
    axes[i, 1].legend()

    axes[i, 2].set_title(f"looph variance cross-section ($\delta={bs}$)", fontsize=14)
    axes[i, 2].set_ylabel("looph", fontsize=15)
    axes[i, 2].set_xlabel("$\sigma_i$", fontsize=15)
    axes[i, 2].plot(sigmas, np.flip(loophs[i, :, 33]), linestyle="solid", label="$\mid \mu_i - y_i \mid = 1.0$")
    axes[i, 2].plot(sigmas, np.flip(loophs[i, :, 29]), linestyle="dotted", label="$\mid \mu_i - y_i \mid = 0.5$")
    axes[i, 2].plot(sigmas, np.flip(loophs[i, :, 24]), linestyle="dashed", label="$\mid \mu_i - y_i \mid = 0.0$")
    axes[i, 2].legend()

plt.tight_layout()
plt.show()

These plots show us that the looph function can exhibit a more exaggerated upward slope where the residual is in the linear component of the pseudo-Huber curve but is not so large that it still outweighs the variance component of the loss.
Note that in practice that both pseudo Huber loss functions may require more training iterations to converge than their alternatives.