In [1]:
import math
import numpy as np
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import torch.multiprocessing as mp

import torchvision
import torchvision.datasets as datasets
import torchvision.models as models
import torchvision.transforms as transforms

<br>

# MSE Loss
---

For regression, corresponds to the assumption that the data generating process follows a gaussian probability distribution around the value to be found (basically, assumes that the noise is gaussian, and that the output is mono-modal).

In [12]:
torch.random.manual_seed(0)
x = torch.zeros(size=(10, 3, 32, 32))
y = torch.zeros(size=(10, 3, 32, 32))
x.normal_()
y.normal_()

mse = nn.MSELoss(reduction='mean')
print(f"{mse(x, y).item():.2f}")
print(f"{mse(y, x).item():.2f}")

1.99
1.99


<br>

# Cross Entropy Loss
---

In classification problems, the goal is to maximize the joint probability of guessing the right class for a list of samples $(x_i, y_i)$. Maximizing this probability, assuming i.i.d. samples, takes the form of maximizing the product:

&emsp; $\displaystyle P = \prod_i p(y_i|x_i) \implies \log P = \sum_i \log p(y_i|x_i)$

This is equivalent to minimizing the loss function, which **negative log likelihood**:

&emsp; $\displaystyle \mathcal{L} = - \sum_i \log p(y_i|x_i)$

Since in most networks, the outputs of the network are not bounded, we generally apply a **softmax** function to transform the **logits** of each class to probabilities:

&emsp; $\displaystyle p(y_i|x_i) = \frac{\exp(l_i)}{\sum_j \exp(l_j)}$ where $l_i$ is the logit output for class *i*

This softmax function correspond to the assumption that the classes are linearly separable in their final representation (the previous layers of the network create this representation) which itself correspond to the assumption that the N classes are centered around N point, following a gaussian process centered on that point, with the same variance for all classes.

The *CrossEntropyLoss* class of Pytorch combines both the application of Softmax and the Negative Log Likelihood in one class (but is more stable numerically):

In [23]:
# Example for 5 classes, and a batch size of 10

logits = torch.zeros(size=(10, 5))
target = torch.LongTensor([0, 1, 2, 3, 4, 0, 1, 2, 3, 4])

ce = nn.CrossEntropyLoss()
print(ce(logits, target))

# Equivalent through Softmax

softmax = nn.Softmax(dim=-1)
probs = softmax(logits)
print(probs) # each class has same probability by construction here
nnl = nn.NLLLoss()
print(nnl(torch.log(probs), target))

# Equivalent through LogSoftmax

log_softmax = nn.LogSoftmax(dim=-1)
nnl = nn.NLLLoss()
print(nnl(log_softmax(logits), target))

tensor(1.6094)
tensor([[0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000]])
tensor(1.6094)
tensor(1.6094)


<br>

# Contrastive Loss
---

The contrastive loss is used a lot in Self Supervised Learning for images and videos. To make a model invariant to rotations, crops, noise, color jittering and such, a model is trained to recognize that two random augmentations of the same images are identical.

The notion of identical is typically measured by the **cosine similarity** between the representation of the two augmentations of the image.

&emsp; $\displaystyle \mathcal{S}(x_i, x_j) = \frac{x_i . x_j}{\Vert x_i \Vert \Vert x_j \Vert}$ (but any form of similarity function $\mathcal{S}$ can do)

The problem with training a neural network to do so, is that a trivial solution is to assign all images to the same representation (constant function) so that indeed, all augmentations are perfectly recognized. The idea of contrastive learning is to avoid those trivial solution by also making sure that two related images have different representations.

The network is therefore trained to increase the similary between the representation of two random augmentations of the same image, and to decrease this similarity between two random augmentations of two different images. To do so, we deal with images the same way as we would do with words in a sentence:

* each image is thought as having its own label
* we try to "classify" correctly the representations

We therefore use a modified softmax function (good for classifying) together with the similarity function we talked about before, giving the loss between a **positive pair** (two representations $x_i$ and $x_j$ of two augmentations of the same image):

&emsp; $\displaystyle \mathcal{L}(i,j) = \frac{\exp \mathcal{S}(x_i, x_j)}{\sum_{k \ne i} \exp \mathcal{S}(x_i, x_k)}$ (classify the $x_j$ as the sole matching element)

We usually add an hyper parameter named the **temperature** $\tau$ to get the full formula:

&emsp; $\displaystyle \mathcal{L}(i,j) = \frac{\exp \big( \mathcal{S}(x_i, x_j) / \tau \big)}{\sum_{k \ne i} \exp \big( \mathcal{S}(x_i, x_k) / \tau \big)}$

The parameter $\tau$ controls how much the similarity is enforced to be strong between positive pairs, for we have: $\exp \big( s / \tau \big) = \sqrt[\tau] \exp s$

Note: for reference, check SimCLR: https://arxiv.org/pdf/2002.05709.pdf


<br>

# Hinge loss
---