# NYU Deep Learning Spring 2021 - 04: Joint embedding method and Latent Variable Energy Based Models

## From Predictive systems to multi-output system

Consider the following change of paradigm:

<img src="material/LV_EBMs/predictive_multiout.svg">

We want to move from the classical single-output feed-forward architecture used up to now for classification tasks to a more general model type where our system is able to return a range of feasible outputs, e.g.

Assuming a italian-to-french translation task, given one input sentence there may be multiple way to correctly translate it from italian to french, each of this feasible translation will be compliant to a set of constraints that are intrinsecally defined in the grammar.

this can be defined as **Inference through constraint satisfaction**

- A Feed-forward model is an explicit function that computes $y$ from $x$
- The constraints based model is an implicit function that captures the dependencies between $x$ and $y$

## Energy Based Models and energy function

Our previously defined model tackling the translation task must be able to represent multiple values of y. To do this we define our model $F(x, y)$ as an **energy function** which output a single scalar that measure the incompatibility between $x$ and $y$, e.g. given an italian sentence and a proposed french translation the system will output a score that represent "how good" the proposed translation is.

If we define **Energy Function** $F(x,y)$ as a scalar-valued funtion:
- $F(x,y)$ takes *low values* when $y$ is compatible with $x$ and *higher values* when $y$ is less compatible with $x$ (just like the physical case of the loaded/unloaded spring) 
- The **Inference** task will be to find values of $y$ that make $F(x,y)$ *small*. There may be multiple solutions to this task, formally: $\hat{y}=\text{argmin}_yF(x,y)$
- if $y$ is continuous, $F(x,y)$ should be smooth and differentiable, so we can use gradient-based inference algorithms.
- Inference task is now defined NOT as forward propagation but as an optimization task
- NOTICE: the concept of energy is used for inference, not for learning.

#### Conditional vs Unconditional EBM:

The case presented above falls into the category of **Conditional EBM** where: $\hat{y}=\text{argmin}_yF(x,y)$

Another possible case is called **Unconditional EBM** where: $\hat{y}=\text{argmin}_yF(y)$, in this case:

- We measure the compatibility between the components of $y$
- We don't know in advance which part of $y$ is known and which one is unknown
- There is no observed part $x$

## Energy Based Models vs Probabilistic Models
A probabilistic method or model is based on the theory of probability or the fact that randomness plays a role in predicting future events (as opposite as deterministic models), this models incorporate random variables and probability distributions into the model of an event or phenomenon. While a deterministic model gives a single possible outcome for an event, a probabilistic model gives a probability distribution as a solution.

#### Idea: Probabilistic models are a special case of EBM
We can see the energies as un-normalized negative log probabilities, EBMs are prefered because they gives more flexibility in the choice of the scoring function and more flexibility in the choice of objective function for learning.

To go from energy to a probability distribution we use the **Gibbs-Boltzmann distribution**:

$$P(y|x)=\frac{e^{-\beta F(x,y)}}{\int_{y'}e^{-\beta F(x,y')}}$$

- $\beta$ is a positive constant
- $F(x,y)$ is the energy representing the compatibility between x and y
- $P(y|x)$ is positive-definite and gives high values when $F(x,y)$ is low
- denominator is the normalization term w.r.t y

# Joint Embedding Architectures

<img src="material/LV_EBMs/joint_embs.svg">

- The architecture has 2 neural nets that may be or be not identical, if they are identical this is called **Siamese Network**
- The two networks computes two vectors as representation of the two inputs
- The energy function is going to compute a distance or divergence of some kind between the two vectors
- If the 2 vectors are "close" then the energy is "low"
- The neural net that looks at y may be invariant to certain transformations of y (rotation, luminance, etc) so that its output doesn't change that much with respect to x. This means that given x there's gonna be multiple y that will have low energy (they'll be "close to x").

# Latent Variables Architecture
<img src="material/LV_EBMs/latent_variables.svg">

- We want the system to be able to produce different predictions $\bar{y}$ given $x$
- The set of possible prediction is parametrized (ribbon surface) by a latent variable of unknown value.
- The latent variable $z$ varies within a set (in this example a rectangle)
- By varying the value of $z$ the ribbon surface will change
- To do inference we have a $x$ and a proposed $y$ and we want to optimize the energy function by finding a value of $z$ that minimizes the energy

#### Latent Variable: concept

Ideally the latent variable represents **independent explanatory factors of variation** of the prediction.
In statistic, latent variables are variables that are not directly observed but are rather inferred from other variables that are observed.

E.g: We want to recognize a picture of a face from a 3d model:
- $x$ is a 3d model
- $y$ is the target photo
- The 3d model needs to be translated and rotated in order to overlap it upon the target 2d picture to see if they match.
- The idea of aligning the 3d model to the target face to see if they are similar can be thought as a latent variable while the process of rotating and translating the model to make it overlap can be thought as the energy minimization process.

Formally we can state the latent variables approach as a simultaneous energy minimization problem with respect to $y$ and $z$:

$$\hat{y},\hat{z}=\text{argmin}_{y,z}E(x,y,z)$$

### Redefinition of F(x,y)

We define an energy model starting from a latent variables model. The defined energy model has its energy internally minimized with respect to $z$ so F depends only on $x$ and $y$ and can be seen as system's **free energy**.

$$F_{\inf}(x,y)=\text{min}_zE(x,y,z)$$

$$F_{\beta}(x,y)=-\frac{1}{\beta}\log\int_{z}e^{-\beta E(x,y,z)}$$

$$\hat{y}=\text{argmin}_yF(x,y)$$

## Limiting the information capacity of the latent variable

Imagine the latent variable $z$ has the same dimension of the desired output $y$, what may happen is that for every couple $x$ and $y$ there's always gonna be a $z$ for which the energy is 0 ($\bar{y}$ coming out from the decoder exactly equal to $y$), meaning that my energy $F(x,y)$ is a flat function.

This happens when the latent variable $z$ has too much information capacity (too high dimension). To avoid this (and train energy based models) we need to introduce some Regularizer

# Training EBMs: overview

**main concept**: push down on the energy of datapoints while making sure the energy is higher elsewhere

Conceptually what we do during training is:
- Parametrize $F(x,y)$
- Select Training samples couples as $x[i]$ and $y[i]$
- Shape $F(x,y)$ so that $F(x[i],y[i])$ is strictly smaller than $F(x[i],y)$ for all $y$ different from $y[i]$
- Keep $F$ smoooth (usually using Max-likelihood probabilistic methods)

There are two classes of learning methods: Contrastive and Regularized/Architectural methods.

## Contrastive methods
**main concept**: push down on $F(x[i],y[i])$ and push up on other points $F(x[i], y')$

The main differences among this methods is how you pic the point to push up $y'$:
- Push down of the energy of data points, push up everywhere else: Max likelihood (needs tractable partition function or variational approximation)
- Push down of the energy of data points, push up on chosen locations: Max likelihood with MC/MMC/HMC, Contrastive divergence, Metric Learning/Siamese nets, Ratio Matching, Noise Contrastive Estimation, Min Probability Flow, Adversarial Generator/GANs
- Train a function that maps points off the data manifold to points on the data manifolds: denoising auto-encoder, masked auto-encoder (e.g. BERT)

## Regularized/Architectural methods
**main concept**: build $F(x,y)$ so that the volume of low energy regions is limited or minimized through regularization

The main differences among methods here is how we limit the information capacity of the latent representation (how we build the regularizer)
- Build the machine so that the volume of energy space is bounded: PCA, K-means, Gaussian Mixture Models, Square ICA, normalizing flows...
- Use a regularization term that measures the volume of space that has low energy: Sparse Coding, Sparse Auto-Encoder, LISTA, Variational Auto-Encoders, discretization/VQ/VQVAE
- $F(x,y)=C(y,G(x,y))$ make $G(x,y)$ as "constant"as possible with respect to $y$: Contracting auto-encoder, saturating auto-encoder
- Minimize the gradient and maximize the curvature around data points: score matching

# LAB: Inference for Latent Variable Energy Models (EBMs)
## Unconditional Case (no labels)

original: https://github.com/Atcold/pytorch-Deep-Learning/blob/master/slides/12%20-%20EBM.pdf

#### Training Samples:
- $x$ observable variable (input)
- $\theta$ not observable
- $\epsilon$ is a noise
- $\alpha = 1.5$
- $\beta = 2$

$y = \begin{bmatrix} \rho_1(x)\cos(\theta)+\epsilon \\ \rho_2(x)\sin(\theta)+\epsilon \end{bmatrix}$

$\rho:\mathbb{R}\rightarrow\mathbb{R}^2$

$x \mapsto \begin{bmatrix} \alpha x + \beta(1-x) \\ \beta x + \alpha(1-x) \end{bmatrix} \cdot \exp(2x)$

$x \rightarrow U(0,1)$

$\theta \in U(0,2\pi)$

$\epsilon \in N[0,(\frac{1}{20})^2]$

It's basically an ellipse expanding following an exponential envelope

<img src="material/LV_EBMs/manifolds.png">

### Question: why do we need Energy models for this?

For every $x$ I can get an infinite number of $y$ values, all of them lying on a ellipse so the approach that we normally use in a classification problem with, for example, MLPs that is "*which is the correct value of $y$ given input $x$?*" does NOT makes sense anymore since we gonna have multiple valid values of $y$ for a single $x$.

### Simplification: we set x=0

$$y = \begin{bmatrix} 2\cdot\cos(\theta)+\epsilon \\ 1.5\cdot\sin(\theta)+\epsilon \end{bmatrix}$$

We collect 24 samples from the ellipse that will represent our training set, there's gonna be no $x$ meaning that this is an uncoditional and unsupervised case
<img src="material/LV_EBMs/ellipse.png">

### Untrained manifold

The untrained manifold is a generated manifold that will resemble the target ellipse that will obtain if we properly train the model.

We should consider our latent variable $z$ that will replace the $\theta$ that is unobservable:

$$z = [0:\frac{\pi}{24}:2\pi[$$

Feeding $z$ across the line and feedind its values to a **decoder** will give us $\tilde{y}$ predicted values lying on a ellipse. From this example we can see that the model is untrained because if it would be trained the purle dots would lie upon the ellipse manifold defined by the observed blue dots.

<img src="material/LV_EBMs/untrained_manifold.png">

### Energy function

Energy function $E$ will be a function of the observed variable $y$ and of the latent variable $z$ as sum of the squared euclidian distance between $y_1$ and $\tilde{y}_1=g_1(z)$ plus the squared euclidian distance between $y_2$ and $\tilde{y}_2=g_1(z)$

$$E(y,z)=[y_1-g_1(z)]^2+[y_2-g_2(z)]^2,\quad y\in Y$$

$E(y,z)$ is computed for each observation

<img src="material/LV_EBMs/example_energy_f.svg">

The **Decoder** that is the "smart" component of this system will be finally defined as:

$$g=[g_1,g_2]^T:\mathbb{R}\rightarrow\mathbb{R}^2$$

The Decoder maps $z$  to the following 2 expression (2 parameters $\omega_1$ and $\omega_2$) that are able to depict different ellipses varying $\omega_1$ and $\omega_2$.

$$z\mapsto[\omega_1\cos(z)\quad\omega_2\sin(z)]^T$$

PLEASE NOTICE: this is a toy example, in real cases the Decoder $g$ would be a Neural Network able to represent arbitrary complex manifolds and learn an Energy function $E$ that is efficient enough for the given case.

### Energy functions shape

We gonna have as many energies as many samples we have in our observations set (24 $y$ samples in this case)

For each $y$ we make $z$ vary from 0 to $2\pi$ (sampling 24 points)

<img src="material/LV_EBMs/energies.png">

### Free Energy: 0-Temperature limit

We can now define the 0-Temperature limit for the free energy of our energy based model as the minimum values that our energy funtion $E(y,z)$ can take with respect of the latent variable $z$:

$$F_\infty(y) = \min_z E(y,z) = E(y,\check{z})$$

We can then write the value of $z$ that minimizes $E(y,z)$ as:

$$\check{z}=\arg\min_z E(y,z)$$

**idea:** if we immagine a comparison with a thermodynamic system where the temperature T can be defined as the average kinetic energy of all the particles then the region with the lowest energy of such system will correnspond to the "coldest" region of the system. If energy is 0 then all the particles are frozen meaning that we are in a absoulute 0-Temperature situation (0 Kelvin).

There are many ways to find $\check{z}$ such as exhaustive search, conjugate gradient, line search, LBFGS...

Let's now consider the 0-Temperature free energy for $y'=Y[10]$:

<img src="material/LV_EBMs/e10.png" style="width: 100px;">

Given an initial guess value $\tilde{z}$ we gonna perform gradient descent (NON STOCHASTIC) until we get to $\check{z}$

<img src="material/LV_EBMs/e10gd.png">

In the $y_1,y_2$ space this is translated in starting at a guess location $g(\tilde{z})$ in the manifold and then, as the gradient descent run, we will travel until we get to the point $g(\check{z})$ of the manifold which is closer to the observed data point $y'$

<img src="material/LV_EBMs/e10gd2.png">

If we compute the free energy for every possible point in the $y_1,y_2$ space we gonna get the following:

<img src="material/LV_EBMs/free_energy_manifold.png">

The blue area is the Energy sink corresponding to our decoder manifold, if the model was properly trained the blue energy sink would correnspond exactly to the ellipse manifold generated by the observed $y$ values (blue dots).

**idea:** Training an EBM means moving and distributing the energy manifold in a appropriate way, doing inference instead will involve computing the energy for a given observation to see if it's a good match to our target.