# NYU Deep Learning Spring 2021 - 05: Latent Variable Energy Based Models for structured prediction

### Recap from previous lesson - Training concepts:
#### Contrastive Method
- Push down on the energy of training samples
- Pull up on the energy of other "well chosen" points (e.g. to avoid creating canyons)

**General Contrastive Loss formula**:

$$\mathcal{L}(x_1...x_{p^+}, y_1...y_{p^+},\hat{y}_1...\hat{y}_{p^-}, \omega) = 
H(E(x_1,y_1),...E(x_{p^+}, y_{p^+}, E(x_1,\hat{y}_1),...E(x_{p^+}, \hat{y}_{p^+}), M(Y_{1...p^+}, \hat{Y}_{1...p^{-}}))$$

### Margin Loss

Push down on data points, push up of other points (well chosen contrastive points)

PLEASE NOTICE: they take only one pair of "good guy / bad guy" per time

**General Margin Loss**:
$$\mathcal{L}(x,y,\hat{y},\omega)=H(F_\omega(x,y), F_\omega(x,\hat{y}), m(y,\hat{y}))$$

where: 
- $H(F^+,F^-,m)$ is a strictly increasing function of $F^+$
- $H(F^+,F^-,m)$ is a strictly decreasing function of $F^-$ 
- True whenever $F^{-}-F^{+}<m$ (with $m$ positive definite)

Main concept: when we minimize $\mathcal{L}$ the energy of the "bad guys" $F_\omega(x,\hat{y})$ will be greater than the energy of the "good guys" $F_\omega(x,y)$ by at least $m$. 

#### Examples:
- **Simple** [Bromley 1993] 
$$\mathcal{L}(x,y,\hat{y},\omega)=[F_\omega(x,y)]^+ + [m(y,\hat{y})-F_\omega(x,\hat{y})]^+$$
Explicitely try to make energy of good guys 0 and energy of bad guys equal to $m$

- **Hinge pair loss** (Triplet Loss) [Altun 2003], **Ranking loss** [Weston 2010]:
$$\mathcal{L}(x,y,\hat{y},\omega)=[F_\omega(x,y) - F_\omega(x,\hat{y}) + m(y,\hat{y})]^+$$
Just tries to make the energy difference between good guys and bad guys greater than a certain margin $m$

- **Square-Square** [Chopra CVPR 2005]
$$\mathcal{L}(x,y,\hat{y},\omega)=([F_\omega(x,y)]^+)^2 + ([m(y,\hat{y})-F_\omega(x,\hat{y})]^+)^2$$

### Generalised Additive Margin Loss

Instead of having a margin for a pair of samples you compute margin for a "good guy" against a set of "bad guys" by summing each contribution
$$\mathcal{L}(x,y,\omega)=\sum_{\hat{y}\in Y}H(F_\omega(x,y), F_\omega(x,\hat{y}),m(y,\hat{y}))$$

### InfoNCE, Contrastive Predictive Coding

- Used a lot in Siamese net and joint Embedding
- Margin is implicit and infinite
- Contrastive samples compete for gradient

$$\mathcal{L}(x,y,\hat{y_1},...,\hat{y_{p^-}},\omega)=-\log\frac{\mathcal{e}^{-E_\omega(x,y)}}{\mathcal{e}^{-E_\omega(x,y)}+\sum_{i=1}^p\mathcal{e}^{-E_\omega(x,\hat{y}_i,\omega)}}$$

- put the scores of all "bad guys" in a soft-max
- if all "bad guys" have a very high energy $\sum_{i=1}^p\mathcal{e}^{-E_\omega(x,\hat{y}_i,\omega)}=0$ and then also $\mathcal{L}=0$
- if a "bad guy" has low energy, because of the soft max, that guy's gonna get all the gradient and it's energy is going to be pushed up very hard, while the other high energy "bad guy" are not going to be affected too much