In [1]:
from collections import *
import math
import numpy as np
import matplotlib.pyplot as plt
from typing import *

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data

<br>

# Motivation: Density Estimation
---

Say we have a collection of data points $X$, where each row is a data point $x_i$. We would like to model the data generating process that generate these points. This is called **density estimation**.

Density estimation is useful any time you need some kind of probability distribution to fuel some other model that is based on it. For instance, if you want to estimate how many people take the bus in average, you are interested in modeling the probability that a person comes to the bus station. To do so, you might have to fit some data that lists all the coming and going at the bus stations.

<br>

### Latent variables

Some of those distributions can be more easily described by introducing **latent variables, i.e. unobserved data that underlies the distribution of X**. For instance, if we take the problem of the modeling of arrival of people at the bus station, we can easily imagine two distributions, based on the type of passengers:

* People $z_1 = (1\;0)^T$ who go or come back to work
* People $z_2 = (0\;1)^T$ who wander around for other business

Let us call $Z$ the random variable that represents the type of people. We could try to model two probability distributions for the two classes of people. Our overall probability distribution is then the mixture (composition) of both:

&emsp; $p(X) = \sum_{z \in Z} p(x,z) = \sum_{z \in Z} p(x|z) p(z)$

Where $p(x|z)$ can have different shapes, different type of distribution, depending on the value of $z$.

<br>

### Why latent variables?

Latent variables can serve several purposes. One is simply **pure mathematical convenience**: it might be easier to model (i.e. approximate) a distribution as a mixture of several gaussian as it would be to try to find a custom function for it.

But these latent variables could also **represent some underlying data, or confounder** we have not access to be would be interested by. For instance, we might suspect that behind the two clusters of people that use the bus, one is made of people that travel for business reasons. If we later get some information regarding a given category of passengers, we are able to better reproduct their behavior. For instance, in the case of the bus, if we can later correlate the probability of $z_1$ (travel for work) with the time of the day, we can able to do better predictions.

Finally, it can also be used for **data compression**. If we are able to show that some N-dimensional data is in fact generated by an underlying D-dimensional latent vector, with $D \ll N$, we can sumarize our data using this D-dimensional vector plus some noise if need be.

<br>

### Difficulties of dealing with latent variables

Say we are interested in modeling the distribution of data $p(X|\theta)$ where $\theta$ are the parameters of our distribution, and we decomposed our distribution as a mixture of distributions $p_z = p(x|z)$:

&emsp; $p(x|\theta) = \sum_z p(x|z,\theta) p(z|\theta)$

To model the distribution, we might following the Maximum Likelihood approach, which consists in maximizing the likelihood of seeing the data $X$ with respect to the parameters $\theta$. If we assume that all $x_i$ of the *design matrix* $X$ are independent and identically distributed, the classical trick consists in maximizing the logarithm of the likelihood instead:

&emsp; $\theta^* = \underset{\theta}{\operatorname{argmax}} p(X|\theta) = \underset{\theta}{\operatorname{argmax}} \log p(X|\theta) = \underset{\theta}{\operatorname{argmax}} \log \prod_i p(x_i|\theta) =\underset{\theta}{\operatorname{argmax}} \sum_i \log p(x_i|\theta)$

For exponential form distributions, such as Gaussians, the logarithm makes it very easy to get a *closed form* solution. But for mixture distributions, even mixture of exponetial distributions, the sum inside the logarithm forbids us to get any closed form solution:

&emsp; $\theta^* = \underset{\theta}{\operatorname{argmax}} \sum_i \log \sum_z p(x_i,z|\theta)$

Note that this does not forbid us to deal with these latent variables with approximate optimization techniques, such as gradient descent.

<br>

# EM algorithm
---

TODOs - discrete variables:

* apply on mixture of gaussian
* GEM

<br>

# Continuous latent variables
---

* PCA
* Auto encoders for MNIST
* Can it work for other things like classification or regression? does it even make sense? p(y|x)