### What are we trying to do?
Autoregressive models estimate a distribution $p_\theta(x)$ drawing from samples $x^{(1)}, \ldots, x^{(n)}$.The name Autoregessive comes from estimating $x_i$, using $x_{<i}$ 

#### The Chain Rule 
Take any old distribution
$$p(x_1, \ldots, x_n)$$
The chain rule states
$$ p(x_1, \ldots, x_n) = \prod_{i=1}^n p(x_i ~|~ x_1, \ldots, x_{i-1}) = \prod_{i=1}^n p(x_i ~|~ x_{< i})$$
This will be important when talking about how many parameters it would take to hold all possible values in a joint distribution

#### **Curse of Dimensionality** 
Fitting distributions is all about learning some distribution to best represent/ fit our collection of data points. One of the issues with fitting higher dimensional distributions is representation. For example compare holding the distribution of a single bernoulli random variable in memory
$$
\begin{pmatrix}
p(x_1 = 1) & p(x_1 = 0)
\end{pmatrix}
$$
To holding the joint distribution of 2 bernoulli random variables
$$
\begin{pmatrix}
p(x_2 = 1 | x_1 = 1) & p(x_2 = 1 | x_1 = 1) 
& p(x_2 = 0 | x_1 = 1) & p(x_2 = 0 | x_1 = 0)
\end{pmatrix}
$$
We see already that the memory complexity has doubled. And with 3 bernoulli random variables
$$\begin{pmatrix} p(x_3=0 | x_2=0, x_1=0) & p(x_3=0 | x_2=0, x_1=1) & p(x_3=0 | x_2=1, x_1=0) & p(x_3=0 | x_2=1, x_1=1)  & \ldots \end{pmatrix}$$
In this simple case with Benoulli random variables, adding another random variable triples the number of parameters required


Imagine then we are trying to generate a simple 28 by 28 black and white image. The memory complexity would be
$$2^{784} \approx 10^{236}$$

Also imagine this from a modeling perspective!
- We could never have all the possible training examples!
- Each image has their own parameter, no generalization or extrapolation what so ever!

#### Maximum Likelihood Estimate
The goal for MLE is to minimize the negative log likelihood (NLL) for each training sample $x^{(i)}$
$$\frac{1}{n}\sum_{i=1}^n \text{NLL}({p_\theta(x^{(i)})}) $$

#### Negative Log Likelihood Definition
If $p_\theta(x^{(i)}) = \hat{x}$, and $\hat{x}$ represents the output, and $x$ represents the target, then 
$$\text{NLL}(p_\theta(x^{(i)})) = \text{NLL}(\hat{x}) = x \cdot \log(\hat{x}) + (1 - x) \cdot \log(1- \hat{x})$$

[Why do I need to add log to my nll in pytorch?](https://discuss.pytorch.org/t/why-there-is-no-log-operator-in-implementation-of-torch-nn-nllloss/16610/3)

In [1]:
import torch
import torch.nn.functional as F
import math

In [2]:
x_hat = torch.tensor([[.2, .8]])
x = torch.tensor([1])

F.nll_loss(torch.log(x_hat), x), -1 *(0 * math.log(.2) + 1 * math.log(.8))

(tensor(0.2231), 0.2231435513142097)

#### Cross Entropy
One thing we will see is that maximizing the NLL is the same as minimizing the Cross Entropy
$$ \sum x \cdot \log(\hat{x}) = \text{Cross Entropy}$$

In [3]:
F.cross_entropy(x_hat, x), F.nll_loss(F.log_softmax(x_hat, dim=1), x)

(tensor(0.4375), tensor(0.4375))