# PS2-1: Bayesian Interpretation of Regularization

## Introduction, notation, and terminology

#### Setup for Bayesian framework

We have $m$ points of data $X = \{x_1 , \dots , x_m\}$ which are *observations*.

Assume:

1. There is a family of probability distributions $p_\theta(x)$ from which the data is drawn independently
1. The parameters are themselves a random variable distributed according to a distribution $p(\theta)$.

The distribution of the random variable from which the samples are drawn is given by

$$p(x) = \int p(x | \theta) p(\theta) \, d\theta$$

After seeing the observations $X$, we update our belief of the distribution of $\theta$. The **posterior
distribution** of the parameters is

$$p(\theta|X).$$

We then use this to update our model for the distribution of the random variabel $x$. The **posterior predictive distribution** of the random variable $x$ is

$$p(x|X) = \int p(x|\theta)p(\theta|X) \, d\theta.$$

Notice that this follows from the chain rule of probability because of the assumption that the samples are independent given $\theta$, i.e. we assume that $p(x | X , \theta) = p(x | \theta)$, so

$$p(x|X) = \int p(x|X,\theta)p(\theta|X)\,d\theta = \int p(x|\theta)p(\theta|X) \, d\theta$$

Summary of terms:

- **Model**: The joint distribution of all quantities, observed and unobserved, $p(x,\theta)$.
- **Prior distribution**: The marginal distribution of the (unobserved) parameters $\theta$, $p(\theta)$.
- **Sampling distribution** The distribution the samples are drawn from $p(x) = \int p(x|\theta)p(\theta)\,d\theta$.
- **Posterior distribution**: The distribution of the unobserved parameters *after* seeing the observed data, $p(\theta|X)$.


## Two examples of Bayesian analysis

#### Example 1: Modeling a coin toss with a uniform prior

Consider a random a repeated coin toss where samples are drawn iid from a Bernoulli distribution $p(H) = \phi$, $p(T) = 1 - \phi$. Suppose we have flipped a coin 5 times and gotten $X = \{H,H,T,H,T\}$ as results. Our prior assumption is that the Bernoulli parameter $\phi$ is actually uniform in $[0,1]$, so that $\phi$ has the distribution $p(\phi) = 1$. Note that before seeing any evidence, the sampling distribution is

$$p(H) = \int_0^1 p(H|\phi) \, d\phi = \frac{1}{2} = P(T)$$

Now by Bayes' rule, the posterior distribution of $\phi$ is given by

$$p(\phi | X) \sim p(X | \phi) p(\phi) = c\phi^3(1-\phi)^2$$

where $c^{-1} = \int_0^1 \phi^3(1-\phi^2) \,d\phi = 1/60$. Now we can calculate the posterior predictive distribution;

$$p(H|X) = \int_0^1 p(H | \phi) p(\phi | X) = 60 \int_0^1 \phi^4(1-\phi)^2 \,d\phi \approx 0.57$$

#### Example 2. A mixture of Gaussians

We consider drawing independent random samples from finitely many Gaussian distributions $q_j \sim N(\mu_j,\Sigma_j)$, $j=1,\dots,k$.
Let $L$ be a random variable taking values in $\{1,...,k\}$ which labels the Gaussian a sample comes from. We might take the prior distribution of $L$ to be $p(L=j) = 1/k$, i.e. each Gaussian is equally likely to be drawn from. 

Given $m$ observations of the data, $X =\{x_1,\dots,x_m\}$, we can then use bayesian inference to calculate the posterior predictive distribution
of the samples after seeing the data $X$. The posterior distribution of the labels is


$$p_{L|x=X}(j) = \frac{p_{x|L=j}(X)p_L(j)}{\sum_{j=1}^k p_{x|L=j}(X)p_L(j)}$$

and thus the posterior predictive distribution is 


$$p_{x|X}(z) = \sum_{j=1}^k p_{x|L=j}(z)p_{L|x=X}(j) = \sum_{j=1}^k p_{x|L=j}(z) \frac{p_{x|L=j}(X)p_L(j)}{\sum_{j=1}^k p_{x|L=j}(X)p_L(j)}$$




## MAP Estimation

#### An alternative to the posterior predictive distribution

Often times it is too difficult to store the entire posterior distribution and make predictions with the posterior sampling distribution. Instead, there is a common approximation used. Instead of carrying over the entire posterior distribution of parameters, we just condense that into a single 'most likely' choice of parameter, $\theta_0$, and then suppose our sampling distribution is 

$$p_{x|\theta = \theta_0}$$

The most common choice is to take the mode of the posterior distribution, $\theta_0 = \argmax_\theta p(\theta | X)$. This is called **maximum a posteriori estimation** and we usually write the mode as $\theta_{MAP}$. 

## The problem statement

We consider MAP estimation as applied to the obversation of labeled data. That is, assume our labeled data $(x^i,y^i)$ is drawn independently from a joint distribution. We model the conditional distributions $p(y|x,\theta)$ by varying over some space of parameters $\theta$. We equip the parameters $\theta$ with a prior distribution $p(\theta)$. 


We will assume that $p(\theta) = p(\theta|x)$. That is, the parameters are independent of the inputs. This is reasonable since we are modeling the conditional distributions $p(y|x)$.

## (A)
#### With the assumption of $p(\theta|x) = p(\theta)$, show that

#### $$\theta_{\text{MAP}} = \argmax_\theta p(y|x,\theta)p(\theta)$$

The posterior distribution is 

$$p(\theta|x,y) = \frac{p(\theta,x,y)}{p(x,y)} = \frac{p(y|x,\theta)p(x,\theta)}{p(x,y)} = \frac{p(y|x,\theta)p(\theta|x)p(x)}{p(x,y)}$$

Applying the assumption to the equality we get

$$p(\theta|x,y) = \frac{p(y|x,\theta)p(\theta)}{p(y|x)}$$

from which the claim follows by taking the argmax in $\theta$.

## (B)
#### Suppose that $\theta \sim N(0,\eta^2\text{Id})$. Show that MAP estimation reduces to MLE with an $L^2$ regularization term.

From part (A), 

$$\theta_{MAP} = \argmax p(y|x,\theta)p(\theta) = \argmax_\theta \left( \log p(y|x,\theta) + \log p(\theta) \right)$$

And 

$$p(\theta) = (2\pi)^{-\frac{n}{2}}\eta^{-n} \exp \left(- \frac{||\theta||^2}{2\eta^2}\right)$$

so after taking logs and discarding the term that doesn't depend on $\theta$,

$$\theta_{MAP} = \argmax_\theta \left( \log p(y|x,\theta) + \frac{1}{2\eta^2}||\theta||^2 \right)$$

## (C) 
#### Consider a concrete version of the setup in part (B), a linear regression model.

#### Assume that

1. $y = \theta \cdot x + \epsilon$ where $\epsilon \sim N(0,\sigma^2)$.
1. Assume a Gaussian prior $\theta \sim N(0,\eta^2\text{Id})$. 

#### Find a closed form expression for $\theta_{MAP}$.

Let $X$ be the matrix of sample inputs where each input is a row, and $Y$ the vector of sample labels. In this case the conditional distribution of the labels is $p(y^i|x^i,\theta) \sim N(x^i\cdot\theta,\sigma^2)$. So if we write this out and take logs, discarding the term that doesn't depend on $\theta$,

$$\theta_{MAP} = \argmax_\theta \left(\frac{1}{2\sigma^2} \sum_{i=1}^m|y^i - x^i\cdot\theta|^2  + \frac{1}{2\eta^2}||\theta||^2\right)$$

The right hand side can be written as 

$$\frac{1}{2\sigma^2} ||Y - X\theta||^2 + \frac{1}{2\eta^2}||\theta||^2$$

This is stationary in $\theta$ exactly when 

$$(Y - X\theta)^TX + \frac{\sigma^2}{\eta^2}\theta^T = 0 $$

or $$Y^TX - \theta^T(X^TX + \frac{\sigma}{\eta}\text{I}) = 0$$

Which means 

$$\theta = (X^TX + \frac{\sigma^2}{\eta^2}\text{I})^{-1}X^TY.$$

## (D)
#### Consider the same linear regression problem but now suppose that the prior distribution of $\theta$ is Laplace; that is

$$p(\theta) = \frac{1}{2b} \exp \left(-\frac{||\theta||}{b}\right)$$

Calculate the quantity to maximize to obtain $\theta_{MAP}$.

Repeat the analysis from the last part. The objective function to maximize is 

$$J(\theta) = -\frac{1}{\sigma^2}||X\theta - Y||^2 - \frac{1}{b}||\theta||$$

which is the standard mean-square error term plus $L^1$ regularization.