# Reading Activity 14 - Bayesian Linear Regression

## Objectives

+ To introduce the probabilistic interpretation of least squares

## Probabilistic interpretation of least squares (maximum likelihood)

We wish to model the data using some **fixed** basis/features:
$$
y(\mathbf{x};\mathbf{w}) = \sum_{j=1}^{m} w_{j}\phi_{j}(\mathbf{x}) = \mathbf{w}^T\boldsymbol{\phi}(\mathbf{x})
$$
However, instead of directly picking a loss function to minimize we come up with a probabilistic description of the measurement process.
In particular, we *model the measurement process* using a **likelihood** function:
$$
\mathbf{y}_{1:n} | \mathbf{x}_{1:n}, \mathbf{w} \sim p(\mathbf{y}_{1:n}|\mathbf{x}_{1:n}, \mathbf{w}).
$$

What is the interpretation of the likelihood function?
Well, $p(\mathbf{y}_{1:n} | \mathbf{x}_{1:n}, \mathbf{w})$ tells us how plausible is it to observe $\mathbf{y}_{1:n}$ at inputs $\mathbf{x}_{1:n}$, if we know that the model parameters are $\mathbf{w}$.

Since, in almost all the cases we consider, the measurements are independent conditioned on the model, then likelihood of the data factorizes as follows:
$$
p(\mathbf{y}_{1:n}|\mathbf{x}_{1:n}, \mathbf{w}) = \prod_{i=1}^np(y_i|\mathbf{x}_i, \mathbf{w}),
$$
where $p(y_i|\mathbf{x}_i,\mathbf{w})$ is the likelihood of a single measurement.

The most common choice for the likehood of a single measurement is to pick it to be Gaussian.
We assign:
$$
\begin{array}{ccc}
p(y_i|\mathbf{x}_i, \mathbf{w}, \sigma) &=& \mathcal{N}\left(y_i| y(\mathbf{x}_i;\mathbf{w}), \sigma^2\right)\\
&=& \mathcal{N}\left(y_i | \mathbf{w^{T}\boldsymbol{\phi}(\mathbf{x}_i)}, \sigma^2\right),
\end{array}
$$
where $\sigma$ models the **noise**.
This correspond to the belief that our measurement is around the model prediction $\mathbf{w^{T}\boldsymbol{\phi}(\mathbf{x})}$
but it is contaminated with Gaussian noice of variance $\sigma^2$.

Assuming a Gaussian likelihood for a single observation, we have for all the data:
$$
p(\mathbf{y}_{1:n} | \mathbf{x}_{1:n}, \mathbf{w}, \sigma) = \mathcal{N}\left(\mathbf{y}_{1:n} | \mathbf{\Phi}\mathbf{w}, \sigma^2\mathbf{I}_n\right).
$$
Let's look up the form of the multivariate Gaussian from the ([Wiki](https://en.wikipedia.org/wiki/Multivariate_normal_distribution)):
$$
p(\mathbf{y}_{1:n} | \mathbf{x}_{1:n}, \mathbf{w}, \sigma) 
= (2\pi)^{-\frac{n}{2}}\sigma^{-n} e^{-\frac{1}{2\sigma^2}\lVert\mathbf{\Phi}\mathbf{w}-\mathbf{y}_{1:n}\rVert^2}.
$$

### Maximum Likelihood Estimate of $\mathbf{w}$

Once we have a likelihood, we can train the model by maximizing the likelihood:
$$
\mathbf{w}_{\mbox{MLE}} = \arg\max_{\mathbf{w}} p(\mathbf{y}_{1:n}, |\mathbf{x}_{1:n}, \mathbf{w}, \sigma).
$$
When we do this we are essentially selecting the model that makes the observations most likely.
For the Gaussian likelihood, we have:
$$
\log p(\mathbf{y}_{1:n}, |\mathbf{x}_{1:n}, \mathbf{w}, \sigma) =
-\frac{n}{2}\log(2\pi)
-n\log\sigma
- \frac{1}{2\sigma^2}\lVert\mathbf{\Phi}\mathbf{w}-\mathbf{y}_{1:n}\rVert^2.
$$
Taking the derivatives of this expression with respect to $\mathbf{w}$ and setting them equal to zero (sufficient condition) yields the same solution as least squares.
$$
\mathbf{w}_{\mbox{MLE}} \equiv \mathbf{w}_{\mbox{LS}}.
$$

### Maximum Likelihood Estimate of $\sigma$
The probabilistic interpretation above gives the same solution as least squares.
To start undersanding its power, notice that it can also give us an estimate for the measurement noise variance $\sigma^2$.
All you have to do is maximize likelihood with respect to $\sigma$.
For the Gaussian likelihood:

+ Take the derivative of $p(\mathbf{y}_{1:n}|\mathbf{x}_{1:n},\mathbf{w}_{\mbox{MLE}},\sigma)$ with respect to $\sigma$.
+ Set to zero, and solve for $\sigma$.
+ You will get:
$$
\sigma_{\mbox{MLE}}^2 = \frac{\lVert \mathbf{\Phi}\mathbf{w} - \mathbf{y}_{1:n}\rVert^2}{n}.
$$

### Making Predictions
How do we make predictions about $y$ at a new point $\mathbf{x}$?
We just use the laws of probability...
For the Gaussian likelihood, the **point predictive distribution** is:
$$
p(y|\mathbf{x}, \mathbf{w}_{\mbox{MLE}}, \sigma^2_{\mathbf{\mbox{MLE}}}) = 
\mathcal{N}\left(y\middle|\mathbf{w}_{\mbox{MLE}}^T\mathbf{\phi}(\mathbf{x}), \sigma_{\mbox{MLE}}^2\right).
$$