# Tutorial: Gaussians and Least Squares

So far in the notes and problems, we've mostly avoided one of the most commonly used probability distributions, the Gaussian or normal distribution:

$\mathrm{Normal}(x|\mu,\sigma) \equiv p_\mathrm{Normal}(x|\mu,\sigma) = \frac{1}{\sqrt{2\pi\sigma}}\exp \left[-\frac{(x-\mu)^2}{2\sigma^2}\right]$. [Endnote 1]

There are two reasons for this:
1. The symmetry between $x$ and $\mu$ makes it easy to miss the distinction between the sampling distribution and the likelihood function, and to conflate the model parameter $\sigma$ with an "error bar" associated strictly with the data (which it may or may not be).
2. The assumption of Gaussian PDFs is baked into various classical statistics methods to the extent that it isn't always obvious to the user. As always, it's important to think about whether an assumption or approximation is justified, and thus to see examples of when it is not.

That said, it is certainly common to use Gaussian distributions in practice, particularly in cases where
1. the approximation is well justified, as in the large-count limit of the Poisson distribution (typical of optical astronomy and longer wavelengths); or
2. we are effectively handed a table of data with "error bars" and have no better alternative than to assume a Gaussian sampling distribution.

Gaussians have lots of nice mathematical features that make them convenient to work with when we can. For example, see a list of identities for the multivariate Gaussian [here](https://ipvs.informatik.uni-stuttgart.de/mlr/marc/notes/gaussians.pdf) or [here](https://cs.nyu.edu/~roweis/notes/gaussid.pdf).

There are a couple of cases that it's useful to work through if you haven't before, to build intuition. We'll do that here, with:

* the product of two Gaussians
* showing conjugacy
* linear transformations
* extending classical weighted least squares 

In [None]:
exec(open('tbc.py').read()) # define TBC and TBC_above
import numpy as np
import scipy.stats as st
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Multiplication

The product of Gaussians comes up, for example, when the sampling distributions for different data points are independent Gaussians, or when the sampling distribution and prior are both Gaussian (this is a conjugate pair).

So, consider

$\mathrm{Normal}(x|\mu_1,\sigma_1) \, \mathrm{Normal}(x|\mu_2,\sigma_2)$.

This can be manipulated into a different product of two Gaussians, with $x$ appearing in only one of them. Do so. (Note that this is a proportionality, not an equality - the coefficient in front will not perfectly normalize things when you're done.)

$\mathrm{Normal}(x|\mu_1,\sigma_1) \, \mathrm{Normal}(x|\mu_2,\sigma_2) \propto \mathrm{Normal}(x|\mu_a,\sigma_a) \, \mathrm{Normal}(0|\mu_b,\sigma_b)$.

If $x$ were a model parameter, and $\mu_i$ and $\sigma_i$ were independent measurements of $x$ with error bars, how do you interpret each term of this factorization?

> math, math, math, math,

Check your solution by plugging in some values for $x$, $\mu_i$ and $\sigma_i$. The function below returns the $\frac{(x-\mu)^2}{\sigma^2}$ part of the PDF, which is what we care about here (since it's where $x$ appears).

In [None]:
TBC()

# pick some values (where m is mu, s sigma)
# x = 
# m1 = 
# s1 = 
# m2 = 
# s2 = 

# compute things
# sa = 
# ma = 
# mb = 
# sb = 

def exp_part(y, m, s):
    return ((y - m) / s)**2

print('This should be a pretty small number:', 
      exp_part(x,m1,s1) + exp_part(x,m2,s2) - ( exp_part(x,ma,sa) + exp_part(0,mb,sb) ) )

## 2. Conjugacy

When the sampling distribution is normal with a fixed variance, the conjugate prior for the mean is also normal. Show this for the case of a single data point, $y$; that is,

$p(\mu|y,\sigma) \propto \mathrm{Normal}(y|\mu,\sigma)\,\mathrm{Normal}(\mu|m_0,s_0) \propto \mathrm{Normal}(\mu|m_1,s_1)$

and find $m_1$ and $s_1$ in terms of $y$, $\sigma$, $m_0$ and $s_0$.

> math, math, math, math

Again, check your work by choosing some fiducial values and 
looking at the ratio $\mathrm{Normal}(y|\mu,\sigma)\,\mathrm{Normal}(\mu|m_0,s_0) / \mathrm{Normal}(\mu|m_1,s_1)$ over a range of $\mu$. It should be constant.

In [None]:
TBC()

# pick some values
# y = 
# sigma = 
# m0 = 
# s0 = 

# compute things
# s1 = 
# m1 = 

# plot
mugrid = np.arange(-1.0, 2.0, 0.01)
# we'll compare the log-probabilities, since that's a good habit to be in
diff = st.norm.logpdf(y, loc=mugrid, scale=sigma)+st.norm.logpdf(mugrid, loc=m0, scale=s0) - st.norm.logpdf(mugrid, loc=m1, scale=s1)

print('This should be a pretty small number, and constant:')
plt.rcParams['figure.figsize'] = (7.0, 5.0)
plt.plot(mugrid, diff, 'b-');
plt.xlabel(r'$\mu$');
plt.ylabel('log-posterior difference');

## 3. Linear transformation

Consider the distribution

$\mathrm{Normal}\left[y\,\big|\,\mu_y(x;a,b),\sigma_y\right]$,

where $\mu_y(x;a,b)=a+bx$. Re-express this in terms of a distribution over $x$, i.e.

$\mathrm{Normal}\left[x|\mu_x(y;a,b),\sigma_x(y;a,b)\right]$.

> math, math, math, math

## 4. Classical weighted least squares

Classical WLS is a simple method for fitting a line to data that you've almost certainly seen before. Consider data consisting of $n$ triplets $(x_i,y_i,\sigma_i)$, where $x_i$ are assumed to be known perfectly and $\sigma_i$ is interpreted as a "measurement error" for $y_i$. WLS maximizes the likelihood function

$\mathcal{L}(a,b;x,y,\sigma) = \prod_{i=1}^n \mathrm{Normal}(y_i|a+bx_i,\sigma_i)$.

In fact, we can get away with being more general and allowing for the possibility that the different measurements are not independent, with their measurement errors jointly characterized by a known covariance matrix, $\Sigma$, rather than the individual $\sigma_i$:

$\mathcal{L}(a,b;x,y,\Sigma) = \mathrm{Normal}(y|X\beta,\Sigma) = \frac{1}{\sqrt{(2\pi)^n|\Sigma|}}\exp \left[-\frac{1}{2}(y-X\beta)^\mathrm{T}\Sigma^{-1}(y-X\beta)\right]$,

where $X$ is called the _design matrix_, with each row equal to $(1, x_i)$, and $\beta = \left(\begin{array}{c}a\\b\end{array}\right)$.

With a certain amount of algebra, it can be shown that $\mathcal{L}$ is proportional to a bivariate Gaussian over $\beta$,

$\mathcal{L} \propto \mathrm{Normal}(\beta | \mu_\beta, \Sigma_\beta)$,

with

$\Sigma_\beta = (X^\mathrm{T}\Sigma^{-1}X)^{-1}$;

$\mu_\beta = \Sigma_\beta X^\mathrm{T}\Sigma^{-1} y$.

In classical WLS, $\mu_\beta$ is the "best fit" estimate of $a$ and $b$, and $\Sigma_\beta$ is the covariance of the standard errors on those parameters.

The relative simplicity of the computations above, not to mention the fact that they are efficiently implemented in numerous packages, can be useful even in situations beyond the assumption-heavy scenario where WLS is derived. As a simple example, consider a case where the sampling distribution corresponds to the likelihood function above, but we wish to use an informative prior on $a$ and $b$.

Taking advantage of the results you derived above (all of which have straightforward multivariate analogs), 
1. What is the form of prior, $p(a,b|\alpha)$, that makes this problem conjugate? (Here $\alpha$ is a stand-in for whatever parameters determine the prior.)
2. What are the form and parameters of the posterior, $p(a,b|x,y,\Sigma,\alpha)$?
3. Verify that you recover the WLS solution in the limit of the prior being uniform over the $(a,b)$ plane.

> 1.

> 2. 

> 3. 

Below, we will explicitly show the correspondance in (3) for a WLS fit of some mock data.

In [None]:
# generate some fake data
a = 0.0
b = 1.0
n = 10
x = st.norm.rvs(size=n)
sigma = st.uniform.rvs(1.0, 2.0, size=n)
y = st.norm.rvs(loc=a+b*x, scale=sigma, size=n)

plt.rcParams['figure.figsize'] = (7.0, 5.0)
plt.errorbar(x, y, yerr=sigma, fmt='bo');
plt.xlabel('x');
plt.ylabel('y');

The next cell uses the `statsmodels` package to perform the WLS calculations. You are encouraged to implement the matrix algebra above to verify the results. What we get at the end are $\mu_\beta$ and $\Sigma_\beta$, as defined above.

In [None]:
import statsmodels.api as sm

model = sm.WLS(y, sm.add_constant(x), weights=sigma**-2)
wls = model.fit()
mu_beta = np.matrix(wls.params).T # cast as a column vector
Sigma_beta = np.asmatrix(wls.normalized_cov_params)

Now, compute the parameters of the posterior for $\beta$ based on $\mu_\beta$ and $\Sigma_\beta$ (parameters that appear in the sampling distribution) and the parameters of the conjugate prior. Set the prior parameters to be equivalent to the uniform distribution for the check below (you can put in something different to see how it looks later).

Transform `post_mean` to a shape (3,) numpy array for convenience (as opposed to, say, a 3x1 matrix).

In [None]:
TBC()

# define prior parameters

# do some calculations, possibly

# parameters of the posterior:
# post_cov = ...
# post_mean = ...

Compare the WLS and posterior parameters (they should be identical for a uniform prior):

In [None]:
print('WLS mean and covariance:')
print(mu_beta)
print(Sigma_beta)

In [None]:
print('Posterior mean and covariance:')
print(post_mean)
print(post_cov)

Below, we can compare your analytic solution to a brute-force calculation of the posterior:

In [None]:
def log_post_brute(a, b):
    like = np.sum( st.norm.logpdf(y, loc=a+b*x, scale=sigma) )
    prior = st.multivariate_normal.logpdf([a,b], mean=np.asarray(prior_mean)[:,0], cov=prior_cov)
    return prior + like

print('Difference between elegant and brute-force log posteriors for some random parameter values:')
print('(The third column should be basically constant, though non-zero.)\n')
for i in range(10):
    a = np.random.rand() * 10.0 - 5.0
    b = np.random.rand() * 10.0 - 5.0
    diff = st.multivariate_normal.logpdf([a,b], mean=post_mean, cov=post_cov) - log_post_brute(a,b)
    print([a, b, diff])

#### Endnotes

1. In statistics literature, a more common convention is $p(x|\mu,\sigma^2)$, with the second parameter being the variance rather than the standard deviation; this is then consistent with the multivariate Gaussian notation in which the second parameter is the covariance matrix. However, most code implementations of the univariate normal distribution take the standard deviation as an argument rather than the variance, so we'll stick with that notation.