*Credit*: some material here has been adapted from [Machine Learning: A Probabilistic Perspective](https://www.cs.ubc.ca/~murphyk/MLbook/) by Kevin P. Murphy (Chapter 2).


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# The Gaussian and its Longer-tailed Cousins

In this notebook we will explore some common distributions for continuous-valued random variables.

## Gaussian (normal) distribution

The most widely used distribution in statistics and machine learning is the Gaussian or normal distribution. Its pdf is given by

$$\mathcal{N}(x|\mu, \sigma^2) \triangleq \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$

where $\mu = \mathbb{E}_X[x]$ is the mean (and mode), and $\sigma^2 = \mathbb{V}_X[x]$ is the variance. $\frac{1}{\sqrt{2 \pi \sigma^2}}$ is the normalization constant needed to ensure the density integrates to 1.

We write $X \sim \mathcal{N}(\mu, \sigma^2)$ to denote $p(x)=\mathcal{N}(x|\mu, \sigma^2)$. If $X \sim \mathcal{N}(0,1)$, we say $X$ follows a **standard normal** distribution. 

We will sometimes talk about the **precision** of a Gaussian, by which we mean the inverse variance: $\lambda = 1/\sigma^2$.

The Gaussian distribution is the most widely used distribution in statistics. Why?

* It has two parameters that are easy to interpret
* The central limit theorem tells us that sums of independent random variables have an approximately Gaussian distribution, making it a good fit for modeling residual errors or "noise"
* The Gaussian distribution makes the least number of assumptions (i.e. has maximum entropy) which makes it a good default choice in many cases
* It has a simple mathematical form, which results in easy to implement, but often highly effective methods

## The Student $t$ distribution

One problem with the Gaussian distribution is that it is sensitive to outliers, since the log-probability only decays quadratically with distance from the centre. A more robust distribution is the **Student** $t$ **distribution**. Its pdf is as follows

$$\mathcal{T}(x|\mu, \sigma^2, \nu) \propto \left[ 1 + \frac{1}{\nu} \left( \frac{x-\mu}{\sigma}\right)^2\right]^{-\left(\frac{\nu + 1}{2}\right)}$$

where $\mu$ is the mean, $\sigma^2>0$ is the scale parameter, and $\nu > 0$ is called the **degrees of freedom**.

The distribution has the following properties:

mean = $\mu$, mode = $\mu$, var = $\frac{\nu \sigma^2}{(\nu - 2)}$

The variance is only defined if $\nu > 2$. The mean is only defined if $\nu > 1$. It is common to use $\nu = 4$, which gives good performance in a range of problems. For $\nu \gg 5$, the Student distribution rapidly approaches a Gaussian distribution and loses its robustness properties.

## The Laplace distribution

Another distribution with heavy tails is the **Laplace distribution**, also known as the **double sided exponential** distribution. This has the following pdf:

$$\text{Lap}(x|\mu,b) \triangleq \frac{1}{2b} \exp \left( - \frac{|x - \mu|}{b}\right)$$

Here $\mu$ is a location parameter and $b>0$ is a scale parameter. This distribution has the following properties:

mean = $\mu$, mode = $\mu$, var = $2b^2$

Not only does it have heavier tails, it puts more probability density at 0 than the Gaussian. This property is a useful way to encourage sparsity in machine learning models.

In [None]:
# Show Gaussian, Student, Laplace pdfs and log pdfs
fig, ax = plt.subplots(2, 1, sharex=True)

g = lambda x : stats.norm.pdf(x, loc=0, scale=1)
t = lambda x : stats.t.pdf(x, df=1, loc=0, scale=1)
l = lambda x : stats.laplace.pdf(x, loc=0, scale=1/np.sqrt(2))

x = np.arange(-4, 4, 0.1)


ax[0].plot(x, g(x), 'b-', label='Gaussian')
ax[0].plot(x, t(x), 'r.', label='Student')
ax[0].plot(x, l(x), 'g--', label='Laplace')

ax[0].legend(loc='best')
ax[0].set_title('pdfs')

ax[1].plot(x, np.log(g(x)), 'b-', label='Gaussian')
ax[1].plot(x, np.log(t(x)), 'r.', label='Student')
ax[1].plot(x, np.log(l(x)), 'g--', label='Laplace')
ax[1].set_title('log pdfs')

Let's fit each of these densities to data, with and without outliers. This should make it concrete what we mean by saying that the heavier-tailed densities are more robust.

In [None]:
n = 30  # n data points
np.random.seed(0)
data = np.random.randn(n)

outliers = np.array([8, 8.75, 9.5])
nn = len(outliers)
nbins = 7

# fit each of the models to the data (no outliers)
model_g = stats.norm.fit(data)
model_t = stats.t.fit(data)
model_l = stats.laplace.fit(data)

fig, ax = plt.subplots(2, 1, sharex=True)

x = np.arange(-10, 10, 0.1)

g = lambda x : stats.norm.pdf(x, loc=model_g[0], scale=model_g[1])
t = lambda x : stats.t.pdf(x, df=model_t[0], loc=model_t[1], scale=model_t[2])
l = lambda x : stats.laplace.pdf(x, loc=model_l[0], scale=model_l[1])

ax[0].hist(data, bins=25, range=(-10, 10),
           density=True, alpha=0.25, facecolor='gray')
ax[0].plot(x, g(x), 'b-', label='Gaussian')
ax[0].plot(x, t(x), 'r.', label='Student')
ax[0].plot(x, l(x), 'g--', label='Laplace')

ax[0].legend(loc='best')
ax[0].set_title('no outliers')

# fit each of the models to the data (with outliers)
newdata = np.r_[data, outliers]  # row concatenation
model_g = stats.norm.fit(newdata)
model_t = stats.t.fit(newdata)
model_l = stats.laplace.fit(newdata)


g = lambda x : stats.norm.pdf(x, loc=model_g[0], scale=model_g[1])
t = lambda x : stats.t.pdf(x, df=model_t[0], loc=model_t[1], scale=model_t[2])
l = lambda x : stats.laplace.pdf(x, loc=model_l[0], scale=model_l[1])

ax[1].hist(newdata, bins=25, range=(-10, 10),
           density=True, alpha=0.25, facecolor='gray')
ax[1].plot(x, g(x), 'b-', label='Gaussian')
ax[1].plot(x, t(x), 'r.', label='Student')
ax[1].plot(x, l(x), 'g--', label='Laplace')


ax[1].set_title('with outliers')

So we see that in the case where outliers in the data were present, the Gaussian really spread out its density, while the student and Laplace distributions weren't greatly affected.

## The multivariate Gaussian

The **multivariate Gaussian** or **multivariate normal (MVN)** is the most widely used joint probability density function for continuous variables. The pdf of the MVN in $D$ dimensions is defined by the following

$$\mathcal{N}(\mathbf{x}|\boldsymbol\mu,\mathbf{\Sigma}) \triangleq \frac{1}{(2 \pi)^{D/2}|\mathbf{\Sigma}|^{1/2}} \exp \left[ - \frac{1}{2} \left(\mathbf{x} - \boldsymbol\mu \right)^T \mathbf{\Sigma}^{-1} \left(\mathbf{x} - \boldsymbol\mu\right)\right]$$

where $\boldsymbol\mu = \mathbb{E}_X[\mathbf{x}] \in \mathbb{R}^D$ is the mean vector, and $\Sigma = \mathbb{V}_X[\mathbf{x}]=\text{Cov}[\mathbf{x}, \mathbf{x}]$ is the $D \times D$ covariance matrix. Sometimes we will work in terms of the **precision matrix** or **concentration matrix** instead. This is just the inverse covariance matrix, $\Lambda = \Sigma^{-1}$. The normalization constant $(2 \pi)^{-D/2}|\Lambda|^{1/2}$ ensures that the pdf integrates to 1.

The figure below plots some MVN densities in 2d for three different kinds of covariance matrices. A full covariance matrix has $D(D+1)/2$ parameters (we divide by 2 since $\Sigma$ is symmetric). A diagonal covariance matrix has $D$ parameters, and has 0s on the off-diagonal terms. A **spherical** or **isotropic** covariance, $\Sigma = \sigma^2 \mathbf{I}_D$, has one free parameter.

In [None]:
# plot a MVN in 2D and 3D
import matplotlib.mlab as mlab
from scipy.linalg import eig, inv
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

delta = 0.05
x = np.arange(-10.0, 10.0, delta)
y = np.arange(-10.0, 10.0, delta)
X, Y = np.meshgrid(x, y)

S = np.asarray([[2.0, 1.8],
                [1.8, 2.0]])
mu = np.asarray([0, 0])

Z = mlab.bivariate_normal(X, Y, sigmax=S[0, 0], sigmay=S[1, 1], 
                          mux=mu[0], muy=mu[1], sigmaxy=S[0, 1])

#fig, ax = plt.subplots(2, 2, figsize=(10, 10),
#                       subplot_kw={'aspect': 'equal'})

fig = plt.figure(figsize=(10, 10))

ax = fig.add_subplot(2, 2, 1)

CS = ax.contour(X, Y, Z)
plt.clabel(CS, inline=1, fontsize=10)

ax.set_xlim((-6, 6))
ax.set_ylim((-6, 6))
ax.set_title('full')

# Decorrelate
[D, U] = eig(S)

S1 = np.dot(np.dot(U.T, S), U)

Z = mlab.bivariate_normal(X, Y, sigmax=S1[0, 0], sigmay=S1[1, 1], 
                          mux=mu[0], muy=mu[0], sigmaxy=S1[0, 1])

ax = fig.add_subplot(2, 2, 2)
CS = ax.contour(X, Y, Z)
plt.clabel(CS, inline=1, fontsize=10)


ax.set_xlim((-10, 10))
ax.set_ylim((-5, 5))
ax.set_title('diagonal')

# Whiten
A = np.dot(np.sqrt(np.linalg.inv(np.diag(np.real(D)))), U.T)
mu2 = np.dot(A, mu)
S2 = np.dot(np.dot(A, S), A.T)  # may not be numerically equal to I


#np.testing.assert_allclose(S2, np.eye(2))  # check
print(np.allclose(S2, np.eye(2)))

# plot centred on original mu, not shifted mu
Z = mlab.bivariate_normal(X, Y, sigmax=S2[0, 0], sigmay=S2[1, 1], 
                          mux=mu[0], muy=mu[0], sigmaxy=S2[0, 1])

ax = fig.add_subplot(2, 2, 3)
CS = ax.contour(X, Y, Z)
plt.clabel(CS, inline=1, fontsize=10)

ax.set_xlim((-6, 6))
ax.set_ylim((-6, 6))
ax.set_title('spherical')

# demonstration of how to do a surface plot
axx = fig.add_subplot(2, 2, 4, projection='3d')
surf = axx.plot_surface(X, Y, Z, rstride=5, cstride=5, cmap=cm.coolwarm,
                        linewidth=0, antialiased=False)
axx.set_title('spherical')