In [1]:
import numpy as np
import matplotlib.pyplot as plt

<br>

# Univariate Gaussian
---

The formula for a univariate Gaussian (or normal) distribution:

&emsp; $\boxed{\mathcal{N}(x|\mu,\sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\textstyle - \frac{1}{2 \sigma^2} (x-\mu)^2} = \frac{\vert \beta \vert}{\sqrt{2 \pi}} e^{\textstyle - \frac{\beta}{2}(x-\mu)^2}}$
&emsp; where
&emsp; $\mathbb{E}[x] = \mu$
&emsp; and
&emsp; $\mathbb{V}[x] = \sigma^2 = \beta^{-1}$

Knowing the mean $\mu$ and variance $\sigma^2$ (or alternatively the precisions $\beta$) of a gaussian distribution is enough to completely characterize the distribution. The mean is also the mode (the highest value of the distribution):

* The higher the variance $\sigma^2$, the smaller the precision $\beta$, the more spread the distribution is
* The higher the precision $\beta$, the higher the precision $\sigma^2$, the more centralized the distribution is

<br>

### Product of gaussians

The product of gaussian appears quite often in Bayesian treatment of machine learning, in which we will often marginalize the probability of $x$ over the parameters $w$ of a model $p(x|w)$:

&emsp; $\displaystyle p(x) = \int_{\mathcal{W}} p(x|w) \, p(w) \, dw$

If $w$ follows a gaussian distribution, and if $x$ is gaussian, with a mean that is a linear function on $w$, then the resulting distribution is also gaussian:

&emsp; $p(w) = \mathcal{N}(\mu_{w}, \; \sigma_{w}^2)$
&emsp; and
&emsp; $p(x|w) = \mathcal{N}(a w + b, \; \alpha^2)$
&emsp; $\implies$
&emsp; $p(x) = \mathcal{N}(a \mu_w + b, \; a^2 \sigma_w^2 + \alpha^2)$

This has a link with correlation, which is the reduction of variance of $x$, knowing $w$.

&emsp; $\displaystyle r^2 = \frac{\sigma_x^2 - \alpha^2}{\sigma_x^2} = \frac{a^2 \sigma_w^2}{\sigma_x^2}$
&emsp; $\implies$
&emsp; $\displaystyle r = \frac{a \sigma_w}{\sigma_x} = \frac{a \sigma_w^2}{\sigma_x \sigma_w} = \frac{\mathbb{C}\text{ov}[x,w]}{\sqrt{\mathbb{V}[x] \mathbb{V}[w]}}$

<br>

### Convolutions of gaussians

This previous formula can be applied for the convolution of two gaussian. This occurs whenever we want to know the distribution of the **sum of two independent** variables $t = x + y$ that each follow a gaussian distribution:

&emsp; $\displaystyle p(t) = \int_{\mathcal{X}} p(t|x) \, p(x) \, dx$
&emsp; where
&emsp; $p(t|x) = \mathcal{N}(x + \mu_y, \, \sigma_y^2)$
&emsp; $\implies$
&emsp; $\displaystyle \boxed{p(x+y) = \mathcal{N}(\mu_x + \mu_y, \; \sigma_x^2 + \sigma_y^2)}$

Note that this formula is also true for any distribution, because of linearity of the expectation:

&emsp; $\mathbb{E}[x+y] = \mathbb{E}[x] + \mathbb{E}[y]$
&emsp; and
&emsp; $\mathbb{V}[x+y] = \mathbb{E}[(x+y)^2] = \mathbb{E}[x^2] + \mathbb{E}[y^2] + 2 \mathbb{E}[xy] = \mathbb{V}[x] + \mathbb{V}[y]$

<br>

### Moments

The p-moment of a distribution is defined as the expected value of functions $x \mapsto x^p$. The central p-moments of a distribution are defined as the expected value of functions $x \mapsto (x-\mu)^p$. Because the distribution is symmetric, the central moments are zero for even values of $p$.

&emsp; $\mathbb{E}[x] = \mu$ (first moment)

&emsp; $\mathbb{E}[x^2] = \mu^2 + \sigma^2$ (second moment)

&emsp; $\mathbb{E}[(x-\mu)^2] = \sigma^2 = \mathbb{V}[x]$ (central second moment, the variance)

<br>

### Exponential distribution

The exponential distribution is part of the exponential distributions, of the form:

&emsp; $\displaystyle p(x|\theta) = h(x) \, g(\theta ) \, \exp \big( \eta (\theta )\cdot T(x) \big)$

Where the parameters $\theta = (\mu, \sigma^2)$ and we have the following (not unique) decomposition:

&emsp; $T(x) = \begin{pmatrix} 1 \\ x \\ x^2 \end{pmatrix}$
&emsp; $\eta(\theta) = \begin{pmatrix} -\mu^2 / (2 \sigma^2) \\ \mu / \sigma^2 \\ - 1 / (2 \sigma^2) \end{pmatrix}$
&emsp; $\displaystyle h(x) = \frac{1}{\sqrt{2 \pi}}$
&emsp; $\displaystyle g(\theta) = \frac{1}{\sqrt{\sigma^2}}$

So if we have an exponential, with inside it an **order 2 polynomial in x**, we get a Gaussian.

<br>

### Indefinite integral

The gaussian distribution **does not have an indifinite integral**. We can still compute the square of its integral, and this is how we find the normalizing constant. To evaluate the integral of $e^{-x^2}$, we use a multivariate gaussian with 2 variables:

&emsp; $\displaystyle I = \int_{-\infty}^{\infty} e^{-x^2} dx = 2 \int_0^{\infty} e^{-x^2} dx$
&emsp; $\implies$
&emsp; $\displaystyle I^2 = 4 \int_{-\infty}^{\infty} e^{-x^2} dx \int_{-\infty}^{\infty} e^{-y^2} dy = 4 \int_0^{\infty} \Bigg( \int_0^{\infty} e^{-x^2(1 + \frac{y^2}{x^2})} dy \Bigg) dx$

We then do the change of variables $s = \frac{y}{x}$ which implies $dy = x \, ds$:

&emsp; $\displaystyle I^2 = 4 \int_0^{\infty} \Bigg( \int_0^{\infty} e^{-x^2(1 + s^2)} x ds \Bigg) dx = 4 \int_0^{\infty} \Bigg( \int_0^{\infty} \frac{-2 x (1 + s^2)}{- 2(1 + s^2)} e^{-x^2(1 + s^2)} dx \Bigg) ds$

&emsp; $\displaystyle I^2 = 4 \int_0^{\infty} \Big[ \frac{1}{- 2(1 + s^2)} e^{-x^2(1 + s^2)} \Big]_{x=0}^{x=\infty} ds = 4 \int_0^{\infty} \frac{1}{2(1 + s^2)} ds = 2 \Big[ \arctan s \Big]_{s=0}^{s=\infty} = \pi$

<br>

### Maximizing entropy

The Gaussian distribution maximizes the differential entropy **for a given mean and variance**:

&emsp; $\displaystyle h(p) = - \int_{-\infty}^{\infty} p(x) \log p(x) \, dx$

To prove it, we consider the KL divergence between any other distribution $f$ and the normal distribution $g$ and use the property that the KL divergence is always positive:

&emsp; $\displaystyle D_{KL}(f||g) = - \int f(x) \log \frac{g(x)}{f(x)} dx = - h(f) - \int f(x) \big( \log \frac{1}{\sqrt {2 \pi \sigma^2}} - \frac{(x-\mu)^2}{2 \sigma^2} \big) dx$

&emsp; $\displaystyle D_{KL}(f||g) = - h(f) + \frac{1}{2} \log (2 \pi \sigma^2) + \frac{1}{2 \sigma^2} \int f(x) (x-\mu)^2 dx$

&emsp; $\displaystyle D_{KL}(f||g) = - h(f) + \frac{1}{2} \log (2 \pi \sigma^2) + \frac{\sigma^2}{2 \sigma^2} = -h(f) + h(g) \ge 0$

Where the differential entropy of the gaussian distribution is:

&emsp; $\displaystyle h(p) = - \int_{-\infty}^{\infty} p(x) \log p(x) \, dx = \log (\sigma \sqrt{2 \pi}) + \frac{1}{2} = \frac{1}{2} \log ( 2 \pi \sigma^2 ) + \frac{1}{2}$
&emsp; (see [this link](https://proofwiki.org/wiki/Differential_Entropy_of_Gaussian_Distribution)).

<br>

# Multivariate Gaussian
---

* slicing
* not the same as conditional gaussian
* laws of deductions
* PCA to find the axis

<br>

# Multiple data points
---

* independence => covariance matrix of the form $\alpha^{-1} I_n$

<br>

# Conjugate priors
---

* todo