In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_context('paper')

# Hands-On Activity 5.2: The Gaussian Distribution

## Objectives

+ To practice with the Gaussian distribution.

## The Normal distribution

The normal (or Gaussian) distribution is a ubiquitous one.
It appears over and over again.
There are two explanations as to why it appears so often:

+ It is the distribution of maximum uncertainty that matches a known mean and a known variance variance.
+ It is the distribution that arises when you add a lot of random variables together.

We will learn about both these in the next lectures.

We write:
$$
X | \mu, \sigma \sim N(\mu, \sigma),
$$
and we read "$X$ conditioned on $\mu$ and $\sigma$ follows a normal distribution with mean $\mu$ and variance $\sigma^2$.

When $\mu=0$ and $\sigma^2=1$, we say that we have a *standard normal* distribution.
Let
$$
Z\sim N(0,1).
$$
The PDF of the standard normal is:
$$
\phi(z) := N(z|0,1) = \frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{z^2}{2}\right\}.
$$
The CDF of the standard normal is:
$$
\Phi(z) := p(Z \le z) = \int_{-\infty}^z \phi(z')dz',
$$
is not analytically available.
However, there are codes that can compute it.

In [None]:
import scipy.stats as st

# Here is how you can get the PDF of the standard normal
Z = st.norm()
fig, ax = plt.subplots(dpi=150)
zs = np.linspace(-4.0, 4.0, 100)
ax.plot(zs, Z.pdf(zs), lw=2)
ax.set_xlabel('$z$')
ax.set_ylabel('$\phi(z) = N(z|0,1)$');

In [None]:
# And here is the CDF of the standard normal
fig, ax = plt.subplots(dpi=150)
ax.plot(zs, Z.cdf(zs), lw=2)
ax.set_xlabel('$z$')
ax.set_ylabel('$\Phi(z)$');

In [None]:
# Here is the expectation:
print('E[Z] = {0:1.2f}'.format(Z.expect()))

In [None]:
# Here is the variance:
print('V[Z] = {0:1.2f}'.format(Z.var()))

In [None]:
# Here is the probability that Z is between two numbers
a = 1.0
b = 3.0
prob_Z_in_ab = Z.cdf(b) - Z.cdf(a)
print('p({0:1.2f} <= Z <= {1:1.2f}) = {2:1.2f}'.format(a, b, prob_Z_in_ab))

In [None]:
# And here is how you can sample
Z.rvs(100)

In [None]:
# And, of course, you can also sample using the functionality of numpy:
np.random.randn(100)

There are a few more interesting things to know about the standard normal.
For, example how can you find a value $z_q$ such that the probability of $Z$ being less that $z_q$ is $q$\%.
Mathematically, you wish to find this:
$$
\Phi(z_q) = \frac{q}{100}.
$$
The point $z_q$ is called the $q$\% quantile.
To find it, you need to do this:
$$
z_q = \Phi^{-1}\left(\frac{q}{100}\right).
$$
For example, $z_{50}$ is called the median (and it coincides with the expectation here).
Another set of interesting quantiles is are $z_{2.5}$ and $z_{97.5}$.
Why? Because the probability between them is $95$\%.
Here it is:
$$
p(z_{2.5} \le Z \le z_{97.5}) = \Phi(z_{97.5}) - \Phi(z_{2.5}) = \frac{97.5}{100} - \frac{2.5}{100} = \frac{95}{100}.
$$
Let's find these quantiles and visualize them using the functionality of ``scipy.stats``.

In [None]:
z_025 = Z.ppf(0.025) # ppf = percent point function and it is essentially the inverse of the CDF
z_500 = Z.ppf(0.5)
z_975 = Z.ppf(0.975)
print('2.5% quantile of Z = {0:1.2f}'.format(z_025))
print('50% quantile of Z = {0:1.2f}'.format(z_500))
print('97.5% quantile of Z = {0:1.2f}'.format(z_975))

In [None]:
# Here is how much probability there is between the two extreme quantiles:
print('p({0:1.2f} <= Z <= {1:1.2f}) = {2:1.2f}'.format(z_025, z_975, Z.cdf(z_975) - Z.cdf(z_025)))

In [None]:
# Let's also visualize the quantiles with the function
fig, ax = plt.subplots(dpi=150)
ax.plot(zs, Z.pdf(zs), lw=2)
ax.plot(z_025, [0.0], 'o', label='2.5% quantile')
ax.plot(z_975, [0.0], 'o', label='97.5% quantile')
plt.legend(loc='best')

## Question

+ Modify the code above so that you find and vizualize $z_{0.001}$ and $z_{9.999}$.
+ What is the difference between $z_{9.999}$ and $z_{0.001}$?
+ What is the probability that $Z$ is between $z_{9.999}$ and $z_{0.001}$?

## Getting any normal from the standard normal
Using the standard normal, we can express any normal.
It is easy to show that:
$$
X = \mu + \sigma Z,
$$
follows a $N(\mu,\sigma^2)$ if $Z$ follows $N(0,1)$.
Note that $\sigma$ is called the **standard deviation** of $X$ (the standard deviation of a random variable is just the square root of the variance).
You must remember this!
It is extremely useful and it will appear again and again.
For example, using this relationship you can sample from any normal using samples from the standard normal.
Let's take some samples exploiting this relationship and then compare the histogram to the true PDF.

In [None]:
mu = 1.0
sigma = 0.1
X = st.norm(mu, sigma)
xs = np.linspace(mu - 6.0 * sigma, mu + 6.0 * sigma, 100)
x_samples = mu + sigma * Z.rvs(size=10000)
fig, ax = plt.subplots(dpi=150)
ax.hist(x_samples, density=True, alpha=0.5)
ax.plot(xs, X.pdf(xs))
ax.set_xlabel('$x$')
ax.set_ylabel('$p(x)$');

How can you find the quantiles of this normal? Well, you can simply use the functionality of ``scipy.stats``.
As an example, let's find $x_{2.5}$:

In [None]:
x_025 = X.ppf(0.025)
print('2.5% quantile of N({0:1.2f}, {1:1.2f}^2) = {2:1.2f}'.format(mu, sigma, x_025))

But we can also find this quantile by exploiting the connection between $X$ and $Z$.
The definition of a quantile of $X$ is:
$$
p(X \le x_q) = \frac{q}{100}.
$$
But, since $X=\mu+\sigma Z$, this is equivalent to:
$$
p(\mu + \sigma Z \le x_q) = \frac{q}{100},
$$
which becomes:
$$
p(\sigma Z \le x_q-\mu) = \frac{q}{100},
$$
and then:
$$
p\left(Z \le \frac{x_q-\mu}{\sigma}\right) = \frac{q}{100}.
$$
This is just:
$$
\Phi\left(\frac{x_q-\mu}{\sigma}\right) = \frac{q}{100},
$$
which tells us that $\frac{x_q-\mu}{\sigma}$ is the $q$-quantile of $Z$, i.e.,
$$
z_q = \frac{x_q-\mu}{\sigma}.
$$
Solving for $x_q$, we get:
$$
x_q = \mu + \sigma z_q.
$$
Let's do a sanity check:

In [None]:
z_025 = Z.ppf(0.025)
print('mu + sigma * z_025 = {0:1.2f}'.format(mu + sigma * z_025))

which is the same as what we found before. So, let's find also the 97.5% quantile:

In [None]:
z_975 = Z.ppf(0.975)
x_975 = mu + sigma * z_975
print('97.5% quantile of N({0:1.2f}, {1:1.2f}^2) = {2:1.2f}'.format(mu, sigma, x_975))

In [None]:
# Let's visualize the quantiles like we did before:
fig, ax = plt.subplots(dpi=150)
ax.plot(xs, X.pdf(xs))
ax.plot(x_025, 0, 'o', label='2.5% quantile')
ax.plot(x_975, 0, 'o', label='2.5% quantile')
ax.set_xlabel('$x$')
ax.set_ylabel('$p(x)$')
plt.legend(loc='best');

Now, let's find the distance between $x_{2.5}$ and $x_{97.5}$ in terms of the standard deviation $\sigma$.
We have:
$$
x_{97.5} - x_{2.5} = \mu + \sigma z_{97.5} - \mu - \sigma z_{2.5} = \sigma (z_{97.5} - z_{2.5}).
$$
This is:

In [None]:
print('x_975 - x_025 ~= sigma * {0:1.2f}'.format(z_975 - z_025))

Okay. So we see that 95% of the probability is contained within a $3.92\sigma$ interval.
This interval is centered at the median (which here happends to be the same as the mode and the expectation of the probability density).
The value 3.92 is a little bit awkward, so we are going to round up to 4$\sigma$ intervals.
That is slightly more than 95% of the probability, but it's simpler to remember.
So, remember:
$$
p(\mu - 2\sigma < X < \mu + 2 \sigma) \approx 0.95,
$$
for a normal random variable $N(\mu,\sigma^2)$.

## Questions

+ Write code that finds exactly how much probability there is between $\mu - 2\sigma$ and $\mu + 2\sigma$, i.e., find $p(\mu - 2\sigma < X < \mu + 2 \sigma)$.
+ Modify the code you just written, to find how much probability there is in $\mu - 3\sigma$ and $\mu + 3\sigma$, i.e., find $p(\mu - 3\sigma < X < \mu + 3 \sigma)$. This is six-sigmas interval about the mean. Have you ever heard of the [six-sigma process improvement technique](https://en.wikipedia.org/wiki/Six_Sigma)?