## Distributions

See Chapter 2-4 & 6 of [Think Stats 2nd Edition](https://greenteapress.com/wp/think-stats-2e/).

In this notebook we wil be working with tensors of at two dimensions (batch_size, num_features).  

This is to get you used to the idea of having multiple samples in a single array (when training neural networks your `x_train`, `y_train` will have at leat two dimensions).

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from common import make_pmf, make_cdf, percentile, percentile_rank

plt.style.use('ggplot')

Sample data from **two normal** distributions and **two uniform** distributions (four in total):

In [None]:
normal = np.random.normal((10, 10), (10, 20), size=(1000, 2)).astype(int)

uniform = np.random.uniform(low=-50, high=50, size=(1000, 2)).astype(int)

## Central tendency

The **mean** is also known as the **expected value** (*on expectation* == on average):

In [None]:
np.mean(normal, axis=0)

The **median** is a percentile based statistic (more on them later) and is informative when you have outliers:

In [None]:
np.median(normal, axis=0)

The difference between the mean & median can characterize skew:

In [None]:
np.mean(normal, axis=0) - np.median(normal, axis=0)

## Spread / variability of the data

**Variance** - how far away a variable is from its mean:

$$ \sigma^2_x = \frac{1}{n} \sum(x_n - \mu_x)^2 $$

In [None]:
np.var(uniform, axis=0)

**Standard deviation** - square root of the variance (in the same units as the data):

In [None]:
#  test that var^2 == standard deviation
assert (np.sqrt(np.var(uniform, axis=0)) == np.std(uniform, axis=0)).all()

np.std(uniform, axis=0)

## Histogram

We can use a **histogram** to show shape & outliers (the histogram performs binning on our continuous variables):

In [None]:
f = plt.hist(normal)
f = plt.ylabel('frequency')

In [None]:
f = plt.hist(uniform)
_ = plt.ylabel('frequency')

Histograms can also compare distributions (note that we have four!):
- but if our variables have different ranges, it can be hard to compare

In [None]:
f = plt.hist(normal)
f = plt.hist(uniform)
_ = plt.ylabel('frequency')

## Probability mass functions (PMF)

The PMF is a simple **normalizing of the counts** of discrete bins:
- this makes the y-axis comparable 
- look at `make_pmf` definition in `common.py`

Problems with PMFs:
- the more values we have, the smaller their probabilities become (and the larger the effect of noise on the probabilites)
- even with the correct scale, it is still hard to compare distributions

In [None]:
plt.bar(*make_pmf(normal[:, 0]))
plt.bar(*make_pmf(uniform[:, 0]))
_ = plt.ylabel('probability')

## Percentile rank

Value & samples -> **percentile rank**

90th percentile = a value that is higher than 90% of the group

In [None]:
values = np.array([55, 77, 88, 66, 99])
value = 66

percentile_rank(value, values)

## Cumulative distribution functions (CDF)

The CDF is a function that maps from **value -> normalized percentile rank** (in the range of 0 to 1):

In [None]:
cdf = [(percentile_rank(v, values), v) for v in values]

In [None]:
cdf

We can then evaluate the CDF for any value of x using our percentile rank function:

In [None]:
percentile_rank(0, normal[:, 0])

Lets plot a CDF for two of our distributions.  We can see:
- the range (min & max)
- the median

In [None]:
y, x = zip(*make_cdf(normal[:, 0]))
plt.plot(x, y, label='normal')

y, x = zip(*make_cdf(uniform[:, 0]))
plt.plot(x, y, label='uniform')
plt.ylabel('cumulative %')

_ = plt.legend()

## Quantiles

Above we have mapped from 
- value & samples -> percentile rank

Let's now do the opposite 
- **percentile rank & samples -> value**

One example of a percentile based statistic is the median (measuring the central tendency of the distribution):

In [None]:
percentile(0.5, normal[:, 0])

In [None]:
np.median(normal[:, 0])

Also useful are other percentile based statistics such as the **interquartile range (IQR)**, which is the difference between the 75th and 25th percentiles:

In [None]:
percentile(0.75, normal[:, 0]) - percentile(0.25, normal[:, 0])

Also common are **quantiles** - equally spaced points in the distribution:

In [None]:
print(percentile(0.25, normal[:, 0]), percentile(0.5, normal[:, 0]), percentile(0.75, normal[:, 0]))

## Sampling from CDFs

Because the distribution of percentile ranks is uniform, we can eaisly sample from a CDF:

In [None]:
f = plt.hist(
    [percentile(s, normal[:, 0]) for s in np.random.uniform(0, 1, 500)]
)

## Comparing percentile ranks

We can compare values from one distribution with another - for example our two normal distributions:

In [None]:
value = 10

rank = percentile_rank(value, normal[:, 0])
rank

In [None]:
percentile(rank, normal[:, 1])