## 1 Characterizing a Distribution

In [1]:
from scipy import stats

In [8]:
import numpy as np

In [4]:
import pandas as pd

### 1.1 Distribution Center

#### Mode

The *mode* is the most frequently occuring value in a distribution

In [2]:
data = [1, 3, 4, 4, 7]

In [3]:
stats.mode(data)

ModeResult(mode=array([4]), count=array([2]))

In [5]:
data = pd.Series(data)

In [6]:
data.mode()

0    4
dtype: int64

#### Geometric Mean

In [9]:
x = np.arange(1, 101)

In [10]:
stats.gmean(x)

37.992689344834304

### 1.2 Quantifying Variability

#### Range

The *range* is simply the difference between the highest and the lowest data value.

In [12]:
np.ptp(data)

6

`ptp` stands for "peak-to-peak"

#### Percentiles

$$
CDF(x) = \int_{-\infty}^xPDF(x)dx
$$

Percentils are just the inverse of CDF (cumulative distribution function), gives the x corresponding to specifix CDF value

#### Standard Deviation and Variance

The maximum likelihood estimator of the sample variance is given by:

$$
var=\frac{\sum\limits_{i=1}^{n}(x_i-\bar{x})^2}{n}
$$

The best unbiased estimator for the population variance is:

$$
var = \frac{\sum\limits_{i=1}^{n}(x_i - \bar{x})^2}{n - 1}
$$

See the process to prove that it is a unbiased estimator of $\sigma$:
https://www.ma.utexas.edu/users/mks/M358KInstr/SampleSDPf.pdf

`numpy` by default calculates the variance for "n". To obtain the sample variance one has to set `ddof=1`

In [13]:
data = np.arange(7, 14)

In [14]:
np.std(data)

2.0

In [15]:
np.std(data, ddof=1)

2.1602468994692869

In pandas, it is default to `ddof=1`