# Reference: Statistical Measures

In [None]:
import numpy as np

### Mean

Synonomous with the term "average", the **mean**, is calculated by summing up all the values in a dataset and dividing by the total number of values:

$$ \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i $$

In [None]:
data = [10, 8, 12, 15, 10, 9]
mean = sum(data) / len(data)
print(f'Mean: {mean:.2f}')

Mean: 10.67


### Median

The **median** is the midpoint value in the sorted collection of values.

$$
\text{Median} =\begin{cases}\text{the } \left(\frac{n + 1}{2}\right)^{\text{th}} \text{ term, if } n \text{ is odd} \\Â \\\text{the average of the } \left(\frac{n}{2}\right)^{\text{th}} \text{ and } \left(\frac{n}{2} + 1\right)^{\text{th}} \text{ terms, if } n \text{ is even}\end{cases}
$$

In [None]:
data = [10, 8, 12, 15, 10, 9]
data.sort()
midpoint = len(data) // 2

if len(data) % 2 == 0: 
    median = (data[midpoint] + data[midpoint - 1]) / 2
else:
    median = data[midpoint]

print(f'Median: {median:.2f}')

Median: 10.00


### Mode

The **mode** is simply the value that occurs most frequently.

In [None]:
data = [10, 8, 12, 15, 10, 9]

frequency_counts = {}
for num in data:
    frequency_counts[num] = frequency_counts.get(num, 0) + 1
mode = max(frequency_counts, key=frequency_counts.get)

print(f'Mode: {mode:.2f}')

Mode: 10.00



- With small sample sizes, the mean typically provides the most accurate measure of central tendency.
- With larger sample sizes, the mean, median, and mode will tend to coincide, as long as the distribution isn't skewed
- Skewed distributions, in contrast, drag the mean away from the center and toward the tail. Hence, median would be a better representaiton of center.

## Measures of Dispersion

The most widely-used measures of dispersion of values around the center of a distribution are: 

* Range
* Variance
* Standard deviation
* Standard error

IQR is a relatively rare measure of the dispersion.

### Range

Range is simply `max - min` values.

In [None]:
data = [10, 8, 12, 15, 10, 9]
range_value = max(data) - min(data)
print(f'Range: {range_value}')

Range: 7


### Variance

**Variance** (denoted with $\sigma^2$) is the average squared distance between each point and the mean of the distribution: 
$$ \sigma^2 = \frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})^2 $$

In [None]:
data = [10, 8, 12, 15, 10, 9]
variance = np.var(data) 
print(f'Variance: {variance:.2f}')

Variance: 5.22


> Side notes:
> - Technically speaking, we should divide by $n$-1 with a sample of data, but with the large datasets typical of machine learning, it's a negligible difference. If $n$ were equal to a small number like 8 then it would matter.
> - Also technically speaking, the variance of a sample is typically denoted with $s^2$ as opposed to the Greek $\sigma^2$, akin to how $\bar{x}$ denotes the mean of a sample while the Greek $\mu$ is reserved for population mean.

### Standard Deviation

A straightforward derivative of variance is **standard deviation** (denoted with $\sigma$), which is convenient because its units are on the same scale as the values in the distribution: 
$$ \sigma = \sqrt{\sigma^2} $$

A standard deviation close to zero indicates that data points are very close to the mean, whereas a larger standard deviation indicates data points are spread further away from the mean.

In [None]:
data = [10, 8, 12, 15, 10, 9]
standard_deviation = np.std(data)
print(f'Standard Deviation: {standard_deviation:.2f}')

Standard Deviation: 2.29
