# 3.4 Measures of Dispersion
In the previous chapter we looked at measures of central location to find a typical or central value that describes a variable. We also may want to know how observations vary around the center. For example, let's take a look at the Growth and Value data once more.

## The Range
The most simple measure of dispersion is the **range**. The range is the difference between the maximum and minimum value in a sample or a population.
$$
Range = Max - Min
$$

In [1]:
import pandas as pd
import math
gv = pd.read_csv('Growth_Value.csv', index_col=0)
gv.head(5)

Unnamed: 0_level_0,Growth,Value
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1984,-5.5,-8.59
1985,39.91,22.1
1986,13.03,14.74
1987,-1.7,-8.58
1988,16.05,29.05


In [2]:
# Min/max for growth and value.
growth_min, growth_max = gv.Growth.min(), gv.Growth.max()
value_min, value_max = gv.Value.min(), gv.Value.max()

# Ranges.
growth_range = growth_max - growth_min
value_range = value_max - value_min

print('Growth Range:', growth_range)
print('Value Range:', value_range)

Growth Range: 120.38
Value Range: 90.6


Note that the range isn't a great way to understand the dispersion of our data, since it focuses solely on the extreme values. Similarly, the IQR (interquartile range) used for a boxplot doesn't focus on extreme values, but does not incorporate all observations. 

## The Mean Absolute Deviation
A useful measure of dispersion should consider all observations and how they differ from the mean. Averaging all differences with the mean will distort our understanding of dispersion, because values above or below the mean will be negative and positive respectively. To address this, we take the absolute value of our differences. The **mean absolute deviation** (MAD) is an average of absolute differences between the observations and the mean!

For sample values $x_1, x_2, ..., x_n$ the sample MAD is computed
$$
MAD = \frac{\sum |x_i - \bar{x}|}{n}
$$
The population MAD can be calculated similarly with $x_1, ..., x_N$ observations from the population.

In [3]:
# Means.
growth_mean, value_mean = gv.Growth.mean(), gv.Value.mean()

# MADs.
g_MAD = round(sum(abs(gv.Growth - growth_mean)) / len(gv), 2)
v_MAD = round(sum(abs(gv.Value - value_mean)) / len(gv), 2)

print('Growth MAD:', g_MAD)
print('Value MAD:', v_MAD)

Growth MAD: 17.49
Value MAD: 13.67


Notice that the MAD for growth is greater than the MAD for value. Indicating that the observations have greater dispersion for the Growth variable.

## The Variance and the Standard Deviation
The **variance** and the **standard deviation** are the two most commonly used measures of dispersion. Instead of calculating the absolute differences, we calculate the squared differences from the mean. 

Squaring the differences makes future mathematics easier when we use this measure. Additionally, it allows us to emphasize larger differences more than smaller ones. Whereas the MAD weighs large and small differences equally.

Note here that the sample and population variance and standard deviation have distinct formulas. For the sample variance $s^2$ and the sample standard deviation $s$ we have.
$$
s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}
$$
$$
s = \sqrt{s^2}
$$
For sample observations $x_1, ..., x_n$.

For the population observations $x_1, ..., x_N$ the population variance $\sigma^2$ and the population standard deviation $\sigma$ are computed as.
$$
\sigma^2 = \frac{\sum(x_i - \mu)^2}{N}
$$
$$
\sigma = \sqrt{\sigma^2}
$$
Note that the difference between these two formulas lies in the denominator for the variance. The sample variance uses $n-1$ rather than $n$ to ensure that the sample variance is an unbiased estimator for the population variance. 

In [4]:
# Variance and standard deviation.
g_var = sum((gv.Growth - growth_mean)**2) / (len(gv) - 1)
g_std = math.sqrt(g_var)

v_var = sum((gv.Value - value_mean)**2) / (len(gv) - 1)
v_std = math.sqrt(v_var)

print('Growth var/stddev:', round(g_var,3), round(g_std,3))
print('Value var/stddev:', round(v_var,3), round(v_std,3))

Growth var/stddev: 566.406 23.799
Value var/stddev: 323.251 17.979


## The Coefficient of Variation
Sometimes we need to compare the variability of multiple variables that have different means or units of measurement. The **coefficient of variation** (CV) serves as a relative measure of dispersion by adjusting for differences in the magnitudes of the means. To do so, we divide the variable's standrd deviation by its mean. The CV is a unitless measure that allows for direct comparisons of mean-adjusted dispersion.

Note that in our previous calaculations of the variance and standard deviation, these variables still have units. The unit for the variances of the Growth/Value variables is $(\%^2)$ while for the standard deviations it's $(\%)$.

For a sample we have,
$$
CV = s / \bar{x}
$$
And for a population.
$$
CV = \sigma / \mu
$$

In [5]:
# CV for growth and value.
CV_g = g_std / growth_mean
CV_v = v_std / value_mean

print('Growth CV:', round(CV_g,2))
print('Value CV:', round(CV_v,2))

Growth CV: 1.51
Value CV: 1.5
