Dispersion measures how spread out a set of data is. This is especially important in finance because one of the main ways risk is measured is in how spread out returns have been historically. If returns have been very tight around a central value, then we have less reason to worry. If returns have been all over the place, that is risky.

In [1]:
# Import libraries
import numpy as np

np.random.seed(74)

X = np.random.randint(100, size=20)

# Sort them
X = np.sort(X)
print('X: %s' %(X))

mu = np.mean(X)
print('Mean of X:', mu)

X: [ 4  9  9  9 11 28 30 31 35 35 37 37 40 53 62 73 74 91 94 95]
Mean of X: 42.85


Range is simply the difference between the maximum and minimum values in a dataset

In [2]:

print('Range of X: %s' %(np.ptp(X)))

Range of X: 91


The mean absolute deviation is the average of the distances of observations from the arithmetic mean. We use the absolute value of the deviation, so that 5 above the mean and 5 below the mean both contribute 5, because otherwise the deviations always sum to 0.

In [3]:
abs_dispersion = [np.abs(mu - x) for x in X]
MAD = np.sum(abs_dispersion)/len(abs_dispersion)
print('Mean absolute deviation of X:', MAD)

Mean absolute deviation of X: 24.204999999999995


Variance and standard deviation

In [4]:
print('Variance of X:', np.var(X))
print('Standard deviation of X:', np.std(X))

Variance of X: 834.5274999999999
Standard deviation of X: 28.888189628289272


One way to interpret standard deviation is by referring to Chebyshev's inequality. This tells us that the proportion of samples within $k$ standard deviations (that is, within a distance of $k \\cdot$ standard deviation) of the mean is at least $1 - 1/k^2$ for all $k>1$.

In [5]:
k = 1.25
dist = k*np.std(X)
l = [x for x in X if abs(x - mu) <= dist]
print('Observations within', k, 'stds of mean:', l)
print('Confirming that', float(len(l))/len(X), '>', 1 - 1/k**2)

Observations within 1.25 stds of mean: [9, 9, 9, 11, 28, 30, 31, 35, 35, 37, 37, 40, 53, 62, 73, 74]
Confirming that 0.8 > 0.36


The bound given by Chebyshev's inequality seems fairly loose in this case. This bound is rarely strict, but it is useful because it holds for all data sets and distributions.

# Semivariance and semideviation
Although variance and standard deviation tell us how volatile a quantity is, they do not differentiate between deviations upward and deviations downward. Often, such as in the case of returns on an asset, we are more worried about deviations downward. This is addressed by semivariance and semideviation, which only count the observations that fall below the mean. Semivariance is defined as

$$ \\frac{\\sum_{X_i < \\mu} (X_i - \\mu)^2}{n_<} $$

where $n_<$ is the number of observations which are smaller than the mean. Semideviation is the square root of the semivariance.

In [6]:
lows = [e for e in X if e <= mu]

semivar = np.sum( (lows - mu) ** 2 ) / len(lows)

print('Semivariance of X:', semivar)
print('Semideviation of X:', np.sqrt(semivar))

Semivariance of X: 514.3917307692309
Semideviation of X: 22.680205703856192


A related notion is target semivariance (and target semideviation), where we average the distance from a target of values which fall below that target:

$$ \\frac{\\sum_{X_i < B} (X_i - B)^2}{n_{<B}} $$

In [7]:
B = 19
lows_B = [e for e in X if e <= B]
semivar_B = sum(map(lambda x: (x - B)**2,lows_B))/len(lows_B)

print('Target semivariance of X:', semivar_B)
print('Target semideviation of X:', np.sqrt(semivar_B))

Target semivariance of X: 117.8
Target semideviation of X: 10.853570840972107


All of these computations will give you sample statistics, that is standard deviation of a sample of data. Whether or not this reflects the current true population standard deviation is not always obvious, and more effort has to be put into determining that. This is especially problematic in finance because all data are time series and the mean and variance may change over time