Measures of variability 

Variance or variability in a process is at the root of many defects in a six sigma world.

Mean, median and mode are important summary data metrics. However they don't tell us anything about variability in the data. For example we see an identical median and mean for two samples although the spread or variability is different:

In [1]:
import numpy as np
from scipy import stats

In [2]:
Sample_A = [4,4,4,4]
Sample_B = [0,8,8,0]

mean_A = np.mean(Sample_A)
mean_B = np.mean(Sample_B)

median_A = np.median(Sample_A)
median_B = np.median(Sample_B)

In [3]:
print('The mean of sample A is', mean_A)
print('The mean of sample B is', mean_B)
print('The median of sample A is', median_A)
print('The median of sample B is', median_B)

The mean of sample A is 4.0
The mean of sample B is 4.0
The median of sample A is 4.0
The median of sample B is 4.0


Summary stats also need to capture variability. Essentially there are four different types of variability:
    - 1)Range
    - 2)Mean absolute deviation ( including 2a - Mean Deviation)
    - 3)Variance
    - 4)Standard deviation: the long way
    - 5) Standard deviation: the short way
    and for completeness
    -6) Limitations of measures to different types of data

In [4]:
#1) Range

Z = [1,1,1,1,1,1,1,1,1,21]

#The range is simply the max - min; the limitation is that only the extreme values are considered.
#This makes it sensitive to outliers. As we see in this example the range is 20 when we would expect
#it to be closer to zero given the frequency with which 1 appears.
range = max(Z) - min(Z)
range

20

In [5]:
#2a) Mean average deviation...... doesn't really work......
#..because the average distance will be zero.

#To calculated an average deviation, we need a reference point.The mean is often used for this.....
mean_array = np.mean(Z)
print('The mean of the array is', mean_array)

The mean of the array is 3.0


In this example 9 values are below the mean and 1 is above.
If we look at the average value we see that they cancel out and = zero.


In [6]:
((1-3) * 9) + (21-3)

0

2) The Mean Absolute Deviation

To solve the problem of summing to zero, we can take the absolute value of each distance, and then sum up the 
absolute values. The absolute value (also called modulus) of a number is the positive version of 
that number, regardless of its sign. Below I calculate the mean absolute distance (deviation). It is larger than 1, but much less than 20 as expected.

In [7]:
def mean_absolute_deviation(array):
    reference_point = np.mean(array)
    
    distances = []
    for value in array:
        absolute_distance = abs(value - reference_point)
        distances.append(absolute_distance)
        
    return np.mean(distances) # this is sum(distances) / len(distances); we are dividing by n.

mad = mean_absolute_deviation(Z)
mad

3.6

3)The Variance

An alternative solution is to square each distance and then find the mean of all the squared distances. This measure of variability is sometimes called mean squared distance or mean squared deviation . However, it's more commonly known as variance.

In [8]:
def variance(array):
    reference_point = np.mean(array)
    
    distances = []
    for value in array:
        squared_distance = (value - reference_point)**2
        distances.append(squared_distance)
        
    return np.mean(distances) # this is basically sum(distances) / len(distances). We are dividing by n.

variance_Z = variance(Z)
variance_Z

36.0

The variance is higher than expected (i.e greater than 20) which is the result of squaring the difference between the value and the mean. To solve this problem and also reduce the variability value, we can take the square root of variance.

4) Standard Deviation

The square root of variance is called standard deviation (remember that "deviation" is synonymous with "distance"). In practice, standard deviation is perhaps the most used measure of variability. It is simply a measure of the of spread in a distribution.

In [9]:
from math import sqrt

def standard_deviation(array):
    reference_point = np.mean(array)
    
    distances = []
    for value in array:
        squared_distance = (value - reference_point)**2
        distances.append(squared_distance)
        
    variance = sum(distances) / len(distances) # note we are dividing by n
    #variance = np.mean(distances)
                
    return sqrt(variance)

standard_deviation_Z = standard_deviation(Z)
standard_deviation_Z


6.0

In [10]:
# 5) Standard Deviation the short way

#Standard deviation formulas can be found in several python packages. However we need consider whether 
#And let's do this the short way

#Option 1: numpy

standard_deviation_Z_shortway = np.std(Z)
standard_deviation_Z_shortway

6.0

In [11]:
#Applying option 1 to Samples A and B at the top of the file.

standard_deviation_Sample_A = np.std(Sample_A)
print('The standard deviation of sample A is', standard_deviation_Sample_A)

standard_deviation_Sample_B = np.std(Sample_B)
print('The standard deviation of sample B is', standard_deviation_Sample_B)
#We are seeing higher SD for sample B as expected.

The standard deviation of sample A is 0.0
The standard deviation of sample B is 4.0


In [12]:
#Option 2: pandas

import pandas as pd

Z = [1,1,1,1,1,1,1,1,1,21]

standard_deviation_x_pandas = pd.Series([1,1,1,1,1,1,1,1,1,21]).std()
standard_deviation_x_pandas


6.324555320336759

So using pandas we get 6.32 rather than 6. What is going on?

In practice, when calculating standard deviation (and more generally) we are working with samples. Most of the time we're not actually interested in describing the samples. Rather, we want to use the samples to make inferences about their corresponding populations. When we calculate sample standard deviations and compared them against the population standard deviation for a dataset (not done here) we notice that most sample standard deviations are clustered below the population standard deviation. Put simply this suggests that sample standard deviation usually underestimates the population standard deviation. 

To correct the underestimation problem, we can try to slightly modify the sample standard deviation formula to return higher values by dividing by to n-1 rather than n. This is called the Bessel correction which is the use of n − 1 instead of n in the formula for the sample variance and sample standard deviation, where n is the number of observations in a sample. This method corrects the bias in the estimation of the population variance.

Using n-1 is also the default in scipy (option 3 below). 
Put simply this means that pandas and scipy make the Bessel correction as a default; numpy does not.

It is important to remember that excel uses n-1 (i.e the Bessel correction) as a default.

Remember when working with population we continue to use N (rather than n-1).


In [13]:
#Option 3: Scipy

import scipy

scipy.stats.tstd(Z, limits=None, inclusive=(True, True), axis=0, ddof=1)

6.324555320336759

In [14]:
#cross check Sig sigma book (pg65-68)

list = [2,3.5,2.3,2,2.5,3.1,2.2,3.2,4]
np.std(list)

0.6784150009988332

In [15]:
pd.Series([2,3.5,2.3,2,2.5,3.1,2.2,3.2,4]).std()

0.7195677714974301

Important! Excel default setting is n-1 (i.e including bessel correction) - we divide by a small denominator and get a bigger std.

6) Limitations

The variability metrics discussed here are useful for distributions whose values are measured on an interval or ratio scale. Measuring variability for ordinal and nominal data is much harder because we can't quantify the differences between values. For this reason, little is written in the literature about measuring variability for ordinal and nominal data. 