# Statistics Tutorial - Lesson 3
## Standard Deviation and Variance

Standard Deviation and Variance, in addition to Interquartile Range, are measures of [statistical dispersion](https://en.wikipedia.org/wiki/Statistical_dispersion), which is variability or spread of the data.

## Standard Deviation

[Standard Deviation](https://en.wikipedia.org/wiki/Standard_deviation) is calculated as the square root of the average of the square of the difference between each value and the data's mean, i.e.
$$\sigma = \sqrt{\frac{1}{N}\sum_{i=i}^{N}(x_i - \mu)^{2}}$$

A low standard deviation indicates that the data tend to be close to the mean of the set, while a high standard deviation indicates that the data are spread out over a wider range. 

In [1]:
# Example 1
import math

def get_mean(given_list):
    """
    Function for calculating arithmetic mean
    """
    return sum(given_list)/len(given_list)

def get_sd(given_list):
    """
    Function for calculating standard deviation
    """
    mean = get_mean(given_list)
    return math.sqrt(get_mean([math.pow(x - mean, 2) for x in given_list]))

stock_prices = [505, 492, 509, 522, 538, 528, 527]
sd = get_sd(stock_prices)
print('Standard Deviation is {:.2f}'.format(sd))

Standard Deviation is 14.73


In [2]:
# Example 2
# by built-in library
from statistics import pstdev
sd = pstdev(stock_prices)
print('Standard Deviation is {:.2f}'.format(sd))

Standard Deviation is 14.73


In [3]:
# Example 3
# by NumPy
import numpy as np
stock_price_array = np.array(stock_prices)
sd = np.std(stock_price_array)
print('Standard Deviation is {:.2f}'.format(sd))

Standard Deviation is 14.73


## Variance

[Variance](https://en.wikipedia.org/wiki/Variance) is calculated as the average of the square of the difference between each value and the data's mean, i.e.
$$Var(X) = \frac{1}{N}\sum_{i=i}^{N}(x_i - \mu)^{2}$$

The formula apparently shows that variance is just the square of standard deviation. Standard deviation is widely used in practical statistics compared to variance because stadard deviation's unit remains the same as the elements' unit.

In [4]:
# Example 1
def get_var(given_list):
    """
    Function for calculating variance
    """
    mean = get_mean(given_list)
    return get_mean([math.pow(x - mean, 2) for x in given_list])

var = get_var(stock_prices)
print('Variance is {:.2f}'.format(var))

Variance is 217.06


In [5]:
# Example 2
# by built-in library
from statistics import pvariance
var = pvariance(stock_prices)
print('Standard Deviation is {:.2f}'.format(var))

Standard Deviation is 217.06


In [6]:
# Example 3
# by NumPy
stock_price_array = np.array(stock_prices)
var = np.var(stock_price_array)
print('Standard Deviation is {:.2f}'.format(var))

Standard Deviation is 217.06


An advantage of variance as a measure of dispersion is that it is more amenable to algebraic manipulation than other measures of dispersion. There are examples as the following.

Example 1: The variance of X is equal to the mean of the square of X minus the square of the mean of X, i.e.
    
$$Var(X) = E[(X - E[X])^{2}] = E[X^{2}] - E[X]^{2}$$    

In [7]:
# Prove Example 1 by built-in library
from statistics import mean, pvariance

def get_var_method2(given_list):
    """
    Function for calculating variance (Method 2)
    """
    return mean([x**2 for x in given_list]) - mean(given_list)**2
var_1 = pvariance(stock_prices)
var_2 = get_var_method2(stock_prices)
print('Difference between two values is {:.2f}'.format(var_1 - var_2))

Difference between two values is 0.00


Example 2: The variance of a set of n values can be equivalently expressed, without directly referring to the mean, in terms of squared deviations of all points from each other:
$$Var(X) = \frac{1}{n^{2}}\sum_{i=1}^{}\sum_{j>i}^{}(x_i - x_j)^{2}$$

In [8]:
# Prove Example 2 by built-in library
from statistics import mean, pvariance

def get_var_method3(given_list):
    """
    Function for calculating variance (Method 3)
    """
    n = len(given_list)
    temp_sum = 0
    for i in range(0, n):
        for j in range(i+1, n):
            temp_sum += (given_list[i] - given_list[j])**2
    return temp_sum / n**2

var_3 = get_var_method3(stock_prices)
print('Difference between two values is {:.2f}'.format(var_1 - var_3))

Difference between two values is 0.00


## Bessel's correction

[Bessel's correction](https://en.wikipedia.org/wiki/Bessel%27s_correction) is the use of (n − 1) instead of n in the formula for the sample variance and sample standard deviation. This is an approach to reduce the bias due to finite sample size. 

Therefore, the sample variance (unbiased) is calculated as
$$s^{2} = \frac{1}{n-1}\sum_{i=i}^{n}(x_i - \overline{x})^{2}$$

It is also known that a biased sample variance (calculated as population variance) is multiplied by the factor n/(n-1) to get an unbiased sample variance, i.e. 
$$s_u^{2} = (\frac{n}{n-1})s_b^{2}$$

In [9]:
# Example 1
import math

def get_sample_var(given_list):
    """
    Function for calculating sample variance
    """
    mean = get_mean(given_list)
    n = len(given_list)
    return sum([math.pow(x - mean, 2) for x in given_list]) / (n-1)

sample_var = get_sample_var(stock_prices)
sample_sd = math.sqrt(sample_var)
print('Sample variance is {:.2f}'.format(sample_var))
print('Sample standard deviation is {:.2f}'.format(sample_sd))

Sample variance is 253.24
Sample standard deviation is 15.91


In [10]:
# Example 2
# by built-in library
from statistics import variance, stdev
sample_var = variance(stock_prices)
sample_sd = stdev(stock_prices)
print('Sample variance is {:.2f}'.format(sample_var))
print('Sample standard deviation is {:.2f}'.format(sample_sd))

Sample variance is 253.24
Sample standard deviation is 15.91


In [11]:
# Example 3
# by NumPy
sample_var = np.var(stock_price_array, ddof=1)
sample_sd = np.std(stock_price_array, ddof=1)
print('Sample variance is {:.2f}'.format(sample_var))
print('Sample standard deviation is {:.2f}'.format(sample_sd))

Sample variance is 253.24
Sample standard deviation is 15.91
