# Mean, median, mode, variance and standard deviation

In [16]:
# Import packages
import numpy as np
import pandas as pd
from scipy import stats

## Mean

The mean, or the average, measures the centre of a dataset. Find the average by adding all of the observations together and dividing by the total number of observations.

In [23]:
# Mean in NumPy

example_array = np.array([24, 16, 30, 10, 12, 28, 38, 2, 4, 36])
example_average = np.average(example_array)
print(example_average)

10.0


## Median

The median is the middle value in a dataset, assuming the values are ordered from smallest to largest.

Find the median by ordering the values in the dataset from smallest to largest and finding the middle number. If the dataset has an even number of values, the middle two numbers will be the median.

In [24]:
# Median in NumPy

example_array = np.array([24, 16, 30, 10, 12, 28, 38, 2, 4, 36, 42])
example_median = np.median(example_array)
print(example_median)

10.0


## Mode

The mode is the most frequently occurring observation in the dataset. A dataset can have multiple modes if there is more than one value with the same maximum frequency.

The mode ismode may not always be a great measure of where the data is centered. Simply put, mode is a measure of the most frequent observation in the dataset, and is not an indication of the tallest bin in a histogram.

In [25]:
# Mode in SciPy

example_array = np.array([15,8,9,15,12,13,2,15,13,8,13,6,7])
example_mode = stats.mode(example_array)

print(example_mode) # Returns an object containing mode value and its count

# Accessing the individual values
print('Mode value is: '+ str(example_mode[0][0]))
print('Mode value count is: '+ str(example_mode[1][0]))

ModeResult(mode=array([10]), count=array([2]))
Mode value is: 10
Mode value count is: 2


  example_mode = stats.mode(example_array)


# Variance

Finding the mean, mode and median of a dataset is a good way to start understanding the shape of your data. 

In [None]:
dataset_one = [-4, -2, 0, 2, 4]
dataset_two = [-400, -200, 0, 200, 400]

However, in the datasets above, the mean and median both equal zero. If we only reported those stats, we'd not be communicating any meaningful difference between them. 

Variance is a descriptive statistic that describe how spread out points in a dataset are and can be calculated by finding the difference between every datapoint in the set and the mean. If the data is close together, then each data point will tend to be close to the mean, and the difference will be small. If the data is spread out, the difference between every data point and the mean will be larger.

This can be written as difference = (X−μ)² , where X is a datapoint and mu is the mean.

To work out the variance, the difference is averaged (sum of variance for each datapoint divided by no. of datapoints).

The result of X−μ is squared as it removes negative numbers. If the result wasn't squared, the variance for both datasets above would be 0 as the negative and positive numbers cancel each other out. Squared, the sum of differences is 40.8 and the variance is 8.16.

<img src="Assets/Variance equation.png">

Sigma squared σ² represents variance. ∑ represents 'the sum of'.

For each datapoint in the dataset, from point 1 to point N (the last datapoint), find the difference between that point and the mean. Square each difference to make them all positive. Average the squared differences together and dividing by N, the total number of points in the dataset.

In [27]:
# Variance in NumPy

dataset = [3, 5, -2, 49, 10]
variance = np.var(dataset)
variance

338.8

# Standard deviation

Variance is a difficult statistic to use as the formula includes squaring the difference between the data and the mean. The result is that the variance is measured in different units to the mean and the data itself, making it hard to interpret the results in context.

Standard deviation is useful as it takes the square root of the variance, returning the variance to the same units.

<img src="Assets/standard-deviation.png">

Sigma σ represents variance.

In [28]:
# Standard deviation in NumPy

dataset = [4, 8, 15, 16, 23, 42]
standard_deviation = np.std(dataset)
standard_deviation

12.315302134607444

By finding the number of standard deviations a datapoint is away from the mean we can begin to see how unusual it is. 65% of the data will generally be within 1 standard deviation of mean, 95% will fall within 2 standard deviations and 99.7% will fall within 3.



In [29]:
# Using standard deviation

dataset_mean = np.average(dataset)

# Calculate difference between datapoint and mean

difference = 42 - dataset_mean

# Use the difference between the point and the mean to find how many standard deviations it is from the mean
num_deviations = difference / standard_deviation

num_deviations

1.9487950630587603