In [1]:
import numpy as np

***
***
# Mean and Variance!

## *Mean* - measures the average of a set of numbers
# $\bar x =$ $\dfrac{\sum^n_{i=1}x_i}{n}$
- aka "average, arithemtic mean"
- **interpretation** - used for finding the *central tendency* of a roughly normal (gaussian) distributed  data set
- **use case**: only reliable with gaussian distributions
- data types: interval, ratio

In [7]:
# example 1 - calculating mean

# data set
x = [1, 6, 7, 6, -2, -14, 0]
n = len(x) # length of list == total count of numbers in set

# finding mean with two different methods

# using built in function
avg1 = np.mean(x)
print("Average with numpy function:", avg1)

# manual
avg2 = np.sum(x)/n
print("Average with manual function:", avg2)

# these should have the same result

Average with numpy function: 0.5714285714285714
Average with manual function: 0.5714285714285714


***
***
## *Variance* - measures dispersion of data points around mean or average
# $\sigma^2 =\dfrac{\sum^n_{i=1}(x_i-Mean)^2}{(n-1)}$
- **interpretation** - larger variance indicates data points are distributed  farther from the mean
- **use case**: can be used for ANY distribution
- data types: numerical, ordinal with computed mean

#### - note: when using built in function, must consider 2nd parameter, degrees of freedom
- **degrees of freedom** -  number of variables / values in a calculation that are free to vary

#### - biased vs unbiased variance:
- **biased** - default of numpy's *var()* function is *ddof = 0*. This introduces a bias in the calculation.
- **unbiased** - need to specify the parameter of *ddof = 1* to get the results one would be expecting with the above equation. This removes the bias.
- the difference between these two becomes more negligible for larger data sets



In [9]:
# example 2 - calculating variance

# using built in numpy function
variance1 = np.var(x, ddof=1)
print("Variance 1:", variance1)

# manual computation
variance2 = (1/(n-1))*np.sum((x-avg1)**2) # using unbiased variance
print("Variance 2:", variance2)


Variance 1: 53.285714285714285
Variance 2: 53.285714285714285
