### Task 3. November 16th, 2020: 
Research these Excel standard deviation functions of STDEV.P and STDEV.S, noting the difference between them. Then use numpy to perform a simulation demonstrating that the STDEV.S calculation is a better estimate for the standard deviation of a population when performed on a sample. 

### Solution

Standard deviation is a measure of the amount of variation in a set of numbers. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.[1] Wikipedia:Standard Deviation https://en.wikipedia.org/wiki/Standard_deviation. Below shows a bell curve of how you would expect normal data to fall around an average point. 
![StandardDeviation](Images/StandardDeviation.PNG)
(https://towardsdatascience.com/using-standard-deviation-in-python-77872c32ba9b)
For example, if you were to measure the male population of adults between the age of 30 and 50 in a country students, you would get such with a similar shape, where around the mean/average height would have the highest number of people and the further you drift from the average the less number of people you would expect to measure. To emphasise this, you would expect to find the majority of the population between 5'7 and 6'2, and you would expect very few people at 4'8 or 6'7. Standard deviation calculates how far from the average of the dataset lie 68% of the population. If the is a relatively small figure, then this signifies that the dataset is tight to the mean, wheras a large standard deviation indicates the dataset is more spread out.

In calculating the standard deviation of a population, excle uses two different possible calculations, depending on whether it is the standard deviation of a <b>sample</b> of the population or the standard deviation of the <b>entire</b> population that is being calculated. 
If you are calculating the standard deviation of the entire population/dataset the following formula is used:
$$ STDEV.P = \sqrt{\frac{1}{N}\sum({x - \bar{x})^2}} $$

If you are calculating the standard deviation of a sammple of the-1 population/dataset the following formula is used:
$$ STDEV.S = \sqrt{\frac{1}{N-1}\sum({x - \bar{x})^2}} $$


In [1]:
# import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statistics as st

# Create a list of 100 numbers of normal random distribution around the number 10
x = np.random.normal(10, 2, 100)
x

array([ 9.95696151,  9.22587619,  9.84538449, 11.07953304,  7.77421325,
        7.40471212, 10.47458468,  7.17780555,  9.5740912 ,  8.24275866,
       10.87259929,  6.5855951 , 10.72091397, 14.51707235, 14.01462712,
       11.35020208, 14.81165452, 10.57898073,  9.94770938, 13.99698028,
       10.0300999 ,  9.9209332 ,  9.72338647,  9.89471072,  8.14472442,
        8.47175695,  7.67608252,  9.86189762, 11.90414238,  5.75073504,
       11.0528432 ,  9.38058309,  8.45821315, 10.10041123, 11.76588355,
        8.65256255, 10.10272893,  9.33016617, 11.68257123, 10.69174863,
       12.69426979, 10.92971787,  8.71704129,  8.18678847, 15.42238663,
        9.82149314, 13.10686164, 13.84888557,  9.63247342, 10.97206248,
       10.77346323, 10.08314111,  8.16705472,  8.61831582, 10.11629397,
       10.19740449,  7.43173854,  6.54505877,  7.43636386, 11.08902138,
       10.24394246,  6.40475343,  8.82156126,  9.6534003 , 12.6962477 ,
        9.27539714,  7.46727836, 10.85822911, 11.11377471, 10.50

In [2]:
# Calculate STDEV.P of the entire population of x
a = np.sqrt(np.sum((x - np.mean(x))**2)/(len(x)))
a

1.9424800163477083

In [3]:
# Take the first 20 values of the above generated array. This is a sample of x
y = x[0:20]
y

array([ 9.95696151,  9.22587619,  9.84538449, 11.07953304,  7.77421325,
        7.40471212, 10.47458468,  7.17780555,  9.5740912 ,  8.24275866,
       10.87259929,  6.5855951 , 10.72091397, 14.51707235, 14.01462712,
       11.35020208, 14.81165452, 10.57898073,  9.94770938, 13.99698028])

In [4]:
# Calculate STDEV.S of a sample of the population of x
b = np.sqrt(np.sum((y - np.mean(y))**2)/(len(y)-1))
b

2.4318690829645724

In [5]:
# Calculate STDEV.P of a sample of the population of x
c = np.sqrt(np.sum((y - np.mean(y))**2)/(len(y)))
c

2.3702927825154854

In [6]:
# Calculate the accuracy of STDEV.S on the sample with STDEV.P of the entire population
s = abs(a - b)

In [7]:
# Calculate the accuracy of STDEV.p on the sample with STDEV.P of the entire population
p = abs(a - c)

In [8]:
# If s is less than p, then STDEV.S is more accurate estimation of overall stdev of population the STDEV.P 
print(s < p)

False


In [9]:
# Run the above 100 times
for i in range(1):
    x = np.random.normal(10, 2, 100)
    a = np.sqrt(np.sum((x - np.mean(x))**2)/(len(x)))
    y = x[0:20]
    b = np.sqrt(np.sum((y - np.mean(y))**2)/(len(y)-1))
    c = np.sqrt(np.sum((y - np.mean(y))**2)/(len(y)))
    s = abs(a - b)
    print(s)
    p = abs(a - c)
    print(p)
    if s >= p:
        print("Theory is disproven")
        break
print("Theory is proven")  

0.0468564951608621
0.09753569393298966
Theory is proven
