#### Task 3: Research STDEV.P and STDEV.S

The standard deviation of an array of numbers x is calculated using numpy as np.sqrt(np.sum((x - np.mean(x)) ** 2)/len(x)).

However, Microsoft Excel has two different versions of the standard deviation calculation, STDEV.P and STDEV.S . The STDEV.P function performs the above calculation but in the STDEV.S calculation the division is by len(x)-1 rather
than len(x).

Research these Excel functions, writing a note in a Markdown cell about the difference between them.

Then use numpy to perform a simulation demonstrating that the STDEV.S calculation is a better estimate for the standard deviation of a population when performed on a sample. Note that part of this task is to figure out the terminology in the previous sentence.

### Difference between population and sample

##### STDEV.P
Standard Deviation: Population.
- Data contains all members of the population, 
- Used if the data represents the entire population.
- Results are accurate because the dataset is complete.


##### STDEV.S
Standard Deviation: Sample.
- Data does not contain all members of the population.
- Used if the data is just a sample, and you want to generalise to the entire population.
- Results are not accurate because the dataset is not complete.

In excel, STDEV.S uses Bessel's correction to provide a better estimation for a sample.

##### Bessel's correction
Bessel's correction refers to the "n-1" found in STDEV.S. The use of STDEV.P assumes a whole population of data is available, results will not be accurate when working with a sample.
Bessel's correction adjusts the formula to have a deflated sum, more representative for a sample.[1]

### Example

In [2]:
import numpy as np
import random


size = [5,10,100,500,1000,5000,10000,50000,100000]

# Creates sets with the size given in above array.
# Prints the STDEV.P and STD.S.
for i in size:
    nums = np.random.randint(1,1000,(i))
    sample = np.random.choice(nums, int(i/4))
    
    print("Size of population: ", len(nums))
    print("\tSTDEV:P:\t\t%.3f" % np.std(nums))
    print("Size of sample: ", len(sample))
    print("\tSTDEV.S of sample:\t%.3f" % np.std(sample, ddof=1))
    print("\tSTDEV.S of whole:\t%.3f" % np.std(nums, ddof=1))
    print()


Size of population:  5
	STDEV:P:		262.922
Size of sample:  1
	STDEV.S of sample:	nan
	STDEV.S of whole:	293.956

Size of population:  10
	STDEV:P:		283.274
Size of sample:  2
	STDEV.S of sample:	333.754
	STDEV.S of whole:	298.598

Size of population:  100
	STDEV:P:		285.059
Size of sample:  25
	STDEV.S of sample:	271.087
	STDEV.S of whole:	286.495

Size of population:  500
	STDEV:P:		291.275
Size of sample:  125
	STDEV.S of sample:	284.398
	STDEV.S of whole:	291.567

Size of population:  1000
	STDEV:P:		288.412
Size of sample:  250
	STDEV.S of sample:	293.904
	STDEV.S of whole:	288.557

Size of population:  5000
	STDEV:P:		291.301
Size of sample:  1250
	STDEV.S of sample:	291.013
	STDEV.S of whole:	291.331

Size of population:  10000
	STDEV:P:		287.656
Size of sample:  2500
	STDEV.S of sample:	282.988
	STDEV.S of whole:	287.670

Size of population:  50000
	STDEV:P:		288.767
Size of sample:  12500
	STDEV.S of sample:	287.083
	STDEV.S of whole:	288.770

Size of population:  100000
	STDEV

### Conclusion

The difference between STDEV.P and STDEV.S decreases as the size of the array increases.
STDEV.S no longer seems to be useful as the sample size increases.[2]

References

[1] Bessel's correction: https://www.statisticshowto.com/bessels-correction/  
[2] Standard deviation calculation: https://exceljet.net/formula/standard-deviation-calculation