# Tasks
This my proposed solution to the given assessment. The objective is to research the two different implementations of the standard deviation function used by mircosoft excel libary and highlight the differences between them. These functions are *STDEV.P* and *STDEV.S* respeciievly. Then use **numpy** to perform a simulation demonstrating that the *STDEV.S* calculation is a better estimate for the standard deviation of a population when performed on a sample.

## Research
Consulting the Microsoft Excel documentation [1,2], I went to investigate why there is a need for two different formulas for calculating the standard deviation of a population and in what context are they used.

#### STDEV.S

> Estimates standard deviation based on a sample (ignores logical values and text in the sample).

#### STDEV.P

> Calculates standard deviation based on the entire population given as arguments (ignores logical values and text).

On further investigation [3], I was able to establish that *STDEV.P* is to be used for calculating the standard deviation on an entire population. For example: if you had 20 students in a class and all of them had been accounted for.

By extension *STDEV.S* is to be used in instances where only a sample of a larger population has been taken. For example: if you had 20 students in a class, but only obtained data for 12 of them.

Therefore, *STDEV.S* functions as a type of "correction" when the data collected is only a sample of a larger population [4].

[1] STDEV.S function - https://support.microsoft.com/en-us/office/stdev-s-function-7d69cf97-0c1f-4acf-be27-f3e83904cc23

[2] STDEV.P function - https://support.microsoft.com/en-us/office/stdev-p-function-6e917c05-31a0-496f-ade7-4f4e7462f285

[3] Standard Deviation - https://en.wikipedia.org/wiki/Standard_deviation

[4] Standard Deviation and Variance - https://www.mathsisfun.com/data/standard-deviation.html


## Application of numpy.std() for baseline testing

*numpy.std*(arr, axis = None) : Compute the standard deviation of the given data (array elements) along the specified axis[5].

**Standard Deviation** (SD) is measured as the spread of data distribution in the given data set[6].

[5] Numpy Docs -https://numpy.org/doc/stable/reference/generated/numpy.std.html#:~:text=The%20standard%20deviation%20is%20the,N%20%3D%20len(x)%20.

[6] numpy.std() in Python - https://www.geeksforgeeks.org/numpy-std-in-python/


In [1]:
# Python Program illustrating STDEV.P
# using numpy.std() method  
import numpy as np
    
# population array  
pop = [20, 2, 7, 1, 34] 

## get std dev  
print("pop : ", pop)  
print("std of pop : ", np.std(pop))  

pop :  [20, 2, 7, 1, 34]
std of pop :  12.576167937809991


In [2]:
# Python Program illustrating STDEV.S
# numpy.std() method  

# Remove the last 2 index from the array, so it is a sample population of 3/5 
sample = [20, 2, 7] 

## get std dev
print("sample : ", sample)  
print("std of sample : ", np.std(sample, ddof=1))

sample :  [20, 2, 7]
std of sample :  9.291573243177568


## Application of formula for result validation
#### STDEV.P = *np.sqrt(np.sum((x - np.mean(x))**2)/len(x))* 
#### STDEV.S = *np.sqrt(np.sum((x - np.mean(x))**2)/(len(x)-1))

In [3]:
# formula implementation STDEV.P
# population array
pop = [20, 2, 7, 1, 34] 

## get std dev
print("pop : ", pop)
print("std of pop : ", np.sqrt(np.sum((pop - np.mean(pop))**2)/len(pop)))

pop :  [20, 2, 7, 1, 34]
std of pop :  12.576167937809991


In [4]:
# formula implementation STDEV.S
# Remove the last 2 index from the arry, so it is a sample population of 3/5 
sample = [20, 2, 7] 

## get std dev
print("sample : ", sample)
print("std of sample : ", np.sqrt(np.sum((sample - np.mean(sample))**2)/(len(sample) -1)))

sample :  [20, 2, 7]
std of sample :  9.291573243177568


## Simulation to prove STDEV.S is a better estimate for the standard deviation when performed on a sample population

Given that we have been able to demonstrate the testing and validation of results using various methods, we can now be confident that the calculations obtained are correct. For the simulation we will perform a *STDEV.P* calculation on a **sample population** of data and compare it to the actual results of the *STDEV.S* calculation.

Technically this is the incorrect application of the *STDEV.P* formula. However, this is being done for the purpose of demonstrating that STDEV.S is a more accurate estimate in this situation. As before we will use numpy.std() as the baseline and validate the results with our own formula.

In [5]:
# Python Program illustrating STDEV.P on a sample
# using numpy.std() method  

# Remove the last 2 index from the arry, so it is a sample population of 3/5
sample = [20, 2, 7] 

## get std dev  
print("sample : ", sample)  
print("std of sample : ", np.std(sample)) 

sample :  [20, 2, 7]
std of sample :  7.586537784494028


In [6]:
# formula implementation STDEV.P on a sample
# Remove the last 2 index from the arry, so it is a sample population of 3/5 
sample = [20, 2, 7] 

## get std dev
print("sample : ", sample)
print("std of sample : ", np.sqrt(np.sum((sample - np.mean(sample))**2)/len(sample)))

sample :  [20, 2, 7]
std of sample :  7.586537784494028


## Observation
Using the the *STDEV.P* calculation on a sample of population data we were yeiled with the result of 7.586537784494028, which is incorrect based on the previous tests we performed on the same data. Furthermore, the variance between the incorrect application od *STDEV.P* on a sample and the actual *STDEV.P* on a population, was greater than the correct application STDEV.S when used on a sample of population data. The results can be summerised below:

Population : [20, 2, 7, 1, 34].

Sample : [20, 2, 7].

Actual STDEV.P = 12.576167937809991

Actual STDEV.S = 9.291573243177568

Incorrect STDEV.P = 7.586537784494028 (Using STDEV.P on a sample)

#### Calculate the difference between the results
Percent Difference = (v1 - v2) / ((v1 + v2) / 2) * 100

In [7]:
# calculate the difference between Actual STDEV.P and Actual STDEV.S
v1 = 12.576167937809991
v2 = 9.291573243177568

dif = ((v1 - v2) / ((v1 + v2) / 2)) * 100
print("Actual Percentage Difference: ", dif)

Actual Percentage Difference:  30.040548472268767


In [8]:
# calculate the difference between Actual STDEV.P and Incorrect STDEV.P
v1 = 12.576167937809991
v2 = 7.586537784494028

dif = ((v1 - v2) / ((v1 + v2) / 2)) * 100
print("Incorrect Percentage Difference: ", dif)

Incorrect Percentage Difference:  49.49365647683313


## Conclusion
On the small sample of data that we provided during the tests, the correct application of STDEV.S on a sample was within approx 30% of the actual STDEV.P results from the full population.

However, the incorrect application of *STDEV.P* on a sample was out by approx 49.5% compared to the actual *STDEV.P* results from the full population.

Using various methods of testing and validation of our baseline data. Then by simulating the incorrect use of *STDEV.P* we have been able to establish that *STDEV.S* is a more accurate estimate for the standard deviation of a population when performed on a sample.