# Tasks 2020

## Task One
Write a Python function called sqrt2 that calculates and
prints to the screen the square root of 2 to 100 decimal places. Your code should
not depend on any module from the standard library1 or otherwise. You should
research the task first and include references and a description of your algorithm.


In [174]:
def sqrt2(x):
    #Initial guess for the the square root of z
    z = x / 2
    #Loop until we're happy with the accuracy.
    while abs(x - (z * z))>0.0000001:
        #Calculate a better guess for the square root.
        z -= (z * z - x) / (2 * z)
    #Return the approximate square for root x and round it to 100 places
    return format(z, '.100f')

In [175]:
#Calling the square root function "sqrt2" with input of 2
sqrt2(2)

'1.4142135623746898698271934335934929549694061279296875000000000000000000000000000000000000000000000000'

In [176]:
#Calling the square root function "sqrt2" with input of 2
sqrt2(4)

'2.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'

In [177]:
#Calling the square root function "sqrt2" with input of 2
sqrt2(16)

'4.0000000000000044408920985006261616945266723632812500000000000000000000000000000000000000000000000000'

In [178]:
#Importing and calling the sqrt function from the math library to compare
import math
math.sqrt(2)

1.4142135623730951

## Task Two

The Chi-squared test for independence is a statistical
hypothesis test like a t-test. It is used to analyse whether two categorical variables
are independent. The Wikipedia article gives the table below as an example [4],
stating the Chi-squared value based on it is approximately 24.6. Use scipy.stats
to verify this value and calculate the associated p value. You should include a short
note with references justifying your analysis in a markdown cell.
![title](images/task2.png)


In [179]:
from scipy.stats import chi2_contingency
import pandas as pd

#### Create a data frame similar to the one in the description

In [180]:
collars = pd.DataFrame(
    [
        [90,60,104,95],
        [30,50,51,20],
        [30,40,45,35]
    ],
    index=["White Collar","Blue Collar","No Collar"],
    columns=["A","B","C","D"])
collars

Unnamed: 0,A,B,C,D
White Collar,90,60,104,95
Blue Collar,30,50,51,20
No Collar,30,40,45,35


#### chi2_contingency
SciPy’s chi2_contingency() returns four values, 𝜒2 value, p-value, degree of freedom and expected values.

Returned values below

In [181]:
chi2_contingency(collars)

(24.5712028585826,
 0.0004098425861096696,
 6,
 array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154],
        [ 34.84615385,  34.84615385,  46.46153846,  34.84615385],
        [ 34.61538462,  34.61538462,  46.15384615,  34.61538462]]))

#### Expected Values
You can find the expected values at the forth in the returned value. It is in an array form so i will create a new table to display them.

The table below is called a contingency table, These values come from the following equation
![title](images/task2-equation.png)

So for example "White Collar (A)" would be (349*150)/650 = 80.54

In [182]:
df=chi2_contingency(collars)[3]
pd.DataFrame(
    data=df[:,:], 
    index=["White Collar","Blue Collar","No Collar"],
    columns=["A","B","C","D"]
).round(2)

Unnamed: 0,A,B,C,D
White Collar,80.54,80.54,107.38,80.54
Blue Collar,34.85,34.85,46.46,34.85
No Collar,34.62,34.62,46.15,34.62


#### Chi-squared value

The Chi Square value is the first returned value

In [183]:
chi2_contingency(collars)

(24.5712028585826,
 0.0004098425861096696,
 6,
 array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154],
        [ 34.84615385,  34.84615385,  46.46153846,  34.84615385],
        [ 34.61538462,  34.61538462,  46.15384615,  34.61538462]]))

In [184]:
chisquare=chi2_contingency(collars)[0]
chisquare.round(1)

24.6

#### P-Value
p-value measures the probability of seeing the effect when the null hypothesis is true.
You can find the p-value at the second in the returned value.

In [185]:
chi2_contingency(collars)

(24.5712028585826,
 0.0004098425861096696,
 6,
 array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154],
        [ 34.84615385,  34.84615385,  46.46153846,  34.84615385],
        [ 34.61538462,  34.61538462,  46.15384615,  34.61538462]]))

In [186]:
pvalue=chi2_contingency(collars)[1]
pvalue

0.0004098425861096696

### Chi-Squared Value simplified

Below i show how to get the chi squared value in only two lines of code, Third line is rounding the value to one decimal place

In [187]:
CollarRows =[[90,60,104,95],[30,50,51,20],[30,40,45,35]]
ChiValue = chi2_contingency(collars)[0]
ChiValue.round(1)

24.6

## Task Three
The standard deviation of an array of numbers x is
calculated using numpy as np.sqrt(np.sum((x - np.mean(x))**(2)/len(x)) .
However, Microsoft Excel has two different versions of the standard deviation
calculation, STDEV.P and STDEV.S . The STDEV.P function performs the above
calculation but in the STDEV.S calculation the division is by len(x)-1 rather
than len(x) . 

Research these Excel functions, writing a note in a Markdown cell
about the difference between them. Then use numpy to perform a simulation
demonstrating that the STDEV.S calculation is a better estimate for the standard
deviation of a population when performed on a sample. Note that part of this task
is to figure out the terminology in the previous sentence.


#### STDEV.P vs STDEV.S
Standard deviation is a measure of how much variance there is in a set of numbers compared to the average of the numbers. 

STDEV.P assumes that its arguments are the entire population.The standard deviation is calculated using the "n" method. This will give the actual standard deviation on a data sample

STDEV.S assumes that its arguments are a sample of the population. If your data represents the entire population, then compute the standard deviation using STDEV.P. The standard deviation is calculated using the "n-1" method. The sample standard deviation is an estimate what the true population standard deviation is based on a data sample.

If both STDEV.P and STDEV.S were given the same values, the sample standard deviation would give back a higher value due to the denominator "n-1". This is because a sample standard deviation tends to underestimate the true population standard deviation

For large sample sizes, STDEV.S and STDEV.P return approximately equal values.

The difference between the two standard deviations all depends on what the scenario is. The sample deviation is more commonly used as usually we do not have access to the entire population of data.

STDEV.P Formula   ![title](images/STDEVP.png)

STDEV.S Formula   ![title](images/STDEVS.png)


### Calculating STDEV.S
Formula: np.sqrt(np.sum((x - np.mean(x))**(2)/len(x)-1)

In [188]:
import numpy as np # importing numpy to generate random numbers for dataset
#generate an array of 1000 random numbers for population
population = np.random.randint(1,100,1000) 
#create and array sample_x that takes a sample of the population data
sample_x = population[200:400]
#get the length of the sample_x
sample_len = len(sample_x)
#get the average(mean) of sample_x
sample_mean = sample_x.mean()
#start to put together the formula in description above
sum1 = (sample_x - sample_mean)**2/(sample_len-1)
sample_sum = np.sum(sum1)
STDEVS = np.sqrt(sample_sum)
#print Results
print(f"Population Size: {len(population)}\nSample Size: {sample_len}\nSample Standard Deviation: {STDEVS}")

Population Size: 1000
Sample Size: 200
Sample Standard Deviation: 27.73808540453987


### Calculating STDEV.P
Formula: np.sqrt(np.sum((x - np.mean(x))**(2)/len(x))

In [189]:
#get the length of the population
pop_len = len(population)
#get the average(mean) of population
pop_mean = population.mean()
#start to put together the formula in description above
sum2 = (population - pop_mean)**2/(pop_len)
pop_sum = np.sum(sum2)
STDEVP=np.sqrt(pop_sum)
#print Results
print(f"Population Size: {pop_len}\nPopulation Standard Deviation: {STDEVP}\nSample Size: {sample_len}\nSample Standard Deviation: {STDEVS}")

Population Size: 1000
Population Standard Deviation: 28.392928978884864
Sample Size: 200
Sample Standard Deviation: 27.73808540453987


### References

1. https://tour.golang.org/flowcontrol/8
2. https://docs.python.org/3/tutorial/floatingpoint.html
3. https://realpython.com/python-square-root-function/
4. https://en.wikipedia.org/wiki/Chi-squared_test
5. https://towardsdatascience.com/gentle-introduction-to-chi-square-test-for-independence-7182a7414a95
6. https://exceljet.net/excel-functions/excel-stdev.p-function#:~:text=of%20the%20numbers.-,The%20STDEV.,S%20function.
7. https://exceljet.net/excel-functions/excel-stdev.s-function#:~:text=The%20STDEV.,The%20STDEV.
8. https://stackoverflow.com/questions/10897339/python-fetch-first-10-results-from-a-list
9. https://www.youtube.com/watch?v=W7q8kfs1bNI
10. https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/variance-standard-deviation-sample/a/population-and-sample-standard-deviation-review
11. https://support.microsoft.com/en-us/office/stdev-p-function-6e917c05-31a0-496f-ade7-4f4e7462f285
12. https://numpy.org/doc/stable/reference/random/generated/numpy.random.standard_normal.html