### Task 2. November 2nd, 2020: 
The purpose of this task is to use scipy-stats to verify a result for the chi-squared value on a set of data. In addition to this, we are asked to calculate the associated p value.

### Solution
### Explanation of chi-squared test

The Chi-squared test for independence is a statistical hypothesis test. It is used to analyse whether two categorical variables are independent. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.[1] To best explain this, consider the example given in the above referenced link:\
Suppose there is a city of 1,000,000 residents with four neighborhoods: A, B, C, and D. A random sample of 650 residents of the city is taken and their occupation is recorded as "white collar", "blue collar", or "no collar".

||A|B|C|D|Total|
|-|-|-|-|-|-|
|White collar|90|60|104|95|349|
|Blue collar|30|50|51|20|151|
|No collar|30|40|45|35|150|
|----------------|--|--|---|--|---|
|Total|150|150|200|150|650|

The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification. Assuming the null hypothesis, the chi-squared test takes the cumulative totals of categories (number of people is each category, and overall totals). From these totals, it calculates expected frequencies within each category, based on the null hypothesis. Finally it compares these expected frequencies with the observed frequencies for each observed frequency.

$$ x = \frac{(expected - observed)^2}{expected} $$

The chi-squared statistic is the sum of all x. 

So, for the above data, the expected frequency of "white collar" in Neighbourhood A would be:

$$ 150\times\frac{349}{650} \approx80.54 $$

The observed frequency of "white collar" in Neighbourhood A is 90

Using these values, we calculate x to be:

$$ \frac{(80.54 - 90)^2}{80.54} \approx 1.1 $$

The chi-squared statistic is calculated by suming the values obtained for x for all observed values in the table. Analysing the result, the lower the value calculated for this statistic, the closer the data is to the null hypothesis. If the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of independence.

### Using python to calculate chi-squared statistic and associated p value

In [1]:
# Libraries required for this solution
import numpy as np # Efficient numeric arrays
from scipy.stats import chi2_contingency # scipy-stats contains probability distributions and statistical functions

Create a numpy array of observed data

In [2]:
obs = np.array([[90, 60, 104, 95], [30, 50, 51, 20], [30, 40, 45, 35]])
obs

array([[ 90,  60, 104,  95],
       [ 30,  50,  51,  20],
       [ 30,  40,  45,  35]])

Using chi2_contingency function in scipy-stats, calculate the chi-squared statistic and associated p value for the above data [2], [3]

In [3]:
chi2_contingency(obs)

(24.5712028585826,
 0.0004098425861096696,
 6,
 array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154],
        [ 34.84615385,  34.84615385,  46.46153846,  34.84615385],
        [ 34.61538462,  34.61538462,  46.15384615,  34.61538462]]))

This confirms the chi-squared statistic for this data to be 24.5712028585826

### Interpreting the results

The test statistic for this data is 24.5712028585826 \
The p value is 0.0004098425861096696 \
The degrees of freedom is 6

As stated in the explanation above,the lower the value calculated for this statistic, the closer the data is to the null hypothesis. If the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of independence. In this case, with a test statistic of 24, one may be unsure whether to accept/reject the null hypothesis.

Another means of confirming the null hypothesis is the p value. The p value is the probability of observing a sample statistic as extreme as the test statistic. [4] In this case, assuming the null hypothesis  that each person's neighborhood of residence is independent of the person's occupational classification, there is a 0.04% probability of getting a chi-squared statistic of > 24.57. This implies that the null hypothesis would be rejected on this data

[1] https://en.wikipedia.org/wiki/Chi-squared_test \
[2] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html \
[3] https://medium.com/@nhan.tran/the-chi-square-statistic-p3-programming-with-python-87eb079f36af \
[4] https://stattrek.com/chi-square-test/independence.aspx#:~:text=The%20P%2Dvalue%20is%20the%20probability%20that%20a%20chi%2Dsquare,Interpret%20results.