##### BSc (Hons) in Computing in Software Development
##### Emerging Teschnolgies Task 2
###### Grace Keane - G00359990

---


###### Task:

The Chi-squared test for independence also called Pearson's Chi-squared test is a statistical
hypothesis test like a t-test. It is used to analyse whether two categorical variables
are independent. The Wikipedia article gives the table below as an example,
stating the Chi-squared value based on it, is approximately $24.6$. Use scipy.stats
to verify this value and calculate the associated p value. You should include a short
note with references justifying your analysis in a markdown cell.

|   Title      |   A  |   B  |   C  |   D  | Total    |
|:-------------|:----:|:----:|:----:|:----:|:-------: |
| White Collar | 90   | 60   | 104  | 95   |   349    |
| Blue Collar  | 30   | 50   | 52   | 20   |   151    |
| No Collar    | 30   | 40   | 45   | 35   |   150    |
|                                                     |
| Total        | 150  | 150  | 200  | 150  |  650     |


###### Task research

A Contingency table (also called crosstab) is used in statistics to summarise the relationship between several categorical variables. For this task I am given data from a city of $1,000,000$ residents with four neighborhoods: $A, B, C,$ and $D$. A random sample of $650$ residents of the city is taken and their occupation is recorded as "white collar", "blue collar", or "no collar". The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification. The data is recorded in the contingency table above. Essentially the aim of this task is to conclude whether the two variables (neighborhood and occupation) are related to each other.

###### Chi-squared test

A chi-squared test, also written as $x^2$ test, is a statistical hypothesis test. This test is used to determine whether there is a statistically difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. The purpose of the test is to evaluate how likely the observed frequencies would be assuming the null hypothesis is true.

For this calculation the first thing I did was define the null hypothesis which states if there is a relation between the two variables neighborhood and occupation. I used a p-value to verify this hypothesis. I defined a significance factor to determine whether the relation between the variables is of significance. Generally an alpha value of $0.05$ is chosen. This alpha value denotes the probability of erroneously rejecting $H0$ when it is true. A lower alpha value is chosen in cases where we expect more precision. If the p-value for the test comes out to be strictly greater than the alpha value, then $H0$ holds true and if not the $H0$ value is false.

###### Code

In [1]:
"""
This method calculated the p value based on the given data and then
compares that data with alpha to determine if occupation and neighborhood
are related.
"""

# Adapted from [3]
# Imports
from scipy.stats import chi2_contingency 
import numpy as np

# defining the table 
myTableData = [[90, 60, 104, 95], [30, 50, 51, 20], [30, 40, 45, 35]] 

# Takes an array as input representing the contingency table
# for the two categorical variables. It returns the calculated statistic and
# p-value for interpretation as well as the calculated degrees of freedom.
stat, p, dof, expected = chi2_contingency(myTableData) 
  
# Interpret p-value
alpha = 0.05

# Printing p value to screen
print("P value is:  " + str(p)) 

# Determines the null hypothesis by comparing p value with alpha (0.05)
# Shows if data set occupation & neighborhood are related
if p <= alpha: 
    print('Dependent - has a significant relation') 
else: 
    print('Independent - does not have a significant relation')


P value is:  0.0004098425861096696
Dependent - has a significant relation


In [2]:
"""
This is another method which calculates the Chi-squared value,
p value and DOF (degree of freedom)
"""

# Adapted from [4, 5]
# Valid imports
from scipy.stats import chi2_contingency
# Defining data from contingency table 
dataSet = np.array([[90, 60, 104, 95], [30, 50, 51, 20], [30, 40, 45, 35]])

# Assigning data to be calculated by import chi2_contingency
# Printed data should contain... 
# 1) Chi-squared value (24.6) *
# 2) Calculated p value (0.004098425861096696)
# 3) DOF - degree of freedom (6)
chi2_contingency(dataSet)

(24.5712028585826,
 0.0004098425861096696,
 6,
 array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154],
        [ 34.84615385,  34.84615385,  46.46153846,  34.84615385],
        [ 34.61538462,  34.61538462,  46.15384615,  34.61538462]]))

###### Calculation Conclusion
Since p is less than alpha ($0.05$), neighborhood and occupation have a significant relation. From method two I calculated the Chi-squared value to be $24.5712028585826$ which rounds up to $24.6$, therefore the Chi-squared value ($24.6$) stated in the assignment task is correct.

###### Referances

[1] StackOverflow; Tables in Markdown (in Jupyter);

https://stackoverflow.com/a/63609790 

[2] Wikipedia; Chi-squared test;

https://en.wikipedia.org/wiki/Chi-squared_test

[3] GeeksforGeeks; Python – Pearson’s Chi-Square Test;

https://www.geeksforgeeks.org/python-pearsons-chi-square-test/

[4] SciPy.org; Examples;

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

[5] towards data science; Running Chi-Square Tests with Die Roll Data in Python.

https://towardsdatascience.com/running-chi-square-tests-in-python-with-die-roll-data-b9903817c51b