#### Task2: Chi-squared test

The Chi-squared test for independence is a statistical hypothesis test like a t-test. It is used to analyse whether two categorical variables are independent. The Wikipedia article gives the table and states the Chi-squared value based on it is approximately 24.6. Use scipy.stats to verify this value and calculate the associated p value. You should include a short note with references justifying your analysis in a markdown cell.


In [1]:
'''
Creating the data.
'''

import pandas as pd

data = {'A': [90,30,30],
        'B': [60,50,40],
        'C': [104,51,45],
        'D': [95,20,35]
       }
pf = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'], index=['White collar', 'Blue collar', 'No collar'])

Data output.

In [2]:
pf2 = pf.copy()

#Adding totals

pf2.loc['Total']= pf2.sum(numeric_only=True, axis=0)
pf2.loc[:,'total'] = pf2.sum(numeric_only=True, axis=1)

print(pf2)

                A    B    C    D  total
White collar   90   60  104   95    349
Blue collar    30   50   51   20    151
No collar      30   40   45   35    150
Total         150  150  200  150    650


#### Explaination of code.

chi2_contingency is imported from the scipy.stats module, it performs a chi-squared test on a data set.
The following variables are returned.

- **chi2:** This is the Chi-squared value which is expected to be approximately 24.6
- **p:** P-value is the probability of obtaining test results at least as extreme as the results actually observed. Our confidence in the result.
- **dof:** Degrees of freedom, which is the freedom for the results to vary.
- **expected:** The expected frequencies, based on the marginal sums of the table.

Expected results = each cell is assigned a value based on column total*(row_total/total).  
Chi2 = For each cell you get the sum of (pow2(observed-expected)/ expected) for each cell.  
DoF = (num_rows - 1)*(num_Col).  
P-value calculator can be found here [5].


In [3]:
'''
Using scipy.stats.chi2_contingency to confirm the results.
'''

from scipy.stats import chi2_contingency

chi2, p, dof, expected = chi2_contingency(pf)

print(f"chi2 statistic:     {chi2:.5g}")
print(f"p-value:            {p:.5g}")
print(f"degrees of freedom: {dof}")
print("expected frequencies:")
print(expected)

chi2 statistic:     24.571
p-value:            0.00040984
degrees of freedom: 6
expected frequencies:
[[ 80.53846154  80.53846154 107.38461538  80.53846154]
 [ 34.84615385  34.84615385  46.46153846  34.84615385]
 [ 34.61538462  34.61538462  46.15384615  34.61538462]]


#### Tests

In [4]:
'''
Confirm the results by rounding to 1 decimal.
I am unsure of a better way to confirm the closeness of the expected and actual results.
'''
import math

expectedChi2=24.6

# Results
print(f"Expected Chi-squared: {expectedChi2}\nActual Chi-squared: {round(chi2, 1)}")
print(f"Match found: {math.isclose(expectedChi2, round(chi2, 1))}")

Expected Chi-squared: 24.6
Actual Chi-squared: 24.6
Match found: True


References:  
[1] Information on how chi2_contingency works:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html  
[2] Chi2_contingency implementation:
https://stackoverflow.com/questions/64669448/understanding-scipy-stats-chisquare  
[3] Chi-squared test:
https://en.wikipedia.org/w/index.php?title=Chi-squared_test&oldid=983024096  
[4] p-value:
https://en.wikipedia.org/wiki/P-value  
[5] Calculating p-value:  https://www.gigacalculator.com/calculators/chi-square-to-p-value-calculator.php  